M Baas

I am a machine learning researcher at Camb.AI. I post about deep learning, electronics, and other things I find interesting.

9 December 2019

Live language modelling in Overleaf (feat GPT2)

by Matthew Baas

Think Gmail smart compose, but for technical LaTeX writing on Overleaf.

TL;DR: I wanted a smart autocomplete for the usual LaTeX editor I use (Overleaf) to make my time writing various documents much more productive. So, I combined a nice open source pytorch implementation of GPT2 into a Chrome browser extension to basically add smart text generation from the GPT2 language model. This article is a bit less about the ins-and-outs of neural language models & GPT2, but more about how to deploy such a model and use it to modify the existing functionality of a website that is reasonably simple, effective, and sandboxed (read: this article has more javascript/python deploy code, less neural net theory stuff). Here is a peak at the results in the Overleaf editor for an electromagnetics practical as part of my undergraduate courses:

Background

Language models are functions mapping an arbitrary sequence of $n$ words to a probability between 0 and 1. Or, in probability theory terms, a language model is a random variable defined no sequences of $n$ words (outcomes), which is described by some associated probability distribution and mass function. As a side note: usually language models predict $\log \frac{p}{1-p}$ (a logit) instead of the actual probability $p$. For practical applications, however, it is sufficient to say that they calculate the most likely next word given a history of $n$ previous words (or more generally it could be $n$ previous characters, sub-words or even byte pair encodings).

In recent times language modelling with neural networks has really taken off and become more accessible. In particular Huggingface has released a great open source implementation and pretrained models of GPT2, a model initially designed at Open-AI. And with the usefulness of neural auto-complete applications like Gmail smart compose, I thought it would be really cool to have a neural auto-complete for LaTeX and to make it super accessible by making it simply plug into one of the biggest online LaTeX editors already around – Overleaf.

Note: If you are mainly interested in using/funding this and not how it was made, skip to the FAQ at the end.

Setup

So the plan to achieve this goal is:

Fine tune a GPT2 language model on LaTeX documents to get a language model that works well on technical LaTeX text.
Make a Chrome browser extension that can add auto-complete suggestions to the Overleaf text editor, as well as pull out recent text history near the current position in the document.
Link step 2’s functionality to an API call that we pass in the recent text history, and get out a few completions generated by GPT2.
Make the backend deploy code to serve this API endpoint that the browser extension will use.

Design

I will now discuss a bit of each step, and the tricks I had to use to make them work nicely and seamlessly.

1. Fine tuning a GPT2 language model

This first step is super easy thanks to Huggingface’s GPT2 pytorch implementation. I simply adapted their pytorch fine-tuning script for the training, using nearly the same defaults the provide for GPT2 fine-tuning and starting training from their pretrained GPT2 weights.

Data preparation

For the data, since this project is more of a first prototype, I didn’t go too big and used a fairly small corpus of LaTeX documents – namely, all my .tex source files for all my past LaTeX documents. To do this, I simply collated all my .tex files together into a two .txt files (one for training, one for validation), where the text of each source document is separated by an end-of-stream (eos) token and two new lines in the final .txt file. For Huggingface’s library the eos token is "<|endoftext|>".

I randomly separated the .tex files into a training and validation script with a 80-20% split (so technically it’s an 80-20% split in terms of .tex files, not individual words). This also ensures that the validation text does not come from the same source document as the training text, which is not a great comparison since when you actually use the chrome extension, it will be on new documents not seen during training (unless a future update does training on earlier parts of a large document while one is working on it? 🙂).

Using these two large .txt documents, the fine-tuning script is run for a bit to fine tune the small gpt2 model (117M parameters), purely for funding and GPU memory constraints. After it completes and the error seemed reasonable, it leaves us with a few important files that comprise the gpt2 model (a large pytorch_model.bin of the neural net weights), and a few files describing the config and language tokenizer (description of how to convert text to numerical tensors), which is required since the LaTeX corpus has its own byte-pair encodings that might not be the same as the original pretrained model.

2. Making a Chrome extension that can interface with Overleaf

Chrome (and Firefox) extensions are surprisingly capable and flexible when it comes to modifying the DOM and javascript of any arbitrary website. With some fiddling, you can grab existing Javascript functions defined in the website and DOM elements, and pass these off to a background script which can interact with an external API.

2.1 A primer: chrome extensions

Chrome extensions (the case is similar for Firefox) allow a few kinds of scrips which effect the browsing experience, namely:

Background scripts: these always run in the background and usually consist of a bunch of event listeners for various events happening in the browser (e.g a new tab opening, or visiting a certain website). They do not have access to the DOM or javascript of any particular website.
Content scripts: these are javascript files which run in the context of a particular website, and are sandboxed to that site (they cannot access anything outside it) with the exception of so called “messages” which they can pass to other content or background scripts. They do have access to the DOM of their associate website, but do not have access to the javascript memory and scope of the website.
Popup scripts: these are html and javascript files which run to render and give functionality to the extension icon usually shown in the top right of the browser window. They usually have a few settings open when clicked, allowing one to configure the extension.
Manifest file: this is a json file with the basic configuration of the extension, telling chrome how to handle it and what scripts to run when, and what permissions the extension is allowed.

This is just a few of the files and features of chromium extensions, but they will be sufficient for the purposes of this project. See more about chrome extensions here.

2.2 Overleaf and the Ace editor

The code editor in Overleaf actually uses a great open source project called the Ace editor , which exposes a wide-ranging API that allows one to get text from certain regions of the document, add and initiate autocompletions, and even set themes. Overleaf uses this API to make a great LaTeX editor, and we will try to hook into that editor to pull out pieces of text and add additional autocompletions. This way, all the other setup of the Ace editor is handled by Overleaf’s code, so that this whole project will work seamlessly with different themes, font sizes, contexts and other unique Overleaf functionality.

Extracting snippets of historical text

The Ace editor is encapsulated in a javascript Editor object. This editor object is created and configured by Overleaf’s code when the IDE first loads on a page. Assuming for now that we can access this javascript object, we can access the full range of functions it exposes. Namely, the getLines(start, end) will return us the text between rows start and end (inclusive). Well that was quite easy. And the editor object also allows a call to get the current cursor position. So to get the recent history, let’s grab the last 10 lines, which is quite easy by just making end be the current cursor position, and start = end - 10.

Adding autocompletions

This is a little more tricky, since Overleaf’s code modifies how autocompletions are done quite heavily to make it work nicely with LaTeX’s syntax. Autocompletes in Ace are handled by running .execCommand("startAutocomplete") on the editor object. When this is run, the editor runs a getCompletions() function on each one of its completers. A completer is an object that defines a function getCompletions, which takes in the editor, and some context about the current cursor position’s surrounding text, and returns a list of text completions, each one with an associated score to rank them in the final UI autocomplete display. For example, Overleaf has a file completor, which suggests completions of particular file names defined in the project. Note that because of the whole javascript promise thing, it doesn’t quite ‘return’ the completions, but passes the completions to a callback function, however that is not too important since Ace handles the calling and and callbacks of the getCompletions function.

Ok, seems simple enough. We just need to define an object with a getCompletions function and add that object to the editor’s .completers attribute (which is just a list of all its completers). The getCompletions will get the last 10 lines of history or so, pass them off to the API that returns the completions generated by the language model, and then passes these completions to the callback function mentioned earlier.

One problem, however, is that all the completions given by each completor is not the final list of autocompletes given to the user - Ace first filters the autocompletions to those autocompletions that start with a specified prefix, where prefix is calculated by Overleaf’s customization of the startAutocomplete function. i.e Overleaf has setup the editor’s .execCommand("startAutocomplete") functionality to make prefix the text between the last backslash on the current line and the current cursor position (which of course is how just about all LaTeX autocompletions start). Another problem is that if we simply add the new completor, it will always trigger even during ordinary autocompletes, which we probably don’t want. Ideally, the neural auto-complete should be invoked when some hotkey is pressed.

To solve these issues, a flag system will work nicely together with invoking the .execCommand("startAutocomplete") command on some hotkey (I used shift+tab). So whenever the user hits shift+tab, we simply run .execCommand("startAutocomplete") to start an autocomplete operation, and set a flag lm_active to true. Then, the completor we discussed earlier simply returns no completions if the lm_active flag is false. When the autocompletion is done, we simply set lm_active to false again to ensure that the completor does not trigger on regular LaTeX syntax autocompletes.

Finally, to fix the whole prefix filtering problem, I further modified the piece of code which calculates the prefix to simply set a blank prefix if the lm_active flag is true, and does what it usually does otherwise. And to make matters more complicated there is yet another problem: Overleaf modifies how Ace inserts autocompletions to work nicely with some nested LaTeX operations. Luckily solving this is not too bad since we can just further modify the function which does this and just override the functionality when lm_active is true to be the default Ace ‘insert completion’ behavior.

2.3 Connecting the chrome extension to the Ace editor and Overleaf

So from the previous section it is quite clear that, if we have access to the javascript memory of the webpage, we can access and modify the Ace editor object, and thus extract text and add/trigger autocompletions. But as described in (2.1), none of Chrome’s extension behavior gives us access to the javascript memory of a webpage directly. There are a few ways to tackle setting up the chrome extension to overcome this. The way I found worked quite well is to have a background script attack a specific content script to a website if the website’s URL is an Overleaf project URL. Then, the content script injects javascript into the DOM to allow that injected javascript to have access to the existing javascript memory. Thus the extension code setup looks like:

Chrome extension diagram

Now this injected script can do all the things I mentioned in section (2.2) earlier, since it can access the existing editor object in the javascript memory, and override its functions. The specific javascript code for each section is not that much. The background script has a few things to start the content script and enable the popup icon if the url is an Overleaf project url:

chrome.runtime.onInstalled.addListener(function() {
    chrome.declarativeContent.onPageChanged.removeRules(undefined, function() {
        chrome.declarativeContent.onPageChanged.addRules([{
            conditions: [new chrome.declarativeContent.PageStateMatcher({
                pageUrl: {urlMatches: '.*overleaf\.com\\/project/.*'},
            })
            ],
            actions: [
                new chrome.declarativeContent.ShowPageAction(),
            ]
        }]);
      });
  });

The background script also contains a listener for messages from the injected script, but more on that later. The content script is just a couple of lines to add the injected javascript as a <script> DOM element. The injected javascript performs all the code and logic discussed in section (2.2). The main piece of logic is the definition of the getCompletions for the completor, which looks something like:

var languageModelCompleter = {
    getCompletions: function(editor, session, pos, prefix, callback) {
        // don't return any completions if we aren't busy predicting
        if(lm_active == false) {
            return
        }
        // gather last 10 lines
        var lines = editor_proxy.session.getLines(pos["row"] - 10, pos["row"])
        lines[lines.length - 1] = lines[lines.length - 1].substring(0, pos["column"]+1)
        // dispatch message to background script
        var resp = "None";
        chrome.runtime.sendMessage("<the chrome extension id>", 
            {lines: lines}, // the message data is the last 10 lines
            function(response) {
                if(response === null) {
                    callback(null, []); // don't add completions if nothing returned
                    lm_active = false; // reset neural autocomplete prediction flag
                    return;
                }
                resp = response.prediction;
                const result = resp.map(function(x) { 
                    return {
                        caption: x, // what is shown in the preview bar
                        value: x, // what goes onto the line if u smash tab
                        score: 90 // some score to rank the predictions
                    }
                })
                callback(null, result)
                lm_active = false; // reset neural autocomplete prediction flag
            });        
    }
};

Cool beans. Nearly there.

3. Linking the Chrome extension to an API endpoint

So the injected code sends the last 10 lines to the background script. To communicate with a server api that will take in these 10 lines and return the predictions for gpt2, we just need a listener in the background script which performs an HTTP request to the server api endpoint:

chrome.runtime.onMessageExternal.addListener(
    function(request, sender, sendResponse) {
        const lines = request.lines; // history of 10 lines of text around cursor

        // send request to server api with lines, get prediction pack
        var xhr = new XMLHttpRequest();
        xhr.open("POST", "<server api endpoint url>", true);
        xhr.setRequestHeader("Content-Type", "application/json;charset=UTF-8");
        xhr.onload = function (e) {
            if (xhr.readyState === 4) {
                if (xhr.status === 200) {
                    var response = JSON.parse(xhr.responseText);
                    // if api responded with predictions, send them to completor
                    sendResponse({prediction: response.prediction});
                } else {
                    // else return no autocompletions
                    sendResponse(null);
                    console.error(xhr.statusText);
                }
            }
        };
        // sends the url request
        xhr.send(JSON.stringify({lines: lines}));
});

And one last thing which might catch one out in similar applications is that we need to add permission for the extension to communicate with an external server url. Namely, in the extension’s manifest.json, we need to add the lines:

"externally_connectable": {"matches": ["https://*.overleaf.com/project/*"]},
"permissions": [..., "<server api endpoint url>"],

Cool, chrome extension done!

4. Backend deployment of pytorch GPT2 model

Last we need to make the backend server which quickly (<200ms) computes the autocompletions using the pytorch model. Here is where you will probably have the most freedom with which server framework/services you use. For the sake of simplicity and familiarity, I’ve used Django together with its REST framework which is quite nice to build APIs.

So to start, a new python environment is needed with 4 main dependencies:

Django (pip install Django)
Django rest framework (pip install djangorestframework)
Pytorch with CUDA support (see this for installation instructions)
transformers (the Huggingface GPT2 implementation), pip install transformers.

Then since we are only going to serve 1 model from 1 endpoint, and (for privacy first, simplicity second) we will not store any of the historical text sent to the server, we don’t actually need a database and can just the REST endpoint functionality. So, getting going in the new python environment we can run:

>> django-admin startproject coolproject
>> cd coolproject
>> python manage.py startapp endpoints

Now the next part is to add the view to the endpoints. So in coolproject/endpoints/views.py, we add:

from rest_framework.response import Response
from rest_framework.views import APIView

import torch
import torch.nn.functional as F
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from transformers import WEIGHTS_NAME, CONFIG_NAME

output_dir = "<path to where the GPT2 model save is stored>"
device = torch.device('gpu')

Here I have assumed the presence of a GPU, since trying to predict GPT2 on a CPU with a reasonably long text history will take ~5-20 seconds, compared to ~100-300ms on GPU. Only the latter is at all reasonable for an autocomplete function. Next, we load the GPT2 model into memory only once at the beginning:

# Step 2: load the saved model and vocabulary
model = GPT2LMHeadModel.from_pretrained(output_dir)
tokenizer = GPT2Tokenizer.from_pretrained(output_dir)
model.eval() # make sure dropout and other train time stuff is turned off
model.to(device)

And then the big thing we need to add is the view which invokes the prediction, which will be similar to the code below:

class Infer(APIView):
    def post(self, request, format=None):
        lines = request.data['lines']
        pred_length = int(request.data)
        # in any production environment, you should check and carefully parse 
        # these post parameters, to catch any errors or weird things thrown at the API.         
        pred_text = gpt2_infer('\n'.join(lines)) # get predictions
        response = {
            'message': 'Success',
            'prediction': pred_text,
        } # build and send response
        return Response(response, status=status.HTTP_200_OK)

Cool. The last piece of code we need is the gpt2_infer function, which is likely the only piece of pytorch that will be done here. This code is an adaptation of Huggingface’s GPT2 language generation script, which actually straight up throws a tensor indexing error if used with any model other than GPT2 small (as of the writing of this post). So in general to predict num_samples samples of pred_length words (or more accurately, byte-pair encodings in the vocab), then the code is:

def gpt2_infer(historical_context, pred_length=10, repetition_penalty=1.0, num_samples=3):
    top_p = 0.5
    temperature = 0.9 # more temperature -> more entropy
    # tokenize  historical context
    original_context_tokens = torch.tensor(tokenizer.encode(historical_context)).to(device)
    generated = original_context_tokens.unsqueeze(0).repeat(num_samples, 1)
    context = generated
    past = None
    # generate `num_samples` prediction sequences of length `pred_length`. 
    for i in range(pred_length):
        output, past = model(context, past=past)

        next_token_logits = output[:, -1, :]
        next_token_logits /=  (temperature if temperature > 0 else 1.)
        # the top_k_top_p_filtering function is taken without alteration from 
        # https://github.com/huggingface/transformers/blob/master/examples/run_generation.py 
        filtered_logits = top_k_top_p_filtering(next_token_logits, top_p=top_p)

        next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
        generated = torch.cat((generated, next_token), dim=1)
        context = next_token
        # WATCH OUT: the shape of past grows a lot as u generate more tokens.
        # See https://github.com/huggingface/transformers/issues/1916

    gen_seqs = []
    gen_lists = generated[:, len(original_context_tokens):].tolist()
    for o in gen_lists:
        sequence = tokenizer.decode(o, clean_up_tokenization_spaces=True)
        if historical_context[-1] == ' ' and sequence[0] == ' ':
            gen_seqs.append(sequence[1:])
        else:
            gen_seqs.append(sequence)

    return gen_seqs

Now the final step is to add the api endpoint to the Django project’s urls.py, which can be done with:

from endpoints import views
...
urlpatterns += format_suffix_patterns([url(r'^infer$', views.Infer.as_view())], allowed=['json', 'html', 'api'])
...

Done! The stuff I have shown here for hosting is for a bare-bones local setup for if your computer has a CUDA enabled GPU. When running the server with python manage.py runserver, you will see the url it is hosted at, and can then just add that as the <server api endpoint url> discussed earlier. Then the full end to end system should work nice and seamlessly. If everything works correctly, the end experience while editing an Overleaf document will be like below (remember each prediction is initiated by a hit of shift+tab):

I hope this was insigtful. If you are curious about anything and you don’t feel its answered in the FAQ below, just send me a message using one of the ways in the About part of this website.

FAQ:

PROJECT STATUS: on-request alpha. If you are interesting in testing it out, please get in contact :).

Can I use it?

TL;DR:

If I get enough funds to run an inference server, then yes! All you would need to do is add the chrome extension and fire away! (The cloud server setup is already done, I just cannot afford to run a ~$0.8/hour server constantly as a student 🙁). I’ll update this if I get enough funds. If you are in a position to support this please get in contact with me.
If you have a decent GPU that supports CUDA, yes! You can just run the whole thing locally (python django program + chrome extension). See the github repo for specific setup details, or those who are keen can try set it up from the code snippets discussed throughout the article. I’ll update this with concrete details once the final details are ironed out.
Any specific questions: send me a message via the ‘about’ section of this website :).

Why not make the extension for Firefox or Brave first?

For Brave and other less developed browsers, there is not as much developer support and documentation as chrome, and it is not quite clear which parts of the chromium extension API is available in each. It might very well be that the extension made here will work without modification for Brave and other chromium browsers, but I am uncertain. Firefox is a great choice and I would like to expand support to it rapidly, but I chose chrome first because Firefox has some nastly PDF bugs. Namely, viewing a PDF in Firefox (which uses pdf.js) has extremely bad behavior with high resolution graphics and text (see this – which is often a problem for high resolution embedded .pdf or .eps graphics in papers. I am not too sure about the details of google chrome’s pdf renderer, but it does not have this problem.

Why pytorch instead of tensorflow/ONNX/other deployment framework for this inference application?

I like pytorch more these days. Its implementation in deployment is still simple and didn’t require a complete new inference framework to work well.

Are you affiliated with Overleaf?

No, not in any way except that I’m a frequent user of the site and think it’s a pretty neat service they offer. This chrome extension just modifies the existing HTML and javascript in your local browser, and has no interaction with anything Overleaf related except for the javascript/HTML loaded into the local browser window.

tags: NLP - deep learning - LaTeX - Chrome extension