- Original article
- 2006.03511 Unsupervised Translation of Programming Languages
- Lample et al.
- Guillaume Lample
Facebook AI Research
Fully unsupervised neural transcompiler
- Converts languages obtained from GitHub
- it looks for previously undetected patterns in data sets without labels and with a minimal amount of human supervision
- outperforms rule-based baselines by a “significant” margin.
- composed of
- transformer architecture
- the same model is used for all programming languages.
- composed of
Trained using the three principles of unsupervised machine translation
- LMing, and
In this section, we summarize these principles and detail how we instantiate them to translate programming languages.
Cross Programming Language Model pretraining
Originally, pretraining was done by initializing the model with cross-lingual word representations.
In the context of unsupervised English-French translation, the embedding of the word “cat” will be close to the embedding of its French translation “chat”.
Cross-lingual word embeddings can be obtained by training monolingual word embeddings and aligning them in an unsupervised manner.
Subsequent work showed that pretraining the entire model (and not only word representations) in a cross-lingual way could lead to significant improvements in unsupervised machine translation.
The pretraining strategy of Lample and Conneau
In particular, we follow the pretraining strategy of Lample and Conneau, where a XLM is pretrained with a masked LMing objective on monolingual source code datasets.
- First principle: initialization
- The first principle initializes the model with
cross-lingual masked LM pretraining.
As a result, pieces of code that express the
same instructions are mapped to the same
representation, regardless of the programming
- Second principle: denoising
- Train the decoder to always generate valid
sequences, even when fed with noisy data, and
increases the encoder robustness to input
- Third (and last) principle: back-translation
- Allows the model to generate parallel data
which can be used for training.
Whenever the Python C++ model becomes better,
it generates more accurate data for the C++
Python model, and vice versa.
We obtain cross-lingual embeddings after training.
The cross-lingual nature of the resulting model comes from the significant number of common tokens (anchor points) that exist across languages.
In the context of English-French translation, the anchor points consists essentially of digits and city and people names.
In programming languages, these anchor points come from common keywords (e.g. for, while, if, try), and also digits, mathematical operators, and English strings that appear in the source code.
For the masked LMing (MLM) objective, at each iteration we consider an input stream of source code sequences, randomly mask out some of the tokens, and train TransCoder to predict the tokens that have been masked out based on their contexts.
We alternate between streams of batches of different languages.
This allows the model to create high quality, cross-lingual sequence representations.
We initialize the encoder and decoder of the seq2seq model with the XLM model that was pretrained.
The initialization is straightforward for the encoder, as it has the same architecture as the XLM model.
The transformer decoder, however, has extra parameters related to the source attention mechanism .
Following Lample and Conneau, we initialize these parameters randomly.
- XLM pretraining
- allows the
generate high quality representations of input
However, the decoder lacks the capacity to translate, as it has never been trained to decode a sequence based on a source representation.
To address this issue, we train the model to
encode and decode sequences with a
Denoising Auto-Encoding (DAE) objective .
DAE objective operates like a supervised
machine translation algorithm, where the model
is trained to predict a sequence of tokens
given a corrupted version of that sequence.
To corrupt a sequence, we use the same noise model as the one described in Lample et al. .
Namely, we randomly mask, remove and shuffle input tokens.
In practice, the “cross-linguality” of the model highly depends on the amount of anchor points across languages.
As a result, a XLM model trained on English- French will provide better cross-lingual representations than a model trained on English-Chinese, because of the different alphabet which reduces the number of anchor points.
In programming languages, the majority of strings are composed of English words, which results in a fairly high number of anchor points, and the model naturally becomes cross- lingual.
GPT-3 summary of Arxiv summary
We train a neural transcompiler using monolingual source code from GitHub, and show that it can translate functions between C++, Java, and Python with high accuracy.