
greedy decoding) and 10 (using length_penalty = 0.5 for beam size 10). Our results with the models we provide for beam size 1 (i.e. You do not need to have the training data in your data_path, only the validation and test sets. But, if you want to evaluate a model without training it, run the same command as the training command and add these flags: validation_metrics 'valid_python_sa-java_sa_mt_comp_acc ' EvaluateĮvaluation is done after each training epoch. has_sentence_ids "valid|para,test|para " \ lgs_mapping 'cpp_sa:cpp,java_sa:java,python_sa:python ' \ bt_steps 'python_sa-java_sa-python_sa,java_sa-python_sa-java_sa,python_sa-cpp_sa-python_sa,java_sa-cpp_sa-java_sa,cpp_sa-python_sa-cpp_sa,cpp_sa-java_sa-cpp_sa ' \
#SUPER TRANSCODER DOWNLOAD#
Simpy download the binarized data transcoder_test_set.zip and add them to the same folder and the data you preprocessed above. In our case, we add 4 machines of 8 GPU each, we set NPU=8 and -split_data_accross_gpu local. You will just have to precise -split_data_accross_gpu local in your training parameters. Note also that if you run you training on multiple machine, each with NGPU GPUS, splitting in NGPU is fine as well. Note that is your data is small enough to fit on a single GPU, then NGPU=1 and loading this single split on all GPU is the normal thing to do. To get the monolingual functions data for DAE et BT change: train_splits=NGPU # nb of splits for training data - corresponds to the number of GPU you have If False run on a cluster (requires submitit setup) local=True # Run on your local machine if True. langs cpp java python # languages to prepocess # folder containing raw data i.e json.gz Then run the following command to get the monolingual data for MLM: In our case we use NGPU=8 Get Training Dataįirst get raw data from Google BigQuery ( see). The path is given as -data_path argument.

#SUPER TRANSCODER CODE#
test / valid data (monolingual): source code in each language to test perplexity of model, ex: /.training data (monolingual): source code in each language, (data is splitted accross GPU).Training Dataset Overviewĭata you need to pretrain a model, with MLM: The code generated by your model can be tested by injecting it where the TO_FILL comment is in the test script. If the script is missing, it means there was an issue with our automatically created tests for the corresponding function. For instance, for the line COUNT_SET_BITS_IN_AN_INTEGER_3 | in the file .tok, the corresponding test script can be found in data/evaluation/geeks_for_geeks_successful_test_scripts/cpp/COUNT_SET_BITS_IN_AN_INTEGER_3.cpp. You can extract the function id and use it to find the corresponding test script in data/evaluation/geeks_for_geeks_successful_test_scripts/ if it exists. You can detokenize them with the script preprocessing/detokenize.py. Python -m codegen_ -src_lang python -tgt_lang java -model_path -beam_size 1 |. You can use these models to translate functions with this script: Its computational accuracy scores are 39.5% for Python -> Java (44.7% with beam size 10) and 49.2% for Java -> Python (52.5% with beam size 10). Better model for translating between java and python (pretrained with our new model DOBF - 2021): If you don't, the result will be the same in more than 99% of the cases and only slightly different otherwise. Note: if you really want the output of these models to be exactly right, you need to change the constant LAYER_NORM_EPSILON to be 1e-12 instead of 1e-5. TransCoder_model_2 for C++ -> Python, Python -> Java.TransCoder_model_1 for C++ -> Java, Java -> C++ and Java -> Python, Python -> C++.

Models used in TransCoder original paper are the following (directions selected using the validation set): We used the validation set to select the best checkpoint for each language pair, and choose the model to use to compute the test scores. Pytorch implementation of TransCoder in Unsupervised Translation of Programming Languages Release pre-trained models
