Fold2Vec: A Source Code Representation for Neural Networks

Questions and issues to Francesco Bertolotti and Walter Cazzola.

Description

The CombTransformer is a category of architectures referred to as comb transformers, which draw substantial inspiration from the transformer architecture while seeking to mitigate its quadratic memory requirements. In pursuit of this objective, comb transformers partition the input sequence into discrete segments, each undergoing individual processing. The amalgamation of cross-segment information is achieved through the utilization of the X-word embedding technique. For detailed instructions on replicating or utilizing the experiments conducted, please refer to the instructions provided in this page's replication package. The replication package is available here.

The Comb transformer and comparative baseline architectures are assessed across three distinct tasks: Method Name Generation (using the JavaLarge dataset), Code Search (using the CodeSearchNet dataset), and Code Summarization (using the Funcom dataset). To minimize package size, we have included only the test splits for each task and 12 out of the 51 trained baselines. It is important to note that the included datasets have already been preprocessed.

We provide also a Colab notebook. This notebook contains the necessary steps to test the provided models with the provided dataset.

Prerequisites

In this section, we will outline the requirements for running the replication package effectively. It is important to note that we provide specific package versions, but newer versions may also be compatible.

You will need either Linux machine or a virtual Linux machine to run the replication package.

You will need to have installed:

python 3.11.3,
Java 20.0.1,
Make 4.4.1,
Wget 1.21.4.

You will need to have installed the python3 packages:

You may also need to apply a patch to spiral.

You will need to updated the enviroment variable LD_LIBRARY_PATH with the installation path of jep. For example:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:.venv/lib/python3.10/site-packages/jep/

Data Replication

To process the data for each architecture baseline, a specific procedure needs to be followed. In total, five different views can be generated from each dataset. Each view is utilized by one or more baselines. The instructions to generate all views from each dataset (JavaLarge, CodeSearchNet, and Funcom) are provided in the following makefiles:

nametask-make/dataset.mk prepares the JavaLarge dataset.
searchtask-make/dataset.mk prepares the CodeSearchNet dataset.
summarizetask-make/dataset.mk prepares the Funcom dataset.

For instance, running the command make -f searchtask-make/dataset.mk serialize will generate all the views for the CodeSearchNet dataset. It's important to note that executing these commands may take several hours to complete. If you are interested in the intermediate steps, please refer to the respective makefile, as they also outline the instructions needed to build intermediate versions of the datasets.

Train Replication

If you intend to retrain a baseline, it is necessary to replicate the training data. The makefiles provide instructions for training 51 models in total. For example, the recipe make --always-make -f nametask-make/train.mk data/models/nametask-f2s-stmtwise-hct1/model2.pt trains the HCT-1 architecture for three epochs. The trained model, intermediate checkpoints, and log files are saved within the data/models/nametask-f2s-stmtwise-hct1/ directory. It is worth noting that all recipes utilize wandb for logging. However, if you do not have a wandb account or are not interested in wandb logs, you can simply set WANDB_MODE=dryrun.

The following makefiles contain the recipes for training all the baselines:

nametask-make/train.mk trains baselines for the JavaLarge dataset.
searchtask-make/train.mk trains baselines for the CodeSearchNet dataset.
summarizetask-make/train.mk trains baselines for the Funcom dataset.

Please be aware that training a model can vary in time, ranging from a few hours to several hours.

Test Replication

As mentioned earlier, we have provided 12 out of the 51 trained architectures, which correspond to the best trained baselines. If you wish to test these architectures as described in our work, you can execute the recipes found in the following makefiles:

nametask-make/test.mk tests baselines for the JavaLarge dataset.
searchtask-make/test.mk tests baselines for the CodeSearchNet dataset.
summarizetask-make/test.mk tests baselines for the Funcom dataset.

For example, the recipe MODEL_NAME=<FILE_NAME> make --always-make -f nametask-make/test.mk data/models/nametask-f2s-stmtwise-hct1/log-test.log tests the trained HCT-1 model located at data/models/nametask-f2s-stmtwise-hct1/<FILE_NAME>. Once again, if you are not interested in wandb logs, you can simply set WANDB_MODE=dryrun. Please note that testing may vary in duration, ranging from a few minutes to several hours depending on your machine's capabilities. We provide also a Colab notebook to test the 12 provided architectures.

CombTransformer

Walter Cazzola

Didactics

Publications

Funded Projects

Research Projects

Related Events