CodeReviewer Model Inference

CodeReviewer Model Inference#

Let’s generate code reviews using microsoft/codereviewer model [LLG+22].

from pathlib import Path

import numpy as np
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from tqdm.autonotebook import tqdm
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

import utils

1 Tokenizers and Datasets#

P.S. Enormous thanks to the authors of [p4v] for providing open-source for working with the tokenizer and the dataset.

# download tokenizer from huggingface
tokenizer = AutoTokenizer.from_pretrained("microsoft/codereviewer")

# add required special tokens to the tokenizer
tokenizer = utils.process_tokenizer(tokenizer)

class ReviewsDataset(Dataset):
    def __init__(self, df: pd.DataFrame, tokenizer):
        self.y = df["human_review"]
        self.code = df["diff_hunk"]
        self.x = torch.tensor(df.apply(lambda row: utils.encode_diff(tokenizer, row["diff_hunk"], '', ''), axis=1), dtype=torch.long).cpu()
 
    def __len__(self):
        return len(self.y)
   
    def __getitem__(self,idx):
        return self.x[idx], self.y[idx]

2 Load data#

Here we load the data and create a dataloader for each project.

filenames = ['../data/msg-test.csv', '../data/JetBrains_kotlin_1000.csv', '../data/microsoft_vscode_1000.csv', '../data/transloadit_uppy_1000.csv']

datasets = []
dataloaders = []
for filename in filenames:
    df = pd.read_csv(filename)
    dataset = ReviewsDataset(df, tokenizer)
    datasets.append(dataset)
    dataloader = DataLoader(dataset, batch_size=16, shuffle=False) # batch_size=6 for 8GB GPU
    dataloaders.append(dataloader)

3 Predict#

Now we can generate code reviews for each project. We will use two models:

Pre-trained model from HuggingFace provided by the authors of [LLG+22]
Fine-tuned model on the CodeReviewer dataset

Predict function#

def predict(model, dataloader, device='cuda'):
    model = model.to(device)
    model.eval()
    
    result = []
    for X, y in tqdm(dataloader):
        inputs_mask = X.ne(tokenizer.pad_id)
        preds = model.generate(
            X.to(device),
            attention_mask=inputs_mask.to(device),
            use_cache=True,
            num_beams=5,
            early_stopping=True,
            max_length=512,
            num_return_sequences=1,
        )
        # decode the predictions
        preds_np = preds.detach().cpu().numpy()
        preds_decoded = [tokenizer.decode(row[2:],
         skip_special_tokens=True,
         clean_up_tokenization_spaces=False) for row in preds_np]
        # add the decoded predictions to the result
        result += preds_decoded
    return result

HuggingFace pre-trained checkpoint#

The model is available on the HuggingFace model hub: https://huggingface.co/microsoft/codereviewer

# download the pretrained model from huggingface
hf_model = AutoModelForSeq2SeqLM.from_pretrained("microsoft/codereviewer")

for filename, dataset, dataloader in zip(filenames, datasets, dataloaders):
    preds = predict(hf_model, dataloader)
    df_pred = pd.DataFrame({'code': dataset.code, 'target': dataset.y, 'prediction': preds})
    df_pred.to_csv(Path(filename).with_suffix('.hf_pred.csv'))
    df_pred.head()

100%|██████████| 636/636 [11:27<00:00,  1.08s/it]
100%|██████████| 63/63 [03:37<00:00,  3.45s/it]
100%|██████████| 63/63 [02:01<00:00,  1.93s/it]
100%|██████████| 63/63 [02:46<00:00,  2.64s/it]

Fine-tuned CodeReviewer#

I fine-tuned the model on the CodeReviewer dataset on the msg task using the instructions from the authors of [LLG+22].

For the fine-tuning I used the following parameters:

batch_size=16
learning_rate=3e-4
max_source_length=512

The fine-tuning took about 12 hours on a single NVIDIA GeForce A100 GPU. The model was fine-tuned for 3 epochs.

I have made the checkpoint available on the HuggingFace model hub: https://huggingface.co/waleko/codereviewer-finetuned-msg

# download the fine-tuned model
ft_model = AutoModelForSeq2SeqLM.from_pretrained("waleko/codereviewer-finetuned-msg")

for filename, dataset, dataloader in zip(filenames, datasets, dataloaders):
    preds = predict(ft_model, dataloader)
    df_pred = pd.DataFrame({'code': dataset.code, 'target': dataset.y, 'prediction': preds})
    df_pred.to_csv(Path(filename).with_suffix('.finetuned_pred.csv'))
    df_pred.head()

Some weights of the model checkpoint at waleko/codereviewer-finetuned-msg were not used when initializing T5ForConditionalGeneration: ['cls_head.weight', 'cls_head.bias']
- This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 636/636 [15:51<00:00,  1.50s/it]
100%|██████████| 63/63 [01:40<00:00,  1.59s/it]
100%|██████████| 63/63 [01:32<00:00,  1.48s/it]
100%|██████████| 63/63 [01:26<00:00,  1.38s/it]