CodeReviewer Model Inference#
Let’s generate code reviews using microsoft/codereviewer model [LLG+22].
from pathlib import Path
import numpy as np
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from tqdm.autonotebook import tqdm
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import utils
1 Tokenizers and Datasets#
P.S. Enormous thanks to the authors of [p4v] for providing open-source for working with the tokenizer and the dataset.
# download tokenizer from huggingface
tokenizer = AutoTokenizer.from_pretrained("microsoft/codereviewer")
# add required special tokens to the tokenizer
tokenizer = utils.process_tokenizer(tokenizer)
class ReviewsDataset(Dataset):
def __init__(self, df: pd.DataFrame, tokenizer):
self.y = df["human_review"]
self.code = df["diff_hunk"]
self.x = torch.tensor(df.apply(lambda row: utils.encode_diff(tokenizer, row["diff_hunk"], '', ''), axis=1), dtype=torch.long).cpu()
def __len__(self):
return len(self.y)
def __getitem__(self,idx):
return self.x[idx], self.y[idx]
2 Load data#
Here we load the data and create a dataloader for each project.
filenames = ['../data/msg-test.csv', '../data/JetBrains_kotlin_1000.csv', '../data/microsoft_vscode_1000.csv', '../data/transloadit_uppy_1000.csv']
datasets = []
dataloaders = []
for filename in filenames:
df = pd.read_csv(filename)
dataset = ReviewsDataset(df, tokenizer)
datasets.append(dataset)
dataloader = DataLoader(dataset, batch_size=16, shuffle=False) # batch_size=6 for 8GB GPU
dataloaders.append(dataloader)
3 Predict#
Now we can generate code reviews for each project. We will use two models:
Pre-trained model from HuggingFace provided by the authors of [LLG+22]
Fine-tuned model on the CodeReviewer dataset
Predict function#
def predict(model, dataloader, device='cuda'):
model = model.to(device)
model.eval()
result = []
for X, y in tqdm(dataloader):
inputs_mask = X.ne(tokenizer.pad_id)
preds = model.generate(
X.to(device),
attention_mask=inputs_mask.to(device),
use_cache=True,
num_beams=5,
early_stopping=True,
max_length=512,
num_return_sequences=1,
)
# decode the predictions
preds_np = preds.detach().cpu().numpy()
preds_decoded = [tokenizer.decode(row[2:],
skip_special_tokens=True,
clean_up_tokenization_spaces=False) for row in preds_np]
# add the decoded predictions to the result
result += preds_decoded
return result
HuggingFace pre-trained checkpoint#
The model is available on the HuggingFace model hub: https://huggingface.co/microsoft/codereviewer
# download the pretrained model from huggingface
hf_model = AutoModelForSeq2SeqLM.from_pretrained("microsoft/codereviewer")
for filename, dataset, dataloader in zip(filenames, datasets, dataloaders):
preds = predict(hf_model, dataloader)
df_pred = pd.DataFrame({'code': dataset.code, 'target': dataset.y, 'prediction': preds})
df_pred.to_csv(Path(filename).with_suffix('.hf_pred.csv'))
df_pred.head()
100%|██████████| 636/636 [11:27<00:00, 1.08s/it]
100%|██████████| 63/63 [03:37<00:00, 3.45s/it]
100%|██████████| 63/63 [02:01<00:00, 1.93s/it]
100%|██████████| 63/63 [02:46<00:00, 2.64s/it]
Fine-tuned CodeReviewer#
I fine-tuned the model on the CodeReviewer dataset on the msg task using the instructions from the authors of [LLG+22].
For the fine-tuning I used the following parameters:
batch_size=16learning_rate=3e-4max_source_length=512
The fine-tuning took about 12 hours on a single NVIDIA GeForce A100 GPU. The model was fine-tuned for 3 epochs.
I have made the checkpoint available on the HuggingFace model hub: https://huggingface.co/waleko/codereviewer-finetuned-msg
# download the fine-tuned model
ft_model = AutoModelForSeq2SeqLM.from_pretrained("waleko/codereviewer-finetuned-msg")
for filename, dataset, dataloader in zip(filenames, datasets, dataloaders):
preds = predict(ft_model, dataloader)
df_pred = pd.DataFrame({'code': dataset.code, 'target': dataset.y, 'prediction': preds})
df_pred.to_csv(Path(filename).with_suffix('.finetuned_pred.csv'))
df_pred.head()
Some weights of the model checkpoint at waleko/codereviewer-finetuned-msg were not used when initializing T5ForConditionalGeneration: ['cls_head.weight', 'cls_head.bias']
- This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 636/636 [15:51<00:00, 1.50s/it]
100%|██████████| 63/63 [01:40<00:00, 1.59s/it]
100%|██████████| 63/63 [01:32<00:00, 1.48s/it]
100%|██████████| 63/63 [01:26<00:00, 1.38s/it]