Collecting Code Review Data#
In this notebook, we will collect code review data from Github. We will use the PyGithub library to interact with the Github API.
from getpass import getpass
from github import Auth, Github
import pandas as pd
from tqdm.autonotebook import tqdm
C:\Users\akovr\AppData\Local\Temp\ipykernel_15472\323726258.py:5: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
from tqdm.autonotebook import tqdm
Although, we can use the Github API without authentication, we will need to authenticate to increase the rate limit. We can authenticate using a Github Access Token. You can then enter the token below. If you do not enter a token, the code will run without authentication, but you will be rate limited to 60 requests per hour.
token = getpass("Enter your Github Access Token: ")
if token:
# using token
g = Github(auth=Auth.Token(token))
else:
# no token
# warn: possibly rate limited
g = Github()
Next, we will define a function to collect code review data from a Github repository.
def collect_reviews(repo_name: str, num_comments: int = 1000, skip_author=True, allow_threads=False, save=True, max_length=512):
"""
Crawl a repo for code review data
:param repo_name: Repo name in format "owner/repo"
:param num_comments: Number of comments to load
:param skip_author: Skip comments made by the author of the pull request
:param allow_threads: Allow comments that are replies to other comments
:param save: Save the data to a csv file
:param max_length: Maximum length of the diff hunk
:return: Returns a pandas dataframe with columns diff_hunk, human_review, created_at
"""
data = []
# diff hunk for counting
hunks = set()
# load repo
repo = g.get_repo(repo_name)
# load comments
comment_pages = repo.get_pulls_review_comments()
# iterate over comments
progress_bar = tqdm(total=num_comments)
for comment in comment_pages:
if len(hunks) >= num_comments:
# if we have enough comments, stop
break
if comment.diff_hunk in hunks:
# if we already have this diff hunk, skip
continue
if len(comment.diff_hunk) > max_length:
# if the diff hunk is too long, skip
continue
# get commit author
commit_author = repo.get_git_commit(comment.commit_id).author
if skip_author and comment.user == commit_author:
# if the comment is made by the author of the pull request, skip
continue
# add comment to data, along with diff hunk, created_at and ground truth review
data.append({'diff_hunk': comment.diff_hunk, 'human_review': comment.body, 'created_at': comment.created_at})
# add diff hunk to set for counting
progress_bar.update(1)
hunks.add(comment.diff_hunk)
df = pd.DataFrame(data)
if not allow_threads:
# remove comments that are replies to other comments, keeping the first comment
df = df.loc[df.groupby('diff_hunk').created_at.idxmin()]
if save:
df.to_csv(f'../data/{repo_name.replace("/", "_")}_{len(df)}.csv')
return df
Finally, we will collect code review data from the following repositories:
I have chosen these repositories because they are popular, and they have a large number of pull requests with code review comments. The authors of [LLG+22] have also used similar criteria to select repositories for their study.
The data will be saved to the data folder.
repos = ['microsoft/vscode', 'JetBrains/kotlin', 'transloadit/uppy']
for repo in repos:
collect_reviews(repo)
Additionally, we will be using the test data from [LLG+22] and their dataset on zenodo. This dataset is available at data/msg-test.csv.