Collecting Code Review Data#

In this notebook, we will collect code review data from Github. We will use the PyGithub library to interact with the Github API.

from getpass import getpass

from github import Auth, Github
import pandas as pd
from tqdm.autonotebook import tqdm
C:\Users\akovr\AppData\Local\Temp\ipykernel_15472\323726258.py:5: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import tqdm

Although, we can use the Github API without authentication, we will need to authenticate to increase the rate limit. We can authenticate using a Github Access Token. You can then enter the token below. If you do not enter a token, the code will run without authentication, but you will be rate limited to 60 requests per hour.

token = getpass("Enter your Github Access Token: ")
if token:
    # using token
    g = Github(auth=Auth.Token(token))
else:
    # no token
    # warn: possibly rate limited
    g = Github()

Next, we will define a function to collect code review data from a Github repository.

def collect_reviews(repo_name: str, num_comments: int = 1000, skip_author=True, allow_threads=False, save=True, max_length=512):
    """
    Crawl a repo for code review data
    :param repo_name: Repo name in format "owner/repo"
    :param num_comments: Number of comments to load
    :param skip_author: Skip comments made by the author of the pull request
    :param allow_threads: Allow comments that are replies to other comments
    :param save: Save the data to a csv file
    :param max_length: Maximum length of the diff hunk
    :return: Returns a pandas dataframe with columns diff_hunk, human_review, created_at
    """
    data = []
    # diff hunk for counting
    hunks = set()
    # load repo
    repo = g.get_repo(repo_name)
    # load comments
    comment_pages = repo.get_pulls_review_comments()
    # iterate over comments
    progress_bar = tqdm(total=num_comments)
    for comment in comment_pages:
        if len(hunks) >= num_comments:
            # if we have enough comments, stop
            break
        if comment.diff_hunk in hunks:
            # if we already have this diff hunk, skip
            continue
        if len(comment.diff_hunk) > max_length:
            # if the diff hunk is too long, skip
            continue
        # get commit author
        commit_author = repo.get_git_commit(comment.commit_id).author
        if skip_author and comment.user == commit_author:
            # if the comment is made by the author of the pull request, skip
            continue
        # add comment to data, along with diff hunk, created_at and ground truth review
        data.append({'diff_hunk': comment.diff_hunk, 'human_review': comment.body, 'created_at': comment.created_at})
        # add diff hunk to set for counting
        progress_bar.update(1)
        hunks.add(comment.diff_hunk)
    df = pd.DataFrame(data)
    if not allow_threads:
        # remove comments that are replies to other comments, keeping the first comment
        df = df.loc[df.groupby('diff_hunk').created_at.idxmin()]
    if save:
        df.to_csv(f'../data/{repo_name.replace("/", "_")}_{len(df)}.csv')
    return df

Finally, we will collect code review data from the following repositories:

I have chosen these repositories because they are popular, and they have a large number of pull requests with code review comments. The authors of [LLG+22] have also used similar criteria to select repositories for their study.

The data will be saved to the data folder.

repos = ['microsoft/vscode', 'JetBrains/kotlin', 'transloadit/uppy']
for repo in repos:
    collect_reviews(repo)

Additionally, we will be using the test data from [LLG+22] and their dataset on zenodo. This dataset is available at data/msg-test.csv.