Predictions Evaluation#

from pathlib import Path
import numpy as np
import pandas as pd

from utils.smooth_bleu import bleu_fromstr
def analyze_preds(base_file, sample_size=5):
    # read files
    hf_preds_file = Path(base_file).with_suffix('.hf_pred.csv')
    fine_tuned_file = Path(base_file).with_suffix('.finetuned_pred.csv')
    hf_preds = pd.read_csv(hf_preds_file)
    fine_tuned = pd.read_csv(fine_tuned_file)
    # put in df
    df = pd.DataFrame({'code': fine_tuned['code'],
     'hf_pred': hf_preds['prediction'],
     'fine_tuned_pred': fine_tuned['prediction'],
     'target': fine_tuned['target']})
    df.replace(np.nan, '', regex=True)
    # print sample with predictions
    sample = df.sample(sample_size, random_state=42)
    for code, hf_pred, fine_tuned_pred, target in sample.to_numpy():
        print('-------------------')
        print(code)
        print(f'## Human Review: {target}')
        print(f'## HF Pred: {hf_pred}')
        print(f'## Fine Tuned Pred: {fine_tuned_pred}')
    return df
def calc_bleu(df):
    refs = list(df['target'])
    preds = list(df['prediction'])
    for i in range(len(preds)):
        chars = "(_)`."
        for c in chars:
            preds[i] = preds[i].replace(c, " " + c + " ")
            preds[i] = " ".join(preds[i].split())
            refs[i] = refs[i].replace(c, " " + c + " ")
            refs[i] = " ".join(refs[i].split())
    return bleu_fromstr(preds, refs, rmstop=False)

def calc_bleu_score(base_file):
    hf_preds_file = Path(base_file).with_suffix('.hf_pred.csv')
    fine_tuned_file = Path(base_file).with_suffix('.finetuned_pred.csv')
    hf_preds = pd.read_csv(hf_preds_file)
    ft_preds = pd.read_csv(fine_tuned_file)
    hf_preds.replace(np.nan, '', regex=True, inplace=True)
    ft_preds.replace(np.nan, '', regex=True, inplace=True)
    hf_bleu = calc_bleu(hf_preds)
    ft_bleu = calc_bleu(ft_preds)
    print(f'HF BLEU: {hf_bleu}')
    print(f'Fine Tuned BLEU: {ft_bleu}')
    return hf_bleu, ft_bleu

Qualitative Evaluation#

We will now compare the predictions of the HF model and the fine-tuned model on samples of the four datasets.

We will print the code, the prediction of the HF model and the prediction of the fine-tuned model.

df = {}
df['msg'] = analyze_preds('../data/msg-test')
-------------------
@@ -1,6 +1,7 @@
 # frozen_string_literal: true

 require 'hocon'
+require 'bolt/error'

 class TransportConfig
   attr_accessor :host, :port, :ssl_cert, :ssl_key, :ssl_ca_cert, :ssl_cipher_suites,
## Human Review: Looks like this isn't used in this file?
## HF Pred: Why do we need this?
## Fine Tuned Pred: I am surprised you don't use `hocon` here.
-------------------
@@ -92,7 +92,7 @@ public final class OAuth2AuthorizedClientArgumentResolver implements HandlerMeth
 							clientRegistrationId.flatMap(id -> Mono.error(new IllegalStateException(
 									"Unable to resolve the Authorized Client with registration identifier \""
 											+ id
-											+ "\". An \"authenticated\" or \"unauthenticated\" session is required. To allow for unauthenticated access, ensure ServerHttpSecurity.anonymous() is configured."))))
+											+ "\". An \"authenticated\" or \"anonymous\" request is required. To allow for anonymous access, ensure ServerHttpSecurity.anonymous() is configured."))))
 					.flatMap(zipped -> {
 						String registrationId = zipped.getT1();
 						String username = zipped.getT2();
## Human Review: I don't think this should be changed since on the reactive side we don't support anonymous users.
## HF Pred: I don't think we should change this message. It's not clear to me why this is an error.
## Fine Tuned Pred: I think this message should be updated. `An "anonymous" request is required. To allow for anonymous access, ServerHttpSecurity.anonymous() is configured.`
-------------------
@@ -3,7 +3,7 @@ const test = require('./shared').assert,
   setupDatabase = require('./shared').setupDatabase,
   Script = require('vm'),
   expect = require('chai').expect,
-  normalizedFunctionString = require('bson/lib/bson/parser/utils').normalizedFunctionString,
+  normalizedFunctionString = require('bson/lib/parser/utils').normalizedFunctionString,
   Buffer = require('safe-buffer').Buffer;

 const {
## Human Review: feel free when editing sections like this to introduce modern features like object destructuring.
## HF Pred: Why this change?
## Fine Tuned Pred: whats with the whitespace changes in this file?
-------------------
@@ -0,0 +1,17 @@
+<?php
+/**
+ * Copyright © Bold Brand Commerce Sp. z o.o. All rights reserved.
+ * See LICENSE.txt for license details.
+ */
+
+declare(strict_types=1);
+
+namespace Ergonode\Core\Infrastructure\Exception;
+
+class SerializationException extends SerializerException
+{
+    public function __construct(string $message, \Throwable $previous = null)
+    {
+        parent::__construct($message, $previous);
+    }
+}
## Human Review: I think it should been in `SharedKernel` module.
## HF Pred: Missing license header.
## Fine Tuned Pred: Why do we need a separate exception class here? Seems very specific to Elgg\Core\Infrastructure\Exception
-------------------
@@ -473,6 +473,7 @@ describe('GridFS Stream', function () {
                     // Fail if user tries to abort an aborted stream
                     uploadStream.abort().then(null, function (error) {
                       expect(error.toString()).to.equal(
+                        // TODO(NODE-3405): Replace with MongoStreamClosedError
                         'MongoDriverError: Cannot call abort() on a stream twice'
                       );
                       client.close(done);
## Human Review: You can remove these if they've been resolved in NODE-3405 and this isn't depending on it
## HF Pred: Is this TODO still relevant?
## Fine Tuned Pred: What is the problem here?
df['vscode'] = analyze_preds('../data/microsoft_vscode_1000.csv')
-------------------
@@ -340,6 +341,17 @@ export class DebugEditorContribution implements IDebugEditorContribution {
 		return new RunOnceScheduler(() => this.hoverWidget.hide(), HOVER_DELAY);
 	}
 
+	@memoize
+	private get provideNonDebugHoverScheduler(): RunOnceScheduler {
+		return new RunOnceScheduler(() => {
## Human Review: Where is this disposed?
## HF Pred: Why is this memoized?
## Fine Tuned Pred: This is not `Runnable` but `RunOnceScheduler`, right?
-------------------
@@ -53,6 +53,11 @@ export class SassParser extends cssParser.Parser {
 
 	// Sass variables: $font-size: 12px;
 	public _parseVariableDeclaration(panic:scanner.TokenType[]=[]): nodes.VariableDeclaration {
+		var cssVariableDeclaration= super._parseCssVariableDeclaration(panic);
## Human Review: That looks wrong. Not all places where you can declare a sass variable are also suited to declare a css variable.

## HF Pred: Nit: space after `=`
## Fine Tuned Pred: Minor style issue: missing space before the `=`
-------------------
@@ -537,6 +537,23 @@ export interface CompletionList {
 	 */
 	items: CompletionItem[];
 }
+
+/**
+ * Contains additional information about the context in which
+ * [completion provider](#CompletionItemProvider.provideCompletionItems) is triggered.
+ */
+export interface CompletionContext {
## Human Review: This also needs the following information  *manual* invocation, *24x7* IntelliSense, completing incomplete result set 
## HF Pred: I don't think we need this interface. We can just use `CompletionItemProvider`.
## Fine Tuned Pred: Is there a reason this is not an object?
-------------------
@@ -439,6 +441,16 @@ export class DebugService implements debug.IDebugService {
 		});
 	}
 
+	private debouncedDisplayThreads(session: RawDebugSession) {
+		const timer = this.displayThreadsTimer.get(session.getId());
+		if (timer) {
+			clearTimeout(timer);
+		}
+		this.displayThreadsTimer.set(session.getId(), setTimeout(() => {
+			this.fetchThreads(session).done(undefined, errors.onUnexpectedError);
## Human Review: This will make 
## HF Pred: I'm not sure if this is the best way to do this. I think it would be better to do this in the `fetchThreads` function.
## Fine Tuned Pred: Is there a reason we can't just use `clearTimeout` here? Is there any difference in the code between `clearTimeout` and `set`?
-------------------
@@ -265,4 +266,33 @@ class MDDocumentContentProvider implements TextDocumentContentProvider {
 			}, 300);
 		}
 	}
+}
+
+class DocumentHeadingsProvider implements vscode.DocumentSymbolProvider {
+
+	// http://daringfireball.net/projects/markdown/syntax#header
+	private static _atxPattern = /^(#){1,6}\s+.+(\s+\1)?/;
+	private static _settext = /^\s*[-=]+\s*$/;
## Human Review: Make sure to also test that this does not pick up separators

## HF Pred: nit: `_atxPattern` -> `_atxRegex`
## Fine Tuned Pred: Remove this and use directly the `SymbolProvider` below.
df['kotlin'] = analyze_preds('../data/JetBrains_kotlin_1000.csv')
-------------------
@@ -24,10 +24,13 @@ abstract class AbstractIrLineNumberTest : AbstractLineNumberTest() {
 
     override fun compareCustom(psiFile: KtFile, wholeFile: File) {
         val fileText = psiFile.text
-        val expectedLineNumbers = normalize(
-            fileText.substring(fileText.indexOf("//") + 2)
-                .trim().split(" ").map { it.trim() }.toMutableList()
-        )
+        val expectedLineNumbers = fileText.split("\n".toRegex()).filter { line ->
## Human Review: How about simplifying this to:

        val expectedLineNumbers = normalize(
            fileText.substring(Regex("// \\d+").find(fileText)!!.range.start + 2)
                .trim().split(" ").map { it.trim() }.toMutableList()
        )

Then we are looking for a line that starts with "// " with a number of digits after it. That should work to exclude the comment lines with text. That looks consistent with what the old backend expects from these tests.
## HF Pred: Why did you change this?
## Fine Tuned Pred: Remove the `toRegex()` call here.
-------------------
@@ -45,7 +45,6 @@ import org.jetbrains.kotlin.types.KotlinType
 import org.jetbrains.kotlin.types.typeUtil.isPrimitiveNumberType
 import org.jetbrains.kotlin.types.typeUtil.makeNotNullable
 import org.jetbrains.kotlin.types.typeUtil.makeNullable
-import java.lang.AssertionError
## Human Review: Pls remove this file from pr
## HF Pred: import org.jetbrains.kotlin.util.KotlinConstants;
## Fine Tuned Pred: Why did you remove this import?
-------------------
@@ -46,6 +46,7 @@ abstract class AbstractCheckLocalVariablesTableTest : CodegenTestCase() {
             assertNotNull("Couldn't find class file for pattern $classFileRegex in: $pathsString", outputFile)
 
             val actualLocalVariables = readLocalVariable(ClassReader(outputFile.asByteArray()), methodName)
+            checkLocalVariableTypes(ClassReader(outputFile.asByteArray()), methodName, actualLocalVariables)
## Human Review: Could we reuse class reader from previous line?
## HF Pred: assertEquals(expectedLocalVariables, actualLocalVariables)
## Fine Tuned Pred: What's the purpose of this line?
-------------------
@@ -37,3 +37,17 @@ public fun Char.equals(other: Char, ignoreCase: Boolean = false): Boolean {
  * Returns `true` if this character is a Unicode surrogate code unit.
  */
 public fun Char.isSurrogate(): Boolean = this in Char.MIN_SURROGATE..Char.MAX_SURROGATE
+
+/**
+ * Minimum value for character
+ * @see <a href="https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html">
## Human Review: I don't think this way of linking is supported in `@see` tag.  I'm not sure what is the purpose of these links anyway.
## HF Pred: This should be `Char.isSurrogate()`.
## Fine Tuned Pred: Please remove the `<a>` tag.
-------------------
@@ -175,7 +195,9 @@ class ExpressionCodegen(
 
     override fun visitBlockBody(body: IrBlockBody, data: BlockInfo): StackValue {
         return body.statements.fold(none()) { _, exp ->
-            exp.accept(this, data)
+            val result = exp.accept(this, data)
+            (exp as? IrExpression)?.markEndOfStatementIfNeeded()
## Human Review: ```
exp.accept(this, data).also {
       (exp as? IrExpression)?.markEndOfStatementIfNeeded()
}
```
## HF Pred: Why is this needed?
## Fine Tuned Pred: Unrelated to this PR, but this doesn't seem like the best place to markEndOfStatementIfNeeded. Should the markEndOfStatementIfNeeded be in some other place, or should the markEndOfStatementIfNeeded be in BlockInfoResolver?
df['uppy'] = analyze_preds('../data/transloadit_uppy_1000.csv')
-------------------
@@ -28,6 +28,7 @@
     "@uppy/thumbnail-generator": "0.29.1",
     "@uppy/utils": "0.29.1",
     "classnames": "^2.2.6",
+    "cuid": "^2.1.1",
     "drag-drop": "2.13.3",
## Human Review: In other places we are using uuid, I think, is there a reason to use different packages? I was actually gonna suggest we switch to https://github.com/ai/nanoid everywhere, it’s smaller and ids are smaller.
## HF Pred: This should be `^2.1.0`, right?
## Fine Tuned Pred: Why are we using a fixed version here?
-------------------
@@ -49,7 +49,8 @@ class Tabs extends Component {
             tabindex="-1"
             type="file"
             name="files[]"
-            multiple="true"
+            multiple={this.props.maxNumberOfFiles !== 1 || !this.props.maxNumberOfFiles}
## Human Review: same here
## HF Pred: `multiple={this.props.maxNumberOfFiles !== 1}`
## Fine Tuned Pred: shouldnt it be only true if `this.props.maxNumberOfFiles` is defined ?
-------------------
@@ -496,6 +494,25 @@ module.exports = class ProviderView {
     }
   }
 
+  listAllFiles (path, files = null) {
+    files = files || []
+    return new Promise((resolve) => {
+      this.provider.list(path).then((res) => {
## Human Review: should we add a `catch` for this call. Otherwise what will happen when the call to `this.provider.list` fails?
## HF Pred: `files = files || []`
## Fine Tuned Pred: `resolve` is unnecessary here, you can just call `this.provider.list(path, files)`
-------------------
@@ -422,7 +422,7 @@ module.exports = class ProviderView {
   }
 
   handleAuth () {
-    const authState = btoa(JSON.stringify({ origin: location.origin }))
+    const authState = btoa(JSON.stringify({ origin: location.origin, redirect: 'http://localhost:3' }))
     const link = `${this.provider.authUrl()}?state=${authState}`
## Human Review: `http://localhost:3` 🤔 
## HF Pred: I think this should be `https://localhost:3`
## Fine Tuned Pred: I think it would be better if we could use `location.protocol` here instead of hardcoding `localhost:3`
-------------------
@@ -2,6 +2,14 @@
 
 echo "Preparing for end to end test: copying static HTML and CSS, building JS"
 rm -rf ./test/endtoend/dist && mkdir ./test/endtoend/dist
-cp ./dist/uppy.min.css ./test/endtoend/dist 
-cp ./test/endtoend/src/index.html ./test/endtoend/dist 
+rm -rf ./test/endtoend/node_modules
+
+UPPY_VERSION=$(cat package.json | grep version | head -1 | awk -F= "{ print $2 }" | sed 's/[version:,\",]//g' | tr -d '[[:space:]]')
+# archive the uppy package
+npm pack
## Human Review: I think we should also run the `npm run prepublishOnly` script first here, so that the package is exactly the same as the published one (I thought this would happen automatically inside `npm pack` but it doesn't seem to)
## HF Pred: Why do we need this?
## Fine Tuned Pred: Do we want to keep the old version or is there a way to upgrade to a newer version?

As we can see, the models are pretty mediocre at understanding the issue. But I find that the HF model tends to produce more generic predictions, while the fine-tuned model produces predictions that are more specific to the code and shows a better understanding.

Still, both models predict something sensible but struggle to pin-point the problem.

Quantitative Evaluation#

For each dataset, we calculate the BLEU-4 score for the predictions of the HF model and the fine-tuned model. The BLEU score is a measure of how similar the predictions are to the target. The higher the score, the better the predictions.

calc_bleu_score('../data/msg-test')
Total: 10169
Total: 10169
HF BLEU: 5.14
Fine Tuned BLEU: 5.34
(5.14, 5.34)
calc_bleu_score('../data/microsoft_vscode_1000.csv')
Total: 1000
Total: 1000
HF BLEU: 3.39
Fine Tuned BLEU: 4.02
(3.39, 4.02)
calc_bleu_score('../data/JetBrains_kotlin_1000.csv')
Total: 1000
HF BLEU: 3.39
Fine Tuned BLEU: 4.32
Total: 1000
(3.39, 4.32)
calc_bleu_score('../data/transloadit_uppy_1000.csv')
Total: 1000
HF BLEU: 4.55
Fine Tuned BLEU: 4.84
Total: 1000
(4.55, 4.84)

As we can see, the fine-tuned model performs slightly better than the HF model on all datasets.

Nevertheless, the score is still pretty low (as the authors of [LLG+22] put it: “it is a hard task”).