Skip to content

GitHub Diffs #31

@herbiebradley

Description

@herbiebradley

GitHub Diffs

Description

Dataset is on BigQuery as a table of commit hashes and messages.

Procedure

From commit hash and message, produce dict containing:

  • Raw files before changes
  • Commit message
  • Diff file

This requires for each commit, downloading the files after changes and applying the reverse patch to obtain the files before changes.

We also need to decide on a suitable length threshold to filter on since we need to include most or all of the before file in the context window, which restricts the line numbers significantly.

Minimal working example here: https://gist.github.com/herbiebradley/b08d2e13775384fe4b5353e831dac43a

  • Minimal working example
  • Decide on length threshold
  • parquet output
  • Inherit from dataset.py base classes
  • Parallel processing
  • Bitbucket modifications - see Bitbucket diffs #5

Example

Give an example of the columns and data:

before_file commit_message diff
['from setuptools import setup, find_packages\n', '\n', 'setup(\n', ... ] Change version [{'addition_count': 1, 'deletion_count': 1, 'hunks': [[[3, 7], [3, 7], '', ' setup(', " name = 'denoising-diffusion-pytorch',", ' packages = find_packages(),', "- version = '0.26.1',", "+ version = '0.26.3',", " license='MIT',", " description = 'Denoising Diffusion Probabilistic " "Models - Pytorch',", " author = 'Phil Wang',"]], 'patch_info': <PatchInfo: diff --git a/setup.py b/setup.py>, 'src_file': 'a/setup.py', 'tgt_file': 'b/setup.py'}]

Metadata

Metadata

Assignees

No one assigned

    Labels

    dataset-requestRequest for addition of new dataset

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions