-
Notifications
You must be signed in to change notification settings - Fork 31
Open
Labels
dataset-requestRequest for addition of new datasetRequest for addition of new dataset
Description
GitHub Diffs
Description
Dataset is on BigQuery as a table of commit hashes and messages.
Procedure
From commit hash and message, produce dict containing:
- Raw files before changes
- Commit message
- Diff file
This requires for each commit, downloading the files after changes and applying the reverse patch to obtain the files before changes.
We also need to decide on a suitable length threshold to filter on since we need to include most or all of the before file in the context window, which restricts the line numbers significantly.
Minimal working example here: https://gist.github.com/herbiebradley/b08d2e13775384fe4b5353e831dac43a
- Minimal working example
- Decide on length threshold
- parquet output
- Inherit from
dataset.pybase classes - Parallel processing
- Bitbucket modifications - see Bitbucket diffs #5
Example
Give an example of the columns and data:
| before_file | commit_message | diff |
|---|---|---|
| ['from setuptools import setup, find_packages\n', '\n', 'setup(\n', ... ] | Change version | [{'addition_count': 1, 'deletion_count': 1, 'hunks': [[[3, 7], [3, 7], '', ' setup(', " name = 'denoising-diffusion-pytorch',", ' packages = find_packages(),', "- version = '0.26.1',", "+ version = '0.26.3',", " license='MIT',", " description = 'Denoising Diffusion Probabilistic " "Models - Pytorch',", " author = 'Phil Wang',"]], 'patch_info': <PatchInfo: diff --git a/setup.py b/setup.py>, 'src_file': 'a/setup.py', 'tgt_file': 'b/setup.py'}] |
ncoop57
Metadata
Metadata
Assignees
Labels
dataset-requestRequest for addition of new datasetRequest for addition of new dataset
Type
Projects
Status
Done