Bitbucket Code

## Title

Dataset URL - [here](https://bitbucket.org/)

Does the dataset exist in a scraped format?   
URL if Yes - [here](https://drive.google.com/file/d/13QsJRhhpL64m3jhsH4up0CBtxDIalO-A/view?usp=sharing)

## Description
Got 1261420 repos from bitbucket that we can download. This data included: ['type', 'full_name', 'links', 'name', 'slug', 'description', 'scm', 'website', 'owner', 'workspace', 'is_private', 'project', 'fork_policy', 'created_on', 'updated_on', 'size', 'language', 'has_issues', 'has_wiki', 'uuid', 'mainbranch', 'override_settings', 'parent'] from repos. 

## Procedure
- [ ] Attempt to clone repo based on information parquest file above
- [ ] Filtering by Licence following this list 
```
MIT-0
MIT
MIT-feh
Apache-2.0
BSD-3-Clause
BSD-3-Clause-Clear
BSD-3-Clause-No-Nuclear-License-2014
BSD-2-Clause
CC0-1.0
EPL-1.0
MPL-2.0
Unlicense
ISC
Artistic-2.0
deprecated_LGPL-3.0+
deprecated_LGPL-2.1+
ECL-2.0
SHL-0.51
MPL-2.0-no-copyleft-exception
```
- [ ] Procedure processes like Github [CodeParrot](https://github.com/huggingface/transformers/tree/main/examples/research_projects/codeparrot#preprocessing)
- [ ] Convert to lm_dataformat


## Tests

Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.

Give an example of the columns and data:

| col1 | col2 | .... |
| ---- | ---- | ---- |
| row1 | row1 | .... |


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bitbucket Code #34

Title

Description

Procedure

Tests

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bitbucket Code #34

Description

Title

Description

Procedure

Tests

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions