Data Processing

We should follow a similar process to the BigScience workshop's dataset processing. They include many of the tools ready for us to use such as data deduplication, both exact match and near dedup, filtering of low information content  examples, removal of potentially hateful documents, and removal of PII.

They have all their tools available and discussions of them here: https://github.com/bigscience-workshop/data_tooling

Here is an initial set of tasks to perform:
- [ ] Filtering of low quality documents
- [ ] Filtering of documents with specific removal words
- [ ] Filtering of exact duplicate content
- [ ] Filtering of near duplicate content
- [ ] Removal of PII

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data Processing #37

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Data Processing #37

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions