-
Notifications
You must be signed in to change notification settings - Fork 31
Open
Description
We should follow a similar process to the BigScience workshop's dataset processing. They include many of the tools ready for us to use such as data deduplication, both exact match and near dedup, filtering of low information content examples, removal of potentially hateful documents, and removal of PII.
They have all their tools available and discussions of them here: https://github.com/bigscience-workshop/data_tooling
Here is an initial set of tasks to perform:
- Filtering of low quality documents
- Filtering of documents with specific removal words
- Filtering of exact duplicate content
- Filtering of near duplicate content
- Removal of PII
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
Done