Skip to content

Data Processing #37

@ncoop57

Description

@ncoop57

We should follow a similar process to the BigScience workshop's dataset processing. They include many of the tools ready for us to use such as data deduplication, both exact match and near dedup, filtering of low information content examples, removal of potentially hateful documents, and removal of PII.

They have all their tools available and discussions of them here: https://github.com/bigscience-workshop/data_tooling

Here is an initial set of tasks to perform:

  • Filtering of low quality documents
  • Filtering of documents with specific removal words
  • Filtering of exact duplicate content
  • Filtering of near duplicate content
  • Removal of PII

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions