This issue focuses on collecting ideas and formalizing the postprocessing steps and formatting of data instances for datasets in different categories, e.g., forums, articles, books, etc.
Initial draft of postprocessing:
- Exact duplication removal
- Near duplication removal
- Removal of specific html tags
Questions for formatting:
- How to format forums?
- How to format general website articles?
- How to format books?