There are no reviews yet. Be the first to send feedback to the community and the maintainers!
Repository Details
Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.