arXiv cs.CL
· Papers
Removing Noise, not Finding Gold: Quality Filtering for Large-Scale Pretraining
arXiv:2510.00866v3 Announce Type: replace-cross Abstract: Large-scale models are pretrained on massive web-crawled datasets containing documents of mixed quality, making data filtering essential. A popular method is Classifier-based Quality Filtering (CQF), which trains a binary classifier to distinguish between pretra