Removing Noise, not Finding Gold: Quality Filtering for Large-Scale Pretraining
arXiv:2510.00866v3 Announce Type: replace-cross Abstract: Large-scale models are pretrained on massive web-crawled datasets containing documents of mixed quality, making data filtering essential. A popular method is…