FineWeb: 15 trillion tokens of high quality web data the web has to offer.
The dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl.
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Comments are closed.