Criteo is pleased to announce the release of a new dataset which is an extended version of our Kaggle click prediction dataset. With over 4 billion lines and over 1TB in size, this is the largest public machine learning dataset ever released.
As large-scale problems become more prevalent, we believe it is important to make such a dataset available to the academic community and we hope this will serve as a useful benchmark for distributed learning algorithms.
This dataset is hosted on Microsoft Azure, making it possible for researchers to directly run map-reduce jobs on that platform. Details on how to access the dataset and/or download it can be found here.
Drop us a line at firstname.lastname@example.org if you are curious.