Download Criteo 1TB Click Logs dataset

This dataset contains feature values and click feedback for millions of display
ads. Its purpose is to benchmark algorithms for clickthrough rate (CTR) prediction.
It is similar, but larger, to the dataset released for the Display Advertising
Challenge hosted by Kaggle: https://www.kaggle.com/c/criteo-display-ad-challenge


Full description:

This dataset contains 24 files, each one corresponding to one day of data.


Dataset construction:

The training dataset consists of a portion of Criteo’s traffic over a period
of 24 days. Each row corresponds to a display ad served by Criteo and the first
column is indicates whether this ad has been clicked or not.
The positive (clicked) and negatives (non-clicked) examples have both been
subsampled (but at different rates) in order to reduce the dataset size.

There are 13 features taking integer values (mostly count features) and 26
categorical features. The values of the categorical features have been hashed
onto 32 bits for anonymization purposes.
The semantic of these features is undisclosed. Some features may have missing values.

The rows are chronologically ordered.


Format:

The columns are tab separated with the following schema:
<label> <integer feature 1> … <integer feature 13> <categorical feature 1> … <categorical feature 26>

When a value is missing, the field is just empty.

Difference with the Kaggle challenge dataset:

– The dataset is not over the same time period;
– The subsampling ratios are different;
– The ordering of the features is not the same and the computation of some of them has changed;
– The hash function for categorical features is different.


Download:

The dataset is hosted on Azure ML.
Each day can be downloaded at http://azuremlsampleexperiments.blob.core.windows.net/criteo/day_XX.gz, where XX goes from 0 to 23.

The following command downloads all days:
curl -O http://azuremlsampleexperiments.blob.core.windows.net/criteo/day_{`seq -s ‘,’ 0 23`}.gz


Direct access from Azure

The data can be used directly by Azure Machine Learning and Azure HDInsight (Hadoop) users.
In Hive queries on HDInsight, use LOCATION ‘wasb://criteo@azuremlsampleexperiments.blob.core.windows.net/raw/’
(first 21 days are stored in “count” subfolder, last three days in “train” and “test” subfolders).
Azure Machine Learning users can run Hive queries directly via the Reader module, or use Learning with Counts modules.

For further details, please check here.

Latest Tweets