This dataset contains feature values and click feedback for millions of display
ads. Its purpose is to benchmark algorithms for clickthrough rate (CTR) prediction.
It is hosted on HuggingFace: https://huggingface.co/datasets/criteo/CriteoClickLogs
It is similar, but larger, to the dataset released for the Display Advertising
Challenge hosted by Kaggle: https://www.kaggle.com/c/criteo-display-ad-challenge
Full description:
This dataset contains 24 files, each one corresponding to one day of data.
Dataset construction:
The training dataset consists of a portion of Criteo’s traffic over a period
of 24 days. Each row corresponds to a display ad served by Criteo and the first
column is indicates whether this ad has been clicked or not.
The positive (clicked) and negatives (non-clicked) examples have both been
subsampled (but at different rates) in order to reduce the dataset size.
There are 13 features taking integer values (mostly count features) and 26
categorical features. The values of the categorical features have been hashed
onto 32 bits for anonymization purposes.
The semantic of these features is undisclosed. Some features may have missing values.
The rows are chronologically ordered.
Format:
The columns are tab separated with the following schema:
<label> <integer feature 1> … <integer feature 13> <categorical feature 1> … <categorical feature 26>
When a value is missing, the field is just empty.
Difference with the Kaggle challenge dataset:
– The dataset is not over the same time period;
– The subsampling ratios are different;
– The ordering of the features is not the same and the computation of some of them has changed;
– The hash function for categorical features is different.