Criteo Uplift Prediction Dataset

By: Criteo AI Lab / 31 May 2018

Criteo Uplift Modeling Dataset

This dataset is released along with the paper:

A Large Scale Benchmark for Uplift Modeling
Eustache Diemert, Artem Betlei, Christophe Renaudin; (Criteo AI Lab), Massih-Reza Amini (LIG, Grenoble INP)

This work was published in: AdKDD 2018 Workshop, in conjunction with KDD 2018.

When using this dataset, please cite the paper with following bibtex:

author = {{Diemert Eustache, Betlei Artem} and Renaudin, Christophe and Massih-Reza, Amini},
title={A Large Scale Benchmark for Uplift Modeling},
publisher = {ACM},
booktitle = {Proceedings of the AdKDD and TargetAd Workshop, KDD, London,United Kingdom, August, 20, 2018},
year = {2018}

We would love to hear from you if use this data or plan to use it. Refer to the Contact section below.

Data description

This dataset is constructed by assembling data resulting from several incrementality tests, a particular randomized trial procedure where a random part of the population is prevented from being targeted by advertising. it consists of 25M rows, each one representing a user with 11 features, a treatment indicator and 2 labels (visits and conversions).


For privacy reasons the data has been sub-sampled non-uniformly so that the original incrementality level cannot be deduced from the dataset while preserving a realistic, challenging benchmark. Feature names have been anonymized and their values randomly projected so as to keep predictive power while making it practically impossible to recover the original features or user context.


Here is a detailed description of the fields (they are comma-separated in the file):

  • f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11: feature values (dense, float)
  • treatment: treatment group (1 = treated, 0 = control)
  • conversion: whether a conversion occured for this user (binary, label)
  • visit: whether a visit occured for this user (binary, label)
  • exposure: treatment effect, whether the user has been effectively exposed (binary)

Key figures

  • Format: CSV
  • Size: 459MB (compressed)
  • Rows: 25,309,483
  • Average Visit Rate: .04132
  • Average Conversion Rate: .00229
  • Treatment Ratio: .846


The dataset was collected and prepared with uplift prediction in mind as the main task. Additionally we can foresee related usages such as but not limited to:

  • benchmark for causal inference
  • uplift modeling
  • interactions between features and treatment
  • heterogeneity of treatment
  • benchmark for observational causality methods


For any question, feel free to contact:

  • The authors of the paper directly (emails in the paper)
  • Criteo AI Lab team:
  • Criteo AI Lab twitter account: @CriteoResearch


Non uniformity of the incrementality level across advertisers caused the first version of the dataset to have a leak: uplift prediction could be artificially improved by differentiating advertisers using individual features (distribution of features being advertiser-dependent).

For this reason, we release an un-biased version of the dataset containing the same fields. Here are its corresponding key figures:

Key figures

Format: CSV
Size: 297M (compressed)
Rows: 13,979,592
Average Visit Rate: .046992
Average Conversion Rate: .00292
Treatment Ratio: .85

Download instructions

To download the un-biaised dataset click here