Criteo Uplift Prediction Dataset

By: Criteo AI Lab / 31 May 2018

Criteo Uplift Modeling Dataset

This dataset is released along with the paper:

A Large Scale Benchmark for Uplift Modeling
Eustache Diemert, Artem Betlei; (Criteo AI Lab), Christophe Renaudin (Criteo), Massih-Reza Amini (LIG, Grenoble INP)

This work was published in: AdKDD 2018 Workshop, in conjunction with KDD 2018.

When using this dataset, please cite the paper with following bibtex:

author = {{Diemert Eustache, Betlei Artem} and Renaudin, Christophe and Massih-Reza, Amini},
title={A Large Scale Benchmark for Uplift Modeling},
publisher = {ACM},
booktitle = {Proceedings of the AdKDD and TargetAd Workshop, KDD, London,United Kingdom, August, 20, 2018},
year = {2018}

We would love to hear from you if use this data or plan to use it. Refer to the Contact section below.

Data description

This dataset is constructed by assembling data resulting from several incrementality tests, a particular randomized trial procedure where a random part of the population is prevented from being targeted by advertising. it consists of 25M rows, each one representing a user with 11 features, a treatment indicator and 2 labels (visits and conversions).


For privacy reasons the data has been sub-sampled non-uniformly so that the original incrementality level cannot be deduced from the dataset while preserving a realistic, challenging benchmark. Feature names have been anonymized and their values randomly projected so as to keep predictive power while making it practically impossible to recover the original features or user context.


Here is a detailed description of the fields (they are comma-separated in the file):

  • f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11: feature values (dense, float)
  • treatment: treatment group (1 = treated, 0 = control)
  • conversion: whether a conversion occured for this user (binary, label)
  • visit: whether a visit occured for this user (binary, label)
  • exposure: treatment effect, whether the user has been effectively exposed (binary)

Key figures

  • Format: CSV
  • Size: 459MB (compressed)
  • Rows: 25,309,483
  • Average Visit Rate: .04132
  • Average Conversion Rate: .00229
  • Treatment Ratio: .846


The dataset was collected and prepared with uplift prediction in mind as the main task. Additionally we can foresee related usages such as but not limited to:

  • benchmark for causal inference
  • uplift modeling
  • interactions between features and treatment
  • heterogeneity of treatment
  • benchmark for observational causality methods


For any question, feel free to contact:

  • The authors of the paper directly (emails in the paper)
  • Criteo AI Lab team:
  • Criteo AI Lab twitter account: @CriteoResearch

Download instructions

To Download the Dataset click here