This conference gathered over 3000 attendants at the interface of academia and industry. More than 10 scientists and engineers from Criteo were there! As in most top tier conferences, we were one of the sponsors and our booth showcased our current AI initiatives.
Unsupervised learning highlight
One of the most classical clustering algorithms is k-means; we were delighted to discover “Scalable k-Means Clustering via Lightweight Coresets“, which proposes a simple variant of it. Just generate a coreset of arbitrary size in only two parallelizable passes on the data, run the k-means algorithm on this weighted sub-sample of data, and you’re set. This coreset generation approach is much more reliable than uniform sampling, and much faster than k-means++-like procedures. Simple and efficient: the kind of algorithms we love!
We had a full standing room for the AdKDD & TargetAd workshop, co-organized by our Research VP Suju Rajan and other leaders of the Ad Science landscape. Among the interesting papers, we liked the Deep Neural Net with Attention for Multi-channel Multi-touch Attribution that pushes attribution modeling (a major research area in AdTech) by blending in deep RL model and modern explainability techniques. In this context, Criteo published a A Large Scale Benchmark for Uplift Modeling, hoping to foster research in causal inference and provable advertising optimization by releasing a new dataset and bridging the two major metrics of the task (dataset can be downloaded here).
Ensemble methods refer to the practice of mixing different models in order to achieve greater accuracy than each individual model. At first the authors explained why ensemble methods work: As every model has an error, taking the average of two models, if their error are independent will divide by two the variance of the error and therefore the model accuracy. A simple way to use ensemble methods would therefore be to find the linear combination of the model outputs which minimizes the variance of the errors. It is also possible to add Bayesian prior to this minimization (for instance to give more weight to the more accurate model). Additionally, an option is to use the outputs of the different models as input to a new model in order to use the “stacked” prediction. It is important to remember that for ensemble methods to work, it is not necessary to have two models with exactly the same performance, but you need their error to be as uncorrelated as possible. Ensemble methods with two linear models are therefore likely to be less performant than ensemble methods using a linear and a non-linear method.
Experimental Design highlight
An interesting line of work for people doing A/B tests is the “Experiment of Experiment” design proposed by J. Pouget-Abadie et al. It allows one to measure the gap between different split methods and is especially useful in the case of network interference that violates the classical SUTVA assumption. This situation happens in many online applications: social networks, cross-device identification and so on. Under a monotonicity assumption it even allows to choose the least-biased split!
Stein’s Paradox in Statistics and ML
Ali Rahimi gave an interesting lecture shading new light into regularization. He (re)introduced Stein’s phenomena which is well-known to statisticians but is relatively unknown to machine learning practitioners. Stein showed in the 70’s that a maximum likelihood estimator of an average is not optimal, and gave particular examples where by shrinking the maximum likelihood estimator one can achieve equal or better estimators (in terms of mean square error). This discovery stunned the statistical world at the time. Rahimi emphasized that maybe the ML community has been studying Stein’s phenomenon all along by what is called regularization, or intervention (by operation on layers, regularizing weights, etc.). As known, regularization is a common way to control the model capacity, to stabilize learning, to generalize better. However, as Rahimi repeated “No one ever proved empirical risk minimization was optimal”. The point of Rahimi is that: “There is a stein effect in empirical risk minimization”, and maybe we’ve been trying to correct for it all along without talking about it explicitly, by instead referring to “overfitting”, or “reducing generalization gaps”, or “reducing model capacity” or “introducing side information” .
Tim Roughgarden gave a great introductory talk to auction theory, and on what’s going on behind the scene when ads space are sold on the internet. He introduced Myerson’s optimal auction design (of a single item auction in a symmetric setting) which states how the seller should design the auction such that his expected revenue is maximized. By doing so, in a second price auction mechanism, the auction design moves the auction from a welfare-maximizing auction (where no reserve price is set, or set to 0), to an auction optimizing for the seller revenue (i.e. where the seller constraints the buyer to a reserve price (i.e. minimum accepted bid)). In such a scenario, a recent work from Criteo shows that bidding truthfully is not an optimal strategy for the buyer and details different optimal bidding strategies. Tim’s talk also considered more complex auction mechanisms with multiple items and multiple reserve prices.
Graph-based algorithms have proven to be strongly efficient in the common situation where the data cannot be simply described with “flat” features (e.g. when most of the information comes from pairwise relations between entities). They are also very powerful when the data exhibits a low-dimension manifold structure that prevents us from using simpler algorithms based on Euclidean proximity (e.g. k-means). However, the high computational cost of many graph methods, such as the Spectral Clustering algorithm, limits its practicality. The graph learning community has been quite active at overcoming these limitations in the past years and we observed the continuation of this trend at KDD 2018. Among the interesting works presented on the subject this year, the highlights were the Fast Sylvester equation solver, the improvements to the classical Spectral Clustering methods presented (see Wu et al. and Chen et al.) and the new approximated method introduced by Cohen-Steiner to compute the graph spectrum, “almost regardless” of the size of the graph.
Distilling knowledge in recommender systems
Quality of a recommender system often comes at the price of complexity. In real-world applications where infrastructure considerations come into play, it is often hard to obtain a good model with a reasonable inference time. Knowledge Distillation was first introduced in 2015 by Hinton et al. and provides an efficient way of reducing the complexity of a classifier by adopting a “student / teacher” paradigm: a simple model is trained with the help of a (pre-trained) complex model, which improves its final performance. The work presented at KDD this year by Tang et al. extended the distillation method to recommender systems.
Personalization for search rankings
The recommendation system at Airbnb faces some issues related to scarcity of data. As explained by Grbovic et at, it is rare for a user to book the same place twice and their availability constraints during a search varies a lot. In order to work through those constraints, they presented a solution leveraging listing embeddings that would be more focused on the attributes of the listing itself than location or specific ids. As a benefit, it reduces impact of cold start for newer items. They apply the same type of embeddings to the users’ past bookings to have personalized results. Finally, when combined with their localized search, it allows them to have realtime personalization of search rankings.
OpenTag: attribute value extraction from product profile
When working with product catalogs, having complete profiles with detailed attributes can be key. A lot of the existing work is done by leveraging hand crafted lexicons or annotations. OpenTag stands out because the main focus is to break this closed world assumption and try to come up with a solution that would be able to extract new attribute values with minimal supervision or human interactions. They held an interesting session on their combination of CRF, bidirectional LSTM, word embedding and active learning.
Lot of learning, fun, and social interactions. Hope to see you all for the next edition in Alaska!
Post written by: