The 35th International Conference on Machine Learning (ICML) in 2018 just finished in Stockholm, Sweden and Criteo was there!
This year one of the leading machine learning conferences around the world had a record-breaking number of 618 accepted research papers, which were presented in 10 conceptually different tracks, from reinforcement learning to gaussian processes to optimization and deep learning. This time, on the participation side, more than 30 scientists and engineers from Criteo attended ICML, and on the publication side, we presented two research papers. As in most top tier conferences, we were one of the sponsors and our booth showcased our current AI initiatives. This is a photo with some of the Criteos at ICML.
Criteo AI Lab
Two weeks before ICML, Criteo announced the launch of the Criteo AI Lab, a new milestone in our history, aimed to foster innovation and to improve the performance of digital advertising. With a major investment of 20M euros over three years, the Criteo AI Lab sets the goal to establish a better understanding of the models used in online advertising and push forward the interpretability, robustness and effectiveness of computational advertising to a new level. If you are up to the challenge, take a look at our careers page, and join this exciting adventure!
Here are below, what we think are some of the highlights of the conference per area of interest:
At ICML this year, a whole tutorial session by Benjamin Recht was dedicated to RL for continuous control problems. One of the key message was that the first step toward this direction is to study the Linear Quadratic Regulator (LQR) problem. Indeed, while LQR is now a standard optimal control problem (with continuous states and actions), addressing the unknown dynamics case illuminates most of the difficulties at stake in general. Several papers were focusing on RL for LQR at ICML this year! Marc Abeille, researcher at Criteo, presented his work on a theoretical analysis of TS for LQR. The motivation behind this work is that TS offers an efficient way of addressing the exploration-exploitation in this setting (as opposed to algorithms based on OFU), but so far goes with very poor frequentist performance guarantees. However, numerical experiments suggest that it performs at least as good as OFU. Marc’s work aims at closing the gap between experimental and theoretical performances, showing that in the one-dimensional case, TS suffers at most a sqrt(T) cumulative regret, thus matching the guarantee previously derived for OFU. Finally, the derivation of this novel bound reveals an interesting finding about TS: due to the random nature of its functioning, it seems necessary to update the control policy very frequently, which is in stark contrast with how other RL algorithms are usually implemented. As a result, the analysis required a novel way of controlling the regret due to frequent updates, by showing that one can afford to update policies at each time step, as long as those updates remain ‘smooth enough’, which is ultimately satisfied by the TS algorithm.
Vector representations aka embeddings are used to replace old one-hot encoded vectors with distributed versions that are learned with neural networks. Embeddings are studied extensively at Criteo (if you are curious, take a look at our previous work on product and user timeline embeddings) and this year Sergey Ivanov, one of our research scientists in Paris, presented his work on embedding networks into a latent space. His paper, Anonymous Walk Embeddings, describes a method for embedding graph structure based on a recently discovered graph object, anonymous walk, for the classification purposes. What’s interesting about these embeddings is their reconstruction properties: in particular, he shows that embeddings obtained by his methods guarantee the uniqueness representation of a graph, i.e. any two different graphs have two different embeddings. This goes in striking contrast with previous methods with no bijection property between the space of all graphs and the latent space of embeddings.
Another highlight on graph embeddings is that we observed a trend on using the Hyperbolic space instead of Euclidean space for the embedding models that continued at ICML 2018. Nickel and Kiela present their work on using Lorentz version of hyperbolic space to learn hierarchical relationships from unstructured similarity scores. Their contribution is in the design of Riemannian optimization in the Lorentz model which proves to be more efficient and with no numerical instabilities than the Poincare model. We will continued watching this body of work with interest, especially from the point of view of product embeddings for Recommendation.
There is a growing concern in the machine learning community around understanding the decisions taken by models. From a theoretical perspective, researchers are using sparse models and dictionary learning as an elegant way to explain the efficiency of deep networks. More practically, decision trees, prototypes and nearest neighbor methods are good candidates to provide explanations of the decisions automatically taken by a system, as shown, for instance, by Cynthia Rudin’s in her invited talk at the workshop on Human Interpretability in ML, on Saturday, featuring a deep model learning to classify images based on patches from the training set (“This looks like that” paper).
Another important way of understanding machine learning models is through visualization. At the same workshop, Fernanda Viégas and Martin Wattenberg showcased their work at Google. It would be nice if more people would publish efficient tools & methods for data visualization.
Temporal point processes
Temporal point processes are a relatively niche topic in machine learning that is outside the usual classification, regression, density estimation paradigm but have important applications in modeling the temporal aspects of how users respond to marketing stimuli. Rodriguez and Valera gave an accessible and very interesting tutorial on this important area.
There were many exciting developments in approximate inference. Inference in generative models remained a hot topic with many papers presenting impressive results on generative adversarial networks and variational autoencoders. It is impossible to summarise all the exciting results here, but a few highlights include: Neural Processes by Garnelo et al. which develops a new model combining the advantages of Gaussian Processes with deep learning and shows impressive performance on a range of tasks. Many papers used implicit distributions both for modelling and as a variational approximation, a highlight among these was Neural Autoregressive Flows by Huang et al. which introduced an implicit distribution which is a universal approximator for continuous probability distributions allowing for more accurate variational approximations.
Optimising variational approximations (as well as reinforcement learning) requires the ability to estimate derivatives from Monte Carlo samples. The “re-parameterisation trick” and the “score function estimator” have been the workhorse behind dozens of machine learning results. The extension of the “score function estimator” to be able to estimate derivatives of arbitrary order in The Infinitely Differentiable Monte Carlo Estimator by Foerster et al. is an exciting technical development that we expect to have far reaching impact.
Closer to our core business problems, Criteo often must process large cardinality categorical variables e.g. when modelling the products in large catalogues. Traditional maximum likelihood and Bayesian estimation can become expensive due to the normalization over the large number of states, the traditional response to this is to use alternatives to maximum likelihood. Augment and Reduce presented by Ruiz et al. is a new latent variable representation and variational EM algorithm that addresses this problem allowing fast maximum likelihood estimation.
Ethics in AI
Ethics in machine learning is emerging as an important topic. A message that came through in both the tutorial by Corbett-Davies and Goel; and also some of the papers (e.g. Liu et al.) is that naïve applications of fairness criteria can be counterproductive. We are closely following these issues in order to understand if there are any findings that impact our massive scale recommender systems.
The CausalML workshop took place at the end of a whole week of conference. Because of this, and the sunny Sunday weather outside, we were a bit afraid that the attendance will be rather low on this last day of conference. Fortunately, at 9am the room became quickly packed, and Don Rubin made a very insightful walkthrough of the history of causality.
It was interesting to see the different point of views on the links between causality and learning all along the day. Jonas Peters and Suchi Saria explored links between causality and robustification of the learnt models to changes of conditions at inference. Victor Chernozhukov and Stefan Wager provided insights on the role of machine learning in inferring observational causality.
Definitely, one of the highlights of the days was the proposition of Csaba Szepesvari to formalize the relationship between causality and reinforcement learning, with the intent to show they are two sides of the same coin. He may not have convinced the entire audience, but his talk opened this long overdue discussion. Unfortunately, he couldn’t stay for the panel to continue the debate.
We concluded the day with the panel. In terms of attendance, we feel the event was quite a success, knowing that, for the people who stayed the entire day, there was an extremely compelling competing event: the World Cup final! Then again, maybe the population coming to a causality workshop, is suffering from a slight selection bias! As always in real-world situations we are faced with an intractable confounder! 🙂
In the end, it was a great conference, with lots of bold ideas and interesting people.
We are looking forward to ICML 2019. Au revoir!
Post written by:
Senior Staff Researcher