ICML 2017 highlights

By: Criteo AI Lab / 05 Sep 2017

The 34th International Conference on Machine Learning (ICML) in 2017, took place in Sydney, Australia on August 6-11. ICML is one of the most prestigious conferences in machine learning which covers a wide range of topics in machine learning both from practical and theoretical views. This year it brought together researchers and practitioners from machine learning, for 434 talks in 9 parallel tracks, 9 tutorial sessions, 22 workshops. We were there!

Machine learning is about learning from historical data. There are three distinct ways that machine learning systems can learn: supervised learning, unsupervised learning, reinforcement learning. Recently, there has been a tremendous success in machine learning due to the end-to-end training capabilities of deep neural networks – so called deep learning – for learning both prediction representations and parameters at the same time. Deep learning architectures were originally developed for supervised learning problems such as classifications but recently have been extended into a wide range of problems from regression in supervised learning to unsupervised learning domains, such as generative methods (such as GANs) as well as reinforcement learning (deep RL).

As expected, this year, deep learning was one the hottest topics along with other topics such as continuous optimization, reinforcement learning, GANs, as well as online learning. Deep learning continues to be a very active research area – over 20% of all sessions were devoted to this area. Here is a selection of our observations, topics and papers that captured our attention.

Deep learning 

The main widely discussed new challenges for deep learning were: transfer learning, attention and memory. There was a heavy emphasize on understanding of how and why deep learning works. There were some  papers and workshops   trying to address some theoretical aspects, in order to enhance understandings and interpret the results which is crucial for many real-world applications. For example, there were special workshops devoted to visualization for deep learning  or interpretability in machine learning and a lot of results of studies on the interpretability of predictive models,  developing methodology to interpret black-box machine learning models or even developing interpretable machine learning algorithms (e.g., architectures designed for interpretability).

The theory still seems to be far away from the point that can explain the effectiveness of current deep learning solutions. On this subject, the paper Sharp Minima Can Generalize For Deep Nets explains why the widely standing hypothesis: “flatness of local minimum of the objective function, found by the stochastic gradient-descent method, results in a good generalization”, is not necessarily the case for deep nets and sharp minima can also generalize well.

At Criteo, we use deep learning for product recommendation and user selection

Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs)

Following the success of CNNs on Independent and Identically Distributed (IID) datasets such as ImageNet for classification tasks, recurrent neural networks (RNNs) were successful to some extends to tackle problems whose datasets were not IID or in other words time dependent. For example, sequence-to-sequence predictions on text data have used RNN approach with some success. However, the current RNNs with their specific architectures have not been as influential as CNNs yet and have faced some limitations in terms of training and speed of convergence. The paper Convolutional Sequence to Sequence Learning introduces a CNN architecture that outperforms the state-of-the-art RNN methods (LSTM-style) on sequence-to-sequence predictions. The advantage of their architecture is that it can be fully parallelized and thus leads to faster speed of convergence on both GPU and CPU.

Generative Adversarial Networks (GANs)

From unsupervised learning side, generative models such as GANs continue to interest the community. The paper Wasserstein Generative Adversarial Networks (WGANs) introduces an alternative learning approach to traditional GAN training that can improve the stability of learning. Another unsupervised approach is described in the paper Learning the Structure of Generative Models without Labeled Data. The authors presented a fast and accurate architecture that was tested on real world data.

When deep neural networks showed their ability to learn hierarchical features through the hidden layers, GANs have benefited less from hierarchical  architectures. The following paper Learning Hierarchical Features from Generative Models present an alternative hierarchical  architecture that learns highly interpretable features.   .

Finally, there was an interesting paper Learning to Discover Cross-Domain Relations with Generative Adversarial Networks addressing pairs discovery in unpaired data.

Reinforcement Learning (RL)

As expected, deep RL was a popular topic. It was mainly centered around Computer Game benchmarks, however, there were some emphasis on real-world applications of RL as well, such as robotics, recommendation systems, advertising, and even optimizing device placement for TensorFlow computational graphs.

The tutorial on real world interactive learning alluded on the fact that off-policy evaluation and learning is needed for making reinforcement learning   applicable in recommendation systems and computational advertising, however,  the talk focused only  on the contextual bandit setting (a very special case of RL with generalization). Off-policy learning is needed because normally the data is generated according to one policy while the evaluation or learning is carried on another policy (a target policy which may be different than the one which used to generate data)  . It also frees exploration phase from learning phase (for the learning agent) which is an incredible property.

Policy-gradient methods in RL,  are one of the most popular techniques in RL for policy improvement specially for the case of continuous actions. In the off-policy learning setting, the policy-gradient update comes with a likelihood ratio between the target and behavior policy, which can cause variance issues in the estimators which can reduce the overall performance.  There was a few talks and papers (including tutorial and workshops) that addressed this problem such as Optimal and Adaptive Off-Policy Evaluation in Contextual Bandits and Evaluating the Variance of Likelihood-Ratio Gradient Estimators. Also in terms of stepping towards RL applications, the paper Device Placement optimization with Reinforcement Learning (still in research and not yet in production mode at Google) proposes an (RL) policy-gradient method which learns to optimize device placement for TensorFlow computational graphs.

Human Interpretability in Machine Learning

The importance of interpretability and explainability of machine learning algorithms, their results, or the machine-learned relationship between the inputs and the dependent variables has recently been an area of active interest. People making critical decisions that can have profound consequences are more and more being supported by machine learning algorithms. However, those algorithms are becoming highly complex systems (such as deep neural networks), and their black-box nature offers little comfort to decision makers. Therefore, it is imperative to develop machine learning methods that produce interpretable models with excellent predictive accuracy.

The focus of interpretability has been two-fold: first on the receiver of the decision (“why was I rejected for this job?”) and second on the model creator (“why is my model giving these answers?”). As explained by Adrian Weller in its paper Challenges for Transparency, there are different types and goals of transparency, that require a different sort of explanation and measures of efficacy. For a developer, it can be useful to understand how their system is working, in order to debug or improve it, to monitor and test for safety standards. That way, they can see what is working well or badly, and get a sense for why. For a user, it might be necessary to provide a sense for what the system is doing and why, to enable prediction of what it might do in unforeseen circumstances and build a sense of trust in the technology.

A great presentation was made on Interpretable Active Learning. Active learning is a special case of semi-supervised machine learning in which a learning algorithm is able to interactively query the user (or some other information source) to obtain the desired outputs at new data points. In this domain, an explanation is of interest to the labeler: “Why am I being asked these questions and why is it worth it to answer?”. Those kinds of algorithms are often applied in drug discovery where it is expensive (whether in terms of time or money) to label a query, the labeler in these contexts is often a domain expert in their own right (e.g., a chemist). Therefore, a query explanation can serve as a way to both justify an expensive request and allow the domain expert to give feedback to the model. Another great presentation was made on Interpreting Classifiers through Attribute Interactions in Datasets. They introduce a method called ASTRID (Automatic STRucture IDentification), that investigates which attribute interactions classifiers exploit when making predictions.


A really interesting tutorial has been done by Zeyuan Allen-Zhu (Microsoft Research) on Recent Advances in Stochastic Convex and Non-Convex Optimization. It was a great overview of the state-of-the-art optimization methods based on stochastic gradient descent (SGD), for both convex and non-convex tasks. It answers different questions with theoretical support, like “How can we properly use momentum to speed up SGD?” or “What is the maximum parallel speed up can we achieve for SGD?”.

Paper Awards

Test of Time Award recognizes research that has had long-lasting influence. This year award was given to Sylvain Gelly and David Silver for the paper Combining Online and Offline Knowledge in UCT presented at the 2007 ICML conference. They presented how their work impacted the later success of AlphaGo – winning against professional Go player – using UCT, and then in conjunction with deep RL. And finally this event has had a significant impact on the whole field. Idea of such retrospective awards and the following discussions are always very interesting and it can also give us some hints on what to expect in the future.  

The paper Understanding Black-box Predictions via Influence Functions received the best 2017 ICML paper award. It addresses the question of how can one explain the output of a black-box model. This is a key question to a lot of machine learning systems deployed into production. One could address this problem by perturbing data and then training which would be very costly. Instead, the authors provide a classic technique from robust statistics (influence functions) that can address this question to some extend.

Overall, it was a very interesting conference. It is an exciting time for us to be in this field. We are looking forward to ICML 2018!

Post written by:

Fanny Riols – Software Engineer R&D, Grzegorz Haranczyk – Senior Software Engineer R&D, Hamid Maei – Staff Research Scientist R&D, Tarik Berrada Hmima – Senior Software Engineer R&D