Criteo was once again very active at KDD 2017. The Conference on Knowledge Discovery and Data Mining, was held last August at Halifax, Canada. This post highlights subjects that kept our attention during the 5 days.
To put everyone on the same page, this workshop started from the basics of what a neuron is, to how the training is done, spanning also the vital notions: Word2Vec, convolution, RNN, Long-Short-Term memory.
It then described the recent architecture changes regarding the Job Recommendation model at LinkedIn, which is an extension of the Deep and Wide architecture. The basic ingredients of this model are the user features on the one hand, and the job features on the other hand: they train a user embedding and a job embedding jointly, and they juxtapose this with a Gradient Boosted Decision Trees transformation; on top is the good old Logistic Regression. The embedding part allows to account for high dimensional features (text!); the GBDT part allows to produce high-order crossing of features without going through a tedious cross-features selection process. The Neural Networks model, the GBDT model and the Logistic Regression are trained jointly via successive optimization steps of each of the three blocks. For the Neural Network part, hyper parameter tuning is achieved through distributed grid search.
Finally we had a description of the architecture used at LinkedIn for Job Search; an interesting takeaway was the use of tri-letter hashing to overcome the well-known challenges of query processing: misspellings, word inflections and free text search.
Although Deep Learning allows to automate the crafting of new features, it requires tons of data and plenty of GPU power, as well as carefully engineered model architectures; this is why “classical” Feature Selection remains a crucial topic. Furthermore, selecting features among the set of existing features has many benefits: it helps to preserve readability and interpretability of the model, it reduces the learning time and it speeds up the online prediction; the latter is a challenge for us in the context of real-time auctions for online advertising.
This tutorial presented the broad categories of methods, and described the pros and cons for each of them: whether they handle feature redundancy, whether they handle numerical data as well as discretized data, or whether they yield interpretable results. But the true focus was on the recent advances to tackle the following challenges: volume, heterogeneous data, structures between features, and finally streaming data and features (the number of features can be time-dependent, unknown, and even infinite).
Finally, we were pointed to a Python toolkit developed by the Arizon State University, that already implements 40 Feature Selection methods.
AdKDD & Target Ad workshop
Criteo AI Lab co-organized a Workshop on Advertising that we introduced in our earlier post. It was a huge success, with an overbooked room all day long, and excellent talks by our invited speakers. Thanks again the organizers of their hard work!
Causal Inference and Counterfactual Learning were in the spotlight, with 3 invited speakers presenting various facets of these topics. Measuring the causal impact of advertising events is key in shaping the online adverting ecosystem in the next years and all the actors in the industry will benefit from such a move. This involves important business changes (such as conversion attribution) but research on causal inference has a huge role to play is this area.
Another interesting aspect during this workshop was the diversity of the contributions: optimal reserve price on publisher side, conversion modeling and attribution considerations on DSP side, several contributions on the metrics side. Each entity seems to be optimizing for its own objectives. While the use of GSP auctions was supposed to introduce some stability in the ecosystem, things are moving (header bidding is a good example), and each entity has to anticipate / react to the changes in the environment. Improving click or conversion predictions, even if there is still significant improvements to be done in this area (e.g scaling up deep nets keeping latency low), might not be the main research focus in the coming years. Game theory, Causal Inference, Reinforcement Learning might be the next direction to look at.
To conclude on the workshop, let’s congratulate Yeming Shi, Ori Stitelman and Claudia Perlich from Dstillery. They won the Best Paper award (based on peer review scores) for their paper “Blacklisting the Blacklist in Online Advertising“; this paper relies on the following observation: publishers sometimes blacklist brands, when they have exclusive deals with other brands for example. This translates into an unusally low win rate for the auctions on this publisher for this brand. This paper describes an explore-exploit strategy to detect blacklisted brand-publisher pairs. When a pair is discovered, one can stop bidding for it for some time, and bid for the next best candidate instead; this results in a better win rate while significantly reducing server load. As not every publisher reports about the blacklisted brands, this method is simple and impactful enough to be highlighted.
Let’s also congratulate our Criteo team who won the Runner-up prize for the paper “Attribution Modeling Increases Efficiency of Bidding in Display Advertising“. You can find more context about the latter in one of our previous blogposts.
Also a huge shout out to the organizers of this workshop: Kun Liu (LinkedIn), Kuang-Chih Lee (Yahoo Research), Abraham Bagherjeiran (A9), Nemanja Djuric (Uber), Vladan Radosavljevic (Uber), Mihajlo Grbovic (AirBnB) and Suju Rajan (VP research at Criteo).
Online A/B testing
Everyone knows about A/B testing: split your user base into two independent groups, and apply a different treatment for each of them. The challenging part in today’s world is to be able to get fast, reliable and unbiased results.
This is precisely what Ramesh Johari, Leonid Pekelis, Pete Koomen and David Walsh from Stanford University and Optimizely, Inc., achieve in their paper “Peeking at A/B Tests: Why it matters, and what to do about it”. They remind us that in order for traditional statistical tests and p-values to be valid, one assumes that the time when one decides to stop the experiments and conclude on its results is independent from the results themselves. However, most people continuously look at A/B test results, until the moment when they are statistically significant, breaking the independence assumption. This paper defines a distribution that ensures the p-value at a given time is also statistically met at any prior time, making the independence assumption valid again. This ensures one gets its results faster than by defining a time window ahead of time, and more accurately than by stopping as soon as the traditional statistical test is conclusive.
With another paper, Pavel Dmitriev, Somit Gupta, Dong Woo Kim, Garnet Vaz from Microsoft Corporation presented “A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments” and showed how easy it is to get biased A/B test results without knowing about it. Assumptions such as an obvious but wrong neutral impact of the A/B test on a given metric, bot or outliers presence in the data, or early stopping as discussed in the above paper, are examples of clear cases when the results would be biased. These are not pitfalls you would fall into, would you?
Survival analysis (SA) is a field of statistics that focuses on modeling time to events. While it is widely used in medical applications, e.g predict time to hospital readmission, its use in other domains is still limited. However, potential applications of SA, for example in online services, is huge: next user visit/click/sale prediction, user churn modeling, dwell time prediction in search engines, etc.. Several recent techniques allow to bridge the gap between SA and classical Machine Learning techniques but we think there are still many open research areas in this field. A detailed survey of SA was presented during KDD by Chandan Reddy (Virginia Tech) and Yan Li(University of Michigan).
In his talk during AdKDD workshop, Alex Smola (Amazon) proposed a deep net using LSTMS to model user engagement, and it largely outperforms state-of-the-art Cox survival model. This shows that there are still significant improvements to be made in that area.
There was also several relevant papers during KDD, either directly involving SA, or addressing related problems:
Huayu Li et al: Prospecting the Career Development of Talents: A Survival Analysis Perspective proposed a new SA approach to model turnovers in carrer progressions. They propose to use a multi-task approach to allows to model both recurrent and non recurrent events in the timeline. Ben Chamberlain et al. proposed to use embeddings to estimate customer lifetime value: Customer Life Time Value (CLTV) Prediction Using Embeddings. Finally, In their paper called RUSH! Targeted Time-limited Coupons via Purchase Forecasts, Emaad Manzoor et al. proposed to model the timing and duration of online discounts. They do not explicitly use an SA approach but proposed an interesting Time-Varying Hawkes Processes. It would be interesting to see how both approaches compare on this task.
At Criteo we can SA-like techniques to model conversion delays after clicks: the conversion might appear up to 30 days after clicks and we cannot wait for full feedback to update the models. We can also model the time-to-next visit on advertiser webpage. Stay tuned!
We look forward to participating and contributing to the the next year edition!
Post written by:
Julien Meynet – Senior R&D Scientist R&D, Clotilde Guinard – Software Engineer R&D, Julien Weber – Engineering Manager R&D