Session 3: Metrics and Evaluation

Date: Monday 18:00-19:30
Chair: Dietmar Jannach (University of Klagenfurt)

  • PATowards Unified Metrics for Accuracy and Diversity for Recommender Systems
    by Javier Parapar (University of A Coruña, Spain) and Filip Radlinski (Google, United Kingdom)

    Recommender systems evaluation has evolved rapidly in recent years. However, for offline evaluation, accuracy is the de facto standard for assessing the superiority of one method over another, with most research comparisons focused on tasks ranging from rating prediction to ranking metrics for top-n recommendation. Simultaneously, recommendation diversity and novelty have become recognized as critical to users’ perceived utility, with several new metrics recently proposed for evaluating these aspects of recommendation lists. Consequently, the accuracy-diversity dilemma frequently shows up as a choice to make when creating new recommendation algorithms.
    We propose a novel adaptation of a unified metric, derived from one commonly used for search system evaluation, to Recommender Systems. The proposed metric combines topical diversity and accuracy, and we show it to satisfy a set of desired properties that we formulate axiomatically. These axioms are defined as fundamental constraints that a good unified metric should always satisfy. Moreover, beyond the axiomatic analysis, we present an experimental evaluation of the metric with collaborative filtering data. Our analysis shows that the metric respects the desired theoretical constraints and behaves as expected when performing offline evaluation.

    Full text in ACM Digital Library

  • PAValues of User Exploration in Recommender Systems
    by Minmin Chen (Google, United States), Yuyan Wang (Google Research, Brain Team, United States), Can Xu (Google Inc, United States), Ya Le (Google AI, United States), Mohit Sharma (Google, United States), Lee Richardson (Google, United States), Su-Lin Wu (Google, USA), and Ed Chi (Google, United States)

    Reinforcement Learning (RL) has been sought after to bring next-generation recommender systems to further improve user experience on recommendation platforms. While the exploration-exploitation tradeoff is the foundation of RL research, the value of exploration in (RL-based) recommender systems is less well understood. Exploration, commonly seen as a tool to reduce model uncertainty in regions of sparse user interaction/feedback, is believed to cost user experience in the short term, while the indirect benefit of better model quality arrives at a later time. We focus on another aspect of exploration, which we refer to as user exploration to help discover new user interests, and argue it can improve user experience even in the more imminent term.
    We examine the role of user exploration in changing different facets of recommendation quality that more directly impact user experience. To do so, we introduce a series of methods inspired by exploration research in RL to increase user exploration in an RL-based recommender system, and study their effect on the end recommendation quality, more specifically, on accuracy, diversity, novelty and serendipity. We propose a set of metrics to measure (RL based) recommender systems in these four aspects and evaluate the impact of exploration-induced methods against these metrics. In addition to the offline measurements, we conduct live experiments on an industrial recommendation platform serving billions of users to showcase the benefit of user exploration. Moreover, we use conversion of casual users to core users as an indicator of the holistic long-term user experience and study the values of user exploration in helping platforms convert users. Through offline analyses and live experiments, we study the correlation between these four facets of recommendation quality and long term user experience, and connect serendipity to improved long term user experience.

    Full text in ACM Digital Library

  • PAOnline Evaluation Methods for the Causal Effect of Recommendations
    by Masahiro Sato (independent researcher, Japan)

    Evaluating the causal effect of recommendations is an important objective because the causal effect on user interactions can directly leads to an increase in sales and user engagement. To select an optimal recommendation model, it is common to conduct A/B testing to compare model performance. However, A/B testing of causal effects requires a large number of users, making such experiments costly and risky. We therefore propose the first interleaving methods that can efficiently compare recommendation models in terms of causal effects. In contrast to conventional interleaving methods, we measure the outcomes of both items on an interleaved list and items not on the interleaved list, since the causal effect is the difference between outcomes with and without recommendations. To ensure that the evaluations are unbiased, we either select items with equal probability or weight the outcomes using inverse propensity scores. We then verify the unbiasedness and efficiency of online evaluation methods through simulated online experiments. The results indicate that our proposed methods are unbiased and that they have superior efficiency to A/B testing.

    Full text in ACM Digital Library

  • REPReenvisioning the comparison between Neural Collaborative Filtering and Matrix Factorization
    by Vito Walter Anelli (Politecnico di Bari, Italy), Alejandro Bellogín (Universidad Autónoma de Madrid, Spain), Tommaso Di Noia (Politecnico di Bari, Italy), and Claudio Pomo (Politecnico di Bari, Italy)

    Collaborative filtering models based on matrix factorization and learned similarities using Artificial Neural Networks (ANNs) have gained significant attention in recent years. This is, in part, because ANNs have demonstrated very good results in a wide variety of recommendation tasks. However, the introduction of ANNs within the recommendation ecosystem has been recently questioned, raising several comparisons in terms of efficiency and effectiveness. One aspect most of these comparisons have in common is their focus on accuracy, neglecting other evaluation dimensions important for the recommendation, such as novelty, diversity, or accounting for biases. In this work, we replicate experiments from three different papers that compare Neural Collaborative Filtering (NCF) and Matrix Factorization (MF), to extend the analysis to other evaluation dimensions. First, our contribution shows that the experiments under analysis are entirely reproducible, and we extend the study including other accuracy metrics and two statistical hypothesis tests. Second, we investigated the Diversity and Novelty of the recommendations, showing that MF provides a better accuracy also on the long tail, although NCF provides a better item coverage and more diversified recommendation lists. Lastly, we discuss the bias effect generated by the tested methods. They show a relatively small bias, but other recommendation baselines, with competitive accuracy performance, consistently show to be less affected by this issue. This is the first work, to the best of our knowledge, where several complementary evaluation dimensions have been explored for an array of state-of-the-art algorithms covering recent adaptations of ANNs and MF. Hence, we aim to show the potential these techniques may have on beyond-accuracy evaluation while analyzing the effect on reproducibility these complementary dimensions may spark. The code to reproduce the experiments is publicly available on GitHub at

    Full text in ACM Digital Library

  • PAEvaluating the Robustness of Off-Policy Evaluation
    by Yuta Saito (Hanjuku-kaso Co., Ltd., Japan), Takuma Udagawa (Sony Group Corporation, Japan), Haruka Kiyohara (Tokyo Institute of Technology, Japan), Kazuki Mogi (Stanford University, United States), Yusuke Narita (Yale University, United States), and Kei Tateno (Sony Group Corporation, Japan)

    Off-policy Evaluation (OPE), or offline evaluation in general, evaluates the performance of hypothetical policies leveraging only offline log data. It is particularly useful in applications where the online interaction involves high stakes and expensive setting such as precision medicine and recommender systems. Since many OPE estimators have been proposed and some of them have hyperparameters to be tuned, there is an emerging challenge for practitioners to select and tune OPE estimators for their specific application. Unfortunately, identifying a reliable estimator from results reported in research papers is often difficult because the current experimental procedure evaluates and compares the estimators’ performance on a narrow set of hyperparameters and evaluation policies. Therefore, it is difficult to know which estimator is safe and reliable to use. In this work, we develop Interpretable Evaluation for Offline Evaluation (IEOE), an experimental procedure to evaluate OPE estimators’ robustness to changes in hyperparameters and/or evaluation policies in an interpretable manner. Then, using the IEOE procedure, we perform extensive evaluation of a wide variety of existing estimators on Open Bandit Dataset, a large-scale public real-world dataset for OPE. We demonstrate that our procedure can evaluate the estimators’ robustness to the hyperparamter choice, helping us avoid using unsafe estimators. Finally, we apply IEOE to real-world e-commerce platform data and demonstrate how to use our protocol in practice.

    Full text in ACM Digital Library

Platinum Supporters
Gold Supporters
Silver Supporters
Special Supporter