Session 12: Evaluation

Date: Thursday September 21, 4:05 PM – 5:25 PM (GMT+8)
Room: Hall 406D
Session Chair: Christine Bauer
Parallel with: Session 11: Sequential Recommendation 2

  • RESHow Should We Measure Filter Bubbles? A Regression Model and Evidence for Online News
    by Lien Michiels (UAntwerpen), Jorre Vannieuwenhuyze (Statistiek Vlaanderen), Jens Leysen (University of Antwerp), Robin Verachtert (Froomle NV), Annelien Smets (imec-SMIT, Vrije Universiteit Brussel) and Bart Goethals (University of Antwerp).

    News media play an important role in democratic societies. Central to fulfilling this role is the premise that users should be exposed to diverse news. However, news recommender systems are gaining popularity on news websites, which has sparked concerns over filter bubbles. Editors, policy-makers and scholars are worried that news recommender systems may expose users to less diverse content over time. To the best of our knowledge, this hypothesis has not been tested in a longitudinal observational study of real users that interact with a real news website. Such observational studies require the use of research methods that are robust and can account for the many covariates that may influence the diversity of recommendations at any given time. In this work, we propose an analysis model to study whether the variety of articles recommended to a user decreases over time, in observational studies of real news websites with real users. Further, we present results from two case studies using aggregated and anonymized data that were collected by two western European news websites employing a collaborative filtering-based news recommender system to serve (personalized) recommendations to their users. Through these case studies we validate empirically that our modeling assumptions are sound and supported by the data, and that our model obtains more reliable and interpretable results than analysis methods used in prior empirical work on filter bubbles. Our case studies provide evidence of a small decrease in the topic variety of a user’s recommendations in the first weeks after they sign up, but no evidence of a decrease in political variety.

    Full text in ACM Digital Library

  • REPEveryone’s a Winner! On Hyperparameter Tuning of Recommendation Models
    by Faisal Shehzad (University of Klagenfurt) and Dietmar Jannach (University of Klagenfurt)

    The performance of a recommender system algorithm in terms of common offline accuracy measures often strongly depends on the chosen hyperparameters. Therefore, when comparing algorithms in offline experiments, we can obtain reliable insights regarding the effectiveness of a newly proposed algorithm only if we compare it to a number of state-of-the-art baselines that are carefully tuned for each of the considered datasets. While this fundamental principle of any area of applied machine learning is undisputed, we find that the tuning process for the baselines in the current literature is barely documented in much of today’s published research. Ultimately, in case the baselines are actually not carefully tuned, progress may remain unclear. In this paper, we showcase how every method in such an unsound comparison can be reported to be outperforming the state-of-the-art. Finally, we iterate appropriate research practices to avoid unreliable algorithm comparisons in the future.

    Full text in ACM Digital Library

  • RESWhat We Evaluate When We Evaluate Recommender Systems: Understanding Recommender Systems’ Performance using Item Response Theory
    by Yang Liu (University of Helsinki), Alan Medlar (University of Helsinki) and Dorota Glowacka (University of Helsinki).

    Current practices in offline evaluation use rank-based metrics to measure the quality of recommendation lists. This approach has practical benefits as it centers assessment on the output of the recommender system and, therefore, measures performance from the perspective of end-users. However, this methodology neglects how recommender systems more broadly model user preferences, which is not captured by only considering the top-n recommendations. In this article, we use item response theory (IRT), a family of latent variable models used in psychometric assessment, to gain a comprehensive understanding of offline evaluation. We used IRT to jointly estimate the latent abilities of 51 recommendation algorithms and the characteristics of 3 commonly used benchmark data sets. For all data sets, the latent abilities estimated by IRT suggest that higher scores from traditional rank-based metrics do not reflect improvements in modeling user preferences. Furthermore, we show the top-n recommendations with the most discriminatory power are biased towards lower difficulty items, leaving much room for improvement. Lastly, we highlight the role of popularity in evaluation by investigating how user engagement and item popularity influence recommendation difficulty.

    Full text in ACM Digital Library

  • INDIdentifying Controversial Pairs in Item-to-Item Recommendations
    by Junyi Shen (Apple), Dayvid Rodrigues de Oliveira (Apple), Jin Cao (Apple), Brian Knott (Apple), Goodman Gu (Apple), Sindhu Vijaya Raghavan (Apple) and Rob Monarch (Apple).

    Recommendation systems in large-scale online marketplaces are essential to aiding users in discovering new content. However, state-of-the-art systems for item-to-item recommendation tasks are often based on a shallow level of contextual relevance, which can make the system insufficient for tasks where item relationships are more nuanced. Contextually relevant item pairs can sometimes have controversial or problematic relationships, and they could degrade user experiences and brand perception when recommended to users. For example, a recommendation of a divorce and co-parenting book can create a disturbing experience for someone who is downloading or viewing a marriage therapy book. In this paper, we propose a classifier to identify and prevent such problematic item-to-item recommendations and to enhance overall user experiences. The proposed approach utilizes active learning to sample hard examples effectively across sensitive item categories and uses human raters for data labeling. We also perform offline experiments to demonstrate the efficacy of this system for identifying and filtering controversial recommendations while maintaining recommendation quality.

    Full text in ACM Digital Library

Back to program

Diamond Supporter
 
 
Platinum Supporter
 
 
Amazon Science
 
Gold Supporter
 
 
Silver Supporter
 
 
Bronze Supporter
 
Challenge Sponsor
ShareChat
 
Special Supporters