
Session 4: Metrics and Evaluation
Date:Wednesday, Oct 8, 8:30-10:00
Moderator: Pablo Castells
- Beyond Clicks: Dwell Time for Personalization
by Xing Yi, Liangjie Hong, Erheng Zhong, Nathan Liu and Suju Rajan
Many internet companies, such as Yahoo, Facebook, Google and Twitter, rely on content recommendation systems to deliver the most relevant content items to individual users through personalization. Delivering such personalized user experiences is believed to increase the long term engagement of users. While there has been a lot of progress in designing effective personalized recommender systems, by exploiting user interests and historical interaction data through implicit (item click) or explicit (item rating) feedback, directly optimizing for users satisfaction with the system remains challenging. In this paper, we explore the idea of using item-level dwell time as a proxy to quantify how likely a content item is relevant to a particular user. We describe a novel method to compute accurate dwell time based on client-side and server-side logging and demonstrate how to normalize dwell time across different devices and contexts. In addition, we demonstrate how to incorporate dwell time into state-of-the-art learning to rank techniques and collaborative filtering models and obtain competitive performances in both offline and online settings. This paper is the first work to go beyond "clicks" for optimizing for large-scale content recommendation, and outlines a number of interesting future research directions.
- Evaluating Recommender Behavior For New Users
by Daniel Kluver and Joseph Konstan
The new user experience is one of the important problems in recommender systems. Past work on recommending for new users has focused on the process of gathering information from the user. Our work focuses on how different algorithms behave for new users. We describe a methodology that we use to compare representatives of three common families of algorithms along eleven different metrics. We find that for the first few ratings a baseline algorithm performs better than three common collaborative filtering algorithms. Once we have a few ratings, we find that Funk’s SVD algorithm has the best overall performance. We also find that ItemItem, a very commonly deployed algorithm, performs very poorly for new users. Our results can inform the design of interfaces and algorithms for new users.
- Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks
by Alan Said and Alejandro Bellogin
Recommender systems research is often based on comparisons of predictive accuracy: the better the evaluation scores, the better the recommender. However, it is difficult to compare results from different recommender systems due to the many options in design and implementation of an evaluation strategy. Additionally, algorithm implementations can diverge from the standard formulation due to manual tuning and modifications that work better in some situations. In this work we compare common recommendation algorithms as implemented in three popular recommendation frameworks. To provide a fair comparison, we have complete control of the evaluation dimensions being benchmarked: dataset, data splitting, evaluation strategies, and metrics. We also include results using the internal evaluation mechanisms of these frameworks. Our analysis points to large differences in recommendation accuracy across frameworks and strategies, i.e. the same baselines may perform orders of magnitude better or worse across frameworks. Our results show the necessity of clear guidelines when reporting evaluation of recommender systems to ensure reproducibility and comparison of results.
- Social Influence Bias in Recommender Systems: A Methodology for Learning, Analyzing, and Mitigating Bias in Ratings
by Sanjay Krishnan, Jay Patel, Michael Franklin, and Ken Goldberg
To facilitate browsing and selection, almost all recommender systems display an aggregate statistic (the average/mean or median rating value) for each item. This value has potential to influence a participant’s individual rating for an item due to what is known in the survey and psychology literature as Social Influence Bias; the tendency for individuals to conform to what they perceive as the norm in a community. As a result, ratings can be closer to the average and less diverse than they would be otherwise. We propose a methodology to 1) learn, 2) analyze, and 3) mitigate the effect of social influence bias in recommender systems. In the Learning phase, a baseline dataset is established with an initial set of participants by allowing them to rate items twice: before seeing the median rating, and again after seeing it. In the Analysis phase, a new non-parametric significance test based on the Wilcoxon statistic can quantify the extent of social influence bias in this data. If this bias is significant, we propose a Mitigation phase where mathematical models are constructed from this data using polynomial regression and the Bayesian Information Criterion (BIC) and then inverted to produce a filter that can reduce the effect of social influence bias. As a case study, we apply this methodology to the California Report Card (CRC), a new recommender system that encourages political engagement. After the Learning phase collected 9390 ratings, the non-parametric test in the Analysis phase rejected the null hypothesis, identifying significant social influence bias: ratings after display of the median were on average 19.3% closer to the median value. In the Mitigating phase, the learned polynomial models were able to predict changed ratings with a normalized RMSE of 12.8% and reduce bias by 76.3%. Results suggest that social influence bias can be significant in recommender systems and that this bias can be substantially reduced with machine learning.