RecSys 2018 - Paper Session 6: Metrics & Evaluation - RecSys

Paper Session 6: Does it Work? Metrics and Evaluation

Date: Friday, Oct 5, 2018, 11:00-12:30
Location: Parq D/E/F
Chair: Denis Parra

LPGet Me The Best: Predicting Best Answerers in Community Question Answering Sites
by Rohan Ravindra Tondulkar, Manisha Dubey, Maunendra Sankar Desarkar

There has been a massive rise in the use of Community Question and Answering (CQA) forums to get solutions to various technical and non-technical queries. One common problem faced in CQA is the small number of experts, which leaves many questions unanswered. This paper addresses the challenging problem of predicting the best answerer for a new question and thereby recommending the best expert for the same. Although there are work in the literature that aim to find possible answerers for questions posted in CQA, very few algorithms exist for finding the best answerer whose answer will satisfy the information need of the original Poster. For finding answerers, existing approaches mostly use features based on content and tags associated with the questions. There are few approaches that additionally consider the users’ history. In this paper, we propose an approach that considers a comprehensive set of features including but not limited to text representation, tag based similarity as well as multiple user-based features that target users’ availability, agility as well as expertise for predicting the best answerer for a given question. We also include features that give incentives to users who answer less but more important questions over those who answer a lot of questions of less importance. A learning to rank algorithm is used to find the weight for each feature. Experiments conducted on a real dataset from Stack Exchange show the efficacy of the proposed method in terms of multiple evaluation metrics for accuracy, robustness and real time performance.

Full text in ACM Digital Library

LPOn the Robustness and Discriminative Power of IR Metrics for Top-N Recommendation
by Daniel Valcarce, Alejandro Bellogin, Javier Parapar, Pablo Castells

The evaluation of Recommender Systems is still an open issue in the field. Despite its limitations, offline evaluation usually constitutes the first step in assessing recommendation methods due to its reduced costs and high reproducibility. Selecting the appropriate metric is a central issue in offline evaluation. Among the properties of recommendation systems, ranking accuracy attracts the most attention nowadays. In this paper, we aim to shed light on the advantages of different ranking metrics which were previously used in Information Retrieval and are now typically used for assessing top-N recommender systems. We propose methodologies for comparing the robustness and the discriminative power of different metrics. On the one hand, we study the influence of cut-offs and we find that deeper cut-offs offer greater robustness and discriminative power. On the other hand, we find that precision offers high robustness and Normalised Discounted Cumulative Gain provides the best discriminative power.

Full text in ACM Digital Library

SPOStreamingRec: A Framework for Benchmarking Stream-based News Recommenders
by Michael Jugovac, Dietmar Jannach, Mozhgan Karimi

News is one of the earliest application domains of recommender systems, and recommending items from a virtually endless stream of news is still a relevant problem today. News recommendation is different from other application domains in a variety of ways, e.g., because new items constantly become available for recommendation. To be effective, news recommenders therefore have to continuously consider the latest items in the incoming stream of news in their recommendation models. However, today’s public software libraries for algorithm benchmarking mostly do not consider these particularities of the domain. As a result, authors often rely on proprietary protocols, which hampers the comparability of the obtained results. In this paper, we present StreamingRec as a framework for evaluating streaming-based news recommenders in a replicable way. The open-source framework implements a replay-based evaluation protocol that allows algorithms to update the underlying models in real-time when new events are recorded and new articles are available for recommendation. Furthermore, a variety of baseline algorithms for session-based recommendation are part of StreamingRec. For these, we also report a number of performance results for two datasets, which confirm the importance of immediate model updates.

Full text in ACM Digital Library

SPOA Field Study of Related Video Recommendations: Newest, Most Similar, or Most Relevant
by Yifan Zhong, Tahir Lazaro Sousa Menezes, Vikas Kumar, Qian Zhao, F. Maxwell Harper

Many video sites recommend videos related to the one a user is watching. These recommendations have been shown to influence what users end up exploring and are an important part of a recommender system. Plenty of methods have been proposed to recommend related videos, but there has been relatively little work that compares competing strategies. We describe a field study of related video recommendations, where we deploy algorithms to recommend related videos in a movie trailer viewing interface. Our results show that non-personalized algorithms yield the highest click-through rates, while the algorithm prioritizing recency is the strongest in leading to trailer-level user engagement. Our findings suggest the potential to design non-personalized yet effective related item recommendation strategies.

Full text in ACM Digital Library

LPUnbiased Offline Recommender Evaluation for Missing-Not-At-Random Implicit Feedback
by Longqi Yang, Yin Cui, Yuan Xuan, Chenyang Wang, Serge Belongie, Deborah Estrin

Implicit-feedback recommenders (ImplicitRec) leverage positive-only user-item interactions, such as clicks, to learn personalized user preferences. Recommenders are often evaluated and compared offline using datasets collected from online platforms. These platforms are subject to popularity bias (i.e., popular items are more likely to be presented and interacted with), and therefore logged ground truth data is Missing-Not-At-Random (MNAR). As a result, the existing Average-Over-All (AOA) evaluation is biased towards accurately recommending trendy items. Prior research on debiasing MNAR data for explicit-rating recommenders (ExplicitRec) are not directly applicable due to the fact that negative user opinion is not available in implicit feedback. In this paper, we (a) show that existing offline evaluations for ImplicitRec are biased and (b) develop an unbiased and practical offline evaluator for implicit MNAR datasets using the inverse-propensity-weighting technique. Through extensive experiments using three real world datasets and four classical and state-of-the-art algorithms, we show that (a) popularity bias is widely manifested in item presentation and interaction; (b) evaluation bias due to MNAR data pervasively exists in most ImplicitRec; and (c) the unbiased estimator can correct the potential inaccurate judgements of algorithms’ relative utilities.

Full text in ACM Digital Library

LPJudging Similarity: A User-Centric Study of Related Item Recommendations
by Yuan Yao, F. Maxwell Harper

Related item recommenders operate in the context of a particular item. For instance, a music system’s page about the artist Radiohead might recommend other similar artists such as The Flaming Lips. Often central to these recommendations is the computation of similarity between pairs of items. Prior work has explored many algorithms and features that allow for the computation of similarity scores, but little work has evaluated these approaches from a user-centric perspective. In this work, we build and evaluate six similarity scoring algorithms that span a range of activity- and content-based approaches. We evaluate the performance of these algorithms using both offline metrics and a new set of more than 22,000 user-contributed evaluations. We integrate these results with a survey of more than 700 participants concerning their expectations about item similarity and related item recommendations. We find that content-based algorithms outperform ratings- and clickstream-based algorithms in terms of how well they match user expectations for similarity and recommendation quality. Our results yield a number of implications to guide the construction of related item recommendation algorithms.

Full text in ACM Digital Library

Back to Program

Paper Session 6: Does it Work? Metrics and Evaluation

RecSys 2018 (Vancouver)

Diamond Supporter

Platinum Supporters

Gold Supporters

Silver Supporters

Special Supporter

About this site

RecSys 2026

About the photos on this site