Paper Session P2: Evaluating and Explaining Recommendations

Session A: 18:3020:00, chaired by Kim Falk and Nava Tintarev
Session B: 5:307:00, chaired by Ludovico Boratto and Li Chen

  • LPEnsuring Fairness in Group Recommendations by Rank-Sensitive Balancing of Relevance
    by Mesut Kaya (TU Delft), Derek Bridge (Insight Centre for Data Analytics, University College Cork), Nana Tintarev (TU Delft)

    “For group recommendations, one objective is to recommend an ordered set of items, a top-N, to a group such that each individual recommendation is relevant for everyone. A common way to do this is to select items on which the group can agree, using so-called ‘aggregation strategies’. One weakness of these aggregation strategies is that they select items independently of each other. They therefore cannot guarantee properties such as fairness, that apply to the set of recommendations as a whole.
    In this paper, we give a definition of fairness that ‘balances’ the relevance of the recommended items across the group members in a rank-sensitive way. Informally, an ordered set of recommended items is considered fair to a group if the relevance of the items in the top-N is balanced across the group members for each prefix of the top-N. In other words, the first item in the top-N should, as far as possible, balance the interests of all group members; the first two items taken together must do the same; also the first three; and so on up to N. In this paper, we formalize this notion of rank-sensitive balance and provide a greedy algorithm (GFAR) for finding a top-N set of group recommendations that satisfies our definition.
    We compare the performance of GFAR to five approaches from the literature on two datasets, one from each of the movie and music domains. We evaluate performance for 42 different configurations (two datasets, seven different group sizes, three different group types) and for ten evaluation metrics. We find that GFAR performs significantly better than all other algorithms around 43% of the time; in only 10% of cases are there algorithms that are significantly better than GFAR. Furthermore, GFAR performs particularly well in the most difficult cases, where groups are large and interests within the group diverge. We attribute GFAR’s success both to its rank-sensitivity and its way of balancing relevance. Current methods do not define fairness in a rank-sensitive way (although some achieve a degree of rank-sensitivity through the use of greedy algorithms) and none define balance in the way that we do.”

  • LPWhat does BERT know about books, movies and music? Probing BERT for Conversational Recommendation
    by Gustavo Penha (Delft University of Technology), Claudia Hauff (Delft University of Technology)

    “Heavily pre-trained transformer models such as BERT have recently shown to be remarkably powerful at language modelling, achieving impressive results on numerous downstream tasks. It has also been shown that they implicitly store factual knowledge in their parameters after pre-training. Understanding what the pre-training procedure of LMs actually learns is a crucial step for using and improving them for Conversational Recommender Systems (CRS). We first study how much off-the-shelf pre-trained BERT “knows” about recommendation items such as books, movies and music. In order to analyze the knowledge stored in BERT’s parameters, we use different probes (i.e., tasks to examine a trained model regarding certain properties) that require different types of knowledge to solve, namely content-based and collaborative-based. Content-based knowledge is knowledge that requires the model to match the titles of items with their content information, such as textual descriptions and genres. In contrast, collaborative-based knowledge requires the model to match items with similar ones, according to community interactions such as ratings. We resort to BERT’s Masked Language Modelling (MLM) head to probe its knowledge about the genre of items, with cloze style prompts. In addition, we employ BERT’s Next Sentence Prediction (NSP) head and representations’ similarity (SIM) to compare relevant and non-relevant search and recommendation query-document inputs to explore whether BERT can, without any fine-tuning, rank relevant items first. Finally, we study how BERT performs in a conversational recommendation downstream task. To this end, we fine-tune BERT to act as a retrieval-based CRS. Overall, our experiments show that: (i) BERT has knowledge stored in its parameters about the content of books, movies and music; (ii) it has more content-based knowledge than collaborative-based knowledge; and (iii) fails on conversational recommendation when faced with adversarial data.”

  • LPKeeping Dataset Biases out of the Simulation: A Debiased Simulator for Reinforcement Learning based Recommender Systems
    by Jin Huang (University of Amsterdam), Harrie Oosterhuis (University of Amsterdam), Maarten de Rijke (University of Amsterdam & Ahold Delhaize), Herke van Hoof (University of Amsterdam)

    Reinforcement learning for recommendation (RL4Rec) methods are increasingly receiving attention as an effective way to improve long-term user engagement. However, applying RL4Rec online comes with risks: exploration may lead to periods of detrimental user experience. Moreover, few researchers have access to real-world recommender systems. Simulations have been put forward as a solution where user feedback is simulated based on logged historical user data, thus enabling optimization and evaluation without being run online. While simulators do not risk the user experience and are widely accessible, we identify an important limitation of existing simulation methods. They ignore the interaction biases present in logged user data, and consequently, these biases affect the resulting simulation. As a solution to this issue, we introduce a debiasing step in the simulation pipeline, which corrects for the biases present in the logged data before it is used to simulate user behavior. To evaluate the effects of bias on RL4Rec simulations, we propose a novel evaluation approach for simulators that considers the performance of policies optimized with the simulator. Our results reveal that the biases from logged data negatively impact the resulting policies, unless corrected for with our debiasing method. While our debiasing methods can be applied to any simulator, we make our complete pipeline publicly available as the Simulator for OFfline leArning and evaluation (SOFA): the first simulator that accounts for interaction biases prior to optimization and evaluation.

  • LPMaking Neural Networks Interpretable with Attribution: Application to Implicit Signals Prediction
    by Darius Afchar (Deezer Research), Romain Hennequin (Deezer Research)

    Explaining recommendations enables users to understand whether recommended items are relevant to their needs and has been shown to increase their trust in the system. More generally, if designing explainable machine learning models is key to check the sanity and robustness of a decision process and improve their efficiency, it however remains a challenge for complex architectures, especially deep neural networks that are often deemed ”black-box”. In this paper, we propose a novel formulation of interpretable deep neural networks for the attribution task. Differently to popular post-hoc methods, our approach is interpretable by design. Using masked weights, hidden features can be deeply attributed, split into several input-restricted sub-networks and trained as a boosted mixture of experts. Experimental results on synthetic data and real-world recommendation tasks demonstrate that our method enables to build models achieving close predictive performances to their non-interpretable counterparts, while providing informative attribution interpretations.

  • LPOn Target Item Sampling in Offline Recommender System Evaluation
    by Rocío Cañamares (Universidad Autónoma de Madrid), Pablo Castells (Universidad Autónoma de Madrid)

    Target selection is a basic yet often implicit decision in the configuration of offline recommendation experiments. In this paper we research the impact of target sampling on the outcome of comparative recommender system evaluation. Specifically, we undertake a detailed analysis considering the informativeness and consistency of experiments across the target size axis. We find that comparative evaluation using reduced target sets contradicts in many cases the corresponding outcome using large targets, and we provide a principled explanation for these disagreements. We further seek to determine which among the contradicting results may be more reliable. Through comparison to unbiased evaluation, we find that minimum target sets incur in substantial distortion in pairwise system comparisons, while maximum sets may not be ideal either, and better options may lie in between the extremes. We further find means for informing the target size setting in the common case where unbiased evaluation is not possible, by an assessment of the discriminative power of evaluation, that remarkably aligns with the agreement with unbiased evaluation.

  • LPRecommendations as Graph Explorations
    by Marialena Kyriakidi (University of Athens and Athena Research Center), Georgia Koutrika (Athena Research Center), Yannis Ioannidis (University of Athens and Athena Research Center)

    “We argue that most recommendation approaches can be abstracted as a graph exploration problem. In particular, we describe a graph-theoretic framework with two primary parts: (a) a recommendation graph, modeling all the elements of an (application) domain from a recommendation perspective, including the subjects and objects of recommendations as well as the relationships between them; (b) a set of path operations, inferring new edges, i.e., implicit or unknown relationships, by traversing and combining paths on the graph. The resulting path algebra model provides an abstraction and a common foundation that is beneficial to three aspects of recommendations: (a) expressive power – expression and subsequent use of several significantly different, existing but also novel recommendation approaches is reduced to parameterizing a unique model; (b) usability – by capturing part of the recommendation mechanisms in the underlying path algebra semantics, specification of recommendation approaches becomes easier and less tedious; (c) processing speed – implementing recommender systems on top of graph engines opens up the door for several optimizations that speed up execution. We demonstrate the above benefits by expressing several categories of recommendation approaches in the path algebra model and benchmarking some of them in a recommender system implemented on top of Neo4J, a widely used graph system.”

Back to Program

Select timezone:

Current time in :

Diamond Supporter
Platinum Supporters
Gold Supporters
Silver Supporter
Special Supporter