Thursday Poster Session: Reproducibility + Demos + LBR
Date: Thursday September 25
Reproducibility Papers
- SPOT #1Are We Really Making Recommendations Robust? Revisiting Model Evaluation for Denoising Recommendation
by Guohang Zeng, Jie Lu, Guangquan ZhangImplicit feedback data has emerged as a fundamental component of modern recommender systems due to its scalability and availability. However, the presence of noisy interactions—such as accidental clicks and position bias—can potentially degrade recommendation performance. Recently, denoising recommendation have emerged as a popular research topic, aiming to identify and mitigate the impact of noisy samples to train robust recommendation models in the presence of noisy interactions. Although denoising recommendation methods have become a promising solution, our systematic evaluation reveals critical reproducibility issues in this growing research area. We observe inconsistent performance across different experimental settings and a concerning misalignment between validation metrics and test performance caused by distribution shifts. Through extensive experiments testing 6 representative denoising methods across 4 recommender models and 3 datasets, we find that no single denoising approach consistently outperforms others, and simple improvements to evaluation strategies can sometimes match or exceed state-of-the-art denoising methods. Our analysis further reveals concerns about denoising recommendation in high-noise scenarios. We identify key factors contributing to reproducibility defects and propose pathways toward more reliable denoising recommendation research. This work serves as both a cautionary examination of current practices and a constructive guide for the development of more reliable evaluation methodologies in denoising recommendation.
- SPOT #2Context Trails: A Dataset to Study Contextual and Route Recommendation
by Pablo Sánchez, Alejandro Bellogin, José L. Jorro-AragonesesRecommender systems in the tourism domain are gaining increasing attention, yet the development of diverse recommendation tasks remains limited, largely due to the scarcity of comprehensive public datasets. This paper introduces Context Trails, a novel tourism dataset addressing this gap. Context Trails distinguishes itself by including not only user interactions with touristic venues, but also the itineraries (trails or routes) followed by users. Furthermore, it enriches existing item features (e.g., category, coordinates) with contextual attributes related to the interaction moment (e.g., weather) and the venue itself (e.g., opening hours). Beyond a detailed description of the dataset’s characteristics, we evaluate the performance of several baseline algorithms across three distinct recommendation tasks: classical recommendation, route recommendation, and contextual recommendation. We believe this dataset will foster further research and development of advanced recommender systems within the tourism domain.
- SPOT #3DistillRecDial: A Knowledge-Distilled Dataset Capturing User Diversity in Conversational Recommendation
by Alessandro Francesco Maria Martina, Alessandro Petruzzelli, Cataldo Musto, Marco de Gemmis, Pasquale Lops, Giovanni SemeraroConversational Recommender Systems (CRSs) facilitate item discovery through multi-turn dialogues that elicit user preferences via natural language interaction. This field has gained significant attention following advancements in Natural Language Processing (NLP) enabled by Large Language Models (LLMs). However, current CRS research remains constrained by datasets with fundamental limitations. Human-generated datasets suffer from inconsistent dialogue quality, limited domain expertise, and insufficient scale for real-world application, while synthetic datasets created with proprietary LLMs ignore the diversity of real-world user behavior and present significant barriers to accessibility and reproducibility. The development of effective CRSs depends critically on addressing these deficiencies. To this end, we present \textsc{DistillRecDial}, a novel conversational recommendation dataset generated through a knowledge distillation pipeline that leverages smaller, more accessible open LLMs. Crucially, \textsc{DistillRecDial} simulates a range of user types with varying intentions, preference expression styles, and initiative levels, capturing behavioral diversity that is largely absent from prior work. Human evaluation demonstrates that our dataset significantly outperforms widely adopted CRS datasets in dialogue coherence and domain-specific expertise, indicating its potential to advance the development of more realistic and effective conversational recommender systems.
- SPOT #4Exploring the Potential of LLMs for Serendipity Evaluation in Recommender Systems
by Li Kang, Yuhan Zhao, Li ChenSerendipity plays a pivotal role in enhancing user satisfaction within recommender systems, yet its evaluation poses significant challenges due to its inherently subjective nature and conceptual ambiguity. Current algorithmic approaches predominantly rely on proxy metrics for indirect assessment, often failing to align with real user perceptions and thereby creating a gap. With large language models (LLMs) increasingly revolutionizing evaluation methodologies across various human annotation tasks, we are inspired to explore a core research proposition: Can LLMs effectively simulate human users for serendipity evaluation? To address this question, we conduct a meta-evaluation on two datasets derived from real user studies in the e-commerce and movie domains, focusing on three key aspects: the accuracy of LLMs compared to conventional proxy metrics, the influence of auxiliary data on LLM comprehension, and the efficacy of recent popular multi-LLM techniques. Our findings indicate that even the simplest zero-shot LLMs achieve parity with, or surpass, conventional metrics. Furthermore, multi-LLM techniques and the incorporation of auxiliary data further enhance alignment with human perspectives. Based on our findings, the optimal evaluation of LLMs yields a Pearson correlation coefficient of 21.5% when compared to the results of the user study. This research establishes that LLMs have the potential to serve as accurate, reproducible, reliable, and cost-effective evaluators, introducing a new paradigm for serendipity evaluation in recommender systems.
- SPOT #5Fashion-AlterEval: A Dataset for Improved Evaluation of Conversational Recommendation Systems with Alternative Relevant Items
by Maria VlachouIn Conversational Recommendation Systems (CRS), a user provides feedback on recommended items at each turn, leading the CRS towards improved recommendations. Due to the need for a large amount of data, a user simulator is employed for both training and evaluating the CRS. Such user simulators typically critique the current retrieved items based on knowledge of a single target item. However, the evaluation of such systems in offline settings with simulators is limited by the focus on a single target item and their unlimited patience over a large number of turns. To overcome these limitations of existing simulators, we propose Fashion-AlterEval, a dataset that contains human judgments for a selection of alternative items by adding new annotations in common fashion CRS datasets. Consequently, we propose two novel meta-user simulators that use the collected judgments and allow simulated users not only to express their preferences about alternative items to their original target, but also to change their mind and level of patience. In our experiments using the Shoes and Fashion IQ as the original datasets and three CRS models, we find that using the knowledge of alternatives by the simulator can have a considerable impact on the evaluation of existing CRS models, specifically that the existing single-target evaluation underestimates their effectiveness, and when simulated users are allowed to instead consider alternative relevant items, the system can rapidly respond to more quickly satisfy the user. Importantly, we observe that when using a probabilistic switch to alternative based on the estimation of gains and losses (with a probability threshold) in most cases leads to improved performance estimation than a meta-simulator with a fixed switch to alternatives.
- SPOT #6GreenFoodLens: Sustainability Labels for Food Recommendation
by Giacomo Balloccu, Ludovico Boratto, Gianni Fenu, Mirko Marras, Giacomo Medda, Giovanni MurgiaMost food recommendation systems aim to increase user engagement by looking at recipe ingredients and past choices. Even though consumers are paying more attention to sustainability, such as carbon and water footprints, there remains a notable lack of public corpora that combine detailed user–recipe interactions with reliable environmental impact data. This gap makes it hard to build recommendation tools that both match people’s tastes and help reduce ecological damage. To this end, we present GreenFoodLens, a resource that enriches HUMMUS, one of the largest corpora for food recommendation, with environmental impact estimates derived from the hierarchical taxonomy of the SU-EATABLE-LIFE project. We achieved this result through a multi-step process involving human annotations, iterative labeling assessments, knowledge refinement, and constrained generation techniques with large language models. Finally, we evaluate recommendation baselines on HUMMUS augmented with GreenFoodLens labels and find that models are driven by popularity signals, which may exacerbate the environmental impact of users’ recipe choices. These experiments demonstrate the practical benefit of GreenFoodLens for benchmarking and advancing sustainability-aware recommendation research.
- SPOT #7How Powerful are LLMs to Support Multimodal Recommendation? A Reproducibility Study of LLMRec
by Maria Lucia Fioretti, Nicola Laterza, Alessia Preziosa, Daniele Malitesta, Claudio Pomo, Fedelucio Narducci, Tommaso Di NoiaLarge language models (LLMs) have been exploited as standalone recommender systems (RSs) learning to recommend from the historical user-item data and, more recently, as support tools for already existing RSs. Within this second research line, LLMRec prompts a LLM with the user-item data, the items’ metadata, and the candidate items generated by other multimodal RSs to obtain an augmented version of the original dataset where a final RS is trained on. Despite its remarkable performance, concerns may arise regarding the accountability of this model. In this regard, a few recent studies have proposed reproducing and rigorously evaluating LLM-based recommender systems (RSs) as standalone approaches (first research line). However, little to no attention has been devoted to exploring the use of LLMs as supportive components within existing RSs, particularly in the context of multimodal recommendation (second research line). To this end, in this work, we propose the first reproducibility study of a LLMs-based RS belonging to the second research line, LLMRec, in the multimodal recommendation domain. First, we try to replicate the results of LLMRec with the authors’ provided data and our own reconstructed data, outlining critical issues in the measured recommendation performance. Then, we benchmark LLMRec: (i) with unimodal and multimodal LLMs, showing how the latter may be more beneficial in a multimodal scenario; (ii) other competitive multimodal RSs, LLMs-based solutions, and an additional dataset, demonstrating inconsistencies with the trends emerging in the original paper. Finally, in an attempt to disentangle the observed performance trends, we evaluate (for the first time in the literature) the topological differences of the original user-item interaction graph with respect to the LLMRec’s augmented one.
- SPOT #8Impacts of Mainstream-Driven Algorithms on Recommendations for Children Across Domains: A Reproducibility Study
by Robin Ungruh, Alejandro Bellogín, Dominik Kowald, Maria PeraChildren access varied media across many online platforms, where they are often exposed to items curated by recommendation algorithms. Yet, research seldom considers children as a user group, and when it does, it is anchored on datasets where children are underrepresented, risking overlooking their inherent traits, favoring those of the majority, i.e., mainstream users. Recently, Ungruh et al. demonstrated that children’s consumption patterns and preferences differ from those of mainstream users, resulting in inconsistent recommendation algorithm performance and behavior for this user group. These findings, however, are based on two datasets with a limited child user sample. To advance this line of work, we reproduce this study on a wider range of datasets in the movie, music, and book domains, uncovering interaction patterns and aspects of child-recommender interactions that are consistent across domains, as well as those specific to some user samples in the data. We also extend insights from the original study by analyzing popularity bias metrics, given the interpretation of results from the original study. This reproduction and extension allow us to uncover consumption patterns and differences between age groups stemming from intrinsic differences between children and others, and those unique to specific datasets or domains. We share data samples from our exploration and associated code in a public repository.
- SPOT #9Model Meets Knowledge: Analyzing Knowledge Types for Conversational Recommender Systems
by Jujia Zhao, Yumeng Wang, Zhaochun Ren, Suzan VerberneConversational Recommender Systems (CRSs) often integrate external knowledge to enhance user preference modeling and item representation learning, addressing the challenge of sparse conversational contexts. Traditional methods primarily utilize structured knowledge graphs (KGs) to model entity relationships and capture deep, multi-hop relationships among items. More recent studies employing pre-trained language models (PLMs), however, leverage unstructured text (e.g., customer reviews) to enrich contextual understanding of users and items. Despite reported performance gains from both knowledge types, a question remains: What is the compatibility between specific CRS model architectures and types of external knowledge, and how do different knowledge sources complement each other? We present a reproducibility study evaluating 9 state-of-the-art CRSs, including KG-based and PLM-based paradigms, to systematically investigate model–-knowledge compatibility and complementarity. Through a comprehensive evaluation on three datasets, we uncover three key findings: (1) Different model architectures have different compatibility with knowledge types: decoder-only models excel with structured knowledge, whereas encoder-decoder models better utilize unstructured knowledge. (2) Combining multiple knowledge sources isn’t always superior to using a single type, but merging similar knowledge types is generally more effective than mixing different ones. (3) Unstructured knowledge broadly benefits all scenario-specific conversations, particularly in genre-specific and descriptive scenarios, whereas structured knowledge demonstrates superior performance in comparative recommendation scenarios. Our study serves as an inspiration for future research on maximizing the benefits of external knowledge across different models in CRSs.
- SPOT #10Privacy Risks of LLM-Empowered Recommender Systems: An Inversion Attack Perspective
by Yubo Wang, Min Tang, Nuo Shen, Shujie Cui, Weiqing WangThe large language model (LLM) powered recommendation paradigm has been proposed to address the limitations of traditional recommender systems, which often struggle to handle cold-start users or items with new IDs. Despite its effectiveness, this study uncovers that LLM-empowered recommender systems are vulnerable to reconstruction attacks that can expose both system and user privacy. To thoroughly examine this threat, we present the first systematic study on inversion attacks targeting LLM-empowered RecSys, wherein adversaries attempt to reconstruct original user prompts that contain personal preferences, interaction histories, and demographic attributes by exploiting the output logits of recommendation models. We propose an optimized inversion framework that integrates a vec2text generation engine with Similarity-Guided Refinement to accurately recover textual prompts from logits. Extensive experiments across two domains (movies and books) and two representative LLM-based recommendation models demonstrate that our method achieves high-fidelity reconstructions. Specifically, we can recover nearly 65% of the user-interacted items and correctly infer age and gender in 87% of cases. The experiments also reveal that privacy leakage is largely insensitive to the victim model’s performance but highly dependent on domain consistency and prompt complexity. These findings expose critical and unique privacy vulnerabilities in LLM-powered recommender systems.
- SPOT #11Revisiting Prompt Engineering: A Comprehensive Evaluation for LLM-based Personalized Recommendation
by Genki Kusano, Kosuke Akimoto, Kunihiro TakeokaLarge language models (LLMs) can perform recommendation tasks by taking prompts written in natural language as input. Compared to traditional methods such as collaborative filtering, LLM-based recommendation offers advantages in handling cold-start, cross-domain, and zero-shot scenarios, as well as supporting flexible input formats and generating explanations of user behavior. In this paper, we focus on a single-user setting, where no information from other users is used. This setting is practical for privacy-sensitive or data-limited applications. In such cases, prompt engineering becomes especially important for controlling the output generated by the LLM. We conduct a large-scale comparison of 23 prompt types across 8 public datasets and 12 LLMs. We use statistical tests and linear mixed-effects models to evaluate both accuracy and inference cost. Our results show that for cost-efficient LLMs, three types of prompts are especially effective: those that rephrase instructions, consider background knowledge, and make the reasoning process easier to follow. For high-performance LLMs, simple prompts often outperform more complex ones while reducing cost. In contrast, commonly used prompting styles in natural language processing, such as step-by-step reasoning, or the use of reasoning models often lead to lower accuracy. Based on these findings, we provide practical suggestions for selecting prompts and LLMs depending on the required balance between accuracy and cost.
- SPOT #12Revisiting the Performance of Graph Neural Networks for Session-based Recommendation
by Faisal Shehzad, Dietmar JannachGraph Neural Networks (GNNs) have shown impressive performance in various domains. Motivated by this success, several GNN-based session-based recommender systems (SBRS) have been proposed over the past few years. The literature suggests that these algorithms can achieve strong performance and outperform well-established baseline neural models. However, some recent reproducibility studies suggest that the performance achieved by more complex GNN-based models may sometimes be overstated and that these models may not be as impactful as expected. Moreover, an inconsistent choice of datasets, preprocessing steps, and evaluation protocols across published works makes it difficult to reliably assess progress in the field. In this present study, we reassess the performance of three well-established baseline models—GRU4Rec, NARM, and STAMP—and compare them to six more recent GNN-based SBRS within a standardized evaluation framework. Experiments on commonly used datasets for SBRS reveal that in particular the GRU4Rec model, if properly tuned, is still highly competitive and leads to the best results on two out of three datasets. Furthermore, we find that the performance of the GNN-based models varies largely across datasets. Interestingly, only the quite early SR-GNN model turns out to be superior in terms of accuracy metrics on one of the datasets. We speculate that the reasons for our surprising result may lie in insufficient hyperparameter tuning processes for the baselines in the original papers.
- SPOT #13The XITE Million Sessions Dataset
by Ralvi Isufaj, Ruslan Tsygankov, Zoltán SzlávikWe present the XITE Million Sessions Dataset, a collection of one million music video streaming sessions from an interactive TV platform. This dataset addresses a significant gap in music recommendation research by capturing sequential user interactions with music video content. Each session contains sequences of videos watched by anonymised users, along with metadata including artist information, title, genre and subgenre classifications from XITE’s expert-curated taxonomy, and watch-time metrics. The dataset also includes XITE’s genre hierarchy and subgenre correlation matrix, representing musical relationships established by music experts. We provide MusicBrainz identifiers where possible to enable connections with external music resources. While we do not include the video content itself, the dataset documents how users engage with music in a video-based environment, which may exhibit interaction patterns that differ from audio-only consumption. To demonstrate the dataset’s research utility, we benchmark a standard playlist continuation task using transformer-based and graph-based models. This contribution allows researchers to develop and evaluate recommendation algorithms for music video consumption and examine how existing methods generalise beyond audio-only datasets to screen-based music experiences.
- SPOT #14TIM-Rec: Explicit Sparse Feedback on Multi-Item Upselling Recommendations in an Industrial Dataset of Telco Calls
by Alessandro Sbandi, Federico Siciliano, Fabrizio SilvestriUpselling recommendations play a critical role in improving customer engagement and maximizing revenue in the telecommunications industry. However, real-world data on such interactions often presents unique challenges, including multiple recommendations per call and sparse customer feedback, which complicates the evaluation of recommender systems. Our review of the existing literature reveals a critical gap in publicly available datasets that reflect these challenges, limiting progress in developing and evaluating upselling strategies. This work introduces a novel dataset that captures these complexities, offering valuable insights into customer behavior and recommendation effectiveness. The dataset, derived from real-world interactions between customers and service providers, contains multiple recommendations provided in individual calls and sparse feedback, reflecting typical user behavior where interest may be low or unrecorded. To aid in the development of more effective recommendation systems, we provide detailed statistics on recommendation distributions, user engagement, and feedback patterns. Furthermore, we benchmark various recommendation models, from classical approaches to state-of-the-art neural networks, allowing for a comprehensive assessment of their recommendation accuracy in this challenging setting. The dataset, along with the preprocessing implementations, is publicly available in our GitHub repository.
- SPOT #15“We Share Our Code Online”: Why This Is Not Enough to Ensure Reproducibility and Progress in Recommender Systems Research
by Faisal Shehzad, Timo Breuer, Maria Maistro, Dietmar JannachIssues with reproducibility have been identified as a major factor hampering progress in recommender systems research. In response, researchers increasingly share the code of their models. However, the provision of only the code of the proposed model is usually not sufficient to ensure reproducibility. In many works, the central claim is that a new model is advancing the state-of-the-art. Thus, it is crucial that the entire experiment is reproducible, including the configuration and the results of the considered baselines. With this work, our goal is to gauge the level of reproducibility in algorithms research in recommender systems. We systematically analyzed the reproducibility level of 65 papers published at a top-ranked conference during the last three years. Our results are sobering. While the model code is shared in about two thirds of the papers, the code of the baselines is provided only in eight cases. The hyperparameters of the baselines are reported even less frequently, and how these were exactly determined is not explained in any paper. As a result, it is commonly not only impossible to reproduce the full result tables reported in the papers, it is also unclear if the claimed improvements over the state-of-the-art were actually achieved. Overall, we conclude that the research community has not reached the required level of reproducibility yet. We therefore call for more rigorous reproducibility standards to ensure progress in this field.
Demo Papers
- SPOT #16APS Explorer: Navigating Algorithm Performance Spaces for Informed Dataset Selection
by Tobias Vente, Michael Heep, Abdullah Abbas, Theodor Sperle, Joeran Beel, Bart GoethalsDataset selection is crucial for offline recommender system experiments, as mismatched data (e.g., sparse interaction scenarios require datasets with low user-item density) can lead to unreliable results. Yet, 86% of ACM RecSys 2024 papers provide no justification for their dataset choices, with most relying on just four datasets: Amazon (38%), MovieLens (34%), Yelp (15%), and Gowalla (12%). While Algorithm Performance Spaces (APS) were proposed to guide dataset selection, their adoption has been limited due to the absence of an intuitive, interactive tool for APS exploration. Therefore, we introduce the APS Explorer, a web-based visualization tool for interactive APS exploration, enabling data-driven dataset selection. The APS Explorer provides three interactive features: (1) an interactive PCA plot showing dataset similarity via performance patterns, (2) a dynamic meta-feature table for dataset comparisons, and (3) a specialized visualization for pairwise algorithm performance.
- SPOT #17ArtAICare: An End-to-End Platform for Personalized Art Therapy
by Bereket A. Yilma, Saravanakumar Duraisamy, Stefan Penchev, Tudor Pristav, Luis A. LeivaWe introduce a platform powered by Visual Art recommender systems (VA RecSys) to support art therapy for patients with Post-Intensive Care Syndrome (PICS) or experiencing psychiatric sequelae symptoms such as anxiety, depression, and Post Traumatic Stress Disorder (PTSD). The contribution is threefold: (1) integration of unimodal, multimodal, and cross-domain VA RecSys engines as plug-and-play external APIs for therapeutic art recommendations; (2) development of an end-to-end platform with desktop/mobile/tablet and immersive VR interfaces to connect therapists and patients; and (3) a therapist dashboard providing post-session analytics, including objective and subjective measures, to inform future recommendations. A pilot test with licensed art therapists and patients with PICS demonstrated that the platform enables therapist-supervised personalized therapy, reducing preparation time by 50% and improving affective states by 70.5%.
- SPOT #18ArtEx: A User-Controllable Web Interface for Visual Art Recommendations
by Rully Agus Hendrawan, Peter Brusilovsky, Luis A. Leiva, Bereket A. YilmaWe introduce a web-based interface for visual art recommendations, empowering users to adjust popularity and diversity through intuitive sliders. Built on the SemArt dataset and leveraging multimodal BLIP features, ArtEx allows users to fine-tune recommendations across dimensions like genre, time period, and artist. This demo paper presents ArtEx’s interactive interface, showcasing its ability to enhance user engagement and satisfaction through transparent, user-driven personalization.
- SPOT #19Blooming Beats: An Interactive Music Recommender System Grounded in TRACE Principles and Data Humanism
by Ibrahim Al-Hazwani, Daniel Lutziger, Carlos Kirchdorfer, Luca Huber, Oliver Robin Aschwanden, Jürgen Bernard, Ludovico BorattoMusic streaming platforms reduce rich listening experiences to algorithmic black boxes, overlooking personal narratives that make music meaningful. We present Blooming Beats, an explainable recommender system that transforms Spotify listening data into visual narratives using Data Humanism principles. The system embodies TRACE principles: Transparency through visual explanations, Context-awareness by integrating personal context, and Empathy by matching listening stories rather than user profiles. A user study with 8 participants exploring a decade of listening data shows that narrative-driven visualization suggests potential for enhancing transparency and engagement.
- SPOT #20PRISM: From Individual Preferences to Group Consensus through Conversational AI-Mediated and Visual Explanations
by Ibrahim Al-Hazwani, Oliver Aschwanden, Oana Inel, Jürgen Bernard, Ludovico BorattoGroup accommodation booking forces travelers to coordinate externally through messaging apps and informal voting, missing opportunities for transparent preference alignment. We present PRISM, an interactive group recommender system that transforms opaque recommendation processes into transparent collaborative visual experiences. PRISM employs a two-phase interaction paradigm: individual preference elicitation through conversational AI, followed by collaborative decision-making via bivariate map preference visualization. A controlled user study with 6 pairs shows PRISM enhances transparency (+1.83 on 5-point scale), consensus building (+2.0), and reduces conformity pressure compared to traditional approaches and interfaces.
- SPOT #21Interactive Playlist Generation from Titles
by Eléa Vellard, Enzo Charolois-Pasqua, Youssra Rebboud, Pasquale Lisena, Raphaël TroncyThis demo presents an interactive playlist recommendation system that relies exclusively on playlist titles. By fine-tuning a transformer-based language model on clustered playlists, we enable real-time playlist generation for a given title, relying on the semantic meaning of known playlists’ and tracks’ titles. The playlist title provided in input is freely expressed in natural language in a user-friendly web interface. The system is lightweight, fast, and fully accessible through a simple web page.
- SPOT #22Large Language Model-based Recommendation System Agents
by Tommaso Carraro, Brijraj Singh, Niranjan PedanekarA Large Language Model-based agent is an AI assistant that makes use of advanced Tool Calling (TC) and Retrieval Augmented Generation (RAG) techniques to access external tools (e.g., Python code, databases). This allows the agent to consult additional sources of information that are complementary to its pre-trained knowledge. By doing so, re-training or fine-tuning of the LLM each time new knowledge becomes available can be avoided, as the assistant can access this information thanks to the available tools. In this demo, we investigate this idea in the Recommendation Systems (RSs) scenario. In particular, we design an AI assistant for recommendation that can access (i) a pre-trained recommender system, (ii) a database, and (iii) a vector store. The demo shows how the assistant is able to interact with these tools to reply to complex recommendation and explanation queries that require reasoning on the tool’s results. To the best of our knowledge, this is the first attempt at designing LLM-based recommendation system agents. The code for this demo paper is available at this URL1.
- SPOT #23Flights Pricelock Fee Recommendation on Online Travel Agent Platform
by Akash Khetan, Narasimha Medeme, Deepak Yadav, Anmol PorwalIn this study, we present a neural network (NN) based recommender system with novel custom loss function developed to recommend fee for its pricelock product. It is a popular add-on product that allows users to lock a flight price and book it later at the same locked price, even if the price increases while flight booking. The core challenge in enabling this product lies in predicting the magnitude of future price changes over time horizons. We formulate this problem as a multi-task learning (MTL) setup, where price change magnitudes are modeled as ordinal categories across several time intervals modeled as heads. Crucially, we address the ordinal nature of price change buckets by introducing a novel loss function called Learnable Soft Ordinal Regression (L-SORD). Our demo showcases how this system improves both predictive accuracy and revenue performance, enabling more effective price recommendations in a high stakes, real world environment. This work highlights the potential of combining MTL architectures with custom loss functions in production grade pricing recommender systems.
- SPOT #24RecViz: Intuitive Graph-based Visual Analytics for Dataset Exploration and Recommender System Evaluation
by Jackson Dam, Zixuan Yi, Iadh OunisWe present RecViz, a novel web application designed to support qualitative analysis of recommender system performance on large datasets. RecViz offers real-time, interactive graph visualisation of recommendation data, enabling side-by-side comparisons of models through dual graph views. Leveraging GPU acceleration via CUDA and WebGL, it delivers fast, responsive force-directed layouts, even at scale. Unlike prior tools limited to small datasets, RecViz shows the potential to handle large datasets efficiently. For example, it maintains an average of 28 FPS while visualising the full MovieLens-1M dataset, with all 1 million interactions. RecViz is open-source and available on GitHub1 under the Apache-2.0 licence [7].
- SPOT #25Travel Together, Play Together: Gamifying a Group Recommender System for Tourism
by Patrícia Alves, Joana Neto, Jorge Lima, José Silva, Luís Conceição, Goreti MarreirosGamification is increasingly being used in a variety of domains, such as in education to motivate students learning, in healthcare contexts to help patients follow medical indications or improve healthy habits, or even in tourism to enrich the tourists’ experience. Recommender Systems (RS) are an application example, where gamification has been added to motivate and challenge tourists while visiting a destination, but only few use gamification to motivate using the RS itself, and, to the best of our knowledge, there are no Group RS (GRS) that use gamification. Psychological aspects, such as personality, are also being studied to enhance recommendations, since they have shown to produce better results than generic approaches, but to acquire personality without the social desirability bias associated to questionnaires or a great amount of user interactions is a challenge. In previous studies, we showed serious games can be the leverage needed to implicitly acquire the tourists’ personality and improve recommendations without the observer’s bias. In this demo, we show how we gamified a GRS for tourism prototype by using rewards, a virtual pet, and the serious games.
- SPOT #26VisualReF: Interactive Image Search Prototype with Visual Relevance Feedback
by Bulat Khaertdinov, Mirela Popa, Nava TintarevIn the absence of interaction history, image recommendations often depend on content-based approaches. Prompted by user queries in natural language, such systems rank items based on the similarity between textual and visual features. However, these approaches typically rely on static queries and do not offer alternative feedback mechanisms. In this paper, we present VisualReF: an interactive image retrieval prototype that introduces visual relevance feedback through fine-grained user annotations. Built on vision-language models (VLMs) for retrieval, our system allows users to label relevant and irrelevant regions in retrieved images. These regions are captioned using a generative vision-language model to refine the query vector. Our work bridges the gap between conventional static image retrieval and interactive, user-guided search by introducing visual relevance feedback. Finally, our prototype contributes to the field of visual recommendation by empowering researchers with practical tools for: (i) collecting region-level visual relevance signals from users, (ii) supporting integration of human feedback into interactive search pipelines, and (iii) explaining how the relevance feedback model perceives user input.
LBR Paper
- SPOT #27A Dual-Key Attention Framework for Sequential Recommendation with Side Information
by Minje Kim, Wooseung Kang, Gun-Woo Kim, Chie Hoon Song, Suwon Lee, Sang-Min ChoiSequential recommendation (SR) aims to predict users’ future interactions based on their historical behavior. Recently, deep learning-based SR models leveraging side information have gained considerable attention. Within these systems, items can be viewed from relation-based and attribute-based perspectives. The relation-based perspective characterizes items based on implicit relationships and contextual dependencies derived from user interactions. The attribute-based perspective defines items using inherent properties, such as category or genre. However, these perspectives are inherently entangled, making separate learning challenging. To address this issue, we propose a dual-key attention framework for sequential recommendation (DK-SR), which effectively learns both relation-based and attribute-based representations. DK-SR employs an attention mechanism with dual keys: one for item-level attention, facilitating relation-based representation learning, and another for attribute-level attention, enhancing attribute-based representation. Extensive experiments on four real-world datasets demonstrate that our model outperforms six state-of-the-art SR models leveraging side information. Additionally, an ablation study validates the contribution of the dual-key mechanism.
- SPOT #28Addressing Multiple Hypothesis Bias in CTR Prediction for Ad Selection
by Oren Sar Shalom, Neil DaftaryPredicting click-through rates (CTR) for candidate advertisements is central to many online recommendation and ad-serving systems. However, selecting top-ranked ads based on predicted CTR (pCTR) inherently introduces a systematic bias: since each pCTR contains random estimation error, ads ranked highest tend to exhibit positive error, leading to overestimation of true CTR and miscalibration. Furthermore, as the number of candidates grows, the extreme order statistics amplify this so-called Multiple Hypothesis Bias. Proper calibration of pCTR ensures that estimated probabilities match observed click frequencies, which is essential for setting accurate bids and maximizing revenue in ad auctions. Without reliable calibration, high-accuracy models can still misprice impressions, resulting in both lost revenue and inefficient budget allocation. In this paper, we (1) formally define the bias arising from ranking by noisy estimates and (2) derive an estimator to correct pCTR by subtracting the expected error under mild distributional assumptions. Experiments on large-scale ad data show significant improvements in calibration metrics across multiple ad settings.
- SPOT #29Balancing Accuracy and Novelty with Sub-Item Popularity
by Chiara Mallamaci, Aleksandr Vladimirovich Petrov, Alberto Carlo Maria Mancino, Vito Walter Anelli, Tommaso Di Noia, Craig MacdonaldIn the realm of music recommendation, sequential recommenders have shown promise in capturing the dynamic nature of music consumption. A key characteristic of this domain is repetitive listening, where users frequently replay familiar tracks. To capture these repetition patterns, recent research has introduced Personalised Popularity Scores (PPS), which quantify user-specific preferences based on historical frequency. While PPS enhances relevance in recommendation, it often reinforces already-known content, limiting the system’s ability to surface novel or serendipitous items—key elements for fostering long-term user engagement and satisfaction. To address this limitation, we build upon RecJPQ, a Transformer-based framework initially developed to improve scalability in large-item catalogues through sub-item decomposition. We repurpose RecJPQ’s sub-item architecture to model personalised popularity at a finer granularity. This allows us to capture shared repetition patterns across sub-embeddings—latent structures not accessible through item-level popularity alone. We propose a novel integration of sub-ID-level personalised popularity within the RecJPQ framework, enabling explicit control over the trade-off between accuracy and personalised novelty. Our sub-ID-level PPS method (sPPS) consistently outperforms item-level PPS by achieving significantly higher personalised novelty without compromising recommendation accuracy. Code and experiments are publicly available at https://github.com/sisinflab/Sub-id-Popularity.
- SPOT #30Benefiting from Negative yet Informative Feedback by Contrasting Opposing Sequential Patterns
by Veronika Ivanova, Evgeny Frolov, Alexey VasilevWe consider the task of learning from both positive and negative feedback in a sequential recommendation scenario, as both types of feedback are often present in user interactions. Meanwhile, conventional sequential learning models usually focus on considering and predicting positive interactions, ignoring that reducing items with negative feedback in recommendations improves user satisfaction with the service. Moreover, the negative feedback can potentially provide a useful signal for more accurate identification of true user interests. In this work, we propose to train two transformer encoders on separate positive and negative interaction sequences. We incorporate both types of feedback into the training objective of the sequential recommender using a composite loss function that includes positive and negative cross-entropy as well as a cleverly crafted contrastive term, that helps better modeling opposing patterns. We demonstrate the effectiveness of this approach in terms of increasing true-positive metrics compared to state-of-the-art sequential recommendation methods while reducing the number of wrongly promoted negative items.
- SPOT #31Beyond Clicks: Eye-Tracking Insights into User Responses to Different Recommendation Types
by Georgios Koutroumpas, Matteo Mazzini, Sebastian Idesis, Mireia Masias, Joemon Jose, Sergi Abadal, Ioannis ArapakisModern recommender systems increasingly rely on implicit human feedback to enhance recommendation quality, personalization, and user engagement. In e-commerce, eye-tracking has emerged as a valuable tool for capturing attention and preference, yet little work has explored how users behave across different recommendation categories. In this study, we analyse eye-tracking data from users exposed to four recommendation types—Exact, Substitute, Complement, and Irrelevant—in a query-based setting. Our results reveal consistent patterns: users exhibit predictable, text-focused viewing for Exact and Substitute items, while Complement and Irrelevant items trigger more distributed, exploratory behaviour. Notably, Irrelevant items elicit higher emotional arousal associated with disengagement—a pattern not seen with Complement items, suggesting the latter may increase diversity without harming user experience. These findings highlight the importance of considering recommendation context in user modelling, and provide a foundation for future work on context-aware recommender systems and the use of eye-tracking data.
- SPOT #32Debiasing Implicit Feedback Recommenders via Sliced Wasserstein Distance-based Regularization
by Gustavo Escobedo, David Penz, Markus SchedlRecommendation models often encode users’ sensitive attributes (e.g., gender or age) in their learned representations during training, leading to biased (e.g., stereotypical) recommendations and potential privacy risks. To address this, previous research has predominantly focused on adversarial training to make user representations invariant to sensitive attributes. However, adversarial methods can be unstable and computationally expensive due to additional network parameters. An alternative approach is the use of regularization losses that minimize distributional discrepancies between different demographic groups during training. In particular, the Sliced Wasserstein Distance (SWD) provides a computationally efficient and stable solution for mitigating bias by directly aligning the distributions of user representations across groups. We follow this alternative strategy and propose an in-processing approach to mitigate encoded biases in user representations of implicit feedback-based recommender systems by using SWD-based regularization. We perform extensive experiments targeting the debiasing of the users’ gender on three datasets ML-1M , LFM2b-DB , and EB-NeRD from the movie, music, and news domains, respectively. Our results indicate that SWD-based regularization is an effective approach for mitigating encoded biases in user representations while keeping competitive recommendation accuracy.
- SPOT #33Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
by Marco De Nadai, Andreas Damianou, Mounia LalmasExisting video recommender systems rely primarily on user-defined metadata or on low-level visual and acoustic signals extracted by specialised encoders. These low-level features describe what appears on the screen but miss deeper semantics such as intent, humour, and world knowledge that make clips resonate with viewers. For example, is a 30‑second clip simply a singer on a rooftop, or an ironic parody filmed amid the fairy chimneys of Cappadocia, Turkey? Such distinctions are critical to personalised recommendations yet remain invisible to traditional encoding pipelines. In this paper, we introduce a simple, recommendation system‑agnostic zero-finetuning framework that injects high‑level semantics into the recommendation pipeline by prompting an off‑the‑shelf Multimodal Large Language Model (MLLM) to summarise each clip into a rich natural‑language description (e.g. “a superhero parody with slapstick fights and orchestral stabs”), bridging the gap between raw content and user intent. We use MLLM output with a state‑of‑the‑art text encoder and feed it into standard collaborative, content‑based, and generative recommenders. On the MicroLens‑100K dataset, which emulates user interactions with TikTok‑style videos, our framework consistently surpasses conventional video, audio, and metadata features in five representative models. Our findings highlight the promise of leveraging MLLMs as on‑the‑fly knowledge extractors to build more intent‑aware video recommenders.
- SPOT #34Don’t Get Ahead of Yourself: A Critical Study on Data Leakage in Offline Evaluation of Sequential Recommenders
by Huy Hoang Le, Yang Liu, Alan Medlar, Dorota GlowackaWhile previous studies have investigated data leakage in recommendation, their findings have had little impact on research practice. These studies show that data leakage exists, it can inflate evaluation metrics, and may cause pathological outcomes, such as models predicting items from the future. However, temporal leave-one-out, the data splitting strategy most widely used to evaluate sequential recommenders, remains prevalent even though it is known to suffer from data leakage. We found ourselves asking the question: if so many researchers appear unconcerned with data leakage, maybe it’s not such a big deal? In this article, we investigate data leakage in offline evaluation of sequential recommenders. We compare temporal leave-one-out with split-by-timepoint leave-one-out, a comparable data splitting strategy that prevents data leakage. Across four data sets, we show that sampled nDCG@10 drops by 21.7 with split-by-timepoint leave-one-out. This performance drop is primarily due to the absence of data leakage as controlling for training set size between data splitting strategies yields similar results. Our work highlights the severity of data leakage in sequential recommendation studies and suggests a need to reconsider current research practices and to question the veracity of prior studies
- SPOT #35End-to-End Time Interval-wise Segmentation for Sequential Recommendation
by Minje Kim, Wooseung Kang, Gun-Woo Kim, Chie Hoon Song, Suwon Lee, Sang-Min ChoiSequential recommendation aims to predict a user’s next interaction based on their historical behavior. While recent models have achieved remarkable success, they often overlook time intervals between interactions or rely on fixed thresholds for session segmentation, which can lead to suboptimal results. To address these limitations, several approaches incorporate time intervals via relative positional embeddings or session segmentation based on fixed thresholds. However, these methods are highly sensitive to threshold selection and are prone to inaccurate segmentation. Inspired by these challenges, we propose TiSRec, a Time Interval-wise Segmentation framework that dynamically divides user sequences into Local Preference Blocks (LPBs) by selecting significant time intervals. TiSRec captures evolving user preferences through intra-block and inter-block encoders. Experiments on four real-world datasets demonstrate that TiSRec consistently outperforms state-of-the-art methods, and ablation studies confirm the effectiveness of LPB-based modeling.
- SPOT #36eSASRec: Enhancing Transformer-based Recommendations in a Modular Fashion
by Daria Tikhonovich, Nikita Zelinskiy, Aleksandr V. Petrov, Mayya Spirina, Andrei Semenov, Andrey V. Savchenko, Sergei KulievSince their introduction, Transformer-based models, such as SASRec and BERT4Rec, have become common baselines for sequential recommendations, surpassing earlier neural and non-neural methods. A number of following publications have shown that the effectiveness of these models can be improved by, for example, slightly updating the architecture of the Transformer layers, using better training objectives, and employing improved loss functions. However, the additivity of these modular improvements has not been systematically benchmarked – this is the gap we aim to close in this paper. Through our experiments, we identify a very strong model that uses SASRec’s training objective, LiGR Transformer layers, and Sampled Softmax Loss. We call this combination eSASRec (Enhanced SASRec). While we primarily focus on realistic, production-like evaluation, in our preliminarily study we find that common academic benchmarks show eSASRec to be 23% more effective compared to the most recent state-of-the-art models, such as ActionPiece. In our main production-like benchmark, eSASRec resides on the Pareto frontier in terms of the accuracy–coverage tradeoff (alongside the recent industrial models HSTU and FuXi-α). As the modifications compared to the original SASRec are relatively straightforward and no extra features are needed (such as timestamps in HSTU), we believe that eSASRec can be easily integrated into existing recommendation pipelines and can can serve as a strong yet very simple baseline for emerging complicated algorithms. To facilitate this, we provide the open-source implementations for our models and benchmarks in repository https://github.com/blondered/transformer_benchmark
- SPOT #37Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge
by Francesco Fabbri, Gustavo Penha, Edoardo D’Amico, Alice Wang, Marco De Nadai, Jackie Doremus, Paul Gigioli, Andreas Damianou, Oskar Stål, Mounia LalmasEvaluating personalized recommendations remains a central challenge, especially in long-form audio domains like podcasts, where traditional offline metrics suffer from exposure bias and online methods such as A/B testing are costly and operationally constrained. In this paper, we propose a novel framework that leverages Large Language Models (LLMs) as offline judges to assess the quality of podcast recommendations in a scalable and interpretable manner. Our two-stage profile-aware approach first constructs natural-language user profiles distilled from 90 days of listening history. These profiles summarize both topical interests and behavioral patterns, serving as compact, interpretable representations of user preferences. Rather than prompting the LLM with raw data, we use these profiles to provide high-level, semantically rich context—enabling the LLM to reason more effectively about alignment between a user’s interests and recommended episodes. This reduces input complexity and improves interpretability. The LLM is then prompted to deliver fine-grained pointwise and pairwise judgments based on the profile-episode match. In a controlled study with 47 participants, our profile-aware judge matched human judgments with high fidelity and outperformed or matched a variant using raw listening histories. The framework enables efficient, profile-aware evaluation for iterative testing and model selection in recommender systems.
- SPOT #38Fine-tuning for Inference-efficient Calibrated Recommendations
by Oleg Lesota, Adrian Bajko, Max Walder, Matthias Wenzel, Antonela Tommasel, Markus SchedlCalibration is the degree to which a recommender system is able to match the distribution of a certain item attribute among the items consumed by a user with their respective recommendations. Recent work suggests that many recommenders tend to provide miscalibrated recommendations. Furthermore, most approaches aimed at improving calibration adopt the post-processing paradigm, making them computationally costly at the inference time. This work proposes CaliTune, a fine-tuning approach applied to collaborative filtering based recommenders to allow them generate better calibrated recommendations without relying on costly post-processing. We compare CaliTune to an established post-processing approach on two backbone models and datasets from movie and music domains, focusing on popularity calibration. Our results suggest that CaliTune can offer a competitive accuracy–calibration trade-off in several settings, particularly when the backbone model exhibits high miscalibration and accuracy remains important, making it a promising inference-efficient alternative in such cases.
- SPOT #39From Previous Plays to Long-Term Tastes
by Robin Ungruh, Alejandro Bellogín, Maria Soledad PeraStudying the interplay of children and recommender systems (RS) is ethically and practically challenging, making simulation a promising alternative for exploration. However, recent simulation approaches that aim to model natural user-RS interactions typically rely on behavioral data and assume that user preferences remain consistent over time—an assumption that may not hold for children who undergo continuous developmental changes. With that in mind, we explore the extent to which simulations based on historical data can meaningfully reflect children’s long-term consumption patterns. We do this via a simulation study using real-world data in which user behavior is modeled from observed listening preferences. Specifically, we probe whether simulation mirrors user preferences over time by comparing with organic (i.e., real) consumption patterns. Our findings offer a critical reflection on the reliability of simulation-based RS research for children and question the reliability of using behavioral assumptions to model users.
- SPOT #40How Fair is Your Diffusion Recommender Model?
by Daniele Malitesta, Giacomo Medda, Erasmo Purificato, Mirko Marras, Fragkiskos Malliaros, Ludovico BorattoDiffusion-based learning has settled as a rising paradigm in generative recommendation, outperforming traditional approaches built upon variational autoencoders and generative adversarial networks. Despite their effectiveness, concerns have been raised that diffusion models – widely adopted in other machine-learning domains – could potentially lead to unfair outcomes, since they are trained to recover data distributions that often encode inherent biases. Motivated by the related literature, and acknowledging the extensive discussion around bias and fairness aspects in recommendation, we propose, to the best of our knowledge, the first empirical study of fairness for DiffRec, chronologically the pioneer technique in diffusion-based recommendation. Our empirical study involves DiffRec and its variant L-DiffRec, tested against nine recommender systems on two benchmarking datasets to assess recommendation utility and fairness from both consumer and provider perspectives. Specifically, we first evaluate the utility and fairness dimensions separately and, then, within a multi-criteria setting to investigate whether, and to what extent, these approaches can achieve a trade-off between the two. While showing worrying trends in alignment with the more general machine-learning literature on diffusion models, our results also indicate promising directions to address the unfairness issue in future work. The source code is available at https://github.com/danielemalitesta/FairDiffRec.
- SPOT #41Investigating Carbon Footprint of Recommender Systems Beyond Training Time
by Josef Schodl, Oleg Lesota, Antonela Tommasel, Markus SchedlThe environmental footprint of recommender systems has received growing attention in the research community. While recent work has examined the trade-off between model accuracy and the estimated carbon emissions during training, we argue that a comprehensive evaluation should also account for the emissions produced during inference time, especially in applications where models are deployed for extended periods with frequent inference cycles. In this study, we extend previous carbon footprint analyses from the literature by incorporating the inference phase into the carbon footprint assessment and exploring how variations in training configurations affect emissions. Our findings reveal that models with higher training emissions can, in some cases, offer lower environmental costs at inference time. Moreover, we show that minimizing the number of validation metrics computed during training can lead to significant reductions in overall carbon footprint, highlighting the importance of thoughtful experimental design in sustainable machine learning.
- SPOT #42Learning geometry-aware recommender systems with manifold regularization
by Zaira Zainulabidova, Julia Borisova, Alexander HvatovRecent work shows that hyperbolic geometry may be a better option for recommendation systems in some cases due to the natural hierarchy present in user demands. However, the choice of geometry often determines the model architecture by fixing the type of embedding. This paper discusses the manifold regularization problem statement, which allows for preserving the original architecture and standard embeddings while imposing a non-strict geometry constraint. We demonstrate using hyperbolic geometry for neural collaborative filtering in two distinct recommendation tasks based on multilayer perceptron (MLP) networks: top-k recommendation and explicit rating prediction. For a more comprehensive architecture, we also test SASRec. All tasks are evaluated on the Amazon Reviews and MovieLens1M datasets. Experiments show that manifold regularization achieves performance comparable to hyperbolic embeddings on datasets with hierarchical structure without requiring changes to the model architecture and thus leaves initial model inference unaffected.
- SPOT #43Leveraging Geometric Insights in Hyperbolic Triplet Loss for Improved Recommendations
by Viacheslav Yusupov, Maxim Rakhuba, Evgeny FrolovRecent studies have demonstrated the potential of hyperbolic geometry for capturing complex patterns from interaction data in recommender systems. In this work, we introduce a novel hyperbolic recommendation model that uses geometrical insights to improve representation learning and increase computational stability at the same time. We reformulate the notion of hyperbolic distances to unlock additional representation capacity over conventional Euclidean space and learn more expressive user and item representations. To better capture user-items interactions, we construct a triplet loss that models ternary relations between users and their corresponding preferred and nonpreferred choices through a mix of pairwise interaction terms driven by the geometry of data. Our hyperbolic approach not only outperforms existing Euclidean and hyperbolic models but also reduces popularity bias, leading to more diverse and personalized recommendations.
- SPOT #44Lift It Up Right: A Recommender System for Safer Lifting Postures
by Gaetano Dibenedetto, Pasquale Lops, Marco Polignano, Helma TorkamaanWork-related musculoskeletal disorders, often caused by poor lifting posture and unsafe manual handling, continue to pose a significant threat to worker health and safety. This paper presents a health recommender system designed to prevent injury by assessing and correcting posture for lifting techniques. Leveraging monocular video input, our method estimates key ergonomic parameters to compute the Lifting Index based on the Revised NIOSH Lifting Equation. When the computed Lifting Index exceeds a predefined safety threshold, the system automatically generates graphical and textual recommendations to guide the worker towards safer postural strategies. This safety-aware recommender system provides interpretable and actionable feedback without requiring wearable sensors or multi-camera setups, making it suitable for deployment in real-world workplace environments. By integrating ergonomics with recommender system design, we contribute to a new class of context-aware, safety-oriented recommendation technologies tailored for occupational health.
- SPOT #45The Hidden Cost of Defaults in Recommender System Evaluation
Hannah Berling, Robin Svahn, Alan Said“Hyperparameter optimization is critical for improving the performance of recommender systems, yet its implementation is often treated as a neutral or secondary concern. In this work, we shift focus from model benchmarking to auditing the behavior of RecBole, a widely used recommendation framework. We show that RecBole’s internal defaults, particularly an undocumented early-stopping policy, can prematurely terminate Random Search and Bayesian Optimization. This limits search coverage in ways that are not visible to users. Using six models and two datasets, we compare search strategies and quantify both performance variance and search path instability. Our findings reveal that hidden framework logic can introduce variability comparable to the differences between search strategies. These results highlight the importance of treating frameworks as active components of experimental design and call for more transparent, reproducibility-aware tooling in recommender systems research. We provide actionable recommendations for researchers and developers to mitigate hidden configuration behaviors and improve the transparency of hyperparameter tuning workflows.
- SPOT #46Mitigating Popularity Bias in Counterfactual Explanations using Large Language Models
by Arjan Hasami, Masoud MansouryCounterfactual explanations (CFEs) offer a tangible and actionable way to explain recommendations by showing users a “what-if” scenario that demonstrates how small changes in their history would alter the system’s output. However, existing CFE methods are susceptible to bias, generating explanations that might misalign with the user’s actual preferences. In this paper, we propose a pre-processing step that leverages large language models to filter out-of-character history items before generating an explanation. In experiments on two public datasets, we focus on popularity bias and apply our approach to ACCENT, a neural CFE framework. We find that it creates counterfactuals that are more closely aligned with each user’s popularity preferences than ACCENT alone.
- SPOT #47Opening the Black Box: Interpretable Remedies for Popularity Bias in Recommender Systems
by Parviz Ahmadov, Masoud MansouryPopularity bias is a well-known challenge in recommender systems, where a small number of popular items receive disproportionate attention, while the majority of less popular items are largely overlooked. This imbalance often results in reduced recommendation quality and unfair exposure of items. Although existing mitigation techniques address this bias to some extent, they typically lack transparency in how they operate. In this paper, we propose a post-hoc method using a Sparse Autoencoder (SAE) to interpret and mitigate popularity bias in deep recommendation models. The SAE is trained to replicate a pre-trained model’s behavior while enabling neuron-level interpretability. By introducing synthetic users with clear preferences for either popular or unpopular items, we identify neurons encoding popularity signals based on their activation patterns. We then adjust the activations of the most biased neurons to steer recommendations toward fairer exposure. Experiments on two public datasets using a sequential recommendation model show that our method significantly improves fairness with minimal impact on accuracy. Moreover, it offers interpretability and fine-grained control over the fairness–accuracy trade-off.
- SPOT #48Normative Alignment of Recommender Systems via Internal Label Shift
by Johannes Kruse, Kasper Lindskow, Michael Riis Andersen, Ryotaro Shimizu, Julian McAuley, Pierre-Alexandre Mattei, Jes FrellsenRecommender systems optimized solely for user engagement often fail to meet broader normative objectives such as fairness, diversity, or editorial values. We introduce NAILS (Normative Alignment of recommender systems via Internal Label Shift), a simple and scalable method for aligning recommendation outputs with target distributions over item-level attributes (e.g., categories). NAILS modifies the user-conditional item distribution to induce a specified marginal distribution over attributes, leveraging existing user–item preferences without retraining the model. To achieve this, we recast the problem as a form of label shift applied internally within a hierarchical classification framework. Adopting a stakeholder-centric perspective, NAILS enables alignment with global normative goals. Empirically, we show that NAILS consistently improves attribute-level alignment with minimal impact on user engagement, providing a practical mechanism for value-driven recommendation. Our code is available at https://github.com/johanneskruse/nails.
- SPOT #49PAIRSAT: Integrating Preference-Based Signals for User Satisfaction Estimation in Dialogue Systems
by Eran Fainman, Adir Solomon, Osnat MokrynUser satisfaction estimation in dialogue systems is a fundamental measure for assessing and improving conversational-AI quality and user experience. Current approaches rely on users’ satisfaction annotations, referred to as supervised labels. Yet these labels are scarce, costly to collect, and often domain-specific. Another form of feedback arises when a user selects one of two offered responses in a conversation, usually called a preference signal. In this work, we propose PAIRSAT, a new model for user-satisfaction estimation that integrates both satisfaction labels and preference signals. We reformulate satisfaction prediction as a bounded regression task on a continuous scale, enabling fine-grained modeling of satisfaction levels. To exploit the preference data, we incorporate a pairwise ranking loss that encourages higher predicted satisfaction for accepted conversation responses over rejected ones. PAIRSAT jointly optimizes regression on labeled data and ranking on preference pairs using a Transformer-based encoder. Experiments demonstrate that our model outperforms baselines that rely solely on supervised satisfaction labels, demonstrating the value of adding preference signals. Further, our results underscore the value of leveraging additional signals for satisfaction estimation in dialogue systems.
- SPOT #50Parameter-Efficient Single Collaborative Branch for Recommendation
by Marta Moscati, Shah Nawaz, Markus SchedlRecommender Systems (RS) often rely on representations of users and items in a joint embedding space and on a similarity metric to compute relevance scores. In modern RS, the modules to obtain user and item representations consist of two distinct and separate neural networks (NN). In multimodal representation learning, weight sharing has been proven effective in reducing the distance between multiple modalities of a same item. Inspired by these approaches, we propose a novel RS that leverages weight sharing between the user and item NN modules used to obtain the latent representations in the shared embedding space. The proposed framework consists of a single Collaborative Branch for Recommendation (CoBraR). We evaluate CoBraR by means of quantitative experiments on e-commerce and movie recommendation. Our experiments show that by reducing the number of parameters and improving beyond-accuracy aspects without compromising accuracy, CoBraR has the potential to be applied and extended for real-world scenarios.
- SPOT #51Probabilistic Modeling, Learnability and Uncertainty Estimation for Interaction Prediction in Movie Rating Datasets
by Jennifer Poernomo, Nicole Gabrielle Lee Tan, Rodrigo Alves, Antoine LedentIn this paper, we examine the hypothesis that the interactions recorded in many Recommendation Systems datasets are distributed according to a low-rank distribution, i.e. a mixture of factorizable distributions. Surprisingly, we find that on several popular datasets, a simple non-negative matrix factorization method equals or outperforms more modern methods such as LightGCN, which indicates that the sampling distribution over interactions is indeed low-rank. Furthermore, we mathematically prove that low-rank distributions are learnable with a sparse number of observations (where m/n and r refer to the number of users/items and the non-negative rank respectively) both in terms of the total variation norm and in terms of the expected recall at k, arguably providing some of the first generalization bounds for recommender systems in the implicit feedback setting. We also provide a modified version of the NMF algorithm which provides further performance improvements compared to the standard NMF baseline on the smaller datasets considered. Finally, we propose the theoretically grounded concept of empirical expected recall as an uncertainty estimate for probabilistic models of the recommendation task, and demonstrate its success in a setting where user-wise abstentions are allowed.
- SPOT #52Recommendation Is a Dish Better Served Warm
by Danil Gusak, Nikita Sukhorukov, Evgeny FrolovIn modern recommender systems, experimental settings typically include filtering out cold users and items based on a minimum interaction threshold. However, these thresholds are often chosen arbitrarily and vary widely across studies, leading to inconsistencies that can significantly affect the comparability and reliability of evaluation results. In this paper, we systematically explore the cold-start boundary by examining the criteria used to determine whether a user or an item should be considered cold. Our experiments incrementally vary the number of interactions for different items during training, and gradually update the length of user interaction histories during inference. We investigate the thresholds across several widely used datasets, commonly represented in recent papers from top-tier conferences, and on multiple established recommender baselines. Our findings show that inconsistent selection of cold-start thresholds can either result in the unnecessary removal of valuable data or lead to the misclassification of cold instances as warm, introducing more noise into the system.
- SPOT #53Recurrent Autoregressive Linear Model for Next-Basket Recommendation
by Tereza Zmeškalová, Antoine Ledent, Martin Spišák, Pavel Kordík, Rodrigo AlvesNext-basket recommendation aims to predict the (sets of) items that a user is most likely to purchase during their next visit, capturing both short-term sequential patterns and long-term user preferences. However, effectively modeling these dynamics remains a challenge for traditional methods, which often struggle with interpretability and computational efficiency, particularly when dealing with intricate temporal dependencies and inter-item relationships. In this paper, we propose ReALM, a Recurrent Autoregressive Linear Model that explicitly captures temporal item-to-item dependencies across multiple time steps. By leveraging a recurrent loss function and a closed-form optimization solution, our approach offers both interpretability and scalability while maintaining competitive accuracy. Experimental results on real-world datasets demonstrate that ReALM outperforms several state-of-the-art baselines in both recommendation quality and efficiency, offering a robust and interpretable solution for modern personalization systems.
- SPOT #54Rethinking Subjective Features in Recommender Systems: Personal Views Over Aggregated Values
by Arsen Matej Golubovikj, Marko TkalčičSubjective features of content items, such as emotional resonance and aesthetic quality, have become increasingly important in recommender systems (RecSys), as the field moves beyond objective content and behavioral signals. Traditionally, such features were treated as fixed item-level properties, aggregated across users. However, emerging evidence suggests that subjective features are inherently user-dependent, shaped by individual interpretations and personal perspectives. This paper presents the first direct comparison between fixed (aggregated) and user-specific (subjective) item representations for modeling subjective features in RecSys. Using three datasets spanning movies, videos, and images, with subjective features, such as eudaimonia, hedonia, emotion, and aesthetics, we evaluate the impact of the representation strategy (i.e. fixed vs. user-specific) on recommendation performance across multiple algorithms. Our findings show that user-specific representations consistently outperform aggregate ones, often with statistically significant improvements. These results underscore the importance of modeling subjectivity at the user level, offering concrete guidance for more personalized and effective recommendation systems.
- SPOT #55RicciFlowRec: A Geometric Root Cause Recommender Using Ricci Curvature on Financial Graphs
by Zhongtian Sun, Anoushka HaritWe propose RicciFlowRec, a geometric recommendation framework that performs root cause attribution via Ricci curvature and flow on dynamic financial graphs. By modelling evolving interactions among stocks, macroeconomic indicators, and news, we quantify local stress using discrete Ricci curvature and trace shock propagation via Ricci flow. Curvature gradients reveal causal substructures, informing a structural risk-aware ranking function. Preliminary results on S&P 500 data with FinBERT-based sentiment show improved robustness and interpretability under synthetic perturbations. This ongoing work supports curvature-based attribution and early-stage risk-aware ranking, with plans for portfolio optimization and return forecasting. To our knowledge, RicciFlowRec is the first recommender to apply geometric flow-based reasoning in financial decision support.
- SPOT #56SAGEA: Sparse Autoencoder-based Group Embeddings Aggregation for Fairness-Preserving Group Recommendations
by Vít Koštejn, Ladislav Peška, Martin SpišákGroup recommender systems (GRS) deliver suggestions to users who plan to engage in activities together, rather than individually. To be effective, they must reflect shared group interests while maintaining fairness by accounting for the preferences of individual members. Traditional approaches address fairness through post-processing, aggregating the recommendations after they are generated for each group member. However, this strategy adds significant complexity and offers only limited impact due to its late position in the GRS pipeline. In contrast, we propose an efficient in-processing method combining (1) monosemantic sparse user representations generated via a sparse autoencoder (SAE) bridge module, and (2) fairness-preserving group profile aggregation strategies. By leveraging disentangled representations, our Sparse Autoencoder-based Group Embeddings Aggregation (SAGEA) approach enables transparent, fairness-preserving profile aggregation within the GRS process. Experiments show that SAGEA improves both recommendation accuracy and fairness over profile and results aggregation baselines, while being more efficient than post-processing techniques.
- SPOT #57Semantic IDs for Joint Generative Search and Recommendation
by Gustavo Penha, Edoardo D’Amico, Marco De Nadai, Enrico Palumbo, Alexandre Tamborrino, Ali Vardasbi, Max Lefarov, Shawn Lin, Timothy Heath, Francesco Fabbri, Hugues BouchardGenerative models powered by Large Language Models (LLMs) are emerging as a unified solution for powering both recommendation and search tasks. A key design choice in these models is how to represent items, traditionally through unique identifiers (IDs) and more recently with Semantic IDs composed of discrete codes, obtained from embeddings. While task-specific embedding models can improve performance for individual tasks, they may not generalize well in a joint setting. In this paper, we explore how to construct Semantic IDs that perform well both in search and recommendation when using a unified model. We compare a range of strategies to construct Semantic IDs, looking into task-specific and cross-tasks approaches, and also whether each task should have its own semantic ID tokens in a joint search and recommendation generative model. Our results show that using a bi-encoder model fine-tuned on both search and recommendation tasks to obtain item embeddings, followed by the construction of a unified Semantic ID space provides an effective trade-off, enabling strong performance in both tasks. We hope these findings spark follow-up work on generalisable, semantically grounded ID schemes and inform the next wave of unified generative recommender architectures.
- SPOT #58SlateLLM: Distilling LLM Semantics into Session-Aware Slate Recommendation without Inference Overhead
by Aayush Roy, Elias Tragos, Aonghus Lawlor, Neil HurleySession-based slate recommendation systems curate ranked sets of items in real-time, adapting to evolving user interactions. Balancing relevance, diversity, and novelty remains challenging for reinforcement learning (RL) methods. Recent advances in large language models (LLMs) offer a new possibility to leverage their semantic reasoning capabilities to refine slate composition. In this work, we examine the impact of LLM-driven reasoning on slate generation by integrating LLMs with an RL-based slate recommender and evaluating in terms of accuracy, similarity, diversity, and novelty. We extend the RecSim framework with real-world interaction data and introduce a session-aware evaluation protocol that captures long-term engagement. Our analysis reveals that LLM reasoning enhances subcategory-level diversity while maintaining relevance, leading to increased user engagement. By visualizing category-level shifts in slate composition we uncover systematic patterns in how LLMs refine recommendation diversity. Although direct LLM use during inference may be hampered by computational demands and latency concerns, our experimental results demonstrate that integrating LLM modifications during training enables the model to internalize the nuanced characteristics of LLM reasoning without incurring inference overhead, thereby improving recommendation performance, serving time efficiency, and deployability.
- SPOT #59Meta Off-Policy Estimation
by Olivier JeunenOff-policy estimation (OPE) methods enable unbiased offline evaluation of recommender systems, directly estimating the online reward some target policy would have obtained, from offline data and with statistical guarantees. The theoretical elegance of the framework combined with practical successes have led to a surge of interest, with many competing estimators now available to practitioners and researchers. Among these, Doubly Robust methods provide a prominent strategy to combine value- and policy-based estimators. In this work, we take an alternative perspective to combine a set of OPE estimators and their associated confidence intervals into a single, more accurate estimate. Our approach leverages a correlated fixed-effects meta-analysis framework, explicitly accounting for dependencies among estimators that arise due to shared data. This yields a best linear unbiased estimate (blue) of the target policy’s value, along with an appropriately conservative confidence interval that reflects inter-estimator correlation. We validate our method on both simulated and real-world data, demonstrating improved statistical efficiency over existing individual estimators.
- SPOT #60t-Testing the Waters: Empirically Validating Assumptions for Reliable A/B-Testing
by Olivier JeunenA/B-tests are a cornerstone of experimental design on the web, with wide-ranging applications and use-cases. The statistical t-test comparing differences in means is the most commonly used method for assessing treatment effects, often justified through the Central Limit Theorem (CLT). The CLT ascertains that, as the sample size grows, the sampling distribution of the Average Treatment Effect converges to normality, making the t-test valid for sufficiently large sample sizes. When outcome measures are skewed or non-normal, quantifying what “sufficiently large” entails is not straightforward. To ensure that confidence intervals maintain proper coverage and that p-values accurately reflect the false positive rate, it is critical to validate this normality assumption. We propose a practical method to test this, by analysing repeatedly resampled A/A-tests. When the normality assumption holds, the resulting p-value distribution should be uniform, and this property can be tested using the Kolmogorov-Smirnov test. This provides an efficient and effective way to empirically assess whether the t-test’s assumptions are met, and the A/B-test is valid. We demonstrate our methodology and highlight how it helps to identify scenarios prone to inflated Type-I errors. Our approach provides a practical framework to ensure and improve the reliability and robustness of A/B-testing practices.
- SPOT #61Unobserved Negative Items in Recommender Systems: Challenges and Solutions for Evaluation and Learning
by Masahiro SatoProperly conducting offline evaluation is crucial for recommender systems. While sampling negative items has traditionally been employed for its efficiency in evaluation, recent studies have highlighted the limitations of this approach, fostering researchers to adopt a more cautious stance toward item-sampling evaluation. However, even in the absence of intentional sampling, negative items may still be missing. This issue arises because typical implicit feedback datasets contain only items that have been interacted with by at least one user in the dataset. Consequently, the included items may not encompass the entire catalog of items that serve as true candidate items during online deployment. In this paper, we investigate the impact of missing candidate items on both the evaluation and learning processes of recommender systems. Our findings demonstrate that missing candidate items lead to the overestimation of model performance and inconsistencies in identifying superior models. Moreover, their absence significantly impairs model training. To address this challenge, we propose evaluation and learning methods based on inverse probability weighting, complemented by a novel protocol for estimating the probabilities of missing items. We show that the proposed evaluation methods recover metrics that closely approximate their true values. Furthermore, the proposed learning method yields a more robust model, even when candidate items are missing from the training data.
Extra Papers
- SPOT #62LEAF: Lightweight, Efficient, Adaptive and Flexible Embedding for Large-Scale Recommendation Models
by Chaoyi Jiang, Abdulla Alshabanah, Murali AnnavaramDeep Learning Recommendation Models (DLRMs) are central to modeling user behavior, enhancing user experience, and boosting revenues for internet companies. DLRMs rely heavily on embedding tables, which scale to tens of terabytes as the number of users and features grows, presenting challenges in training and storage. These models typically require substantial GPU memory, as embedding operations are not compute-intensive but occupy significant storage. While some solutions have explored CPU storage, this approach still demands terabytes of memory. We introduce LEAF, a multi-level hashing framework that compresses the large embedding tables based on access frequency. In particular, LEAF leverages a streaming algorithm to estimate access distributions on the fly without relying on model gradients or requiring a priori knowledge of access distribution. By using multiple hash functions, LEAF minimizes collision rates of feature instances. Experiments show that LEAF outperforms state-of-the-art compression methods on Criteo Kaggle, Avazu, KDD12, and Criteo Terabyte datasets, with testing AUC improvements of 1.411\%, 1.885\%, 2.761\%, and 1.243\%, respectively.
RecSys 2025 (Prague)
- About the Conference
- Program at Glance
- Program
- Registration
- Accommodation
- Important Dates
- Call for Contributions
- Accepted Contributions
- Keynotes
- Challenge
- Workshops
- Tutorials
- Women in RecSys
- Committees
- Location
- Inclusion
- Grants
- Student Volunteers
- Summer School
- Sponsors
- Card Game Rules
- Conference App
- Awards





















