Accepted Contributions

List of all full papers accepted for RecSys 2025 (in alphabetical order).

RESA Language Model-Based Playlist Generation Recommender System
by Enzo Charolois-Pasqua (EURECOM), Eléa Vellard (EURECOM), Youssra Rebboud (EURECOM), Pasquale Lisena (EURECOM), Raphaël Troncy (EURECOM)

The title of a playlist reflects its intended mood or theme, allowing creators to easily locate their content and enabling other users to discover music that matches specific situations and needs. This study introduces a novel approach to playlist generation using language models to leverage the thematic coherence between a playlist title and its tracks. Our method involves creating semantic clusters from text embeddings, followed by fine-tuning a transformer model on these thematic clusters. Playlists are generated by evaluating cosine similarity scores between known and unknown titles and applying a voting mechanism. Performance evaluation, combining quantitative and qualitative metrics, demonstrates that using the playlist title as a seed provides useful recommendations, even in a zero-shot scenario.
RESA Multi-Factor Collaborative Prediction for Review-based Recommendation
by Junrui Liu (Beijing University of Technology), Tong Li (Beijing University of Technology), Mingliang Yu (TravelSky Technology Limited), Shiqiu Yang (Beijing University of Technology), Zifang Tang (Beijing University of Technology), Zhen Yang (Beijing University of Technology)

In user behaviors, the higher the click-through rate, the higher the rating. Thus, existing recommendation methods implicitly model click behaviors by learning user preferences and achieving accurate predictions on rating prediction tasks. However, they ignore the help of the rating behaviors for the click-through rate prediction task (CTR). Although the rating behavior occurs after the click behavior, we can still get helpful information about clicks from ratings. In this paper, we propose a multi-factor collaborative prediction method (MFC), which mines the complex relationship between click and rating behaviors, achieving accurate prediction on CTR tasks. Specifically, we factorize the complex relationship into three simple relationships, i.e., linear, sharing, and cross-correlation relationships. Thus, MFC first extracts click factors, rating factors, and their sharing factor from user click and rating behaviors with user reviews, as review-based methods have achieved great results on rating predictions. Then, a rating factor regularization method is used to learn rating factors accurately, helping to model the true relationships between click and rating behavior. Finally, MFC combines those three factors to make predictions, while click and rating factors are used to model the linear and cross-correlation relationships, and the sharing factors correspond to the sharing relation. Experiments on five real-world datasets demonstrate that MFC outperforms the best baseline by 9.19%, 9.80%, 0.69%, and 7.95%, in terms of Accuracy, Precision, Recall, and F1-score, respectively. MFC also reduces the MAE of the rating prediction task by 1.92%.
RESA Non-Parametric Choice Model That Learns How Users Choose Between Recommended Options
by Thorsten Krause (Radboud University), Harrie Oosterhuis (Radboud University)

Choice models predict which items users choose from presented options. In recommendation settings, they can infer user preferences while countering exposure bias. In contrast with traditional univariate recommendation models, choice models consider which competitors appeared with the chosen item. This ability allows them to distinguish whether a user chose an item due to preference, i.e., they liked it; or competition, i.e., it was the best available option. Each choice model assumes specific user behavior, e.g., the multinomial logit model. However, it is currently unclear how accurately these assumptions capture actual user behavior, how wrong assumptions impact inference, and whether better models exist. In this work, we propose the learned choice model for recommendation (LCM4Rec), a non-parametric method for estimating the choice model. By applying kernel density estimation, LCM4Rec infers the most likely error distribution that describes the effect of inter-item cannibalization and thereby characterizes the users’ choice model. Thus, it simultaneously infers what users prefer and how they make choices. Our experimental results indicate that our method (i) can accurately recover the choice model underlying a dataset; (ii) provides robust user preference inference, in contrast with existing choice models that are only effective when their assumptions match user behavior; and (iii) is more resistant against exposure bias than existing choice models. Thereby, we show that learning choice models, instead of assuming them, can produce more robust predictions. We believe this work provides an important step towards better understanding users’ choice behavior.
RESAffect-aware Cross-Domain Recommendation for Art Therapy via Music Preference Elicitation
by Bereket A. Yilma (University of Luxembourg), Luis A. Leiva (University of Luxembourg)

Art Therapy (AT) is an established practice that facilitates emotional processing and recovery through creative expression. Recently, Visual Art Recommender Systems (VA RecSys) have emerged to support AT, demonstrating their potential by personalizing therapeutic artwork recommendations. Nonetheless, current VA RecSys rely on visual stimuli for user modeling, limiting their ability to capture the full spectrum of emotional responses during preference elicitation. Previous studies have shown that music stimuli elicit unique affective reflections, presenting an opportunity for cross-domain recommendation (CDR) to enhance personalization in AT. Since CDR has not yet been explored in this context, we propose a family of CDR methods for AT based on music-driven preference elicitation. A large-scale study with 200 users demonstrates the efficacy of music-driven preference elicitation, outperforming the classic visual-only elicitation approach.
RESAn Off-Policy Learning Approach for Steering Sentence Generation towards Personalization
by Haruka Kiyohara (Cornell University), Daniel Cao (Cornell University), Yuta Saito (Cornell University), Thorsten Joachims (Cornell University)

We study the problem of personalizing the output of a large language model (LLM) by training on logged bandit feedback (e.g., personalizing movie descriptions based on likes). While one may naively treat this as a standard off-policy contextual bandit problem, the large action space and the large parameter space make naive applications of off-policy learning (OPL) infeasible. We overcome this challenge by learning a prompt policy for a frozen LLM that has only a modest number of parameters. The proposed Direct Sentence Off-policy gradient (DSO) effectively propagates the gradient to the prompt policy space by leveraging the smoothness and overlap in the sentence space. Consequently, DSO substantially reduces variance while also suppressing bias. Empirical results on our newly established suite of benchmarks, called OfflinePrompts, demonstrate the effectiveness of the proposed approach in generating personalized descriptions for movie recommendations, particularly when the number of candidate prompts and reward noise are large.
RESAuditing Recommender Systems for User Empowerment in Very Large Online Platforms under the Digital Services Act
by Matteo Fabbri (IMT School for Advanced Studies Lucca), Ludovico Boratto (University of Cagliari)

The governance of recommender systems (RSs) in very large online platforms (VLOPs) is expected to undergo a major transformation under the Digital Services Act (DSA), which imposes new obligations on transparency and user control. However, beyond legal compliance, a critical question remains: How can RSs be reimagined to genuinely empower users and foster meaningful personalization? This paper addresses this question by analyzing how three major short-video platforms—Instagram, TikTok, and YouTube—have implemented the DSA requirements for RSs. By reviewing their audit reports, systemic risk assessments and compliance strategies, we evaluate the extent to which current approaches enhance user autonomy and control over content exposure. Building on this analysis, we outline a perspective for the future of VLOPs’ RSs grounded in speculative design. We argue that meaningful personalization should integrate algorithmic choice, balancing proportionality and granularity in RS customization, and content curation, ensuring authoritativeness and diversity to mitigate systemic risks. By bridging legal analysis, platform governance, and user-centered design, this paper outlines actionable pathways for aligning technical developments with regulatory objectives. Our findings contribute to interdisciplinary research on RSs by highlighting how platforms can move beyond minimal compliance toward a model that prioritizes user empowerment and content pluralism.
RESBeyond Immediate Click: Engagement-Aware and MoE-Enhanced Transformers for Sequential Movie Recommendation
by Haotian Jiang (Amazon Prime Video), Sibendu Paul (Amazon Prime Video), Haiyang Zhang (Amazon Prime Video), Caren Chen (Amazon Prime Video)

Modern video streaming services heavily rely on recommender systems. Although there are many methods for content personalization and recommendation, sequential recommendation models stand out due to their ability to summarize user behavior over time. We propose a novel sequential recommendation framework to address the following key issues: suboptimal negative sampling strategies, fixed user-history context lengths, and single-task optimization objectives, insufficient engagement-aware learning, and short-sighted prediction horizons, ultimately improving both immediate and multi-step next-title prediction for video streaming services. In this work, we propose a novel approach to capture patterns of interaction at different time scales. We also align long-term user happiness with instantaneous intent signals using multi-task learning with engagement-aware personalized loss. Finally, we extend traditional next-item prediction into a next-K forecasting task using a training strategy with soft positive label. Extensive experiments on large-scale streaming data validate the effectiveness of our approach. Our best model outperforms the baseline in NDCG@1 by up to 3.52% under realistic ranking scenarios showing the effectiveness of our engagement-aware and MoE-enhanced designs. Results also show that soft-label Multi-K training is a practical and scalable extension, and that a balanced personalized negative sampling strategy generalizes well. Our framework outperforms baselines across all ranking metrics, providing a robust solution for production-scale streaming recommendations.
RESBreaking Knowledge Boundaries: Cognitive Distillation-enhanced Cross-Behavior Course Recommendation Model
by Ruoyu Li (Xidian University), Yangtao Zhou (Xidian University), Chenzhang Li (Xidian University), Hua Chu (Xidian University), Jianan Li (Xidian University), Yuhan Bian (Xidian University)

Online Course Recommendation (CR) stands as a promising educational strategy within online education platforms, with the goal of providing personalized learning experiences for learners and enhancing their learning efficiency. Existing CR methods focus on modeling learners’ learning needs from their historical course interactions by adopting general recommendation techniques, but fail to consider the shifts in course preferences caused by cognitive states. While Cognitive Diagnosis (CD) techniques are adept at tracking cognitive states’ evolution via mining learner-exercise interactions and benefit the CR task, it is non-trivial to integrate CD and CR properly due to several challenges, including accurate diagnosis, divergent task objectives, and inconsistent data magnitude. To address these challenges, we propose a Cognitive Distillation-enhanced Cross-Behavior Course Recommendation model (C3Rec), which aims to transfer the knowledge of learners’ cognitive states to enhance the CR task. Specifically, for accurate diagnosis, we introduce a dual-granularity cognitive diagnosis module to capture learner representations at both coarse and fine granularities, thereby achieving a comprehensive construction of learners’ cognitive states. For divergent task objectives, we design a cross-behavior course recommendation module to jointly profile the dynamic course preferences from two temporal interleaved learning behaviors, achieving the seamlessly semantic alignment between these two tasks. For inconsistent data magnitude, we introduce a triple-stage distillation mechanism to exploit cognitive state features as prior knowledge, enhancing the CR task by further profiling learners’ course preferences. Experimental comparisons with multiple state-of-the-art methods on two real-world educational datasets demonstrate the effectiveness of our model.
RESEnhancing Online Video Recommendation via a Coarse-to-fine Dynamic Uplift Modeling Framework
by Chang Meng (Kuaishou Technology), Chenhao Zhai (Tsinghua University), Xueliang Wang (Kuaishou Technology), Shuchang Liu (Kuaishou Technology), Xiaoqiang Feng (Kuaishou Technology), Lantao Hu (Kuaishou Technology), Xiu Li (Tsinghua University), Han Li (Kuaishou Technology), Kun Gai (Kuaishou Technology)

The popularity of short video applications has brought new opportunities and challenges to video recommendation. In addition to the traditional ranking-based pipeline, industrial solutions usually introduce additional distribution management components to guarantee a diverse and content-rich user experience. However, existing solutions are either non-personalized or fail to generalize well to the ever-changing user preferences. Inspired by the success of uplift modeling in online marketing, we attempt to implement uplift modeling in the video recommendation scenario to mitigate the problems. However, we face two main challenges when migrating the technique: 1) the complex-response causal relation in distribution management problem, and 2) the modeling of long-term and real-time user preferences. To address these challenges, we correspond each treatment to a specific adjustment of the distribution over video types, then propose a Coarse-to-fine Dynamic Uplift Modeling (CDUM) framework for real-time video recommendation scenarios. Specifically, CDUM consists of two modules, a coarse-grained module that utilizes the offline features of users to model their long-term preferences, and a fine-grained module that leverages online real-time contextual features and request-level candidates to model users’ real-time interests. These two modules collaboratively and dynamically identify and target specific user groups, and then apply treatments effectively. We conduct comprehensive experiments on two offline public datasets, an industrial offline dataset, and an online A/B test, demonstrating the superiority and effectiveness of CDUM. The proposed method is fully deployed on a large-scale short video platform, serving hundreds of millions of users every day. We plan to make source code available after the paper is accepted.
RESEnhancing Sequential Recommender with Large Language Models for Joint Video and Comment Recommendation
by Bowen Zheng (Renmin University of China), Zihan Lin (Kuaishou Technology), Enze Liu (Renmin University of China), Chen Yang (Renmin University of China), Enyang Bai (Kuaishou Technology), Cheng Ling (Kuaishou Technology), Han Li (Kuaishou Technology), Wayne Xin Zhao (Renmin University of China), Ji-Rong Wen (Renmin University of China)

Nowadays, reading or writing comments on captivating videos has emerged as a critical part of the viewing experience on online video platforms. However, existing recommender systems primarily focus on users’ interaction behaviors with videos, neglecting comment content and interaction in user preference modeling. In this paper, we propose a novel recommendation approach called LSVCR that utilizes user interaction histories with both videos and comments to jointly perform personalized video and comment recommendation. Specifically, our approach comprises two key components: sequential recommendation (SR) model and supplemental large language model (LLM) recommender. The SR model functions as the primary recommendation backbone (retained in deployment) of our method for efficient user preference modeling. Concurrently, we employ a LLM as the supplemental recommender (discarded in deployment) to better capture underlying user preferences derived from heterogeneous interaction behaviors. In order to integrate the strengths of the SR model and the supplemental LLM recommender, we introduce a two-stage training paradigm. The first stage, personalized preference alignment, aims to align the preference representations from both components, thereby enhancing the semantics of the SR model. The second stage, recommendation-oriented fine-tuning, involves fine-tuning the alignment-enhanced SR model according to specific objectives. Extensive experiments in both video and comment recommendation tasks demonstrate the effectiveness of LSVCR. Moreover, online A/B testing on a real-world video platform verifies the practical benefits of our approach. In particular, we attain a cumulative gain of 4.13\% in comment watch time.
RESEnhancing Transferability and Consistency in Cross-Domain Recommendations via Supervised Disentanglement
by Yuhan Wang (Wuhan University of Technology), Qing Xie (Wuhan University of Technology), Zhifeng Bao (School of Computing Technologies, RMIT University), Mengzi Tang (Wuhan University of Technology), Lin Li (Wuhan University of Technology), Yongjian Liu (Wuhan University of Technology)

Cross-domain recommendation (CDR) aims to alleviate the data sparsity by transferring knowledge across domains. Disentangled representation learning provides an effective solution to model complex user preferences by separating intra-domain features (domain-shared and domain-specific features), thereby enhancing robustness and interpretability. However, disentanglement-based CDR methods employing generative modeling or GNNs with contrastive objectives face two key challenges: (i) pre-separation strategies decouple features before extracting collaborative signals, disrupting intra-domain interactions and introducing noise; (ii) unsupervised disentanglement objectives lack explicit task-specific guidance, resulting in limited consistency and suboptimal alignment. To address these challenges, we propose DGCDR, a GNN-enhanced encoder-decoder framework. For challenge (i), DGCDR first applies GNN to extract high-order collaborative signals, providing enriched representations as a robust foundation for disentanglement. The encoder then dynamically disentangles features into domain-shared and -specific spaces, preserving collaborative information during the separation process. To handle challenge (ii), the decoder introduces an anchor-based supervision mechanism that leverages hierarchical feature relationships to enhance intra-domain consistency and cross-domain alignment. Extensive experiments on real-world datasets demonstrate that DGCDR achieves state-of-the-art performance, with improvements of up to 11.59% across key metrics. Qualitative analyses further validate its superior disentanglement quality and transferability.
RESExploring Scaling Laws of CTR Model for Online Performance Improvement
by Weijiang Lai (Institute of Software,Chinese Academy of Sciences), Beihong Jin (Institute of Software Chinese Academy of Sciences), Jiongyan Zhang (Meituan), Yiyuan Zheng (Institute of Software Chinese Academy of Sciences), Jian Dong (Meituan), Jia Cheng (Meituan), Jun Lei (Meituan), Xingxing Wang (Meituan)

Click-Through Rate (CTR) models play a vital role in improving user experience and boosting business revenue in many online personalized services. However, current CTR models generally encounter bottlenecks in performance improvement. Inspired by the scaling law phenomenon of Large Language Models (LLMs), we propose a new paradigm for improving CTR predictions: first, constructing a CTR model with accuracy scalable to the model grade and data size, and then distilling the knowledge implied in this model into its lightweight model that can serve online users. To put it into practice, we construct a CTR model named SUAN (Stacked Unified Attention Network). In SUAN, we propose the unified attention block (UAB) as a behavior sequence encoder. A single UAB unifies the modeling of the sequential and non-sequential features and also measures the importance of each user behavior feature from multiple perspectives. Stacked UABs elevate the configuration to a high grade, paving the way for performance improvement. In order to benefit from the high performance of the high-grade SUAN and avoid the disadvantage of its long inference time, we modify the SUAN with sparse self-attention and parallel inference strategies to form LightSUAN, and then adopt online distillation to train the low-grade LightSUAN, taking a high-grade SUAN as a teacher. The distilled LightSUAN has superior performance but the same inference time as the LightSUAN, making it well-suited for online deployment. Experimental results show that SUAN performs exceptionally well and holds the scaling laws spanning three orders of magnitude in model grade and data size, and the distilled LightSUAN outperforms the SUAN configured with one grade higher. More importantly, the distilled LightSUAN has been integrated into an online service, increasing the CTR by 2.81\% and CPM by 1.69\% while keeping the average inference time acceptable.
RESGRACE: Generative Recommendation via Journey-Aware Sparse Attention on Chain-of-Thought Tokenization
by Luyi Ma (Walmart Global Tech), Wanjia Zhang (Walmart Global Tech), Kai Zhao (Walmart Global Tech), Abhishek Kulkarni (Walmart Global Tech), Lalitesh Morishetti (Walmart Global Tech), Anjana Ganesh (Walmart Global Tech), Ashish Ranjan (Walmart Global Tech), Aashika Padmanabhan (Walmart Global Tech), Jianpeng Xu (Walmart Global Tech), Jason H.D. Cho (Walmart Global Tech), Praveen Kumar Kanumala (Walmart Inc), Kaushiki Nag (Walmart), Sumit Dutta (Walmart Global Tech), Kamiya Motwani (Walmart Global Tech), Malay Patel (Walmart Global Tech), Evren Korpeoglu (Walmart), Sushant Kumar (Walmart Global Tech), Kannan Achan (Walmart Global Tech)

Generative models have recently demonstrated strong potential in multi-behavior recommendation systems, leveraging the expressive power of transformers and tokenization to generate personalized item sequences. However, their adoption is hindered by (1) the lack of explicit information for token reasoning, (2) high computational costs due to quadratic attention complexity and dense sequence representations after tokenziation, and (3) limited multi-scale modeling over user history. In this work, we propose GRACE (Generative Recommendation via journey-aware sparse Attention on Chain-of-thought tokEnization), a novel generative framework for multi-behavior sequential recommendation. GRACE introduces a hybrid Chain-of-Thought (CoT) tokenization method that encodes user-item interactions with explicit attributes from product knowledge graphs (e.g., category, brand, price) over semantic tokenization, enabling interpretable and behavior-aligned generation. To address the inefficiency of standard attention, we design a Journey-Aware Sparse Attention (JSA) mechanism, which selectively attends to compressed, intra-, inter-, and current-context segments in the tokenized sequence. Experiments on two real-world datasets show that GRACE significantly outperforms state-of-the-art baselines, achieving up to +106.9% HR@10 and +106.7% NDCG@10 improvement over state-of-the-art baseline on the Home domain, and +22.1% HR@10 on the Electronics domain. GRACE also reduces attention computation by up to 48% with long sequences.
RESGenSAR: Unifying Balanced Search and Recommendation with Generative Retrieval
by Teng Shi (Renmin University of China), Jun Xu (Renmin University of China), Xiao Zhang (Renmin University of China), Xiaoxue Zang (Kuaishou Technology Co., Ltd.), Kai Zheng (Kuaishou Technology Co.. Ltd.), Yang Song (Kuaishou Technology Co., Ltd.), Enyun Yu (Independent)

Many commercial platforms provide both search and recommendation (S&R) services to meet different user needs. This creates an opportunity for joint modeling of S&R. Although many joint S&R studies have demonstrated the advantages of integrating S&R, they have also identified a trade-off between the two tasks. That is, when recommendation performance improves, search performance may decline, or vice versa. This trade-off stems from the different information requirements: search prioritizes the semantic relevance between the queries and the items, while recommendation heavily relies on the collaborative relationship between users and items. To balance semantic and collaborative information and mitigate this trade-off, two main challenges arise: (1) How to incorporate both semantic and collaborative information in item representations. (2) How to train the model to understand the different information requirements of S&R. The recent rise of generative retrieval based on Large Language Models (LLMs) for S&R offers a potential solution. Generative retrieval represents each item as an identifier, allowing us to assign multiple identifiers to each item to capture both semantic and collaborative information. Additionally, generative retrieval formulates both S&R as sequence-to-sequence tasks, enabling us to unify different tasks through varied prompts, thereby helping the model better understand the requirements of each task. Based on this, we propose GenSAR, a method that unifies balanced S&R through generative retrieval. We design joint S&R identifiers and training tasks to address the above challenges, mitigate the trade-off between S&R, and further improve both tasks. Experimental results on a public dataset and a commercial dataset validate the effectiveness of GenSAR.
RESHeterogeneous User Modeling for LLM-based Recommendation
by Honghui Bao (National University of Singapore), Wenjie Wang (University of Science and Technology of China), Xinyu Lin (National University of Singapore), Fengbin Zhu (National University of Singapore), Teng Sun (Shandong University), Fuli Feng (University of Science and Technology of China), Tat-Seng Chua (National University of Singapore)

Leveraging Large Language Models (LLMs) for recommendation has demonstrated notable success in various domains, showcasing their potential for open-domain recommendation. A key challenge to advancing open-domain recommendation lies in effectively modeling user preferences within users’ heterogeneous behaviors across multiple domains. Existing approaches, including ID-based and semantic-based modeling, struggle with poor generalization, an inability to compress noisy interactions effectively, and the domain seesaw phenomenon. To address these challenges, we propose a Heterogeneous User Modeling (HUM) method, which incorporates a compression enhancer and a robustness enhancer for LLM-based recommendation. The compression enhancer uses a customized prompt to compress heterogeneous behaviors into a tailored token, while a masking mechanism enhances cross-domain knowledge extraction and understanding. The robustness enhancer introduces a domain importance score to mitigate the domain seesaw phenomenon by guiding domain optimization. Extensive experiments on heterogeneous datasets validate that HUM effectively models user heterogeneity by achieving both high efficacy and robustness, leading to superior performance in open-domain recommendation.
RESHierarchical Graph Information Bottleneck for Multi-Behavior Recommendation
by Hengyu Zhang (The Chinese University of Hong Kong), Chunxu Shen (WeChat, Tencent), Xiangguo Sun (The Chinese University of Hong Kong), Jie Tan (The Chinese University of Hong Kong), Yanchao Tan (Fuzhou University), Yu Rong (The Chinese University of Hong Kong), Hong Cheng (The Chinese University of Hong Kong), Lingling Yi (WeChat, Tencent)

In real-world recommendation scenarios, users typically engage with platforms through multiple types of behavioral interactions. Multi-behavior recommendation algorithms aim to leverage various auxiliary user behaviors to enhance prediction for target behaviors of primary interest (e.g., buy), thereby overcoming performance limitations caused by data sparsity in target behavior records. Current state-of-the-art approaches typically employ hierarchical design following either cascading (e.g., view→cart→buy) or parallel (unified→behavior→specific components) paradigms, to capture behavioral relationships. However, these methods still face two critical challenges: (1) severe distribution disparities across behaviors, and (2) negative transfer effects caused by noise in auxiliary behaviors. In this paper, we propose a novel model-agnostic Hierarchical Graph Information Bottleneck (HGIB) framework for multi-behavior recommendation to effectively address these challenges. Following information bottleneck principles, our framework optimizes the learning of compact yet sufficient representations that preserve essential information for target behavior prediction while eliminating task-irrelevant redundancies. To further mitigate interaction noise, we introduce a Graph Refinement Encoder (GRE) that dynamically prunes redundant edges through learnable edge dropout mechanisms. We conduct comprehensive experiments on three real-world public datasets, which demonstrate the superior effectiveness of our framework. Beyond these widely used datasets in the academic community, we further expand our evaluation on several real industrial scenarios, showing again a significant improvement in multi-behavior recommendations.
RESHow Do Users Perceive Recommender Systems’ Objectives?
by Patrik Dokoupil (Faculty of Mathematics and Physics, Charles University), Ludovico Boratto (University of Cagliari), Ladislav Peska (Faculty of Mathematics and Physics, Charles University)

Multi-objective recommender systems (MORS) aim to optimize multiple criteria while generating recommendations, such as relevance, novelty, diversity, or exploration. These algorithms are based on the assumption that an operationalization of these criteria (i.e., translating abstract goals into measurable metrics), will reflect how users perceive them. Nevertheless, such beliefs are rarely rigorously evaluated, which can lead to a mismatch between algorithmic goals and user satisfaction. Moreover, if users are allowed to control the RS via their propensities towards such objectives, the misconceptions may further impact users’ trust and engagement. To characterize this problem, we conduct a large user study focusing on recommender systems in two domains: books and movies. Part of the study is focused on how users perceive different recommendation objectives, which we compared with well-established metrics aiming at the same objectives. We found that despite such metrics correlating to some extent with users’ perceptions, the mapping is far from perfect. Moreover, we also report on conceptual-level differences in users’ understanding of RS objectives and how this affects the results.
RESIP2: Entity-Guided Interest Probing for Personalized News Recommendation
by Youlin Wu (Dalian University of Technology), Yuanyuan Sun (Dalian University of Technology), Xiaokun Zhang (City University of Hong Kong), Haoxi Zhan (Dalian University of Technology), Bo Xu (Dalian University of Technology), Liang Yang (Dalian University of Technology), Hongfei Lin (Dalian University of Technology)

News recommender systems aim to deliver personalized news articles for users based on their reading history. Previous behavior study suggested that screen-based news reading contains three successive steps: scanning, title reading, and then clicking. Adhere to these steps, we find that intra-news entity interest dominates the scanning stage, while inter-news entity interest guides title reading and influences click decisions. Unfortunately, current methods overlook the unique utility of entities in news recommendation. To this end, we propose a novel method IP2 to probe entity-guided reading interest at both intra- and inter-news levels. At intra-news level, a transformer-based entity encoder is devised to aggregate mentioned entities in the news title into one signature entity. Then, a signature entity-title contrastive pre-training is adopted to initialize entities with proper meanings in the news context, which in the meantime facilitates us to probe for intra-news entity interest. As for the inter-news level, a dual tower user encoder is presented to capture inter-news reading interest from both title meaning and entity sides. In addition, to highlight the contribution of inter-news entity guidance, a cross-tower attention link is adopted to calibrate title reading interest using inter-news entity interest, thus further aligning with real-world behavior. Extensive experiments on two real-world datasets demonstrate that our IP2 achieves state-of-the-art performance in news recommendation.
RESIntegrating Individual and Group Fairness for Recommender Systems through Social Choice
by Amanda Aird (University of Colorado Boulder), Elena Štefancová (Comenius University Bratislava), Anas Buhayh (University of Colorado Boulder), Cassidy All (Department of Information Science; University of Colorado, Boulder), Martin Homola (Comenius University Bratislava), Nicholas Mattei (Tulane University), Robin Burke (University of Colorado, Boulder)

Fairness in recommender systems is a complex concept, involving multiple definitions of fairness, different parties for whom fairness is sought, and various scopes over which fairness might be measured. Researchers have derived a variety of solutions, usually highly tailored to specific choices along each of these dimensions, and typically aimed at tackling a single fairness concern. However, in practical contexts, we find a multiplicity of fairness concerns within a given recommendation application. We explore a general solution to recommender system fairness using social choice methods to integrate multiple heterogeneous fairness definitions. In this paper, we extend group-fairness results from prior research to provider-side individual fairness, demonstrating in multiple datasets that both individual and group fairness objectives can be integrated and optimized jointly. We identify both synergies and tensions among different fairness objectives with individual fairness correlated with group fairness for some groups and anti-correlated with others.
RESLANCE: Exploration and Reflection for LLM-based Textual Attacks on News Recommender Systems
by Yuyue Zhao (University of Science and Technology of China), Jin Huang (University of Cambridge), Shuchang Liu (Rutgers University), Jiancan Wu (University of Science and Technology of China), Xiang Wang (University of Science and Technology of China), Maarten de Rijke (University of Amsterdam)

News recommender systems rely on rich textual information from news articles to generate user-specific recommendations. This reliance may expose these systems to potential vulnerabilities through textual attacks. To explore this vulnerability, we propose LANCE, a LArge language model-based News Content rEwriting framework, designed to influence news rankings and highlight the unintended promotion of manipulated news. LANCE consists of two key components: an explorer and a reflector. The explorer first generates rewritten news using diverse prompts, incorporating different writing styles, sentiments, and personas. We then collect these rewrites, evaluate their ranking impact within news ecommender systems, and apply a filtering mechanism to retain effective rewrites. Next, the reflector fine-tunes an open-source LLM using the successful rewrites, enhancing its ability to generate more effective textual attacks. Experimental results demonstrate the effectiveness of LANCE in manipulating rankings within news ecommender systems. Unlike attacks in other recomendation domains, negative and neutral rewrites consistently outperform positive ones, revealing a unique vulnerability specific to news recommendation. Once trained, LANCE successfully attacks unseen news ecommender systems, highlighting its generalization ability and exposing shared vulnerabilities across different systems. Our work underscores the urgent need for research on textual attacks and paves the way for future studies on defense strategies.
RESLEAF: Lightweight, Efficient, Adaptive and Flexible Embedding for Large-Scale Recommendation Models
by Chaoyi Jiang (University of Southern California), Abdulla Alshabanah (University of Southern California), Murali Annavaram (University of Southern California)

Deep Learning Recommendation Models (DLRMs) are central to modeling user behavior, enhancing user experience, and boosting revenues for internet companies. DLRMs rely heavily on embedding tables, which scale to tens of terabytes as the number of users and features grows, presenting challenges in training and storage. These models typically require substantial GPU memory, as embedding operations are not compute-intensive but occupy significant storage. While some solutions have explored CPU storage, this approach still demands terabytes of memory. We introduce LEAF, a multi-level hashing framework that compresses the large embedding tables based on access frequency. In particular, LEAF leverages a streaming algorithm to estimate access distributions on the fly without relying on model gradients or requiring a priori knowledge of access distribution. By using multiple hash functions, LEAF minimizes collision rates of feature instances. Experiments show that LEAF outperforms state-of-the-art compression methods on Criteo Kaggle, Avazu, KDD12, and Criteo Terabyte datasets, with testing AUC improvements of 1.411\%, 1.885\%, 2.761\%, and 1.243\%, respectively.
RESLLM-RecG: A Semantic Bias-Aware Framework for Zero-Shot Sequential Recommendation
by Yunzhe Li (University of Illinois, Urbana-Champaign), Junting Wang (University of Illinois at Urbana-Champaign), Hari Sundaram (University of Illinois at Urbana-Champaign), Zhining Liu (University of Illinois at Urbana Champaign)

Zero-shot cross-domain sequential recommendation (ZCDSR) enables predictions in unseen domains without additional training or fine-tuning, addressing the limitations of traditional models in sparse data environments. Recent advancements in large language models (LLMs) have significantly enhanced ZCDSR by facilitating cross-domain knowledge transfer through rich, pretrained representations. Despite this progress, domain semantic bias—arising from differences in vocabulary and content focus between domains—remains a persistent challenge, leading to misaligned item embeddings and reduced generalization across domains. To address this, we propose a novel semantic bias-aware framework that enhances LLM-based ZCDSR by improving cross-domain alignment at both the item and sequential levels. At the item level, we introduce a generalization loss that aligns the embeddings of items across domains (inter-domain compactness), while preserving the unique characteristics of each item within its own domain (intra-domain diversity). This ensures that item embeddings can be transferred effectively between domains without collapsing into overly generic or uniform representations. At the sequential level, we develop a method to transfer user behavioral patterns by clustering source domain user sequences and applying attention-based aggregation during target domain inference. We dynamically adapt user embeddings to unseen domains, enabling effective zero-shot recommendations without requiring target-domain interactions. Extensive experiments across multiple datasets and domains demonstrate that our framework significantly enhances the performance of sequential recommendation models on the ZCDSR task. By addressing domain bias and improving the transfer of sequential patterns, our method offers a scalable and robust solution for better knowledge transfer, enabling improved zero-shot recommendations across domains.
RESLONGER: Scaling Up Long Sequence Modeling in Industrial Recommenders
by Zheng Chai (ByteDance), Qin Ren (ByteDance), Xijun Xiao (ByteDance), Huizhi Yang (ByteDance), Bo Han (ByteDance), Sijun Zhang (ByteDance), Di Chen (ByteDance), Hui Lu (ByteDance), Wenlin Zhao (ByteDance), Lele Yu (ByteDance), Xionghang Xie (ByteDance), Shiru Ren (ByteDance), Xiang Sun (ByteDance), Yaocheng Tan (ByteDance), Peng Xu (ByteDance), Yuchao Zheng (ByteDance), Di Wu (ByteDance)

Modeling ultra-long user behavior sequences is critical for capturing both long- and short-term preferences in industrial recommender systems. Existing solutions typically rely on two-stage retrieval or indirect modeling paradigms, incuring upstream-downstream inconsistency and computational inefficiency. In this paper, we present LONGER, a Long-sequence Optimized traNsformer for GPU-Efficient Recommenders. LONGER incorporates (i) a global token mechanism for stabilizing attention over long contexts, (ii) a token merge module with lightweight InnerTransformers and hybrid attention strategy to reduce quadratic complexity, and (iii) a series of engineering optimizations, including training with mixed-precision and activation recomputation, KV cache serving, and the fully synchronous model training and serving framework for unified GPU-based dense and sparse parameter updates. LONGER consistently outperforms strong baselines in both offline metrics and online A/B testing in both advertising and e-commerce services at ByteDance, validating its consistent effectiveness and industrial-level scaling laws. Currently, LONGER has been fully deployed at more than 10 influential scenarios at ByteDance, serving billion users.
RESLasso: Large Language Model-based User Simulator for Cross-Domain Recommendation
by Yue Chen (College of Computer Science, Sichuan University), Susen Yang (kuaishou Technology), Tong Zhang (College of Computer Science, Sichuan University), Chao Wang (Kuaishou Technology), Mingyue Cheng (State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China), Chenyi Lei (Kuaishou Technology), Han Li (Kuaishou Technology)

Cross-Domain Recommendation (CDR) aims to mitigate the cold-start problem in target domains by leveraging user interactions from source domains. However, existing CDR methods offer suffer from low data efficiency, as they require a substantial number of historical interactions from overlapping users for training, which is impractical in real-world scenarios. To address this challenge, we propose Lasso, a novel framework that leverages the large language model (LLM) as a user simulator to capture cross-domain user preferences based on the remarkable internal knowledge of the LLM. Specifically, we introduce a cross-domain training paradigm to fine-tune the LLM-based simulator, enabling it to simulate user behaviors in the target domain using historical interactions from the source domain. Furthermore, to enhance the efficiency and accuracy of Lasso, we propose two effective modules: Personalized Candidate Pool (PCP) and Confidence-Guided Inference (CGI). The PCP module employs cross-domain collaborative filtering to construct a tailored set of candidate items for simulating interactions of each cold-start user in the target domain, thereby improving the inference efficiency of the LLM. The CGI module utilizes confidence scores from the LLM to reduce noise in the simulated data, ensuring more accurate estimations. During the application phase, the simulated interactions serve as additional inputs for downstream recommendation models, effectively alleviating cold-start problems for users. Extensive experiments on public benchmark datasets and real-world industrial dataset demonstrate that Lasso achieves superior accuracy while requiring fewer historical interactions from overlapping users.
RESLeave No One Behind: Fairness-Aware Cross-Domain Recommender Systems for Non-Overlapping Users
by Weixin Chen (Hong Kong Baptist University), Yuhan Zhao (Hong Kong Baptist University), Li Chen (Hong Kong Baptist University), Weike Pan (Shenzhen University)

Cross-domain recommendation (CDR) methods predominantly leverage overlapping users to transfer knowledge from a source domain to a target domain. However, through empirical studies, we uncover a critical bias inherent in these approaches: while overlapping users experience significant enhancements in recommendation quality, non-overlapping users benefit minimally and even face performance degradation. This unfairness may erode user trust, and, consequently, negatively impact business engagement and revenue. To address this issue, we propose a novel solution that generates virtual source-domain users for non-overlapping target-domain users. Our method utilizes a dual attention mechanism to discern similarities between overlapping and non-overlapping users, thereby synthesizing realistic virtual user embeddings. We further introduce a limiter component that ensures the generated virtual users align with real-data distributions while preserving each user’s unique characteristics. Notably, our method is model-agnostic and can be seamlessly integrated into any CDR model. Comprehensive experiments conducted on three public datasets with five CDR baselines demonstrate that our method effectively mitigates the CDR non-overlapping user bias, without loss of overall accuracy.
RESMDSBR: Multimodal Denoising for Session-based Recommendation
by Yutong Li (University College London), Xinyi Zhang (Imperial College London)

Multimodal session-based recommendation (SBR) has emerged as a promising direction for capturing user intent using visual and textual item content. However, existing methods often overlook a fundamental issue: the modality features extracted from pre-trained models (e.g., BERT, CLIP) are inherently noisy and misaligned with user-specific preferences. This noise arises from label errors, task mismatch, and over-inclusion of irrelevant content, ultimately degrading recommendation quality. In this work, we propose a diffusion-based denoising framework that explicitly refines noisy pre-trained representations without full fine-tuning. By progressively removing noise through a structured denoising process, our Multimodal Denoising Diffusion Layer enhances task-specific semantics. Furthermore, we introduce two auxiliary modules: an Interest-Guided Denoising Layer that filters modality features using session context, and a Multimodal Alignment Layer that enforces cross-modal coherence. Extensive experiments on real-world datasets demonstrate that our model significantly outperforms state-of-the-art methods while maintaining practical training efficiency.
RESMapping Stakeholder Needs to Multi-Sided Fairness in Candidate Recommendation for Algorithmic Hiring
by Mesut Kaya (Jobindex A/S), Toine Bogers (IT University of Copenhagen)

Already before the enactment of the EU AI Act, candidate or job recommendation for algorithmic hiring—semi-automatically matching CVs to job postings—was used as an example of a high-risk application where unfair treatment could result in serious harms to job seekers. Recommending candidates to jobs or jobs to candidates, however, is also a fitting example of a multi-stakeholder recommendation problem. In such multi-stakeholder systems, the end user is not the only party whose interests should be considered when generating recommendations. In addition to job seekers, other stakeholders—such as recruiters, organizations behind the job postings, and the recruitment agency itself—are also stakeholders in this and deserve to have their perspectives included in the design of relevant fairness metrics. Nevertheless, past analyses of fairness in algorithmic hiring have been restricted to single-side fairness, ignoring the perspectives of the other stakeholders. In this paper, we address this gap and present a multi-stakeholder approach to fairness in a candidate recommender system that recommends relevant candidate CVs to human recruiters in a human-in-the-loop algorithmic hiring scenario. We conducted semi-structured inter- views with 40 different stakeholders (job seekers, companies, recruiters, and other job portal employees). We used these interviews to explore their lived experiences of unfairness in hiring, co-design definitions of fairness as well as metrics that might capture these experiences. Finally, we then attempt to reconcile and map these different (and sometimes conflicting) perspectives and definitions to existing (categories of) fairness metrics that are relevant for our candidate recommendation scenario.
RESMeasuring Interaction-Level Unlearning Difficulty for Collaborative Filtering
by Haocheng Dou (Taiyuan University of Technology), Tao Lian (Taiyuan University of Technology), Xin Xin (Shandong University)

The growing emphasis on data privacy and user controllability mandates that recommendation models support the removal of specified data, known as recommendation unlearning (RU). Although model retraining is often regarded as the gold standard for machine unlearning, it is inadequate to attain complete unlearning in collaborative filtering recommendation due to interdependency between user-item interactions. To this end, we introduce the concept of interaction-level unlearning difficulty, which serves as a foresighted indicator of the unlearning incompleteness or actual unlearning effectiveness after forgetting each interaction. Through extensive experiments with retraining and model-agnostic unlearning methods, we identify two interpretable data characteristics that can serve as useful unlearning difficulty indicators: Embedding Entanglement Index (EEI) and Subgraph Average Degree (AD). They have a strong correlation with existing membership inference metrics focusing on data removal as well as our proposed unlearning effectiveness metrics from the recommendation perspective—Score Shift, UnlearnMRR, and UnlearnRecall. In addition, we investigate the efficacy of an unlearning enhancement technique named Extra Deletion in handling unlearning requests of different difficulty levels. The results show that more related interactions need to be extra deleted to achieve acceptable unlearning effectiveness for difficult unlearning requests, while fewer or no extra deletions are needed for easier-to-forget requests. This study provides a novel perspective for advancing the development of more tailored RU methods.
RESMoRE: A Mixture of Reflectors Framework for Large Language Model-Based Sequential Recommendation
by Weicong Qin (Gaoling School of Artificial Intelligence, Renmin University of China), Yi Xu (Gaoling School of Artificial Intelligence, Renmin University of China), Weijie Yu (School of Information Technology and Management, University of International Business and Economics), Chenglei Shen (Gaoling School of Artificial Intelligence, Renmin University of China), Xiao Zhang (Gaoling School of Artificial Intelligence, Renmin University of China), Ming He (AI Lab at Lenovo Research, Lenovo Group Limited), Jianping Fan (AI Lab at Lenovo Research, Lenovo Group Limited), Jun Xu (Gaoling School of Artificial Intelligence, Renmin University of China)

Large language models (LLMs) have emerged as a cutting-edge approach in sequential recommendation, leveraging historical interactions to model dynamic user preferences. Current methods mainly focus on learning processed recommendation data in the form of sequence-to-sequence text. While effective, they exhibit three key limitations: 1) failing to decouple intra-user explicit features (e.g., product titles) from implicit behavioral patterns (e.g., brand loyalty) within interaction histories; 2) underutilizing cross-user collaborative filtering (CF) signals; and 3) relying on inefficient reflection update strategies. To address this, We propose MoRE (Mixture of REflectors), which introduces three perspective-aware offline reflection processes to address these gaps. This decomposition directly resolves Challenges 1 (explicit/implicit ambiguity) and 2 (CF underutilization). Furthermore, MoRE’s meta-reflector employs a self-improving strategy and a dynamic selection mechanism (Challenge 3) to adapt to evolving user preferences. First, two intra-user reflectors decouple explicit and implicit patterns from a user’s interaction sequence, mimicking traditional recommender systems’ ability to distinguish surface-level and latent preferences. A third cross-user reflector captures CF signals by analyzing user similarity patterns from multiple users’ interactions. To optimize reflection quality, MoRE’s meta-reflector employs a offline self-improving strategy that evaluates reflection impacts through comparisons of presence/absence and iterative refinement of old/new versions, with a online contextual bandit mechanism dynamically selecting the optimal perspective for recommendation for each user. Experiments on three benchmarks show MoRE outperforms both traditional recommenders and LLM-based methods with minimal computational overhead, validating its effectiveness in bridging LLMs’ semantic understanding with multidimensional recommendation principles.
RESModeling Long-term User Behaviors with Diffusion-driven Multi-interest Network for CTR Prediction
by Weijiang Lai (Institute of Software,Chinese Academy of Sciences), Beihong Jin (Institute of Software Chinese Academy of Sciences), Yapeng Zhang (Meituan), Yiyuan Zheng (Institute of Software Chinese Academy of Sciences), Rui Zhao (Institute of Software Chinese Academy of Sciences), Jian Dong (Meituan), Jun Lei (Meituan), Xingxing Wang (Meituan)

CTR (Click-Through Rate) prediction, crucial for recommender systems and online advertising, etc., has been confirmed to benefit from modeling long-term user behaviors. Nonetheless, the vast number of behaviors and complexity of noise interference pose challenges to prediction efficiency and effectiveness. Recent solutions have evolved from single-stage models to two-stage models. However, current two-stage models often filter out significant information, resulting in an inability to capture diverse user interests and build the complete latent space of user interests. Inspired by multi-interest and generative modeling, we propose DiffuMIN (Diffusion-driven Multi-Interest Network) to model long-term user behaviors and thoroughly explore the user interest space. Specifically, we propose a target-oriented multi-interest extraction method that begins by orthogonally decomposing the target to obtain interest channels. This is followed by modeling the relationships between interest channels and user behaviors to disentangle and extract multiple user interests. We then introduce a diffusion module guided by contextual interests and interest channels, which anchor users’ personalized and target-oriented interest types, enabling the generation of augmented interests that align with the latent spaces of user interests, thereby further exploring restricted interest space. Finally, we leverage contrastive learning to ensure that the generated augmented interests align with users’ genuine preferences. Extensive offline experiments are conducted on two public datasets and one industrial dataset, yielding results that demonstrate the superiority of DiffuMIN. Moreover, DiffuMIN increased CTR by 1.52\% and CPM by 1.10\% in online A/B testing.
RESMulti-Granularity Distribution Modeling for Video Watch Time Prediction via Exponential-Gaussian Mixture Network
by Xu Zhao (Xiaohongshu), Ruibo Ma (Xiaohongshu), Jiaqi Chen (Xiaohongshu), Weiqi Zhao (Xiaohongshu), Ping Yang (Xiaohongshu), Yao Hu (Xiaohongshu)

Accurate watch time prediction is crucial for enhancing user engagement in streaming short-video platforms, although it is challenged by complex distribution characteristics across multi-granularity levels. Through systematic analysis of real-world industrial data, we uncover two critical challenges in watch time prediction from a distribution aspect: (1) coarse-grained skewness induced by a significant concentration of quick-skips, (2) fine-grained diversity arising from various user-video interaction patterns. Consequently, we assume that the watch time follows the Exponential-Gaussian Mixture (EGM) distribution, where the exponential and Gaussian components respectively characterize the skewness and diversity. Accordingly, an Exponential-Gaussian Mixture Network (EGMN) is proposed for the parameterization of EGM distribution, which consists of two key modules: a hidden representation encoder and a mixture parameter generator. We conduct extensive offline experiments and online A/B tests on our industrial short-video platform to validate the superiority of EGMN compared with existing state-of-the-art methods. Remarkably, comprehensive experimental results have proven that EGMN exhibits excellent distribution fitting ability across coarse-to-fine-grained levels.
RESNLGCL: Naturally Existing Neighbor Layers Graph Contrastive Learning for Recommendation
by Jinfeng Xu (The University of Hong Kong), Zheyu Chen (The Hong Kong Polytechnic University), Shuo Yang (The Univerisity of Hong Kong), Jinze Li (The University of Hong Kong), Hewei Wang (Carnegie Mellon University), Wei Wang (Shenzhen MSU-BIT University), Xiping Hu (Beijing Institute of Technology), Edith Ngai (The University of Hong Kong)

Graph Neural Networks (GNNs) are widely used in collaborative filtering to capture high-order user-item relationships. To address the data sparsity problem in recommendation systems, Graph Contrastive Learning (GCL) has emerged as a promising paradigm that maximizes mutual information between contrastive views. However, existing GCL methods rely on augmentation techniques that introduce semantically irrelevant noise and incur significant computational and storage costs, limiting effectiveness and efficiency. To overcome these challenges, we propose NLGCL, a novel contrastive learning framework that leverages naturally contrastive views between neighbor layers within GNNs. By treating each node and its neighbors in the next layer as positive pairs, and other nodes as negatives, NLGCL avoids augmentation-based noise while preserving semantic relevance. This paradigm eliminates costly view construction and storage, making it computationally efficient and practical for real-world scenarios. Extensive experiments on four public datasets demonstrate that NLGCL outperforms state-of-the-art baselines in effectiveness and efficiency.
RESNon-parametric Graph Convolution for Re-ranking in Recommendation Systems
by Zhongyu Ouyang (Dartmouth College), Mingxuan Ju (Snap Inc.), Soroush Vosoughi (Dartmouth College), Yanfang Ye (University of Notre Dame)

Graph knowledge has been proven effective in enhancing item rankings in recommender systems (RecSys), particularly during the retrieval stage. However, its application in the ranking stage, where richer contextual information (e.g., user, item, and interaction features) is available, remains underexplored. A major challenge lies in the substantial computational cost associated with repeatedly retrieving neighborhood information from billions of items stored in distributed systems. This resource-intensive requirement makes it difficult to scale graph-based methods during model training, and apply them in practical RecSys. To bridge this gap, we first demonstrate that incorporating graphs in the ranking stage improves ranking qualities. Notably, while the improvement is evident, we show that the substantial computational overheads entailed by graphs are prohibitively expensive for real-world recommendations. In light of this, we propose a non-parametric strategy that utilizes graph convolution for re-ranking only during test time. Our strategy circumvents the notorious computational overheads from graph convolution during training, and utilizes structural knowledge hidden in graphs on-the-fly during testing. It can be used as a plug-and-play module and easily employed to enhance the ranking ability of various ranking layers of a real-world RecSys with significantly reduced computational overhead. Through comprehensive experiments across four benchmark datasets with varying levels of sparsity, we demonstrate that our strategy yields noticeable improvements (i.e., 8.1% on average) during testing time with little to no additional computational overheads (i.e., 0.5% on average).
RESOff-Policy Evaluation and Learning for Matching Markets
by Yudai Hayashi (Wantedly, inc.), Shuhei Goda (Independent Researcher), Yuta Saito (Cornell University)

Matching users based on mutual preferences is a fundamental aspect of services driven by reciprocal recommendations, such as job search and dating applications. Although A/B testing remains the gold standard for evaluating new policies in recommender systems for matching markets, it is costly and impractical for frequent policy updates. Off-Policy Evaluation (OPE) thus plays a crucial role by enabling the evaluation of recommendation policies using only offline logged data naturally collected on the platform. However, unlike conventional recommendation settings, the bidirectional nature of user interactions in matching platforms introduces complex biases and exacerbates reward sparsity, making standard OPE methods unreliable. To address these challenges and facilitate effective offline evaluation, we propose novel OPE estimators, DiPS and DPR, specifically designed for matching markets. Our methods combine elements of the Direct Method (DM), Inverse Propensity Score (IPS), and Doubly Robust (DR) estimators while incorporating intermediate labels, such as initial engagement signals, to achieve better bias-variance control, particularly in sparse-reward environments. Theoretically, we derive the bias and variance of the proposed estimators and demonstrate their advantages over conventional methods. Furthermore, we show that these estimators can be seamlessly extended to offline policy learning methods for improving recommendation policies for making more matches. We empirically evaluate our methods through experiments on both synthetic data and real-world AB testing logs from the job-matching platform Wantedly Visit. The empirical results highlight the superiority of our approach over existing methods in both off-policy evaluation and policy learning tasks particularly when the match labels are sparse where existing methods tend to collapse.
RESOff-Policy Evaluation of Candidate Generators in Two-Stage Recommender Systems
by Peiyao Wang (Amazon.com), Zhan Shi (Amazon.com), Amina Shabbeer (Amazon.com), Ben London (Amazon.com)

We study offline evaluation of two-stage recommender systems, focusing on the first stage, candidate generation. Traditionally, candidate generators have been evaluated in terms of standard information retrieval metrics, using curated or heuristically labeled data, which does not always reflect their true impact to user experience or business metrics. We instead take a holistic view, measuring their effectiveness with respect to the downstream recommendation task, using data logged from past user interactions with the system. Using the contextual bandit formalism, we frame this evaluation task as off-policy evaluation (OPE) with a new action set induced by a new candidate generator. To the best of our knowledge, ours is the first study to examine evaluation of candidate generators through the lens of OPE. We propose two importance-weighting methods to measure the impact of a new candidate generator using data collected from a downstream task. We analyze the asymptotic properties of these methods and derive expressions for their respective biases and variances. This analysis illuminates a procedure to optimize the estimators so as to reduce bias. Finally, we present empirical results that demonstrate the estimators’ efficacy on synthetic and benchmark data. We find that our proposed methods achieve lower bias with comparable or reduced variance relative to baseline approaches that do not account for the new action set.
RESOn the Reliability of Sampling Strategies in Offline Recommender Evaluation
by Bruno Pereira (Universidade Federal de Minas Gerais), Alan Said (University of Gothenburg), Rodrygo Santos (Universidade Federal de Minas Gerais)

Offline evaluation plays a central role in benchmarking recommender systems when online testing is impractical or risky. However, it is susceptible to two key sources of bias: exposure bias, where users only interact with items they are shown, and sampling bias, introduced when evaluation is performed on a subset of logged items rather than the full catalog. While prior work has proposed methods to mitigate sampling bias, these are typically assessed on fixed logged datasets rather than for their ability to support reliable model comparisons under varying exposure conditions or relative to true user preferences. In this paper, we investigate how different combinations of logging and sampling choices affect the reliability of offline evaluation. Using a fully observed dataset as ground truth, we systematically simulate diverse exposure biases and assess the reliability of common sampling strategies along four dimensions: discriminative power (recommender model separability), fidelity (agreement with full evaluation), robustness (stability under exposure bias), and predictive power (alignment with ground truth). Our findings highlight when and how sampling distorts evaluation outcomes and offer practical guidance for selecting strategies that yield faithful and robust offline comparisons.
RESParagon: Parameter Generation for Controllable Multi-Task Recommendation
by Chenglei Shen (RenminUniversity of China), Jiahao Zhao (Renmin University of China), Xiao Zhang (Renmin University of China), Weijie Yu (University of International Business and Economics), Ming He (AI Lab at Lenovo Research), Jianping Fan (AI Lab at Lenovo Research)

Commercial recommender systems face the challenge that task requirements from platforms or users often change dynamically (e.g., varying preferences for accuracy or diversity). Ideally, the model should be re-trained after resetting a new objective function, adapting to these changes in task requirements. However, in practice, the high computational costs associated with retraining make this process impractical for models already deployed to online environments. This raises a new challenging problem: how to efficiently adapt the learned model to different task requirements by controlling the model parameters after deployment, without the need for retraining. To address this issue, we propose a novel controllable learning approach via parameter generation for controllable multi-task recommendation (Paragon), which allows the customization and adaptation of recommendation model parameters to new task requirements without retraining. Specifically, we first obtain the optimized model parameters through adapter tunning based on the feasible task requirements. Then, we utilize the generative model as a parameter generator, employing classifier-free guidance in conditional training to learn the distribution of optimized model parameters under various task requirements. Finally, the parameter generator is applied to effectively generate model parameters in a test-time adaptation manner given task requirements. Moreover, Paragon seamlessly integrates with various existing recommendation models to enhance their controllability. Extensive experiments indicate that Paragon can effectively enhance controllability for recommendation through efficient model parameter generation.
RESPinFM: Foundation Model for User Activity Sequences at a Billion-scale Visual Discovery Platform
by Xiangyi Chen (Pinterest), Kousik Rajesh (Pinterest), Matthew Lawhon (Pinterest), Zelun Wang (Pinterest), Hanyu Li (Pinterest), Haomiao Li (Pinterest), Saurabh Vishwas Joshi (Pinterest), Pong Eksombatchai (Pinterest), Jaewon Yang (Pinterest), Yi-Ping Hsu (Pinterest), Jiajing Xu (Pinterest), Charles Rosenberg (Pinterest)

User activity sequences have emerged as one of the most important signals in recommender systems. We present a foundational model, PinFM, for understanding user activity sequences across multiple applications at a billion-scale visual discovery platform. We pretrain a transformer model with 20B+ parameters using extensive user activity data, then fine-tune it for specific applications, efficiently coupling it with existing models. While this pretrainingand- fine-tuning approach has been popular in other domains, such as Vision and NLP, its application in industrial recommender systems presents numerous challenges. The foundational model must be scalable enough to score millions of items every second while meeting tight cost and latency constraints imposed by these systems,. Additionally, it should capture the interactions between user activities and other features and handle new items that were not present during the pretraining stage. We developed innovative techniques to address these challenges. Our infrastructure and algorithmic optimizations, such as the Deduplicated Cross-Attention Transformer (DCAT), improved our throughput by 600%. We demonstrate that PinFM can learn interactions between user sequences and candidate items by altering input sequences, leading to a 20% increase in engagement with new items. PinFM is now deployed to help improve the experience of more than a half billion users across various applications.
RESPrivacy-Preserving Social Recommendation: Privacy Leakage and Countermeasure
by Yuyue Chen (Harbin Institute of Technology, Shenzhen), Peng Yang (The University of Hong Kong), Zoe Lin Jiang (Harbin Institute of Technology, Shenzhen), Wenhao Wu (Harbin Institute of Technology, Shenzhen), Junbin Fang (Jinan University), Xuan Wang (Harbin Institute of Technology, Shenzhen), Chuanyi Liu (Harbin Institute of Technology, Shenzhen)

Social recommendation systems generally utilize two types of data, user-item interaction matrices (R) from rating platform (P0), and user-user social graphs (S) from social platform (P1). Considering user privacy that neither R nor S can be directly shared, Chen et al. introduced the Secure Social Recommendation (SeSoRec) framework with the Secret Sharing-based Matrix Multiplication (SSMM) protocol. However, we find that the leakage of intermeidate information introduced by SSMM will eventually lead to the leakage of S to P0, which challenges the privacy guarantees of SeSoRec.

This work firstly identifies that the claimed “innocuous” leakage in SeSoRec originates from reusing the same One-Time Pad key during two randomization phases in SSMM, with formal proof that SSMM violates semi-honest security. Secondly, this work proposes the Two-Time Pad Attack with two reconstruction algorithms to evaluate the severity of the leakage. The Two-Time Pad Attack can extract the column-wise sum of matrices and , and the row-wise difference of matrices and , where such matrices are closely related to R or S. The Sparse Matrix Reconstruction (SMR) algorithm can achieve 99.35%, 83.83%, and 77.14% reconstruction rates for non-zero entries in S on FilmTrust, Epinions, and Douban datasets, respectively. The Grayscale Image Reconstruction (GIR) algorithm can successfully recover MNIST image contours. Thirdly, when the number of columns/rows of the input matrix A/B in SSMM is odd (requiring zero-padding to an even dimension), this work proposes the Zero-Padding Attack which can directly expose the last column/row of A/B. Finally, this work proposes the Privacy-Preserving Matrix Multiplication (PPMM) protocol with experimental demonstration as a replacement for SSMM, which eliminates such leakage while maintaining efficiency.
RESPrompt-to-Slate: Diffusion Models for Prompt-Conditioned Slate Generation
by Federico Tomasi (Spotify), Francesco Fabbri (Spotify), Justin Carter (Spotify), Elias Kalomiris (Spotify), Mounia Lalmas (Spotify), Zhenwen Dai (Spotify)

Slate generation is a common task in streaming and e-commerce platforms, where multiple items are presented together as a list or “slate”. Traditional systems focus mostly on item-level ranking and often fail to capture the coherence of the slate as a whole. A key challenge lies in the combinatorial nature of selecting multiple items jointly. To manage this, conventional approaches often assume users interact with only one item at a time, assumption that breaks down when items are meant to be consumed together. In this paper, we introduce DMSG, a generative framework based on diffusion models for prompt-conditioned slate generation. DMSG learns high-dimensional structural patterns and generates coherent, diverse slates directly from natural language prompts. Unlike retrieval-based or autoregressive models, DMSG models the joint distribution over slates, enabling greater flexibility and diversity. We evaluate DMSG in two key domains: music playlist generation and e-commerce bundle creation. In both cases, DMSG produces high-quality slates from textual prompts without explicit personalization signals. Offline and online results show that DMSG outperforms strong baselines in both relevance and diversity, offering a scalable, low-latency solution for prompt-driven recommendation. A live A/B test on a production playlist system further demonstrates increased user engagement and content diversity.
RESRecPS: Privacy Risk Scoring for Recommender Systems
by Jiajie He (University of Maryland, Baltimore County), Yuechun Gu (University of Maryland, Baltimore County), Keke Chen (University of Maryland, Baltimore County)

Recommender systems (RecSys) have become an essential component of many web applications. The core of the system is a recommendation model trained on highly sensitive user-item interaction data. While privacy-enhancing techniques are actively studied in the research community, the real-world model development still depends on minimum privacy protection, e.g., via controlled access. Users of such systems should have the right to choose not to share highly sensitive interactions. However, there is no method allowing the user to know which interactions are more sensitive than others. Thus, quantifying the privacy risk of RecSys training data is a critical step to enabling privacy-aware RecSys model development and deployment. We propose a membership-inference-attack (MIA) based privacy scoring method, RecPS, to measure privacy risks at the interaction and the user levels. The RecPS interaction-level score definition is motivated and derived from differential privacy, which is then extended to the user-level scoring method. A critical component is the interaction-level MIA method RecLiRA, which gives high-quality membership estimation. We have conducted extensive experiments on well-known benchmark datasets and RecSys models to show the unique features and benefits of RecPS scoring in risk assessment and RecSys model unlearning.
RESRecommendation and Temptation
by Md Sanzeed Anwar (University of Michigan), Paramveer Dhillon (University of Michigan), Grant Schoenebeck (University of Michigan)

Traditional recommender systems relying on revealed preferences often fail to capture users’ dual-self nature, where consumption choices are driven by both long-term benefits (enrichment) and desire for instant gratification (temptation). Consequently, these systems may generate recommendations that fail to provide long-lasting satisfaction to users. To address this issue, we propose a reimagination of recommender design paradigms. We begin by introducing a novel user model that accounts for dual-self behaviors and the existence of off-platform options. We then propose a novel recommendation objective aligned with long-lasting user satisfaction, and develop the optimal recommendation strategy for this objective. Finally, we present an estimation framework that makes minimal assumptions and leverages the distinction between explicit user feedback and implicit choice data to implement this strategy in practice. We evaluate our approach through both synthetic simulations and simulations based on real-world data from the MovieLens dataset. Results demonstrate that our proposed recommender can deliver superior enrichment compared to several competitive baseline algorithms that operate under the revealed preferences assumption and do not account for dual-self behaviors. Our work opens the door to more nuanced and user-centric recommender design, with significant implications for the development of responsible AI systems.
RESR⁴ec: A Reasoning, Reflection, and Refinement Framework for Recommendation Systems
by Hao Gu (Institute of Automation, Chinese Academy of Sciences), Rui Zhong (Kuaishou Technology), Yu Xia (University of Chinese Academy of Sciences), Wei Yang (Kuaishou Technology), Chi Lu (Kuaishou Technology), Peng Jiang (Kuaishou Technology), Kun Gai (Kuaishou Technology)

Harnessing Large Language Models (LLMs) for recommendation systems has emerged as a prominent avenue, drawing substantial research interest. However, existing approaches primarily involve basic prompt techniques for knowledge acquisition, which resemble System-1 thinking. This makes these methods highly sensitive to errors in the reasoning path, where even a small mistake can lead to an incorrect inference. To this end, in this paper, we propose R⁴ec, a reasoning, reflection and refinement framework that evolves the recommendation system into a weak System-2 model. Specifically, we introduce two models: an actor model that engages in reasoning, and a reflection model that judges these responses and provides valuable feedback. Then the actor model will refine its response based on the feedback, ultimately leading to improved responses. We employ an iterative reflection and refinement process, enabling LLMs to facilitate slow and deliberate System-2-like thinking. Ultimately, the final refined knowledge will be incorporated into a recommendation backbone for prediction. We conduct extensive experiments on Amazon-Book and MovieLens-1M datasets to demonstrate the superiority of R⁴ec. We also deploy R⁴ec on a large scale online advertising platform, showing 2.2\% increase of revenue. Furthermore, we investigate the scaling properties of the actor model and reflection model.
RESScalable Data Debugging for Neighborhood-based Recommendation with Data Shapley Values
by Barrie Kersbergen (Bol & University of Amsterdam), Olivier Sprangers (Nixtla), Bojan Karlaš (Harvard University), Maarten de Rijke (University of Amsterdam), Sebastian Schelter (BIFOLD & TU Berlin)

Machine learning-powered recommendation systems help users find items they like. Issues in the interaction data processed by these systems frequently lead to problems, e.g., to the accidental recommendation of low-quality products or dangerous items. Such data issues are hard to anticipate upfront, and are typically detected post-deployment after they have already impacted the user experience. We argue that a principled data debugging process is required during which human experts identify potentially hurtful data issues and preemptively mitigate them. Recent notions of “data importance”, such as the Data Shapley value (DSV), represent a promising direction to identify training data points likely to cause issues. However, the scale of real-world interaction datasets makes it infeasible to apply existing techniques to compute the DSV in recommendation scenarios. We tackle this problem by introducing the KMC-Shapley algorithm for the scalable estimation of Data Shapley values in neighborhood-based recommendation on sparse interaction data. We conduct an experimental evaluation of the efficiency and scalability of our algorithm on both public and proprietary datasets with millions of interactions, and showcase that the DSV identifies impactful data points for two recommendation tasks in e-commerce. Furthermore, we discuss applications of the DSV on real-world click and purchase data in e-commerce from CompanyX, such as identifying dangerous and low-quality products as well as improving the ecological sustainability of product recommendations.
RESTag-augmented Dual-target Cross-domain Recommendation
by Mingfan Pan (University of Science and Technology of China), Qingyang Mao (University of Science and Technology of China), Xu An (University of Science and Technology of China), Jianhui Ma (University of Science and Technology of China), Gang Zhou (Information Engineering University), Mingyue Cheng (State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China), Enhong Chen (University of Science and Technology of China)

Cross-domain recommendation (CDR) has been proposed to alleviate the data sparsity issue in recommendation systems and has garnered substantial research interest. In recent years, dual-target CDR has been an increasingly prevalent research topic that emphasizes simultaneous enhancement in both the source and target domains. Many existing approaches rely on overlapping users as bridges between domains, yet in real-world scenarios, the number of such users is often severely limited, restricting their practical applicability. To overcome this limitation, alternative methods for cross-domain connections are needed, and item tags serve as a promising solution. However, real-world tags suffer from severe deficiencies in terms of both quantity and diversity, and existing studies have not fully exploited their potential. In this paper, we introduce Tag-augmented Dual-target Cross-domain Recommendation (TA-DTCDR), which is the first to apply LLM-distilled tag information to CDR. TA-DTCDR utilizes item tags distilled by large language models (LLMs) as an additional channel to facilitate information transfer, thereby mitigating performance decline caused by the lack of overlapping users. Furthermore, to fully leverage the natural language information carried by the distilled tags, we design a series of training tasks to align tag semantics across domains while preserving their semantic independence. The proposed method is validated on multiple tasks using public datasets, showing significant improvements over existing state-of-the-art approaches.
RESTest-Time Alignment with State Space Model for Tracking User Interest Shifts in Sequential Recommendation
by Changshuo Zhang (Gaoling School of Artificial Intelligence, Renmin University of China), Xiao Zhang (Gaoling School of Artificial Intelligence, Renmin University of China), Teng Shi (Gaoling School of Artificial Intelligence, Renmin University of China), Jun Xu (Gaoling School of Artificial Intelligence, Renmin University of China), Ji-Rong Wen (Gaoling School of Artificial Intelligence, Renmin University of China)

Sequential recommendation is essential in modern recommender systems, aiming to predict the next item a user may interact with based on their historical behaviors. However, real-world scenarios are often dynamic and subject to shifts in user interests. Conventional sequential recommendation models are typically trained on static historical data, limiting their ability to adapt to such shifts and resulting in significant performance degradation during testing. Recently, Test-Time Training (TTT) has emerged as a promising paradigm, enabling pre-trained models to dynamically adapt to test data by leveraging unlabeled examples during testing. However, applying TTT to effectively track and address user interest shifts in recommender systems remains an open and challenging problem. Key challenges include how to capture temporal information effectively and explicitly identifying shifts in user interests during the testing phase. To address these issues, we propose T2ARec, a novel model leveraging state space model for TTT by introducing two Test Time Alignment modules tailored for sequential recommendation, effectively capturing the distribution shifts in user interest patterns over time. Specifically, T2ARec aligns absolute time intervals with model-adaptive learning intervals to capture temporal dynamics and introduce an interest state alignment mechanism to effectively and explicitly identify the user interest shifts with theoretical guarantees. These two alignment modules enable efficient and incremental updates to model parameters in a self-supervised manner during testing, enhancing predictions for online recommendation. Extensive evaluations on three benchmark datasets demonstrate that T2ARec achieves state-of-the-art performance and robustly mitigates the challenges posed by user interest shifts.
RESUSB-Rec: An Effective Framework for Improving Conversational Recommendation Capability of Large Language Model
by Jianyu Wen (Harbin Institute of Technology), Jingyun Wang (Beihang University), Cilin Yan (Xiaohongshu Inc.), Jiayin Cai (Xiaohongshu Inc.), Xiaolong Jiang (Xiaohongshu Inc.), Ying Zhang (Harbin Institute of Technology)

Recently, Large Language Models (LLMs) have been widely employed in Conversational Recommender Systems (CRSs).Unlike traditional language model approaches that focus on training, all existing LLMs-based approaches are mainly centered around how to leverage the summarization and analysis capabilities of LLMs while ignoring the issue of training.Therefore, in this work, we propose an integrated training-inference framework, User-Simulator-Based framework (USB-Rec), for improving the performance of LLMs in conversational recommendation at the model level.Firstly, we design a LLM-based Preference Optimization (PO) dataset construction strategy for RL training, which helps the LLMs understand the strategies and methods in conversational recommendation.Secondly, we propose a Self-Enhancement Strategy (SES) at the inference stage to further exploit the conversational recommendation potential obtained from RL training.Extensive experiments on various datasets demonstrate that our method consistently outperforms previous state-of-the-art methods.
RESVL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings
by Ramin Giahi (Walmart Global Tech), Kehui Yao (Walmart Global Tech), Sriram Kollipara (Walmart Global Tech), Kai Zhao (Walmart Global Tech), Vahid Mirjalili (Walmart Global Tech), Jianpeng Xu (Walmart Global Tech), Topojoy Biswas (Walmart Global Tech), Evren Korpeoglu (Walmart Global Tech), Kannan Achan (Walmart Global Tech)

Multimodal learning plays a critical role in e-commerce recommendation platforms today, enabling accurate recommendations and product understanding. However, existing vision-language models, such as CLIP, face key challenges in e-commerce recommendation systems: 1) Weak object-level alignment, where global image embeddings fail to capture fine-grained product attributes, leading to suboptimal retrieval performance; 2) Ambiguous textual representations, where product descriptions often lack contextual clarity, affecting cross-modal matching; and 3) Domain mismatch, as generic vision-language models may not generalize well to e-commerce-specific data. To address these limitations, we propose a framework, VL-CLIP, that enhances CLIP embeddings by integrating Visual Grounding for fine-grained visual understanding and an LLM-based agent for generating enriched text embeddings. Visual Grounding refines image representations by localizing key products, while the LLM agent enhances textual features by disambiguating product descriptions. Our approach significantly improves retrieval accuracy, multimodal retrieval effectiveness, and recommendation quality across tens of millions of items on one of the largest e-commerce platforms in the U.S., increasing CTR by 18.6%, ATC by 15.5%, and GMV by 4.0%. Additional experimental results show that our framework outperforms vision-language models, including CLIP, FashionCLIP, and GCL, in both precision and semantic alignment, demonstrating the potential of combining object-aware visual grounding and LLM-enhanced text representation for robust multimodal recommendations.
RESYou Don’t Bring Me Flowers: Mitigating Unwanted Recommendations Through Conformal Risk Control
by Giovanni De Toni (Fondazione Bruno Kessler (FBK)), Erasmo Purificato (European Commission, Joint Research Centre (JRC)), Emilia Gomez (European Commission, Joint Research Centre (JRC)), Andrea Passerini (University of Trento), Bruno Lepri (Fondazione Bruno Kessler (FBK)), Cristian Consonni (European Commission, Joint Research Centre (JRC))

Recommenders are significantly shaping online information consumption. While effective at personalizing content, these systems increasingly face criticism for propagating irrelevant, unwanted, and even harmful recommendations. Such content degrades user satisfaction and contributes to significant societal issues, including misinformation, radicalization, and erosion of user trust. Although platforms offer mechanisms to mitigate exposure to undesired content, these mechanisms are often insufficiently effective and slow to adapt to users’ feedback. This paper introduces an intuitive, model-agnostic, and distribution-free method that uses conformal risk control to provably bound unwanted content in personalized recommendations by leveraging simple binary feedback on items. We also address a limitation of traditional conformal risk control approaches, i.e., the fact that the recommender can provide a smaller set of recommended items, by leveraging implicit feedback on consumed items to expand the recommendation set while ensuring robust risk mitigation. Our experimental evaluation on data coming from a popular online video-sharing platform demonstrates that our approach ensures an effective and controllable reduction of unwanted recommendations with minimal effort.

List of all short papers accepted for RecSys 2025 (in alphabetical order).

RESA Multistakeholder Approach to Value-Driven Co-Design of Recommender Systems Evaluation Metrics in Digital Archives
by Florian Atzenhofer-Baumgartner (Graz University of Technology), Georg Vogeler (University of Graz), Dominik Kowald (Know Center Research GmbH)

This paper presents the first multistakeholder approach for translating diverse stakeholder values into an evaluation metric setup for Recommender Systems (RecSys) in digital archives. While commercial platforms mainly rely on engagement metrics, cultural heritage domains require frameworks that balance competing priorities among archivists, platform owners, researchers, and other stakeholders. To address this challenge, we conducted high-profile focus groups (5 groups × 5 persons) with upstream, provider, system, consumer, and downstream stakeholders, identifying value priorities across critical dimensions: visibility/representation, expertise adaptation, and transparency/trust. Our analysis shows that stakeholder concerns naturally align with four sequential research funnel stages: discovery, interaction, integration, and impact. The resulting framework addresses domain-specific challenges including collection representation imbalances, non-linear research patterns, and tensions between specialized expertise and broader accessibility. We propose tailored metrics for each stage in this research journey, such as research path quality for discovery, contextual appropriateness for interaction, metadata-weighted relevance for integration, and cross-stakeholder value alignment for impact assessment. Our contributions extend beyond digital archives to the broader RecSys community, offering transferable evaluation approaches for domains where value emerges through sustained engagement rather than immediate consumption.
RES‘Beyond the past’: Leveraging Audio and Human Memory for Sequential Music Recommendation
by Viet Anh Tran (Deezer), Bruno Sguerra (Deezer), Gabriel Meseguer-Brocal (Deezer), Lea Briand (Deezer), Manuel Moussallam (Deezer)

On music streaming services, listening sessions are often composed of a balance of familiar and new tracks. Recently, sequential recommender systems have adopted psychology-informed approaches based on human memory models, such as Adaptive Control of Thought—Rational (ACT-R), to successfully improve the prediction of the most relevant tracks for the next user session. However, one limitation of using a model based on human memory (or the past), is that it struggles to recommend new tracks that users have not previously listened to. To bridge this gap, here we propose a model that leverages audio information to predict in advance the ACT-R-like activation of new tracks and incorporates them into the recommendation scoring process. We demonstrate the empirical effectiveness of the proposed model using proprietary data from a global music streaming service, which we publicly release along with the model’s source code to foster future research in this field.
RESBeyond Top-1: Addressing Inconsistencies in Evaluating Counterfactual Explanations for Recommender Systems
by Amir Reza Mohammadi (University of Innsbruck), Andreas Peintner (University of Innsbruck), Michael Müller (University of Innsbruck), Eva Zangerle (University of Innsbruck)

Explainability in recommender systems remains a pivotal yet challenging research frontier. Among state-of-the-art techniques, counterfactual explanations stand out for their effectiveness as they show how small changes to input data can alter recommendations, providing actionable insights that build user trust and enhance transparency. Despite their growing prominence, the evaluation of counterfactual explanations in recommender systems is far from standardized. Specifically, existing metrics exhibit inconsistency, being influenced by the performance of the recommender system being explained. Hence, we critically examine the evaluation of counterfactual explainers through consistency as the key principle of effective evaluation. Through extensive experiments, we assess how going beyond top-1 recommendation and incorporating Top-k recommendations impacts the consistency of existing evaluation metrics. Our findings address existing consistency gaps in the evaluation of counterfactual explainers and highlights an important step toward fair evaluation of counterfactual explanations.
RESBeyond Visit Trajectories: Enhancing POI Recommendation via LLM-Augmented Text and Image Representations
by Zehui Wang (University of Applied Sciences Ravensburg-Weingarten), Wolfram Höpken (University of Applied Sciences Ravensburg-Weingarten), Dietmar Jannach (University of Klagenfurt)

Recommender systems often rely on user visit trajectories, but the integration and representation of diverse side information remains a key challenge. Recent advances in large language models (LLMs) have enabled new strategies for enhancing this process. This study investigates how different types of side information support next Point-of-Interest (POI) recommendation, using a business-level dataset derived from Yelp. An LLM-based summarization pipeline is introduced to convert unstructured reviews and visual content into structured text via instruction-tuned models. These summaries, together with other business features, are each encoded into fixed- length embeddings. Based on these embeddings, four input configurations are constructed for BERT4Rec: trajectory-only, single feature categories, pairwise category combinations, and full combination. Our results show that side information consistently improves performance over the trajectory-only baseline, and their combinations exhibit useful synergies. These findings highlight the importance of modality-aware design and point toward adaptive fusion and selective use of side information. To support further research, we publicly release a multimodal POI recommendation dataset based on the Yelp Open Dataset.
RESBiases in LLM-Generated Musical Taste Profiles for Recommendation
by Bruno Sguerra (Deezer Research), Elena Epure (Deezer Research), Harin Lee (Max Planck Institute for Human Cognitive and Brain Science), Manuel Moussallam (Deezer Research)

One particularly promising use case of Large Language Models (LLMs) for recommendation is the automatic generation of Natural Language (NL) user taste profiles from consumption data. These profiles offer interpretable and editable alternatives to opaque collaborative filtering representations, enabling greater transparency and user control. However, it remains unclear whether users identify these profiles to be an accurate representation of their taste, which is crucial for trust and usability. Moreover, because LLMs inherit societal and data-driven biases, profile quality may systematically vary across user and item characteristics. In this paper, we study this issue in the context of music streaming, where personalization is challenged by a large and culturally diverse catalog. We conduct a user study in which participants rate NL profiles generated from their own listening histories. We analyze whether identification with the profiles is biased by user attributes (e.g., mainstreamness, taste diversity) and item features (e.g., genre, country of origin). We also compare these patterns to those observed when using the profiles in a downstream recommendation task. Our findings highlight both the potential and limitations of scrutable, LLM-based profiling in personalized systems.
RESCollaborative Interest Modeling in Recommender Systems
by Yu-Ting Cheng (National Taiwan Normal University), Yu-Yen Ho (National Taiwan University), Jyun-Yu Jiang (Amazon)

In this paper, we introduce Collaborative Interest Modeling (COIN), a novel approach to tackle interest entanglement and sparse interest representations within multi-interest learning for recommender systems. COIN leverages collaborative signals from behaviorally similar interests to refine interest embeddings and enhance recommendation quality, unlike existing methods that primarily focus on individual user-item interactions. The approach aligns collaborative neighbors with sparse interests, employs a structured routing mechanism to distinguish multiple interests, and avoids routing collapse. Experimental results on three real-world datasets demonstrate that COIN outperforms state-of-the-art models by 4.71% to 15.13% in key recommendation metrics, such as recall, NDCG, and hit ratio.
RESConsistent Explainers or Unreliable Narrators? Understanding LLM-generated Group Recommendations
by Cedric Waterschoot (Maastricht University), Nava Tintarev (Maastricht University), Francesco Barile (Maastricht University)

Large Language Models (LLMs) are increasingly being implemented as joint decision-makers and explanation generators for Group Recommender Systems (GRS). In this paper, we evaluate these recommendations and explanations by comparing them to social choice-based aggregation strategies. Our results indicate that LLM-generated recommendations often resembled those produced by Additive Utilitarian (ADD) aggregation. However, the explanations typically referred to averaging ratings (resembling but not identical to ADD aggregation). Group structure, uniform or divergent, did not impact the recommendations. Furthermore, LLMs regularly claimed additional criteria such as user or item similarity, diversity, or used undefined popularity metrics or thresholds. Our findings have important implications for LLMs in the GRS pipeline as well as standard aggregation strategies. Additional criteria in explanations were dependent on the number of ratings in the group scenario, indicating potential inefficiency of standard aggregation methods at larger item set sizes. Additionally, inconsistent and ambiguous explanations undermine transparency and explainability, which are key motivations behind the use of LLMs for GRS.
RESCorrecting the LogQ Correction: Revisiting Sampled Softmax for Large-Scale Retrieval
by Kirill Khrylchenko (Yandex), Vladimir Baikalov (Yandex), Sergei Makeev (Yandex), Artem Matveev (Yandex), Sergei Liamaev (Yandex)

Two-tower neural networks are a popular architecture for the retrieval stage in recommender systems. These models are typically trained with a softmax loss over the item catalog. However, in web-scale settings, the item catalog is often prohibitively large, making full softmax infeasible. A common solution is sampled softmax, which approximates the full softmax using a small number of sampled negatives. One practical and widely adopted approach is to use in-batch negatives, where negatives are drawn from items in the current mini-batch. However, this introduces a bias: items that appear more frequently in the batch (i.e., popular items) are penalized more heavily. To mitigate this issue, a popular industry technique known as logQ correction adjusts the logits during training by subtracting the log-probability of an item appearing in the batch. This correction is derived by analyzing the bias in the gradient and applying importance sampling, effectively twice, using the in-batch distribution as a proposal distribution. While this approach improves model quality, it does not fully eliminate the bias. In this work, we revisit the derivation of logQ correction and show that it overlooks a subtle but important detail: the positive item in the denominator is not Monte Carlo-sampled — it is always present with probability 1. We propose a refined correction formula that accounts for this. Notably, our loss introduces an interpretable sample weight that reflects the model’s uncertainty — the probability of misclassification under the current parameters. We evaluate our method on both public and proprietary datasets, demonstrating consistent improvements over the standard logQ correction.
RESCounterfactual Inference under Thompson Sampling
by Olivier Jeunen (Aampe)

Recommender systems exemplify sequential decision-making under uncertainty, strategically deciding what content to serve to users, to optimise a range of potential objectives. To balance the explore-exploit trade-off successfully, Thompson sampling provides a natural and widespread paradigm to probabilistically select which action to take. Questions of causal and counterfactual inference, which underpin use-cases like off-policy evaluation, are not straightforward to answer in these contexts. Specifically, whilst most existing estimators rely on action propensities, these are not readily available under Thompson sampling procedures. In this work, we derive exact and efficiently computable expressions for action propensities under a variety of parameter and outcome distributions, enabling the use of off-policy estimators in such settings. This opens up a range of practical use-cases where counterfactual inference is crucial, including unbiased offline evaluation of recommender systems, as well as general applications of causal inference in online advertising, personalisation, and beyond.
RESD-RDW: Diversity-Driven Random Walks for News Recommender Systems
by Runze Li (University of Zurich), Lucien Heitz (University of Zurich), Oana Inel (University of Zurich), Abraham Bernstein (University of Zurich)

This paper introduces Diversity-Driven Random Walks (D-RDW), a lightweight algorithm and re-ranking technique that generates diverse news recommendations. D-RDW is a societal recommender, which combines the diversification capabilities of the traditional random walk algorithms with customizable target distributions of news article properties. In doing so, our model provides a transparent approach for editors to incorporate norms and values into the recommendation process. D-RDW shows enhanced performance across key diversity metrics that consider the articles’ sentiment and political party mentions when compared to state-of-the-art neural models. Furthermore, D-RDW proves to be more computationally efficient than existing approaches.
RESDeterminants of Users’ Chance-Seeking Behavior in Search-Based Recommendation
by Yuki Ninomiya (Nagoya University), Yutaro Sone (Nagoya University), Kazuhisa Miwa (Nagoya University), Yuichiro Sumi (Frontier Research Center, Toyota Motor Corporation), Ryosuke Nakanishi (Frontier Research Center, Toyota Motor Corporation), Eiji Mitsuda (Frontier Research Center, Toyota Motor Corporation), Koji Sato (Frontier Research Center, Toyota Motor Corporation), Tadashi Odashima (Frontier Research Center, Toyota Motor Corporation)

Serendipity in retrieval and recommendation systems has been recognized as a promising approach to mitigate the problem of overspecialization. However, previous research has mainly focused on algorithmic implementations of serendipity in recommendation items, with limited attention to the extent to which users themselves desire chance. This study explores the determinants of chance seeking behavior in retrieval contexts through two experiments. Experiment 1 showed that higher goal specificity suppresses serendipitous behavior. Experiment 2 showed that extraversion, diverse curiosity, enjoyment of ambiguity, and maximization tendencies promoted chance seeking, while neuroticism and specific curiosity inhibited it. These findings suggest that users actively regulate the degree of chance in response to their goal states and individual characteristics. The results indicate the importance of considering users’ chance-seeking trait when designing serendipitous recommendation systems.
RESDisentangling User and Item Sequence Patterns in Sequential Recommendation Data Sets
by Kaiyue Liu (University of Helsinki), Yang Liu (University of Helsinki), Alan Medlar (University of Helsinki), Dorota Glowacka (University of Helsinki)

Sequential recommenders use the ordering of user-item interactions to perform next-item prediction. Several studies have attempted to estimate how much sequential information is available in data sets used in the offline evaluation of sequential recommenders by randomly shuffling users’ interaction histories and breaking the sequential dependencies between interactions. However, random shuffling fails to distinguish between sequential patterns from user behaviour, (i.e., users consuming items based on previous interactions, such as watching a movie and its sequel) and item availability (when items enter the system and become available for user consumption, e.g., the release date of a movie or song). In this article, we analyse several widely used data sets in sequential recommendation studies using two shuffling techniques: random shuffling and constrained shuffling. While random shuffling reorders interactions arbitrarily, constrained shuffling does not allow user-item interactions to occur prior to the item’s first appearance in the data set. Our experiments show that sequential information can either come exclusively from user behaviour patterns or item availability, or from a combination of the two. These findings have implications for understanding evaluation results in sequential recommendation and highlights why some data sets may be less appropriate for offline evaluation given how little sequential information comes from user behaviour.
RESDo We Really Need Specialization? Evaluating Generalist Text Embeddings for Zero-Shot Recommendation and Search
by Matteo Attimonelli (Politecnico di Bari), Alessandro De Bellis (Politecnico di Bari), Claudio Pomo (Politecnico di Bari), Dietmar Jannach (University of Klagenfurt), Eugenio Di Sciascio (Politecnico di Bari), Tommaso Di Noia (Politecnico di Bari)

Pre-trained language models (PLMs) are widely used to derive semantic representations from item metadata in recommendation and search. In sequential recommendation, PLMs enhance ID-based embeddings through textual metadata, while in product search, they align item characteristics with user intent. Recent studies suggest task and domain-specific fine-tuning are needed to improve representational power. This paper challenges this assumption, showing that Generalist Text Embedding Models (GTEs), pre-trained on large-scale corpora, can guarantee strong zero-shot performance without specialized adaptation. Our experiments demonstrate that GTEs outperform traditional and fine-tuned models in both sequential recommendation and product search. We attribute this to a superior representational power, as they distribute features more evenly across the embedding space. Finally, we show that compressing embedding dimensions by focusing on the most informative directions (e.g., via PCA) effectively reduces noise and improves the performance of specialized models.
RESEmotion Vector-Based Fine-Tuning of Large Language Models for Age-Aware Teenage Book Recommendations
by Kate Hill (Brigham Young University), Yiu-Kai Ng (Brigham Young University), Joey Sherrill (Brigham Young University)

Reading is a vital skill for teenagers as described by the National Institute of Child Health and Human Development, “Reading is the single most important skill necessary for a happy, productive, and successful life.” Yet, teens and their parents often struggle to find engaging books amid an overwhelming number of options. Moreover, existing book recommender systems rely heavily on user data such as profiles, reviews, or browsing behavior—information often restricted for minors due to privacy laws. To address this, we propose a privacy- conscious, teenage book recommender system that analyzes the emotional content of books using the NRC Emotion Intensity Lexicon (NRC-EIL). By extracting emotion vectors from book descriptions, we capture each book’s emotional tone and intensity. Our system then uses patterns in emotional preferences across age groups to recommend books that align with teen readers’ developmental and emotional needs. While LLMs can make content-based book recommendations for teenagers as well, they still face challenges like training bias, limited sensitivity to age-specific nuances, and lack of transparency. By integrating our emotion vector approach, we fine-tune LLMs to better detect age relevant emotional cues, enhancing their ability to suggest meaningful and appropriate content for teen audiences. Experimental results confirm that fine-tuning LLMs with our emotional vector approach significantly enhances their performance in generating accurate and age-appropriate book recommendations for teenagers.
RESEstimating Quantum Execution Requirements for Feature Selection in Recommender Systems Using Extreme Value Theory
by Jiayang Niu (RMIT University), Qihan Zou (University of Melbourne), Jie Li (RMIT University), Ke Deng (RMIT University), Mark Sanderson (RMIT University), Yongli Ren (RMIT University)

Recent advances in quantum computing have significantly accelerated research into quantum-assisted information retrieval and recommender systems, particularly in solving feature selection problems by formulating them as Quadratic Unconstrained Binary Optimization (QUBO) problems executable on quantum hardware. However, while existing work primarily focuses on effectiveness and efficiency, it often overlooks the probabilistic and noisy nature of real-world quantum hardware. In this paper, we propose a solution based on Extreme Value Theory (EVT) to quantitatively assess the usability of quantum solutions. Specifically, given a fixed problem size, the proposed method estimates the number of executions (shots) required on a quantum computer to reliably obtain a high-quality solution, which is comparable to or better than that of classical baselines on conventional computers. Experiments conducted across multiple quantum platforms (including two simulators and two physical quantum processors) demonstrate that our method effectively estimates the number of required runs to obtain satisfactory solutions on two widely used benchmark datasets.
RESExploring the Effect of Context-Awareness and Popularity Calibration on Popularity Bias in POI Recommendations
by Andrea Forster (Graz University of Technology), Simone Kopeinik (Know Center Research GmbH), Denis Helic (Graz University of Technology), Stefan Thalmann (University of Graz), Dominik Kowald (Know Center Research GmbH)

Point-of-interest (POI) recommender systems help users discover relevant locations, but their effectiveness is often compromised by popularity bias, which disadvantages less popular yet potentially meaningful places. This paper addresses this challenge by evaluating the effectiveness of context-aware models and calibrated popularity techniques as strategies for mitigating popularity bias. Using four real-world POI datasets (Brightkite, Foursquare, Gowalla, Yelp), we analyze the individual and combined effects of these approaches on recommendation accuracy and popularity bias. Our results reveal that context-aware models cannot be considered a uniform solution, as the models studied exhibit divergent impacts on accuracy and bias. In contrast, calibration techniques can effectively align recommendation popularity with user preferences, provided there is a careful balance between accuracy and bias mitigation. Notably, the combination of calibration and context-awareness yields recommendations that balance accuracy and close alignment with the users’ popularity profiles, i.e., popularity calibration.
RESFailure Prediction in Conversational Recommendation Systems
by Maria Vlachou (University of Glasgow)

In a Conversational Image Recommendation task, users can provide natural language feedback on a recommended image item, which leads to an improved recommendation in the next turn. While typical instantiations of this task assume that the user’s target item will (eventually) be returned, this might often not be true, for example, the item the user seeks is not within the item catalogue. Failing to return a user’s desired item can lead to user frustration, as the user needs to interact with the system for an increased number of turns. To mitigate this issue, in this paper, we introduce the task of Supervised Conversational Performance Prediction, inspired by Query Performance Prediction (QPP) for predicting effectiveness in response to a search engine query. In this regard, we propose predictors for conversational performance that detect conversation failures using multi-turn semantic information contained in the embedded representations of retrieved image items. Specifically, our AutoEncoder-based predictor learns a compressed representation of top-retrieved items of the train turns and uses the classification labels to predict the evaluation turn. Our evaluation scenario addressed two recommendation scenarios, by differentiating between system failure, where the system is unable to find the target, and catalogue failure, where the target does not exist in the item catalogue. In our experiments using the Shoes and FashionIQ Dresses datasets, we measure the accuracy of predictors for both system and catalogue failures. Our results demonstrate the promise of our proposed predictors for predicting system failures (existing evaluation scenario), while we detect a considerable decrease in predictive performance in the case of catalogue failure prediction (when inducing a missing item scenario) compared to system failures.
RESFeedback-Driven Gradual Discovery for Expanding Musical Preferences
by Alec Nonnemaker (Delft University of Technology), Ralvi Isufaj (XITE), Zoltán Szlávik (XITE), Cynthia Liem (Delft University of Technology)

Many current recommender system techniques reinforce established tastes, leaving little room for venturing into unfamiliar music. A key challenge is our uncertainty about user preferences for previously unconsumed content, making it safer to build upon known preferences. To address this, we propose an incremental, feedback-driven method that gradually introduces users to new genres. By dynamically balancing recommendations between verified preferences and content with uncertain appeal, our approach maintains engagement while progressively expanding musical horizons. Adopting a Bayesian active learning approach, we update belief states iteratively as users provide feedback on new items. In a user study with data from a commercial music video platform, participants gradually discovered a previously unfamiliar music genre of their choosing. Comparing our method to both immediate genre introduction and passive small-step strategies without real-time adaptation, we observed significant improvements. Participants showed higher engagement with new music, stronger affinity for unfamiliar genres, and a greater sense of control, demonstrating the effectiveness of our iterative, feedback-informed strategy for broadening musical tastes.
RESHiDePCC: A Novel Dual-Pronged Untargeted Attack on Federated Recommendation via Gradient Perturbation and Cluster Crafting
by Yamini Jha (Indian Institute of Technology (BHU) Varanasi), Krishna Tewari (Indian Institute of Technology (BHU) Varanasi), Sukomal Pal (Indian Institute of Technology (BHU) Varanasi)

Federated recommender systems offer privacy benefits by decentralizing user data and preventing direct data sharing among clients. Although this architecture limits the effectiveness of traditional attack strategies, it remains susceptible to subtle adversarial attacks that can significantly degrade the accuracy of recommendations. To expose these vulnerabilities, we propose a novel untargeted attack (HiDePCC) that degrades overall system performance through a dual-pronged strategy combining adaptive gradient perturbation and hierarchical cluster-based embedding manipulation. We apply adaptive perturbations to item gradients during training and employ hierarchical clustering using several linkage methods to form coherent item clusters. Within these clusters, we converge item embeddings and manipulate boundary points to induce item misclassification. This causes the system to assign similar scores to clustered items and misrank them. We evaluated our attack on two benchmark datasets, MovieLens (with 0.5% and 1% malicious users) and Gowalla (1%), using Matrix Factorization as the base recommendation model and assessing the impact in various robust aggregation techniques. We also examined several permutations of configurations using hierarchical clustering, adaptive gradient perturbation and boundary points misclassification. Our results show that the complete setup outperforms existing state-of-the-art untargeted attacks, with performance drops for HR@5 ranging from 13.93% to 68.02% on MovieLens and ranging from 40.02% and 99.76% on Gowalla dataset . These findings reveal important vulnerabilities in federated recommendation systems.
RESJust Ask for Music (JAM): Multimodal and Personalized Natural Language Music Recommendation
by Alessandro B. Melchiorre (Johannes Kepler University), Elena Epure (Deezer Research), Shahed Masoudian (Johannes Kepler University), Gustavo Escobedo (Joahnnes Kepler University), Anna Hausberger (Johannes Kepler University), Manuel Moussallam (Deezer), Markus Schedl (Johannes Kepler University)

Natural language interfaces offer a compelling approach for music recommendation, enabling users to express complex preferences conversationally. While Large Language Models (LLMs) show promise in this direction, their scalability is limited by high costs and latency. Retrieval-based approaches using smaller language models mitigate these issues but often rely on single-modal item representations, overlook long-term user preferences, and require full model retraining, posing challenges for real-world deployment. In this paper, we present JAM (Just Ask for Music), a lightweight and intuitive framework for natural language music recommendation. JAM models user–query–item interactions as vector translations in a shared latent space, inspired by knowledge graph embedding methods like TransE. To capture the complexity of music and user intent, JAM aggregates multimodal item features via cross-attention and sparse mixture-of-experts. We also introduce JAMSessions, a new dataset of over 100k user–query–item triples with anonymized user/item embeddings, uniquely combining conversational queries and user long-term preferences. Our results show that JAM provides accurate recommendations, produces intuitive representations suitable for practical use cases, and can be easily integrated with existing music recommendation stacks.
RESLarge Scale E-Commerce Model for Learning and Analyzing Long-Term User Preferences
by Yonatan Hadar (eBay), Yotam Eshel (eBay), Tal Franji (eBay), Bracha Shapira (Ben-Gurion University), Michelle Hwang (eBay), Guy Feigenblat (eBay)

Understanding long-term user preferences is critical for delivering consistent and personalized recommendations that go beyond short-term behavioral cues in large-scale e-commerce platforms. We present NILUS (Neural Inference for Long-Term User Signals), a content-based transformer model trained to predict user behavior over a ????-day future window using up to one year of historical interaction data. NILUS learns user embeddings end-to-end via contrastive learning, using item representations from a fine-tuned sentence encoder. We introduce a novel evaluation framework to assess the model’s ability to capture enduring user interests, and demonstrate that NILUS delivers higher accuracy than strong baselines on a large-scale offline dataset spanning millions of users and diverse product verticals. When combined with short-term signals, NILUS further improves recommendation accuracy and diversity. Finally, a large-scale online A/B test on a multinational e-commerce platform confirms statistically significant gains in user engagement.
RESLet It Go? Not Quite: Addressing Item Cold Start in Sequential Recommendations with Content-Based Initialization
by Anton Pembek, Artem Fatkulin, Anton Klenitskiy, Alexey Vasilev

Many sequential recommender systems suffer from the cold start problem, where items with few or no interactions cannot be effectively used by the model due to the absence of a trained embedding. Content-based approaches, which leverage item metadata, are commonly used in such scenarios. One possible way is to use embeddings derived from content features such as textual descriptions as initialization for the model embeddings. However, directly using frozen content embeddings often results in suboptimal performance, as they may not fully adapt to the recommendation task. On the other hand, fine-tuning these embeddings can degrade performance for cold-start items, as item representations may drift far from their original structure after training. We propose a novel approach to address this limitation. Instead of fully freezing the content embeddings or fine-tuning them extensively, we introduce a small trainable delta to frozen embeddings that enables the model to adapt item representations without letting them go too far from their original semantic structure. This approach demonstrates consistent improvements across multiple datasets and modalities, including e-commerce datasets with textual descriptions and a music dataset with audio-based representation.
RESMitigating Latent User Biases in Pre-trained VAE Recommendation Models via On-demand Input Space Transformation
by David Penz (Technische Universität Wien), Gustavo Junior Escobedo Ticona (Johannes Kepler University Linz), Markus Schedl (Johannes Kepler University Linz)

Recommender systems can unintentionally encode protected attributes (e.g., gender, country, or age) in their learned latent user representations. Current in-processing debiasing approaches, notably adversarial training, effectively reduce the encoded information on private user attributes. These approaches modify the model parameters during training. Thus, to alternate between biased and debiased model, two separate models have to be trained. In contrast, we propose a novel method to debias recommendation models post-training, which allows switching between biased and debiased model at inference time. Focusing on state-of-the-art variational autoencoder (VAE) architectures, our method aims to reduce bias at input level (user-item interactions) by learning a transformation from input space to a debiased subspace. As the output of this transformation lies in the same space as the original input vector, we can use transformed (debiased) input vectors without the need to fine-tune the pre-trained model. We evaluate the effectiveness of our method on three datasets, MovieLens-1M, LFM2b-DemoBias, and EB-NeRD, from the movie, music, and news domains, respectively. Our experiments show that the proposed method achieves task performance (in terms of NDCG) and debiasing strength (in terms of balanced accuracy of an attacker network) that are comparable to applying adversarial training during the initial training procedure, while providing the added functionality of alternating between biased and debiased model at inference time.
RESNot Just What, But When: Integrating Irregular Intervals to LLM for Sequential Recommendation
by Wei-Wei Du (Sony Group Corporation), Takuma Udagawa (Sony Group Corporation), Kei Tateno (Sony Group Corporation)

Time intervals between purchasing items are a crucial factor in sequential recommendation tasks, whereas existing approaches focus on item sequences and often overlook by assuming the intervals between items are static. However, dynamic intervals serve as a dimension that describes user profiling on not only the history within a user but also different users with the same item history. In this work, we propose IntervalLLM, a novel framework that integrates interval information into LLM and incorporates the novel interval-infused attention to jointly consider information of items and intervals. Furthermore, unlike prior studies that address the cold-start scenario only from the perspectives of users and items, we introduce a new viewpoint: the interval perspective to serve as an additional metric for evaluating recommendation methods on the warm and cold scenarios. Extensive experiments on 3 benchmarks with both traditional- and LLM-based baselines demonstrate that our IntervalLLM achieves not only 3.6% improvements in average but also the best-performing warm and cold scenarios across all users, items, and the proposed interval perspectives. In addition, our proposed method exhibits the smallest performance degradation between the warm and cold scenarios. Notably, we observe that the cold scenario from the interval perspective experiences the most significant performance drop among all recommendation methods. This finding underscores the necessity of further research on interval-based cold challenges and our integration of interval information in the realm of sequential recommendation tasks.
RESNot One News Recommender To Fit Them All: How Different Recommender Strategies Serve Various User Segments
by Hanne Vandenbroucke (imec-SMIT, Vrije Universiteit Brussel), Ulysse Maes (imec-SMIT, Vrije Universiteit Brussel), Lien Michiels (University of Antwerp), Annelien Smets (imec-SMIT, Vrije Universiteit Brussel)

Many news recommender systems (NRS) adopt a one-recommender-for-all approach, overlooking that users engage with news in fundamentally different ways. In this work, we identify user segments based on various engagement metrics that go beyond clicks by employing cluster analysis on two real-world datasets: EB-NeRD and Adressa. Next to that, we evaluate the performance of common recommendation strategies: popularity, collaborative filtering (EASE and ItemKNN), and a content-based model across these user segments, which exhibit varying reading behaviors and information needs. Our findings show that different recommendation strategies are effective to varying degrees depending on the user profile. This study contributes to NRS research by providing a grounded segmentation of users derived from real-world datasets and emphasizes the importance of user-centered evaluations in advancing our understanding for understanding how NRS designs serve audiences with varying levels of news engagement.
RESOn Inherited Popularity Bias in Cold-Start Item Recommendation
by Gregor Meehan (Queen Mary University of London), Johan Pauwels (Queen Mary University of London)

Collaborative filtering (CF) recommender systems struggle in the item cold-start scenario, i.e. with recommending new or unseen items. Cold-start item recommenders, designed to address this challenge, are typically trained with supervision from warm CF models, so that collaborative and content information from the available interaction data can also be leveraged for cold items. However, since they learn to replicate the behavior of CF methods, cold-start systems may therefore also learn to imitate their predictive biases. In this paper, we examine how cold-start models can inherit popularity bias, a common cause of recommender system unfairness arising when CF models overfit to more popular items to maximize overall accuracy, leaving rarer items underrepresented. We show that cold-start recommenders not only mirror the popularity biases of warm models, but are in fact affected more severely because they cannot infer popularity from interaction data, so instead attempt to estimate it based solely on content features. Through experiments on three real-world datasets, we analyze the impact of this issue on several cold-start methods across multiple training paradigms. We then describe a simple post-processing bias mitigation method which, by using embedding magnitude as a proxy for popularity, can produce more balanced recommendations with limited harm to cold-start accuracy.
RESPersonalized Persuasion-Aware Explanations in Recommender Systems
by Havva Alizadeh Noughabi (University of Guelph), Behshid Behkamal (Western University), Fattane Zarrinkalam (University of Guelph), Mohsen Kahani (Ferdowsi University of Mashhad)

With the increasing accuracy of recommender systems (RSs) in providing recommendations based on user preferences and past behaviors, there is a growing need for generating appropriate explanations to facilitate effective decision-making. Motivated by the recent trend of integrating social science theories into explainable RSs, this paper addresses the challenge of generating and evaluating personalized persuasion-aware explanations. While prior work mainly explores how users with different characteristics respond to persuasion-aware explanations, we build on these insights to construct a persuasion profile for each user and generate personalized persuasive explanations for items recommended by various RS baselines. We then evaluate these explanations from an explainability perspective, including metrics such as model fidelity. Additionally, we incorporate the persuasiveness degrees of generated explanations to re-order the recommendation list and investigate its impact on recommendation utility. Our experimental results on a real-world movie recommendation dataset demonstrate that the proposed approach effectively generates persuasive explanations for recommended items, while enhancing recommendation utility.
RESPopularity-Bias Vulnerability: Semi-Supervised Label Inference Attack on Federated Recommender Systems
by Kenji Shinoda (Toyota Motor Corporation), Takeyuki Sasai (Toyota Motor Corporation), Shintaro Fukushima (Toyota Motor Corporation)

Organizations are increasingly applying Vertical Federated Learning (VFL) to enhance recommender systems without sharing raw data among themselves. However, partial outputs in VFL remain to introduce significant privacy risks. In this study, we propose a novel label inference attack specifically tailored for VFL-based recommender systems, leveraging two common characteristics: (1) item popularity often follows a power-law distribution, and (2) random negative sampling is commonly used for implicit feedback, a substitute for non-existing true labels. By combining partial local information from VFL with this prior knowledge, a malicious party can construct a semi-supervised learning pipeline. The experimental results of three real-world datasets demonstrate that our approach achieves a higher label inference performance than the existing attacks. These findings demonstrate the need for more robust privacy preserving mechanisms in federated recommender systems.
RESRethinking Overconfidence in VAEs: Can Label Smoothing Help?
by Woo-Seong Yun (Chung-Ang University), YeoJun Choi (Chung-Ang University), Yoon-Sik Cho (Chung-Ang University)

By leveraging the expressive power of deep generative models, Variational Autoencoder (VAE) -based recommender models have demonstrated competitive performance. However, deep neural networks (DNNs) tend to exhibit overconfidence in their predictive distributions as training progresses. This issue is further exacerbated by two inherent characteristics of collaborative filtering (CF): (1) extreme data sparsity and (2) implicit feedback. Despite its importance, there has been a lack of systematic study into this problem. To fill the gap, this paper explores the above limitations with label smoothing (LS) from both theoretical and empirical aspects. Our extensive analysis demonstrates that overconfidence leads to embedding collapse, where latent representations collapse into a narrow subspace. Furthermore, we investigate the conditions under which LS helps recommendation, and observe that the optimal LS factor decreases proportionally with data sparsity. To the best of our knowledge, this is the first study in VAE-based CF that discovers the relationship between overconfidence and embedding collapse, and highlights the necessity of explicitly addressing them.
RESSGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation
by Weizhi Zhang (University of Illinois Chicago), Liangwei Yang (Salesforce AI Research), Zihe Song (University of Illinois Chicago), Henry Peng Zou (University of Illinois Chicago), Ke Xu (University of Illinois Chicago), Yuanjie Zhu (University of Illinois Chicago), Philip S. Yu (University of Illinois Chicago)

Recommender systems (RecSys) are essential for online platforms, providing personalized suggestions to users within a vast sea of information. Self-supervised graph learning seeks to harness high-order collaborative filtering signals through unsupervised augmentation on the user-item bipartite graph, primarily leveraging a multi-task learning framework that includes both supervised recommendation loss and self-supervised contrastive loss. However, this separate design introduces additional graph convolution processes and creates inconsistencies in gradient directions due to disparate losses, resulting in prolonged training times and sub-optimal performance. In this study, we introduce a unified framework of Supervised Graph Contrastive Learning for recommendation (SGCL) to address these issues. SGCL uniquely combines the training of recommendation and unsupervised contrastive losses into a cohesive supervised contrastive learning loss, aligning both tasks within a single optimization direction for exceptionally fast training. Extensive experiments on three real-world datasets show that SGCL outperforms state-of-the-art methods, achieving superior accuracy and efficiency.
RESStairway to Fairness: Connecting Group and Individual Fairness
by Theresia Veronika Rampisela (University of Copenhagen), Maria Maistro (University of Copenhagen), Tuukka Ruotsalo (University of Copenhagen), Falk Scholer (RMIT University), Christina Lioma (University of Copenhagen)

Fairness in recommender systems (RSs) is commonly categorised into group fairness and individual fairness. However, there is no established scientific understanding of the relationship between the two fairness types, as prior work on both types has used different evaluation measures or evaluation objectives for each fairness type, thereby not allowing for a proper comparison of the two. As a result, it is currently not known how increasing one type of fairness may affect the other. To fill this gap, we study the relationship of group and individual fairness through a comprehensive comparison of evaluation measures that can be used for both fairness types. Our experiments with 8 RSs across 3 datasets show that recommendations that are highly fair for groups can be very unfair for individuals. Our finding is novel and useful for RS practitioners aiming to improve the fairness of their systems.
RESTowards Personality-Aware Explanations for Music Recommendations Using Generative AI
by Gabrielle Alves (Universidade de São Paulo), Dietmar Jannach (University of Klagenfurt), Luan Soares de Souza (Universidade de São Paulo), Marcelo Garcia Manzato (Universidade de São Paulo)

It is well established that the provision of explanations can positively impact the effectiveness of a recommender system. In many proposals in the literature, these explanations are personalized in that they refer to a user’s known individual preferences. Some recent works, however, also indicate that personalization should also happen at a higher level, where the system, in a first step, decides in which specific way an explanation should be provided, depending, for example, on the user’s expertise. In this research, we take the first steps towards personality-aware explanations by exploring how users perceive explanations designed to match a given personality trait. To this purpose, we leverage the capabilities of modern Generative AI tools to create personality-based explanations at scale in the context of a music recommendation scenario. A linguistic analysis of the generated explanations confirms that they properly reflect expected language patterns associated with individual personality traits. Furthermore, a user study shows that certain forms of explaining are preferred over others, for example, ones that match low-neuroticism linguistic patterns. In addition, we find that some explanation forms are more effective than others regarding persuasiveness and perceived overall quality.
RESTreatRAG: A Framework for Personalized Treatment Recommendation
by Chao-Chin Liu (Georgetown University), Hao-Ren Yao (Carnegie Mellon University), Der-Chen Chang (Georgetown University), Ophir Frieder (Georgetown University)

Medication recommendation is a critical function of clinical decision support systems, directly influencing patient safety and treatment efficacy. While large language models (LLMs) show promise in clinical tasks such as summarization and question answering, their ability to make accurate treatment predictions remains limited due to a lack of specialized medical knowledge and exposure to real-world patient data. We introduce TreatRAG, a retrieval-augmented generation (RAG) framework designed to enhance treatment recommendation by integrating structured electronic health record (EHR) data with pretrained LLMs. TreatRAG retrieves similar patient cases, i.e., so called “digital twins”, using interpretable N-gram Jaccard similarity and augments the input prompt to ground LLM predictions in real clinical scenarios. We evaluate our framework on the MIMIC-IV dataset using BioGPT, BioMistral, Phi3, and Flan-T5. In all cases, TreatRAG statistically significantly improves medication prediction performance. TreatRAG-enhanced BioGPT improves its F1-score from 0.14 to 0.34, BioMistral from 0.22 to 0.54, Phi-3 from 0.09 to 0.16, and Flan-T5 from 0.23 to 0.30. Our model-agnostic framework offers a flexible, effective, and interpretable solution to advance the reliability of LLMs in clinical decision support.

List of all reproducibility papers accepted for RecSys 2025 (in alphabetical order).

REPRA Reproducibility Study of Product-side Fairness in Bundle Recommendation
by Huy-Son Nguyen (Delft University of Technology), Yuanna Liu (University of Amsterdam), Masoud Mansoury (Delft University of Technology), Mohammad Aliannejadi (University of Amsterdam), Alan Hanjalic (Delft University of Technology), Maarten de Rijke (University of Amsterdam)

Recommender systems are known to exhibit fairness issues, particularly on the product side, where products and their associated suppliers receive unequal exposure in recommended results. While this problem has been widely studied in traditional recommendation settings, its implications for bundle recommendation (BR) remain largely unexplored. This emerging task introduces additional complexity: recommendations are generated at the bundle level, yet user satisfaction and product (or supplier) exposure depend on both the bundle and the individual items it contains. Existing fairness frameworks and metrics designed for traditional recommender systems may not directly translate to this multi-layered setting. In this paper, we conduct a comprehensive reproducibility study of product-side fairness in BR across three real-world datasets using four state-of-the-art BR methods. We analyze exposure disparities at both the bundle and item levels using multiple fairness metrics, uncovering important patterns. Our results show that exposure patterns differ notably between bundles and items, revealing the need for fairness interventions that go beyond bundle-level assumptions. We also find that fairness assessments vary considerably depending on the metric used, reinforcing the need for multi-faceted evaluation. Furthermore, user behavior plays a critical role: when users interact more frequently with bundles than with individual items, BR systems tend to yield fairer exposure distributions across both levels. Overall, our findings offer actionable insights for building fairer bundle recommender systems and establish a vital foundation for future research in this emerging domain.
REPRAre We Really Making Recommendations Robust? Revisiting Model Evaluation for Denoising Recommendation
by Guohang Zeng (University of Technology Sydney), Jie Lu (University of Technology Sydney), Guangquan Zhang (University of Technology Sydney)

Implicit feedback data has emerged as a fundamental component of modern recommender systems due to its scalability and availability. However, the presence of noisy interactions—such as accidental clicks and position bias—can potentially degrade recommendation performance. Recently, denoising recommendation have emerged as a popular research topic, aiming to identify and mitigate the impact of noisy samples to train robust recommendation models in the presence of noisy interactions. Although denoising recommendation methods have become a promising solution, our systematic evaluation reveals critical reproducibility issues in this growing research area. We observe inconsistent performance across different experimental settings and a concerning misalignment between validation metrics and test performance caused by distribution shifts. Through extensive experiments testing 6 representative denoising methods across 4 recommender models and 3 datasets, we find that no single denoising approach consistently outperforms others, and simple improvements to evaluation strategies can sometimes match or exceed state-of-the-art denoising methods. Our analysis further reveals concerns about denoising recommendation in high-noise scenarios. We identify key factors contributing to reproducibility defects and propose pathways toward more reliable denoising recommendation research. This work serves as both a cautionary examination of current practices and a constructive guide for the development of more reliable evaluation methodologies in denoising recommendation.
REPRContext Trails: A Dataset to Study Contextual and Route Recommendation
by Pablo Sánchez (Instituto de Investigación Tecnológica (IIT), Universidad Pontificia Comillas), Alejandro Bellogin (Universidad Autónoma de Madrid), José L. Jorro-Aragoneses (Universidad Autónoma de Madrid)

Recommender systems in the tourism domain are gaining increasing attention, yet the development of diverse recommendation tasks remains limited, largely due to the scarcity of comprehensive public datasets. This paper introduces Context Trails, a novel tourism dataset addressing this gap. Context Trails distinguishes itself by including not only user interactions with touristic venues, but also the itineraries (trails or routes) followed by users. Furthermore, it enriches existing item features (e.g., category, coordinates) with contextual attributes related to the interaction moment (e.g., weather) and the venue itself (e.g., opening hours). Beyond a detailed description of the dataset’s characteristics, we evaluate the performance of several baseline algorithms across three distinct recommendation tasks: classical recommendation, route recommendation, and contextual recommendation. We believe this dataset will foster further research and development of advanced recommender systems within the tourism domain.
REPRDistillRecDial: A Knowledge-Distilled Dataset Capturing User Diversity in Conversational Recommendation
by Alessandro Francesco Maria Martina (University of Bari Aldo Moro), Alessandro Petruzzelli (University of Bari Aldo Moro), Cataldo Musto (University of Bari Aldo Moro), Marco de Gemmis (University of Bari Aldo Moro), Pasquale Lops (University of Bari Aldo Moro), Giovanni Semeraro (University of Bari Aldo Moro)

Conversational Recommender Systems (CRSs) facilitate item discovery through multi-turn dialogues that elicit user preferences via natural language interaction. This field has gained significant attention following advancements in Natural Language Processing (NLP) enabled by Large Language Models (LLMs). However, current CRS research remains constrained by datasets with fundamental limitations. Human-generated datasets suffer from inconsistent dialogue quality, limited domain expertise, and insufficient scale for real-world application, while synthetic datasets created with proprietary LLMs ignore the diversity of real-world user behavior and present significant barriers to accessibility and reproducibility. The development of effective CRSs depends critically on addressing these deficiencies. To this end, we present \textsc{DistillRecDial}, a novel conversational recommendation dataset generated through a knowledge distillation pipeline that leverages smaller, more accessible open LLMs. Crucially, \textsc{DistillRecDial} simulates a range of user types with varying intentions, preference expression styles, and initiative levels, capturing behavioral diversity that is largely absent from prior work. Human evaluation demonstrates that our dataset significantly outperforms widely adopted CRS datasets in dialogue coherence and domain-specific expertise, indicating its potential to advance the development of more realistic and effective conversational recommender systems.
REPRExploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation
by Pedro Pires (Federal University of São Carlos (UFSCar)), Gregorio Azevedo (Federal University of São Carlos (UFSCar)), Pietro Campos (Federal University of São Carlos (UFSCar)), Rafael Sereicikas (Federal University of São Carlos (UFSCar)), Tiago Almeida (Federal University of São Carlos (UFSCar))

Multi-Armed Bandit (MAB) algorithms are widely used in recommender systems that require continuous, incremental learning. A core aspect of MABs is the exploration–exploitation trade-off: choosing between exploiting items likely to be enjoyed and exploring new ones to gather information. In contextual linear bandits, this trade-off is particularly central, as many variants share the same linear regression backbone and differ primarily in their exploration strategies. Despite its prevalent use, offline evaluation of MABs is increasingly recognized for its limitations in reliably assessing exploration behavior. This study conducts an extensive offline empirical comparison of several linear MABs. Strikingly, across over 90\% of various datasets, a greedy linear model – with no type of exploration – consistently achieves top-tier performance, often outperforming or matching its exploratory counterparts. This observation is further corroborated by hyperparameter optimization, which consistently favors configurations that minimize exploration, suggesting that pure exploitation is the dominant strategy within these evaluation settings. Our results expose significant inadequacies in offline evaluation protocols for bandits, particularly concerning their capacity to reflect true exploratory efficacy. Consequently, this research underscores the urgent necessity for developing more robust assessment methodologies, guiding future investigations into alternative evaluation frameworks for interactive learning in recommender systems. The source code for our experiments is publicly available on GITHUB-LINK.
REPRExploring the Potential of LLMs for Serendipity Evaluation in Recommender Systems
by Li Kang (Hong Kong Baptist University), Yuhan Zhao (Hong Kong Baptist University), Li Chen (Hong Kong Baptist University)

Serendipity plays a pivotal role in enhancing user satisfaction within recommender systems, yet its evaluation poses significant challenges due to its inherently subjective nature and conceptual ambiguity. Current algorithmic approaches predominantly rely on proxy metrics for indirect assessment, often failing to align with real user perceptions and thereby creating a gap. With large language models (LLMs) increasingly revolutionizing evaluation methodologies across various human annotation tasks, we are inspired to explore a core research proposition: Can LLMs effectively simulate human users for serendipity evaluation? To address this question, we conduct a meta-evaluation on two datasets derived from real user studies in the e-commerce and movie domains, focusing on three key aspects: the accuracy of LLMs compared to conventional proxy metrics, the influence of auxiliary data on LLM comprehension, and the efficacy of recent popular multi-LLM techniques. Our findings indicate that even the simplest zero-shot LLMs achieve parity with, or surpass, conventional metrics. Furthermore, multi-LLM techniques and the incorporation of auxiliary data further enhance alignment with human perspectives. Based on our findings, the optimal evaluation of LLMs yields a Pearson correlation coefficient of 21.5% when compared to the results of the user study. This research establishes that LLMs have the potential to serve as accurate, reproducible, reliable, and cost-effective evaluators, introducing a new paradigm for serendipity evaluation in recommender systems.
REPRFashion-AlterEval: A Dataset for Improved Evaluation of Conversational Recommendation Systems with Alternative Relevant Items
by Maria Vlachou (University of Glasgow)

In Conversational Recommendation Systems (CRS), a user provides feedback on recommended items at each turn, leading the CRS towards improved recommendations. Due to the need for a large amount of data, a user simulator is employed for both training and evaluating the CRS. Such user simulators typically critique the current retrieved items based on knowledge of a single target item. However, the evaluation of such systems in offline settings with simulators is limited by the focus on a single target item and their unlimited patience over a large number of turns. To overcome these limitations of existing simulators, we propose Fashion-AlterEval, a dataset that contains human judgments for a selection of alternative items by adding new annotations in common fashion CRS datasets. Consequently, we propose two novel meta-user simulators that use the collected judgments and allow simulated users not only to express their preferences about alternative items to their original target, but also to change their mind and level of patience. In our experiments using the Shoes and Fashion IQ as the original datasets and three CRS models, we find that using the knowledge of alternatives by the simulator can have a considerable impact on the evaluation of existing CRS models, specifically that the existing single-target evaluation underestimates their effectiveness, and when simulated users are allowed to instead consider alternative relevant items, the system can rapidly respond to more quickly satisfy the user. Importantly, we observe that when using a probabilistic switch to alternative based on the estimation of gains and losses (with a probability threshold) in most cases leads to improved performance estimation than a meta-simulator with a fixed switch to alternatives.
REPRGreenFoodLens: Sustainability Labels for Food Recommendation
by Giacomo Balloccu (Meta), Ludovico Boratto (University of Cagliari), Gianni Fenu (University of Cagliari), Mirko Marras (University of Cagliari), Giacomo Medda (University of Cagliari), Giovanni Murgia (University of Cagliari)

Most food recommendation systems aim to increase user engagement by looking at recipe ingredients and past choices. Even though consumers are paying more attention to sustainability, such as carbon and water footprints, there remains a notable lack of public corpora that combine detailed user–recipe interactions with reliable environmental impact data. This gap makes it hard to build recommendation tools that both match people’s tastes and help reduce ecological damage. To this end, we present GreenFoodLens, a resource that enriches HUMMUS, one of the largest corpora for food recommendation, with environmental impact estimates derived from the hierarchical taxonomy of the SU-EATABLE-LIFE project. We achieved this result through a multi-step process involving human annotations, iterative labeling assessments, knowledge refinement, and constrained generation techniques with large language models. Finally, we evaluate recommendation baselines on HUMMUS augmented with GreenFoodLens labels and find that models are driven by popularity signals, which may exacerbate the environmental impact of users’ recipe choices. These experiments demonstrate the practical benefit of GreenFoodLens for benchmarking and advancing sustainability-aware recommendation research.
REPRHow Powerful are LLMs to Support Multimodal Recommendation? A Reproducibility Study of LLMRec
by Maria Lucia Fioretti (Politecnico di Bari), Nicola Laterza (Politecnico di Bari), Alessia Preziosa (Politecnico di Bari), Daniele Malitesta (CentraleSupélec, Inria, Université Paris-Saclay), Claudio Pomo (Politecnico di Bari), Fedelucio Narducci (Politecnico di Bari), Tommaso Di Noia (Politecnico di Bari)

Large language models (LLMs) have been exploited as standalone recommender systems (RSs) learning to recommend from the historical user-item data and, more recently, as support tools for already existing RSs. Within this second research line, LLMRec prompts a LLM with the user-item data, the items’ metadata, and the candidate items generated by other multimodal RSs to obtain an augmented version of the original dataset where a final RS is trained on. Despite its remarkable performance, concerns may arise regarding the accountability of this model. In this regard, a few recent studies have proposed reproducing and rigorously evaluating LLM-based recommender systems (RSs) as standalone approaches (first research line). However, little to no attention has been devoted to exploring the use of LLMs as supportive components within existing RSs, particularly in the context of multimodal recommendation (second research line). To this end, in this work, we propose the first reproducibility study of a LLMs-based RS belonging to the second research line, LLMRec, in the multimodal recommendation domain. First, we try to replicate the results of LLMRec with the authors’ provided data and our own reconstructed data, outlining critical issues in the measured recommendation performance. Then, we benchmark LLMRec: (i) with unimodal and multimodal LLMs, showing how the latter may be more beneficial in a multimodal scenario; (ii) other competitive multimodal RSs, LLMs-based solutions, and an additional dataset, demonstrating inconsistencies with the trends emerging in the original paper. Finally, in an attempt to disentangle the observed performance trends, we evaluate (for the first time in the literature) the topological differences of the original user-item interaction graph with respect to the LLMRec’s augmented one.
REPRImpacts of Mainstream-Driven Algorithms on Recommendations for Children Across Domains: A Reproducibility Study
by Robin Ungruh (Delft University of Technology), Alejandro Bellogín (Universidad Autónoma de Madrid), Dominik Kowald (Know Center Research GmbH), Maria Pera (Delft University of Technology)

Children access varied media across many online platforms, where they are often exposed to items curated by recommendation algorithms. Yet, research seldom considers children as a user group, and when it does, it is anchored on datasets where children are underrepresented, risking overlooking their inherent traits, favoring those of the majority, i.e., mainstream users. Recently, Ungruh et al. demonstrated that children’s consumption patterns and preferences differ from those of mainstream users, resulting in inconsistent recommendation algorithm performance and behavior for this user group. These findings, however, are based on two datasets with a limited child user sample. To advance this line of work, we reproduce this study on a wider range of datasets in the movie, music, and book domains, uncovering interaction patterns and aspects of child-recommender interactions that are consistent across domains, as well as those specific to some user samples in the data. We also extend insights from the original study by analyzing popularity bias metrics, given the interpretation of results from the original study. This reproduction and extension allow us to uncover consumption patterns and differences between age groups stemming from intrinsic differences between children and others, and those unique to specific datasets or domains. We share data samples from our exploration and associated code in a public repository.
REPRInformfully Recommenders – Reproducibility Framework for Diversity-aware Intra-session Recommendations
by Lucien Heitz (University of Zurich), Runze Li (University of Zurich), Oana Inel (University of Zurich), Abraham Bernstein (University of Zurich)

In recent years, norm-aware recommender systems have gained increased attention, especially for diversity optimization. The recommender systems community has well-established experimentation pipelines that support reproducible evaluations by facilitating models’ benchmarking and comparisons against state-of-the-art methods. However, to the best of our knowledge, there is currently no reproducibility framework that supports thorough norm-driven experimentation at the four stages of the recommender pipeline: pre-processing, in-processing, post-processing, and evaluation stages. To address this gap, we present Informfully Recommenders, a first step towards a normative reproducibility framework that focuses on diversity-aware design built on Cornac. Our extension provides an end-to-end solution for implementing and experimenting with normative and general-purpose diverse recommender systems that cover 1) dataset pre-processing, 2) diversity-optimized models, 3) dedicated intra-session item re-ranking, and 4) an extensive set of diversity metrics together with item visualization for offline and online evaluation. We demonstrate the capabilities of our diversity-aware extension-and in particular of our diversity-driven recommendation models-by providing an extensive offline experiment in the news domain.
REPRModel Meets Knowledge: Analyzing Knowledge Types for Conversational Recommender Systems
by Jujia Zhao (Leiden University), Yumeng Wang (Leiden University), Zhaochun Ren (Leiden University), Suzan Verberne (Leiden University)

Conversational Recommender Systems (CRSs) often integrate external knowledge to enhance user preference modeling and item representation learning, addressing the challenge of sparse conversational contexts. Traditional methods primarily utilize structured knowledge graphs (KGs) to model entity relationships and capture deep, multi-hop relationships among items. More recent studies employing pre-trained language models (PLMs), however, leverage unstructured text (e.g., customer reviews) to enrich contextual understanding of users and items. Despite reported performance gains from both knowledge types, a question remains: What is the compatibility between specific CRS model architectures and types of external knowledge, and how do different knowledge sources complement each other? We present a reproducibility study evaluating 9 state-of-the-art CRSs, including KG-based and PLM-based paradigms, to systematically investigate model–-knowledge compatibility and complementarity. Through a comprehensive evaluation on three datasets, we uncover three key findings: (1) Different model architectures have different compatibility with knowledge types: decoder-only models excel with structured knowledge, whereas encoder-decoder models better utilize unstructured knowledge. (2) Combining multiple knowledge sources isn’t always superior to using a single type, but merging similar knowledge types is generally more effective than mixing different ones. (3) Unstructured knowledge broadly benefits all scenario-specific conversations, particularly in genre-specific and descriptive scenarios, whereas structured knowledge demonstrates superior performance in comparative recommendation scenarios. Our study serves as an inspiration for future research on maximizing the benefits of external knowledge across different models in CRSs.
REPRPrivacy Risks of LLM-Empowered Recommender Systems: An Inversion Attack Perspective
by Yubo Wang (Monash University), Min Tang (Monash University), Nuo Shen (Monash University), Shujie Cui (Monash University), Weiqing Wang (Monash University)

The large language model (LLM) powered recommendation paradigm has been proposed to address the limitations of traditional recommender systems, which often struggle to handle cold-start users or items with new IDs. Despite its effectiveness, this study uncovers that LLM-empowered recommender systems are vulnerable to reconstruction attacks that can expose both system and user privacy. To thoroughly examine this threat, we present the first systematic study on inversion attacks targeting LLM-empowered RecSys, wherein adversaries attempt to reconstruct original user prompts that contain personal preferences, interaction histories, and demographic attributes by exploiting the output logits of recommendation models. We propose an optimized inversion framework that integrates a vec2text generation engine with Similarity-Guided Refinement to accurately recover textual prompts from logits. Extensive experiments across two domains (movies and books) and two representative LLM-based recommendation models demonstrate that our method achieves high-fidelity reconstructions. Specifically, we can recover nearly 65% of the user-interacted items and correctly infer age and gender in 87% of cases. The experiments also reveal that privacy leakage is largely insensitive to the victim model’s performance but highly dependent on domain consistency and prompt complexity. These findings expose critical and unique privacy vulnerabilities in LLM-powered recommender systems.
REPRRethinking the Privacy of Text Embeddings: A Reproducibility Study of “Text Embeddings Reveal (Almost) As Much As Text”
by Dominykas Seputis (University of Amsterdam), Yongkang Li (University of Amsterdam), Karsten Langerak (University of Amsterdam), Serghei Mihailov (University of Amsterdam)

Text embeddings are fundamental to many natural language processing~(NLP) tasks, extensively applied in domains such as recommendation systems and information retrieval~(IR). Traditionally, transmitting embeddings instead of raw text has been seen as privacy-preserving. However, recent methods such as Vec2Text challenge this assumption by demonstrating that controlled decoding can successfully reconstruct original texts from black-box embeddings. The unexpectedly strong results reported by Vec2Text motivated us to conduct further verification, particularly considering the typically non-intuitive and opaque structure of high-dimensional embedding spaces. In this work, we reproduce the Vec2Text framework and evaluate it from two perspectives: (1) validating the original claims, and (2) extending the study through targeted experiments. First, we successfully replicate the original key results in both in-domain and out-of-domain settings, with only minor discrepancies arising due to missing artifacts, such as model checkpoints and dataset splits. Furthermore, we extend the study by conducting a parameter sensitivity analysis, evaluating the feasibility of reconstructing sensitive inputs (e.g., passwords), and exploring embedding quantization as a lightweight privacy defense. Our results show that Vec2Text is effective under ideal conditions, capable of reconstructing even password-like sequences that lack clear semantics. However, we identify key limitations, including its sensitivity to input sequence length. We also find that Gaussian noise and quantization techniques can mitigate the privacy risks posed by Vec2Text, with quantization offering a simpler and more widely applicable solution. Our findings emphasize the need for caution in using text embeddings and highlight the importance of further research into robust defense mechanisms for NLP systems.
REPRRevisiting Prompt Engineering: A Comprehensive Evaluation for LLM-based Personalized Recommendation
by Genki Kusano (NEC Corporation), Kosuke Akimoto (NEC Corporation), Kunihiro Takeoka (NEC Corporation)

Large language models (LLMs) can perform recommendation tasks by taking prompts written in natural language as input. Compared to traditional methods such as collaborative filtering, LLM-based recommendation offers advantages in handling cold-start, cross-domain, and zero-shot scenarios, as well as supporting flexible input formats and generating explanations of user behavior. In this paper, we focus on a single-user setting, where no information from other users is used. This setting is practical for privacy-sensitive or data-limited applications. In such cases, prompt engineering becomes especially important for controlling the output generated by the LLM. We conduct a large-scale comparison of 23 prompt types across 8 public datasets and 12 LLMs. We use statistical tests and linear mixed-effects models to evaluate both accuracy and inference cost. Our results show that for cost-efficient LLMs, three types of prompts are especially effective: those that rephrase instructions, consider background knowledge, and make the reasoning process easier to follow. For high-performance LLMs, simple prompts often outperform more complex ones while reducing cost. In contrast, commonly used prompting styles in natural language processing, such as step-by-step reasoning, or the use of reasoning models often lead to lower accuracy. Based on these findings, we provide practical suggestions for selecting prompts and LLMs depending on the required balance between accuracy and cost.
REPRRevisiting the Performance of Graph Neural Networks for Session-based Recommendation
by Faisal Shehzad (University of Klagenfurt), Dietmar Jannach (University of Klagenfurt)

Graph Neural Networks (GNNs) have shown impressive performance in various domains. Motivated by this success, several GNN-based session-based recommender systems (SBRS) have been proposed over the past few years. The literature suggests that these algorithms can achieve strong performance and outperform well-established baseline neural models. However, some recent reproducibility studies suggest that the performance achieved by more complex GNN-based models may sometimes be overstated and that these models may not be as impactful as expected. Moreover, an inconsistent choice of datasets, preprocessing steps, and evaluation protocols across published works makes it difficult to reliably assess progress in the field. In this present study, we reassess the performance of three well-established baseline models—GRU4Rec, NARM, and STAMP—and compare them to six more recent GNN-based SBRS within a standardized evaluation framework. Experiments on commonly used datasets for SBRS reveal that in particular the GRU4Rec model, if properly tuned, is still highly competitive and leads to the best results on two out of three datasets. Furthermore, we find that the performance of the GNN-based models varies largely across datasets. Interestingly, only the quite early SR-GNN model turns out to be superior in terms of accuracy metrics on one of the datasets. We speculate that the reasons for our surprising result may lie in insufficient hyperparameter tuning processes for the baselines in the original papers.
REPRSee the Movie, Hear the Song, Read the Book: Extending MovieLens-1M, Last.fm-2K, and DBbook with Multimodal Data
by Giuseppe Spillo (University of Bari Aldo Moro), Elio Musacchio (University of Pisa), Cataldo Musto (University of Bari Aldo Moro), Marco de Gemmis (University of Bari Aldo Moro), Pasquale Lops (University of Bari Aldo Moro), Giovanni Semeraro (University of Bari Aldo Moro)

The last few years have seen an increasing interest of the RecSys community in the multimodal recommendation research field, as shown by the numerous contributions proposed in the literature. Our paper falls in this research line, as we released a multimodal extension of three state-of-the-art datasets (MovieLens-1M, DBbook, Last.fm-2K) in the movie, book, music recommendation domains, respectively. Although these datasets have been widely adopted for classical recommendation tasks (e.g., collaborative filtering), the absence of multimodal information has made their use in multimodal recommendation impossible. To fill this gap, we have manually collected multimodal item raw files from different modalities (text, images, audio, and video, when available) for each dataset. Specifically, we have collected, for MovieLens-1M, movie plots (textual information), movie posters (images) and movie trailers (audio and video); for Last.fm-2K, we have collected, for each artist, the tags provided by users (textual information), the most popular album covers (images), and the most popular songs (audio); finally, for DBbook we have collected book abstracts (textual information) and book covers (image). We encoded all this information through state-of-the-art feature encoders, and we released the extended datasets, including the mappings to the raw multimodal information and the encoded features. Finally, we run a benchmark analysis of different recommendation models using MMRec as a multimodal recommendation framework. Our results show that multimodal information can further improve the quality of the recommendation in these domains compared to single collaborative filtering. We release the multimodal version of such datasets to foster this research line, including links to download the raw multimodal files and the encoded item features.
REPRTIM-Rec: Explicit Sparse Feedback on Multi-Item Upselling Recommendations in an Industrial Dataset of Telco Calls
by Alessandro Sbandi (TIM S.p.A.), Federico Siciliano (Sapienza University of Rome), Fabrizio Silvestri (Sapienza University of Rome)

Upselling recommendations play a critical role in improving customer engagement and maximizing revenue in the telecommunications industry. However, real-world data on such interactions often presents unique challenges, including multiple recommendations per call and sparse customer feedback, which complicates the evaluation of recommender systems. Our review of the existing literature reveals a critical gap in publicly available datasets that reflect these challenges, limiting progress in developing and evaluating upselling strategies. This work introduces a novel dataset that captures these complexities, offering valuable insights into customer behavior and recommendation effectiveness. The dataset, derived from real-world interactions between customers and service providers, contains multiple recommendations provided in individual calls and sparse feedback, reflecting typical user behavior where interest may be low or unrecorded. To aid in the development of more effective recommendation systems, we provide detailed statistics on recommendation distributions, user engagement, and feedback patterns. Furthermore, we benchmark various recommendation models, from classical approaches to state-of-the-art neural networks, allowing for a comprehensive assessment of their recommendation accuracy in this challenging setting. The dataset, along with the preprocessing implementations, is publicly available in our GitHub repository.
REPRThe XITE Million Sessions Dataset
by Ralvi Isufaj (XITE), Ruslan Tsygankov (XITE), Zoltán Szlávik (XITE)

We present the XITE Million Sessions Dataset, a collection of one million music video streaming sessions from an interactive TV platform. This dataset addresses a significant gap in music recommendation research by capturing sequential user interactions with music video content. Each session contains sequences of videos watched by anonymised users, along with metadata including artist information, title, genre and subgenre classifications from XITE’s expert-curated taxonomy, and watch-time metrics. The dataset also includes XITE’s genre hierarchy and subgenre correlation matrix, representing musical relationships established by music experts. We provide MusicBrainz identifiers where possible to enable connections with external music resources. While we do not include the video content itself, the dataset documents how users engage with music in a video-based environment, which may exhibit interaction patterns that differ from audio-only consumption. To demonstrate the dataset’s research utility, we benchmark a standard playlist continuation task using transformer-based and graph-based models. This contribution allows researchers to develop and evaluate recommendation algorithms for music video consumption and examine how existing methods generalise beyond audio-only datasets to screen-based music experiences.
REPRTime to Split: Exploring Data Splitting Strategies for Offline Evaluation of Sequential Recommenders
by Danil Gusak (AIRI), Anna Volodkevich, Anton Klenitskiy, Alexey Vasilev, Evgeny Frolov (AIRI)

Modern sequential recommender systems, ranging from lightweight transformer-based variants to large language models, have become increasingly prominent in academia and industry due to their strong performance in the next-item prediction task. Yet common evaluation protocols for sequential recommendations remain insufficiently developed: they often fail to reflect the corresponding recommendation task accurately, or are not aligned with real-world scenarios. Although the widely used leave-one-out split matches next-item prediction, it permits the overlap between training and test periods, which leads to temporal leakage and unrealistically long test horizon, ultimately limiting real-world relevance. Global temporal splitting addresses these issues by evaluating on distinct future periods. However, its applications to sequential recommendations remain loosely defined, particularly in terms of selecting target interactions and constructing a validation subset that provides necessary consistency between validation and test metrics. In this paper, we demonstrate that evaluation outcomes can vary significantly across splitting strategies, influencing model rankings and practical deployment decisions. To improve reproducibility in both academic and industrial settings, we systematically compare different splitting strategies for sequential recommendations across multiple datasets and established baselines. Our findings show that the prevalent leave-one-out split often poorly aligns with more realistic evaluation strategies.
REPR‘We Share Our Code Online’: Why This Is Not Enough to Ensure Reproducibility and Progress in Recommender Systems Research
by Faisal Shehzad (University of Klagenfurt), Timo Breuer (TH Köln (University of Applied Sciences)), Maria Maistro (University of Copenhagen), Dietmar Jannach (University of Klagenfurt)

Issues with reproducibility have been identified as a major factor hampering progress in recommender systems research. In response, researchers increasingly share the code of their models. However, the provision of only the code of the proposed model is usually not sufficient to ensure reproducibility. In many works, the central claim is that a new model is advancing the state-of-the-art. Thus, it is crucial that the entire experiment is reproducible, including the configuration and the results of the considered baselines. With this work, our goal is to gauge the level of reproducibility in algorithms research in recommender systems. We systematically analyzed the reproducibility level of 65 papers published at a top-ranked conference during the last three years. Our results are sobering. While the model code is shared in about two thirds of the papers, the code of the baselines is provided only in eight cases. The hyperparameters of the baselines are reported even less frequently, and how these were exactly determined is not explained in any paper. As a result, it is commonly not only impossible to reproduce the full result tables reported in the papers, it is also unclear if the claimed improvements over the state-of-the-art were actually achieved. Overall, we conclude that the research community has not reached the required level of reproducibility yet. We therefore call for more rigorous reproducibility standards to ensure progress in this field.
REPRYambda-5B — A Large-Scale Multi-Modal Dataset for Ranking and Retrieval
by Alexander Ploshkin (Yandex), Vladislav Tytskiy (Yandex), Alexey Pismenny (Yandex), Vladimir Baikalov (Yandex), Evgeny Taychinov (Yandex), Artem Permiakov (Yandex), Daniil Burlakov (AIM HIGH TECHNOLOGY), Eugene Krofto (Yandex)

We present Yambda-5B, a large-scale open dataset sourced from the Yandex.Music streaming platform. Yambda-5B contains 4.79 billion user-item interactions from 1 million users across 9.39 million tracks. The dataset includes two primary types of interactions: implicit feedback (listening events) and explicit feedback (likes, dislikes, unlikes and undislikes). In addition, we provide audio embeddings for most tracks, generated by a convolutional neural network trained on audio spectrograms. A key distinguishing feature of Yambda-5B is the inclusion of the is_organic flag, which separates organic user actions from recommendation-driven events. This distinction is critical for developing and evaluating machine learning algorithms, as Yandex.Music relies on recommender systems to personalize track selection for users. To support rigorous benchmarking, we introduce an evaluation protocol based on a Global Temporal Split, allowing recommendation algorithms to be assessed in conditions that closely mirror real-world use. We report benchmark results for standard baselines (ItemKNN, iALS) and advanced models (SANSA, SASRec) using a variety of evaluation metrics. By releasing Yambda-5B to the community, we aim to provide a readily accessible, industrial-scale resource to advance research, foster innovation, and promote reproducible results in recommender systems.

List of all industry papers accepted for RecSys 2025 (in alphabetical order).

INDA Media Content Recommendation Method for Playlist Curators using LLM-Based Query Expansion
by Yuta Hagio (Japan Broadcasting Corporation), Chigusa Yamamura (Japan Broadcasting Corporation), Hiromu Ogawa (Japan Broadcasting Corporation), Hisayuki Ohmata (Japan Broadcasting Corporation), Arisa Fujii (Japan Broadcasting Corporation)

Playlist curation is a key factor in media content discovery service, yet efficiently finding diverse, relevant content is challenging for curators owing to time-consuming manual query crafting. We propose a recommendation method that uses large language models (LLMs) for query expansion to assist curators. The proposed system generates multiple diverse queries from a playlist theme (title and optional description) using an LLM. The vectors derived from these expanded queries, along with the original theme vector, retrieve candidates by a vector search of a content database (using multilingual embeddings), enhancing discovery comprehensiveness and diversity. Experiments on Japanese TV programs show that the proposed method significantly improves the precision (e.g., P@50 +22 points) compared to a baseline using only the theme vector. This approach enhances curator efficiency, improves playlist quality, and promotes more comprehensive content discovery.
INDAgentic Personalisation of Cross-Channel Marketing Experiences
by Sami Abboud (Aampe), Eleanor Hanna (Aampe), Olivier Jeunen (Aampe), Vineesha Raheja (Aampe), Schaun Wheeler (Aampe)

Consumer applications provide ample opportunities to surface and communicate various forms of content to users. From promotional campaigns for new features or subscriptions, to evergreen nudges for engagement, or personalised recommendations; across e-mails, push notifications, and in-app surfaces. The conventional approach to orchestration for communication relies heavily on labour-intensive manual marketer work, and inhibits effective personalisation of content, timing, frequency, and copy-writing. We formulate this task under a sequential decision-making framework, where we aim to optimise a modular decision-making policy that maximises incremental engagement for any funnel event. Our approach leverages a Difference-in-Differences design for Individual Treatment Effect estimation, and Thompson sampling to balance the explore-exploit trade-off. We present results from a multi-service application, where our methodology has resulted in significant increases to a variety of goal events across several product features, and is currently deployed across 150 million users.
INDAn Analysis of Learned Product Embeddings in an E-Commerce Context
by Mate Hartstein (IKEA Retail (Ingka Group)), Eva Giannatou (IKEA Retail (Ingka Group)), Martin Tegner (IKEA Retail (Ingka Group))

Recommender systems often represent products with learnable embeddings. Yet, we seldom examine the structure of the embedding space, and what implications it has for the recommendation task at hand. In contrast, embeddings in natural language processing are well-understood and offer intuitive properties through word analogies (e.g. “queen – king = woman – man”). In this work, we present a corresponding approach that reveals latent knowledge in the structure of product embeddings. We prove their relevance in evaluating several embeddings learned from different data modalities in a home-furnishing context. Our findings evince distinct embedding strengths: visual embeddings capture explicit attributes like colour and shape; textual embeddings encode abstract concepts like style and functionality; while behavioural embeddings offer versatile representations driven by user interactions. We also highlight trade-offs, and link our evaluations to practical considerations in embedding development within the e-commerce domain.
INDBalanced Public Service Media Recommendation Trade-offs with a Light Carbon Footprint
by Marcel Hauck (ARD Online and Mainz University of Applied Sciences), Michael Huber (ARD Online), Juri Diels (ARD Online), David Wittenberg (ARD Online), Dietmar Jannach (University of Klagenfurt)

Public service media (PSM) providers commonly face the challenge of balancing user engagement metrics and public value. In this case study, we report on the insights obtained at ARD, Germany’s largest PSM provider, when investigating the effectiveness of different collaborative filtering techniques on their video-on-demand platform ARD Mediathek. While an offline evaluation indicated that a modern model based on a denoising auto-encoder might lead to the best prediction accuracy, A/B testing revealed that an item-based nearest-neighbor technique excelled both in terms of engagement and public value metrics. Our findings thus suggest that traditional, light-weight values should not be easily dismissed, given also their comparably limited resource requirements and light carbon footprint. To enable future research on this topic, we provide a real-world dataset with usage data from our platform.
INDBalancing Fine-tuning and RAG: A Hybrid Strategy for Dynamic LLM Recommendation Updates
by Changping Meng (Google), Hongyi Ling (Google), Jianling Wang (Google DeepMind), Yifan Liu (Google), Shuzhou Zhang (Google), Dapeng Hong (Google), Mingyan Gao (Google), Onkar Dalal (Google), Ed Chi (Google Deepmind), Lichan Hong (Google Deepmind), Haokai Lu (Google DeepMind), Ningren Han (Google)

Large Language Models (LLMs) empower recommendation systems through their advanced reasoning and planning capabilities. However, the dynamic nature of user interests and content poses a significant challenge: While initial fine-tuning aligns LLMs with domain knowledge and user preferences, it fails to capture such real-time changes, necessitating robust update mechanisms. This paper investigates strategies for updating LLM-powered recommenders, focusing on the trade-offs between ongoing fine-tuning and Retrieval-Augmented Generation (RAG). Using an LLM-powered user interest exploration system as a case study, we perform a comparative analysis of these methods across dimensions like cost, agility, and knowledge incorporation. We propose a hybrid update strategy that leverages the long-term knowledge adaptation of periodic fine-tuning with the agility of low-cost RAG. We demonstrate through live A/B experiments on a billion-user platform that this hybrid approach yields statistically significant improvements in user satisfaction, offering a practical and cost-effective framework for maintaining high-quality LLM-powered recommender systems.
INDClosing the Online-Offline Gap: A Scalable Framework for Composed Model Evaluation
by Mahanth Kumar Beeraka (Meta Platforms, Inc.), Chen Chen (Meta Platforms, Inc.), Yining Lu (Meta Platforms, Inc.), Briac Marcatte (Meta Platforms, Inc.), Weikun Lyu (Meta Platforms, Inc.), Brooke Bian (Meta Platforms, Inc.), Enriko Aryanto (Meta Platforms, Inc.), Ellie Wen (Meta Platforms, Inc.), Mohamed Radwan (Meta Platforms, Inc.), Tianshan Cui (Meta Platforms, Inc.), Wenjing Lu (Meta Platforms, Inc.), Mohsen Malmir (Meta Platforms, Inc.), Yang Li (Meta Platforms, Inc.)

We propose iPCF (Intelligent Prediction Composition Framework), a platform for training and evaluating ranking models. Unlike traditional approaches – such as frequent retraining, robust feature selections or output calibration that focus solely on a model’s standalone prediction quality, iPCF evaluates the model’s performance in a production-like environment where multiple models are composed together to estimate the final conversion probability (eCVR). This framework is especially critical in Meta’s Lattice based modeling stack, where multi-task models produce several predictions used downstream in business logic. By introducing the new metric based on simulated recomposed final eCVR, iPCF enables more accurate offline evaluation and informed candidate selection. In production use, the framework has led to up to 18% improvement in L1 distance correlation with final top line results. Beyond evaluation, iPCF brings serving-awareness into the model development cycle, improving the robustness, efficiency, and impact of ranking models.
INDCold Starting a New Content Type: A Case Study with Netflix Live
by Yunan Hu (Netflix), Mark Thornburg (Netflix), Mario Garcia Armas (Netflix), Vito Ostuni (Netflix), Anne Cocos (Netflix), Kriti Kohli (Netflix), Christoph Kofler (Netflix), Rob Saltiel (Netflix)

Industrial recommender systems often face challenges when personalizing content under an ever-changing, heterogeneous item catalog. With Netflix for example, members can watch TV shows and movies on demand, play the latest games, or tune in to thrilling live events. The difficulty of recommending new items with limited historical interaction data is often referred to as “the cold start problem.” This problem becomes exacerbated when an entirely new type of content is introduced into a recommender system, requiring the cold-start of a new content type. The purpose of this work is to review an algorithmic approach we implemented at Netflix to efficiently cold-start live events. We validated this approach through a series of online experiments that resulted in increased live engagement (+20%) across Netflix’s global member base without negatively impacting core business metrics.
INDContrastive Conditional Embeddings for Item-based Recommendation at E-commerce Scale
by Akira Fukumoto (Rakuten Group, Inc.), Aghiles Salah (Rakuten Tech Center Europe), Sarthak Shrivastava (Rakuten Group, Inc.), Alexandru Tatar (Rakuten Tech Center Europe), Yannick Schwartz (Rakuten Tech Center Europe), Vincent Michel (Rakuten Tech Center Europe), Lee Xiong (Meta Platforms)

Item-based recommendation is crucial in e-commerce for helping users navigate the myriad of options available to them. While embedding-based methods are standard, learning high-quality item representations from sparse co-occurrence data is challenging. Deployment at scale is even harder, with a lack of well-documented real-world successes. The two main obstacles are the model size, which scales linearly with the number of items, and the co-occurrence-based training data, which is massive and sparse leading to significant memory, storage, and compute demands. In this work, we propose a conditional factor model combining item co-occurrences and textual information to generate effective embeddings through a contrastive loss with mixed negative sampling for e-commerce recommendations. Our production model exceeds 10 billion parameters–half trainable daily on over 2 billion item-item co-occurrence pairs. We detail key implementation choices that allowed us to overcome the above challenges and successfully deploy the model on Rakuten Group, Inc’s large-scale e-commerce platform in Japan. A/B tests show strong impact, with purchase rate gains of +16.38% and +4.01% across two major recommendation widgets.
INDCross-Batch Aggregation for Streaming Learning from Label Proportions in Industrial-Scale Recommendation Systems
by Jonathan Valverde (Google DeepMind), Tiansheng Yao (Google DeepMind), Xiang Li (Google LLC), Yuan Gao (Google LLC), Yin Zhang (Google DeepMind), Andrew Evdokimov (Google LLC), Adam Kraft (Google DeepMind), Samuel Ieong (Google LLC), Jerry Zhang (Google LLC), Ed Chi (Google DeepMind), Derek Cheng (Google DeepMind), Ruoxi Wang (Google DeepMind)

Recent controls over user data have diluted user signals essential to train industrial recommendation systems, replacing traditional event-level labels with aggregated item-level labels. Fitting these noisy aggregates into the event-level paradigm used by industrial recommendation systems causes models to be biased and miscalibrated, hurting critical business metrics. Learning from Label Proportions (LLP), a framework where instance-level prediction models are trained from aggregated signals, offers a principled solution to this problem — as long as all samples from an aggregate are present within the same training batch. Unfortunately, industry-scale recommender systems impose infrastructure constraints that fail this critical assumption because (1) they are trained in a sequential streaming framework that spreads aggregates across batches, (2) aggregates often exceed the size of a single batch, and (3) label noise makes it difficult to identify the time boundaries that correspond to the aggregated label. To address these issues, we propose a novel technique called Cross Batch Aggregate (XBA) Loss to adapt LLP to the streaming setting. We design the loss to have a gradient that mimics the true aggregated loss gradient, approximating the distribution of the aggregate by using cumulative statistics across each aggregate. This enables (1) optimizing for model calibration and (2) learning a conversion model from the aggregate signals. We have deployed this technique to a Google Ads system impacted by conversion signal loss due to privacy constraints, delivering significant improvements on model calibration (48.8% reduction in online bias), advertiser value, and business metrics. Our key contribution is extend LLP to the streaming setting, providing a practical solution that bridges the gap between LLP research and industrial applications.
INDDecoupled Entity Representation Learning for Pinterest Ads Ranking
by Jie Liu (Pinterest, Inc), Yinrui Li (Pinterest, Inc), Jiankai Sun (Pinterest, Inc), Kungang Li (Pinterest), Han Sun (Pinterest), Sihan Wang (Pinterest, Inc), Huasen Wu (Pinterest, Inc), Siyuan Gao (Pinterest, Inc), Paulo Soares (Pinterest, Inc), Nan Li (Pinterest, Inc), Zhifang Liu (Pinterest, Inc), Haoyang Li (Pinterest, Inc), Siping Ji (Pinterest), Ling Leng (Pinterest), Prathibha Deshikachar (Pinterest)

In this paper, we introduce a novel framework following an upstream-downstream paradigm to construct user and item (Pin) embeddings from diverse data sources, which are essential for Pinterest to deliver personalized Pins and ads effectively. Our upstream models are trained on extensive data sources featuring varied signals, utilizing complex architectures to capture intricate relationships between users and Pins on Pinterest. To ensure scalability of the upstream models, entity embeddings are learned, and regularly refreshed, rather than real-time computation, allowing for asynchronous interaction between the upstream and downstream models. These embeddings are then integrated as input features in numerous downstream tasks, including ad retrieval and ranking models for CTR and CVR predictions. We demonstrate that our framework achieves notable performance improvements in both offline and online settings across various downstream tasks. This framework has been deployed in Pinterest’s production ad ranking systems, resulting in significant gains in online metrics.
INDDeep Reinforcement Learning for Ranking Utility Tuning in the Ad Recommender System at Pinterest
by Xiao Yang (Pinterest), Mehdi Ayed (Pinterest), Longyu Zhao (Pinterest), Fan Zhou (Pinterest), Yuchen Shen (Pinterest), Abe Engle (Pinterest), Jinfeng Zhuang (Pinterest), Ling Leng (Pinterest), Jiajing Xu (Pinterest), Charles Rosenberg (Pinterest), Prathibha Deshikachar (Pinterest)

The ranking utility function in an ad recommender system, which linearly combines predictions of various business goals, plays a central role in balancing values across the platform, advertisers, and users. Traditional manual tuning, while offering simplicity and interpretability, often yields suboptimal results due to its unprincipled tuning objectives, the vast amount of parameter combinations, and its lack of personalization and adaptability to seasonality. In this work, we propose a general Deep Reinforcement Learning framework for Personalized Utility Tuning (DRL-PUT) to address the challenges of multi-objective optimization within ad recommender systems. Our key contributions include: 1) Formulating the problem as a reinforcement learning task: given the state of an ad request, we predict the optimal hyperparameters to maximize a pre-defined reward. 2) Developing an approach to directly learn an optimal policy model using online serving logs, avoiding the need to estimate a value function, which is inherently challenging due to the high variance and unbalanced distribution of immediate rewards. We evaluated DRL-PUT through an online A/B experiment in Pinterest’s ad recommender system. Compared to the baseline manual utility tuning approach, DRL-PUT improved the click-through rate by 9. 7% and the long click-through rate by 7.7% on the treated segment.
INDEfficient Off-Policy Evaluation of Content Blending in Station-Based Music Experiences
by Chelsea Weaver (Amazon Music), Arvind Balasubramanian (Amazon Music), Juan Borgnino (Amazon Music), Ben London (Amazon Music)

Audio streaming services, on both voice assistants and in visual apps, often field requests such as “play more like Foo Fighters.” The service then returns a sequence of tracks that is both relevant to the request and personalized to the requester. While it is natural to evaluate the policies that produce these sequences in terms of customer engagement, such metrics do not assess their performance on other key business goals. We present our work to implement a content blending strategy to increase the prevalence of specific strategically-important content in these sequences and show how it allowed us to meet the needs of our artist and record label customers while minimizing harm to playback rates. In particular, we describe our efficient extension of off-policy evaluation to evaluate how blending impacts both engagement and the number of successful new release plays. We demonstrate how we used this work to choose blend rates for new policies so as to maximize our engagement metric while preserving the new release metric baseline set by the current production policy. We also investigate the accuracy of these methods by comparing our estimates to online results.
INDEnhancing Embedding Representation Stability in Recommendation Systems with Semantic ID
by Carolina Zheng (Columbia University), Minhui Huang (Meta), Dmitrii Pedchenko (Meta), Kaushik Rangadurai (Meta), Siyu Wang (Meta), Fan Xia (Meta), Gaby Nahum (Meta), Jie Lei (Meta), Yang Yang (Meta), Tao Liu (Meta), Zutian Luo (Meta), Xiaohan Wei (Meta), Dinesh Ramasamy (Meta), Jiyan Yang (Meta), Yiping Han (Meta), Lin Yang (Meta), Hangjun Xu (Meta), Rong Jin (Meta), Shuang Yang (Meta)

The exponential growth of online content has posed significant challenges to ID-based models in industrial recommendation systems, ranging from extremely high cardinality and dynamically growing ID space, to highly-skewed engagement distributions, to prediction instability as a result of natural id life cycles. This paper examines these challenges and introduces Semantic ID prefix-ngram, a novel token parameterization technique that significantly improves the performance of the original Semantic ID. Semantic ID prefix-ngram creates semantically meaningful collisions by hierarchically clustering items based on their content embeddings, as opposed to random assignments. Through extensive experimentation, we demonstrate that Semantic ID prefix-ngram not only addresses embedding instability but also significantly improves tail id modeling, and mitigates representation shifts. We report our experience of integrating Semantic ID into Meta’s production Ads Ranking system, leading to notable performance gains.
INDEnhancing Online Ranking Systems via Multi-Surface Co-Training for Content Understanding
by Gwendolyn Zhao (Google), Yilin Zheng (Google), Raghu Keshavan (Google), Lukasz Heldt (Google), Qian Sun (Google), Fabio Soldo (Google), Li Wei (Google), Aniruddh Nath (Google), Nikhil Khani (Google), Weilong Yang (Google), Dapo Omidiran (Google), Rein Zhang (Google), Mei Chen (gNucleus AI, Inc), Lichan Hong (Google Deepmind), Xinyang Yi (Google Deepmind)

Content understanding is an important part in real-world recommendation systems. This paper introduces a Multi-surface Co-training (MulCo) system, designed to enhance online ranking systems by improving content understanding. The model is trained through a task-aligned co-training approach, leveraging objectives and data from multiple video discovery feeding surfaces and various pre-trained embeddings. It separates video content understanding into an offline model, enabling scalability and efficient resource use. Experiments demonstrate that MulCo significantly outperforms non-task-aligned pre-trained embeddings and achieves substantial gains in online user value, e.g. satisfied engagement and freshness metrics. This system presents a practical solution to improve content understanding in multi-surface large-scale recommender systems.
INDGeneralized User Representations for Large-Scale Recommendations and Downstream Tasks
by Ghazal Fazelnia (Spotify), Sanket Gupta (Spotify), Claire Keum (Spotify), Mark Koh (Spotify), Timothy Heath (Spotify), Guillermo Carrasco Hernández (Spotify), Stephen Xie (Spotify), Nandini Singh (Spotify), Ian Anderson (Spotify), Maya Hristakeva (Spotify), Petter Skiden (Spotify), Mounia Lalmas (Spotify)

Accurately capturing diverse user preferences at scale is a core challenge for large-scale recommender systems like Spotify’s, given the complexity and variability of user behavior. To address this, we propose a two-stage framework that combines representation learning and transfer learning to produce generalized user embeddings. In the first stage, an autoencoder compresses rich user features into a compact latent space. In the second, task-specific models consume these embeddings via transfer learning, removing the need for manual feature engineering. This approach enhances flexibility by allowing dynamic updates to input features, enabling near-real-time responsiveness to user behavior. The framework has been deployed in production at Spotify with an efficient infrastructure that allows downstream models to operate independently. Extensive online experiments in a live setting show significant improvements in metrics such as consumption share, content discovery, and search success. Additionally, our method achieves these gains while substantially reducing infrastructure costs.
INDIdentifying Offline Metrics that Predict Online Impact: A Pragmatic Strategy for Real-World Recommender Systems
by Timo Wilm (OTTO (GmbH & Co. KGaA)), Philipp Normann (TU Wien)

A critical challenge in recommender systems is to establish reliable relationships between offline and online metrics that predict real-world performance. Motivated by recent advances in Pareto front approximation, we introduce a pragmatic strategy for identifying offline metrics that align with online impact. A key advantage of this approach is its ability to simultaneously serve multiple test groups, each with distinct offline performance metrics, in an online experiment controlled by a single model. The method is model-agnostic for systems with a neural network backbone, enabling broad applicability across architectures and domains. We validate the strategy through a large-scale online experiment in the field of session-based recommender systems on the OTTO e-commerce platform. The online experiment identifies significant alignments between offline metrics and real-word click-through rate, post-click conversion rate and units sold. Our strategy provides industry practitioners with a valuable tool for understanding offline-to-online metric relationships and making informed, data-driven decisions.
INDImprove the Personalization of Large-Scale Ranking Systems by Integrating User Survey Feedback
by Mengxi Lv (Meta), Drew Hogg (Meta), Thomas Grubb (Meta), Shashank Bassi (Meta), Min Li (Meta), Cayman Simpson (Meta), Senthil Rajagopalan (Meta)

Learning user interests is a crucial aspect of personalized recommendation, as it can create a more personal experience for users to drive their deep-engagement, satisfaction, and loyalty. In this work, we focus on improving users’ interest relevance experience, making users truly feel “this app knows me!” and thus leading to long-term user retention. However, accurately capturing users’ interest remains a significant challenge. Traditional approaches using users’ historical engagements with interest clusters lack sensitivity and accuracy; because such heuristic rules on predefined clusters can easily fall into the ranking feedback loop and thus poorly align with users’ true interest preferences. In this paper, we built a User True Interest Survey (UTIS) model to directly train on user survey data and predict a user’s interest affinity on any given piece of content. The UTIS model is added to the main ranking system to reduce feedback bias and leads to better relevance towards users’ core interests. The UTIS model demonstrates high offline accuracy and high generalization capability in online experiments. On a commercial videos platform serving billion of users, we observed significant metrics wins, including tier 0 user retention and engagements, higher quality and more trustworthy content recommendations, and higher user satisfaction in surveys. Overall, this work demonstrates that improving the relevance of a ranking system by leveraging direct user survey feedback can be a promising solution to enhance personalization of large-scale ranking system and lead to user satisfaction.
INDImproving Visual Recommendation on E-commerce Platforms Using Vision-Language Models
by Yuki Yada (Mercari, Inc.), Sho Akiyama (Mercari, Inc.), Ryo Watanabe (Mercari, Inc.), Yuta Ueno (Mercari, Inc.), Yusuke Shido (Mercari, Inc.), Andre Rusli (Mercari, Inc.)

On large-scale e-commerce platforms with tens of millions of active monthly users, recommending visually similar products is essential for enabling users to efficiently discover items that align with their preferences. This study presents the application of a vision-language model (VLM)—which has demonstrated strong performance in image recognition and image-text retrieval tasks—to product recommendations on Mercari, a major consumer-to-consumer marketplace used by more than 20 million monthly users in Japan. Specifically, we fine-tuned SigLIP, a VLM employing a sigmoid-based contrastive loss, using one million product image-title pairs from Mercari collected over a three-month period, and developed an image encoder for generating item embeddings used in the recommendation system. Our evaluation comprised an offline analysis of historical interaction logs and an online A/B test in a production environment. In offline analysis, the model achieved a 9.1% improvement in nDCG@5 compared with the baseline. In the online A/B test, the click-through rate improved by 50% whereas the conversion rate improved by 14% compared with the existing model. These results demonstrate the effectiveness of VLM-based encoders for e-commerce product recommendations and provide practical insights into the development of visual similarity-based recommendation systems.
INDIn-context Learning for Addressing User Cold-start in Sequential Movie Recommenders
by Xurong Liang (Amazon), Vu Nguyen (Amazon), Vuong Le (Amazon), Paul Albert (Amazon), Julien Monteil (Amazon)

The user cold-start problem remains a fundamental challenge for sequential recommender systems, particularly in large-scale video streaming services where a substantial portion of users have limited or no historical interaction data. In this work, we formulate an attempt at solving this issue by proposing a framework that leverages Large Language Models (LLMs) to enrich interaction histories using user metadata. Our approach generates a set of imaginary video items relevant to a given user’s demographic, represented through structured item key-value attributes. The generated items are inserted into users’ interaction sequences using early or late fusion strategies. We find that the generated user histories enable better initial user profiling for absolute cold users and enhanced preference modeling for nearly cold users. Experimental results on the public ML-1M dataset and an internal dataset from Amazon MX Player streaming service demonstrate the effectiveness of our LLM-based augmentation method in mitigating cold-start challenges.
INDIndustry Insights from Comparing Deep Learning and GBDT Models for E-Commerce Learning-to-Rank
by Yunus Lutz (OTTO (GmbH & Co. KGaA)), Timo Wilm (OTTO (GmbH & Co. KGaA)), Philipp Duwe (OTTO (GmbH & Co. KGaA))

In e-commerce recommender and search systems, tree-based models, such as LambdaMART, have set a strong baseline for Learning-to-Rank (LTR) tasks. Despite their effectiveness and widespread adoption in industry, the debate continues whether deep neural networks (DNNs) can outperform traditional tree-based models in this domain. To contribute to this discussion, we systematically benchmark DNNs against our production-grade LambdaMART model. We evaluate multiple DNN architectures and loss functions on a proprietary dataset from OTTO and validate our findings through an 8-week online A/B test. The results show that a simple DNN architecture outperforms a strong tree-based baseline in terms of total clicks and revenue, while achieving parity in total units sold.
INDItem-centric Exploration for Cold Start Problem
by Dong Wang (Google LLC), Junyi Jiao (Google LLC), Arnab Bhadury (Google), Yaping Zhang (Google), Mingyan Gao (Google), Onkar Dalal (Google)

Recommender systems face a critical challenge in the item cold-start problem, which limits content diversity and exacerbates popularity bias by struggling to recommend new items. While existing solutions often rely on auxiliary data, but this paper illuminates a distinct, yet equally pressing, issue stemming from the inherent user-centricity of many recommender systems. We argue that in environments with large and rapidly expanding item inventories, the traditional focus on finding the “best item for a user” can inadvertently obscure the ideal audience for nascent content. To counter this, we introduce the concept of item-centric recommendations, shifting the paradigm to identify the optimal users for new items. Our initial realization of this vision involves an item-centric control integrated into an exploration system. This control employs a Bayesian model with Beta distributions to assess candidate items based on a predicted balance between user satisfaction and the item’s inherent quality. Empirical online evaluations reveal that this straightforward control markedly improves cold-start targeting efficacy, enhances user satisfaction with newly explored content, and significantly increases overall exploration efficiency.
INDKamae: Bridging Spark and Keras for Seamless ML Preprocessing
by George Barrowclough (Expedia Group), Marian Andrecki (Expedia Group), James Shinner (Expedia Group), Daniele Donghi (Expedia Group)

In production recommender systems, feature preprocessing must be faithfully replicated across training and inference environments. This often requires duplicating logic between offline and online environments, increasing engineering effort and introducing risks of dataset shift. We present Kamae, an open-source Python library that bridges this gap by translating PySpark preprocessing pipelines into equivalent Keras models. Kamae provides a suite of configurable Spark transformers and estimators, each mapped to a corresponding Keras layer, enabling consistent, end-to-end preprocessing across the ML lifecycle. Framework’s utility is illustrated on real-world use cases, including MovieLens dataset and Expedia’s Learning-to-Rank pipelines. The code is available at https://github.com/ExpediaGroup/kamae.
INDLADDER: LLM-Annotated Data for Dogfooded Evaluation of Rankings
by Mattia Ottoborgo (TrustPilot)

In this paper we showcase the implementation of LADDER: A method utilizing Large Language Model to annotate thousands of consumer reviews to train a point-wise learning to rank algorithm. By applying LADDER, we significantly improved the relevance of the top 4 reviews presented to users, demonstrably reducing the need to access the full review collection by 5%. This outcome highlights LADDER’s ability to enhance user experience by providing sufficient information within the initial review set, thereby streamlining the decision-making process. We discuss the efficiency gains in large-scale data labeling, the positive impact on trust and relevance in review presentation without sacrificing usability, and key insights into effectively integrating domain expertise into LLM annotation for high-quality results.
INDLLM-Powered Nuanced Video Attribute Annotation for Enhanced Recommendations
by Boyuan Long (Google), Yueqi Wang (Google), Hiloni Mehta (Google), Mick Zomnir (Google), Omkar Pathak (Google), Changping Meng (Google), Ruolin Jia (Google), Yajun Peng (Google), Dapeng Hong (Google), Xia Wu (Google), Mingyan Gao (Google), Onkar Dalal (Google), Ningren Han (Google)

This paper presents a case study on deploying Large Language Models (LLMs) as an advanced “annotation” mechanism to achieve nuanced content understanding (e.g., discerning content “vibe”) at scale within a large-scale industrial short-form video recommendation system. Traditional machine learning classifiers for content understanding face protracted development cycles and a lack of deep, nuanced comprehension. The “LLM-as-annotators” approach addresses these by significantly shortening development times and enabling the annotation of subtle attributes. This work details an end-to-end workflow encompassing: (1) iterative definition and robust evaluation of target attributes, refined by offline metrics and online A/B testing; (2) scalable offline bulk annotation of video corpora using LLMs with multimodal features, optimized inference, and knowledge distillation for broad application; and (3) integration of these rich annotations into the online recommendation serving system, for example, through personalized restrict retrieval. Experimental results demonstrate the efficacy of this approach, with LLMs outperforming human raters in offline annotation quality for nuanced attributes and yielding significant improvements of user participation and satisfied consumption in online A/B tests. The study provides insights into designing and scaling production-level LLM pipelines for rich content evaluation, highlighting the adaptability and benefits of LLM-generated nuanced understanding for enhancing content discovery, user satisfaction, and the overall effectiveness of modern recommendation systems.
INDLeveraging Explicit Negative Feedback in Large-Scale Recommendation Systems: A Case Study
by Madhura Raju (TikTok Inc), Manisha Sharma (TikTok Inc.), Hongyu Xiong (TikTok, Inc.), Bingfeng Deng (TikTok, Inc.), Meng Na (TikTok Inc)

What users dislike can be just as important as what they engage with, yet explicit negative user feedback remains underutilized in most recommendation systems. This paper presents practical ap- proaches for capturing such feedback through lightweight, context- aware surveys and in-feed interactions. Referencing a case study on large-scale implementations at TikTok, we demonstrate how incorporating user feedback signals, once denoised and modeled, can improve feed quality, content relevance, and long-term user engagement. Our findings highlight that even small, well-designed feedback mechanisms can meaningfully improve user experience and trust.
INDLocation Matters: Leveraging Multi-Resolution Geo-Embeddings for Housing Search
by Ivo Silva (QuintoAndar), Guilherme Bonaldo (QuintoAndar), Pedro Nogueira (QuintoAndar)

QuintoAndar Group is Latin America’s largest housing platform, revolutionizing property rentals and sales. Headquartered in Brazil, it simplifies the housing process by eliminating paperwork and enhancing accessibility for tenants, buyers, and landlords. With thousands of houses available for each city, users struggle to find the ideal home. In this context, location plays a pivotal role, as it significantly influences property value, access to amenities, and life quality. A great location can make even a modest home highly desirable. Therefore, incorporating location into recommendations is essential for their effectiveness. We propose a geo-aware embedding framework to address sparsity and spatial nuances in housing recommendations on digital rental platforms. Our approach integrates an hierarchical H3 [3] grid at multiple levels into a two-tower neural architecture. We compare our method with a traditional matrix factorization baseline and a single-resolution variant using interaction data from our platform. Embedding specific evaluation reveals richer and more balanced embedding representations, while offline ranking simulations demonstrate a substantial uplift in recommendation quality.
INDMetadata Generation and Evaluation using LLMs – Case Study on Canonical Titles
by Sinan Zhu (Indeed.com), Sanja Simonovikj (Indeed.com), Darren Edmonds (Indeed.com), Yang Sun (Indeed.com)

In online job search platforms, autocomplete plays a crucial role in providing immediate, structured suggestions that guide users through their query process. However, inconsistencies in job title expressions, such as ‘sr data scientist’ versus ‘data scientist senior’, or embellished forms such as ‘superstar software engineer’, can undermine the quality of autocomplete suggestions and diminish user satisfaction. Traditional normalization methods rely on manually curated vocabularies, which are labor intensive and often insufficient to capture the diverse variations in raw job titles. In this work, we present an automated and scalable framework for canonical job title generation that leverages large language models (LLMs) alongside embedding-based similarity measures to derive normalized job titles directly from raw data. Our approach systematically removes irrelevant information, enforces a consistent format, and eliminates overly generic or redundant titles by combining LLM-generated normalization with a two-stage deduplication process. Evaluated on a dataset labeled via a human/LLM mix, our method demonstrates significant improvements in normalization quality, with offline accuracy gains of 18.6% over baseline methods and online A/B tests showing over 160% enhancement in user engagement metrics.
INDMinimize Negative Experiences in Video Recommendation Systems with Multimodal Large Language Models
by Suman Malani (Google, Inc), Youwei Zhang (Google), Liang Liu (Google)

Detecting and limiting negative user experiences in recommendation systems with survey feedback modeling is difficult due to ultra-sparse, imbalanced, and noisy data. The proposed approach outlines fine-tuning a multimodal Large Language Model (MLLM) on survey data enriched with contextual information, like post engagement features and community data as a teacher model to generate silver labels. A highly negative ranking model (HNRM) is trained using both the original sparse survey labels and the generated silver labels knowledge distillation. This approach significantly improves model generalization, decreases calibration error rate, increases engagement while reducing negative experiences measured by survey negative experience rates in online A/B tests, and allows the model to scale beyond the limitations imposed by the original sparse and noisy dataset.
INDNever Miss an Episode: How LLMs are Powering Serial Content Discovery on YouTube
by Aditee Kumthekar (Google Inc), Li Wei (Google), Andrea Bettale (Google Inc), Mahesh Sathiamoorthy (Bespoke Labs, Ex-Google), Zrinka Puljiz (Google Inc), Aditya Mahajan (Google Inc)

Leveraging large language models (LLMs) through prompting presents a cost-effective approach to build scalable systems without traditional model training. This paper showcases the effectiveness of using simple few-shot LLM prompt to develop a scalable and easily maintainable system that addresses a real- world user need. A critical user journey in video recommendation is watching serial content, which requires viewing episodes in a specific sequence. The existing method on YouTube for identifying serial content relied on manual creator tagging of playlists or inflexible regular expressions. These methods proved difficult to maintain and scale, limiting the system’s ability to effectively identify and recommend serial content. This paper demonstrates that a carefully designed few-shot LLM prompt can accurately identify serial playlists at scale, improving user experience with minimal engineering. The paper details the challenges and lessons learned in developing and deploying this prompting-based system.
INDNot All Impressions Are Created Equal: Psychology-Informed Retention Optimization for Short-Form Video Recommendation
by Yuyan Wang (Stanford University), Jing Zhong (Meta Platforms Inc.), Yuxin Cui (Meta Platforms Inc.), Zhaohui Guo (Meta Platforms Inc.), Chuanqi Wei (Meta Platforms Inc.), Yanchen Wang (Meta Platforms Inc.), Zellux Wang (Meta Platforms Inc.)

Recommender systems that are optimized only for short-term engagement can lead to undesirable outcomes and hurt long-term consumer experience. In response, researchers and practitioners have proposed to incorporate retention signals into recommender systems. Existing retention models are built on item-level interactions where every impression is weighted equally. However, on short-form video platforms where content is presented sequentially and passively consumed, users are unlikely to engage equally with every video, and it is hard to establish any meaningful relationships between a short video watch and long-term retention behaviors. In this work, we propose a psychology-informed retention modeling approach grounded in the peak–end rule, which suggests that people evaluate past experiences largely based on the most intense moment (“peak”) and the final moment (“end”). Specifically, we train a retention model that predicts user return based on the peak and end moments of each session, which is then incorporated into a multi-stage recommender system. We implemented our approach on Facebook Reels, one of the world’s largest short-form video recommendation platforms. In a long-term A/B test against the production system, our model delivered significant improvements in Daily Active Users and total sessions, suggesting an improved long-term user experience.
INDOperational Twin–Driven AI Recommender for Strategic Service Planning
by Vivek Singh (Siemens Healthineers), Sarith Mohan (Siemens Healthineers), Chetan Srinidhi (Siemens Healthineers), Santosh Pai (Siemens Healthineers), Ullaskrishnan Poikavila (Siemens Healthineers), Codruta Ene (Siemens Healthineers), Ankur Kapoor (Siemens Healthineers), Neil Biehn (Siemens Healthineers), Dorin Comaniciu (Siemens Healthineers)

Traditional service management relies heavily on manual processes due to data complexity and human involvement, limiting the impact of AI in strategic planning. We present an AI recommender system that leverages an operational twin of service operations to optimize long-term KPIs using Monte Carlo search and mixed-integer programming. Focusing on personnel allocation for large healthcare equipment, the system accounts for domain-specific constraints like specialization and continuity. We deployed the system at Siemens Healthineers to support over 300,000 equipment across the U.S. and report productivity gains from over an year of real-world use and key lessons for adoption at scale.
INDOrthogonal Low Rank Embedding Stabilization
by Kevin Zielnicki (Netflix), Ko-Jen Hsiao (Netflix)

The instability of embedding spaces across model retraining cycles presents significant challenges to downstream applications using user or item embeddings derived from recommendation systems as input features. This paper introduces a novel orthogonal low-rank transformation methodology designed to stabilize the user/item embedding space, ensuring consistent embedding dimensions across retraining sessions. Our approach leverages a combination of efficient low-rank singular value decomposition and orthogonal Procrustes transformation to map embeddings into a standardized space. This transformation is computationally efficient, lossless, and lightweight, preserving the dot product and inference quality while reducing operational burdens. Unlike existing methods that modify training objectives or embedding structures, our approach maintains the integrity of the primary model application and can be seamlessly integrated with other stabilization techniques.
INDPareto-Optimal Solution: Optimizing Engagement and Revenue
by Shaghayegh Agah (Comcast Technology AI), Shaun Schaeffer (Comcast Technology AI), Maria Peifer (Comcast Technology AI), Neeraj Sharma (Comcast Technology AI), Ankit Maheshwari (Comcast Technology AI), Sardar Hamidian (Comcast Technology AI)

This paper introduces a multi-objective ranking framework deployed on a large-scale entertainment platform to jointly optimize user engagement, revenue, and content pricing. Unlike prior work, our system addresses a critical real-world challenge: extreme label imbalance across objectives, with monetization signals being over 100× sparser than engagement. To overcome this, we adopt an output aggregation strategy that supports runtime tuning of objective weights, enabling fast iteration and dynamic prioritization without retraining. We further introduce a robust offline evaluation pipeline based on Pareto analysis and distribution-aware test datasets, exposing trade-offs that would otherwise remain hidden. Beyond engagement and revenue, we incorporate a third price-based objective optimized via constrained Bayesian search over a high-dimensional simplex by demonstrating how monetization goals can be achieved without degrading user experience. Our approach is validated both offline and through online A/B tests, showing measurable revenue improvements with minimal impact on engagement. This work provides a novel, end-to-end blueprint for scalable multi-objective optimization under production constraints, where business trade-offs must be explicit, tunable, and validated.
INDPersonalized Interest Graphs for Theme-Driven User Behavior
by Oded Zinman (eBay Inc.), Nazmul Chowdhury (eBay Inc.), Leandro Fiaschetti (eBay Inc.), Yuri Brovman (eBay Inc.), Guy Feigenblat (eBay Inc.), Yotam Eshel (eBay Inc.)

Many eBay users turn to our platform to pursue theme-centric interests that span diverse product categories—for example, a Star Wars fan might search for related video games, toys, memorabilia, and artwork. Existing recommendation systems, typically optimized for short-term engagement, often fail to surface cross-category items aligned with these deeper interests. We present an end-to-end recommendation framework built around a user-interest graph generated by LLM chain. The graph captures user preferences at multiple levels of granularity, enabling a balance between relevance-driven and serendipity-driven recommendations. The system has been deployed at scale, serving millions of users across billions of items. An online A/B test on the eBay homepage showed a significant improvement in engagement with previously unseen categories, alongside gains in purchases and buyer count.
INDPractical Multi-Task Learning for Rare Conversions in Ad Tech
by Yuval Dishi (Teads), Ophir Friedler (Teads), Yonatan Karni (Teads), Natalia Silberstein (Teads), Yulia Stolin (Teads)

We present a Multi-Task Learning (MTL) approach for improving predictions for rare (e.g., less than 1%) conversion events in online advertising. The conversions are classified into ‘rare’ or ‘frequent’ types based on historical statistics. The model learns shared representations across all signals while specializing through separate task towers for each type. The approach was tested and fully deployed to production, demonstrating consistent improvements in both offline (0.69% AUC lift) and online KPI performance metric (2% Cost per Action reduction).
INDRADAR: Recall Augmentation through Deferred Asynchronous Retrieval
by Amit Jaspal (Meta Platforms, Inc.), Qian Dang (Meta Platforms, Inc.), Ajantha Ramineni (Meta)

Modern large-scale recommender systems employ multi-stage ranking funnel (Retrieval, Pre-ranking, Ranking) to balance engagement and computational constraints (latency, CPU). However, the initial retrieval stage, often relying on efficient but less precise methods like K-Nearest Neighbors (KNN), struggles to effectively surface the most engaging items from billion-scale catalogs, particularly distinguishing highly relevant and engaging candidates from merely relevant ones. We introduce Recall Augmentation through Deferred Asynchronous Retrieval (RADAR), a novel framework that leverages asynchronous, offline computation to pre-rank a significantly larger candidate set for users using the full complexity ranking model. These top- ranked items are stored and utilized as a high-quality retrieval source during online inference, bypassing online retrieval and pre- ranking stages for these candidates. We demonstrate through offline experiments that RADAR significantly boosts recall (2X Recall@200 vs DNN retrieval baseline) by effectively combining a larger retrieved candidate set with a more powerful ranking model. Online A/B tests confirm a +0.8% lift in topline engagement metrics, validating RADAR as a practical and effective method to improve recommendation quality under strict online serving constraints.
INDRankGraph: Unified Heterogeneous Graph Learning for Cross-Domain Recommendation
by Renzhi Wu (Meta Platforms, Inc.), Junjie Yang (Meta Platforms, Inc.), Li Chen (Meta Platforms, Inc.), Hong Li (Meta Platforms, Inc.), Li Yu (Meta Platforms, Inc.), Hong Yan (Meta Platforms, Inc.)

Cross-domain recommendation systems face the challenge of integrating fine-grained user and item relationships across various product domains. To address this, we introduce RankGraph, a scalable graph learning framework designed to serve as a core component in recommendation foundation models (FMs). By constructing and leveraging graphs composed of heterogeneous nodes and edges across multiple products, RankGraph enables the integration of complex relationships between users, posts, ads, and other entities. Our framework employs a GPU-accelerated Graph Neural Network and contrastive learning, allowing for dynamic extraction of subgraphs such as item-item and user-user graphs to support similarity-based retrieval and real-time clustering. Furthermore, RankGraph integrates graph-based pretrained representations as contextual tokens into FM sequence models, enriching them with structured relational knowledge. RankGraph has demonstrated improvements in click (+0.92%) and conversion rates (+2.82%) in online A/B tests, showcasing its effectiveness in cross-domain recommendation scenarios.
INDSASRec in Action: Real-World Adaptations for ZDF Streaming Service
by Venkata Harshit Koneru (ZDF (Zweites Deutsches Fernsehen)), Xenija Neufeld (Accso – Accelerated Solutions GmbH), Sebastian Loth (ZDF (Zweites Deutsches Fernsehen)), Andreas Grün (ZDF (Zweites Deutsches Fernsehen))

The ZDF streaming platform uses SASRec (Self-attentive Sequential Recommendation Model) for generating personalized recommendations. In our present study, we tested a novel combination of a) sampling strategies of negative items, and b) augmenting the model’s input data with Repeated Padding (RepPad). We have compared different model variants in three use cases in an A/B test. Depending on the use case, the modifications affected the viewing volume and the popularity in different ways.
INDSEMORec: A Scalarized Efficient Multi-Objective Recommendation Framework
by Sofia Maria Nikolakaki (Apple), Siyong Ma (Apple), Srivas Chennu (Apple), Humeyra Topcu Altintas (Apple)

Recommendation systems in multi-stakeholder environments often require optimizing for multiple objectives simultaneously to meet supplier and consumer demands. Serving recommendations in these settings relies on efficiently combining the objectives to address each stakeholder’s expectations, often through a scalarization function with pre-determined and fixed weights. In practice, selecting these weights becomes a consequent problem. Recent work has developed algorithms that adapt these weights based on application-specific needs by using RL to train a model. While, this solves for automatic weight computation, such approaches are not efficient for frequent weight adaptation. They also do not allow for human intervention oftentimes determined by business needs. To bridge this gap, we propose a novel multi-objective recommendation framework that is highly efficient for a small number of objectives. It also enables business decision makers to easily tune the optimization by assigning different importance to multiple objectives. Through online experiments, we demonstrate the efficacy and efficiency of our framework through improvements in online business metrics.
INDScaling Generative Recommendations with Context Parallelism on Hierarchical Sequential Transducers
by Yue Dong (Meta Platforms), Han Li (Meta Platforms), Shen Li (Meta Platforms), Nikhil Patel (Meta Platforms), Xing Liu (Meta Platforms), Xiaodong Wang (Meta Platforms), Chuanhao Zhuge (Meta Platforms)

Large-scale recommendation systems are pivotal to process an immense volume of daily user interactions, requiring the effective modeling of high cardinality and heterogeneous features to ensure accurate predictions. In prior work, we introduced Hierarchical Sequential Transducers (HSTU), an attention-based architecture for modeling high cardinality, non-stationary streaming recommendation data, providing good scaling law in the generative recommender framework (GR). Recent studies and experiments demonstrate that attending to longer user history sequences yields significant metric improvements. However, scaling sequence length is activation-heavy, necessitating parallelism solutions to effectively shard activation memory. In transformer-based LLMs, context parallelism (CP) is a commonly used technique that distributes computation along the sequence-length dimension across multiple GPUs, effectively reducing memory usage from attention activations. In contrast, production ranking models typically utilize jagged input tensors to represent user interaction features, introducing unique CP implementation challenges. In this work, we introduce context parallelism with jagged tensor support for HSTU attention, establishing foundational capabilities for scaling up sequence dimensions. Our approach enables a 5.3× increase in supported user interaction sequence length, while achieving a 1.55× scaling factor when combined with Distributed Data Parallelism (DDP).
INDScaling Image Variant Optimization Through Customer Bucketing and Response Caching: A Large-Scale Implementation at Amazon Prime Video
by Haiyun Jin (Amazon Prime Video), Bobby Patel (Amazon Prime Video)

Multi-Armed Bandit (MAB) models are widely used in industrial recommender systems to manage the ongoing trade-off between exploration and exploitation. At scale, the computational cost of running inference for every incoming request can become prohibitively high. In this paper, we describe a practical solution deployed at Amazon Prime Video to address the cost challenges of a production MAB-based image-ranking system known as Summer. Our method combines two key strategies: (1) caching the ranking results and (2) bucketing users to distribute the inference workload across customer cohorts. We show that these strategies reduce hourly inference calls by up to 77.8% relative to an uncached fully user-specific baseline, leading to significant operational and infrastructural savings. Despite lowering inference volume, the approach maintained user engagement and improved specific outcomes such as video streams and Amazon Video (AV) purchases. A 21-day global online experiment showed a 0.02% increase in video streams (p = 0.031) and a 0.19% increase in AV purchase units (p = 0.003), demonstrating that the technique reduces inference costs without compromising user experience. We describe the system design, experimental findings, and practical considerations for applying caching and bucketing strategies at scale.
INDScaling Retrieval for Web-Scale Recommenders: Lessons from Inverted Indexes to Embedding Search
by Yuchin Juan (LinkedIn), Jianqiang Shen (LinkedIn), Shaobo Zhang (LinkedIn), Qianqi Shen (LinkedIn), Caleb Johnson (LinkedIn), Luke Simon (LinkedIn), Liangjie Hong (LinkedIn), Wenjing Zhang (LinkedIn)

Web-scale search and recommendation systems depend on efficient retrieval to manage massive datasets and user traffic. This paper chronicles our evolutionary path in building the retrieval layer at LinkedIn, progressing from a CPU-based inverted index system to a GPU-accelerated embedding-based retrieval system. Initially anchored by traditional term-based retrieval, we enhanced relevance and productivity through learning-to-retrieve approaches by generating mappings among inferred attributes. As these early efforts encountered limitations in inferring and matching attributes at scale, we transitioned to embedding-based retrieval for greater flexibility and performance, but found that existing infrastructure couldn’t support large-scale production needs. This led us to develop a GPU-based retrieval system designed for high performance, flexible modeling, and multi-objective business optimization. We present the infrastructure innovations, optimizations, and key lessons learned throughout this transition, offering practical insights for building scalable, flexible retrieval systems.
INDSemantic IDs for Music Recommendation
by M. Jeffrey Mei (SiriusXM Radio Inc.), Florian Henkel (Spotify), Samuel E. Sandberg (SiriusXM Radio Inc.), Oliver Bembom (SiriusXM Radio Inc.), Andreas F. Ehmann (SiriusXM Radio Inc.)

Training recommender systems for next-item recommendation often requires unique embeddings to be learned for each item, which may take up most of the trainable parameters for a model. Shared embeddings, such as using content information, can reduce the number of distinct embeddings to be stored in memory. This allows for a more lightweight model; correspondingly, model complexity can be increased due to having fewer embeddings to store in memory. We show the benefit of using shared content-based features (‘semantic IDs’) in improving recommendation accuracy and reducing model size for two music recommendation datasets, including an online A/B test.
INDSimulating Discoverability for Upcoming Content in TV Entertainment Platforms
by Adeep Hande (Applied AI Research, Comcast), Kishorekumar Sundararajan (Applied AI Research, Comcast), Yidnekachew Endale (Applied AI Research, Comcast), Sardar Hamidian (Applied AI Research, Comcast)

In entertainment platforms, search and browse are critical entry points for content discovery. Yet, newly ingested titles often fail to surface at the moment of highest user interest due to a range of practical issues: lack of user-item interaction data, cold-start sparsity, or filtering strategies that deprioritize fresh content. These visibility gaps are difficult to detect before user complaints or engagement drops emerge. We present a simulation-based evaluation framework that assesses the discoverability of upcoming content that is about to be released or has just been ingesetd into our catalog. Our system uses large language models, grounded in item metadata and historical query patterns, to generate realistic search queries that reflect how users are likely to look for content. These synthetic queries are executed in a staging environment that mirrors production, capturing UI-level responses to compute a discoverability score for each entity. The score identifies visibility risks without modifying the search engine itself, enabling proactive editorial and QA interventions. Integrated into Comcast’s daily workflows, this framework scales to thousands of titles and supports operational search quality assurance. While built for voice and text-based entertainment search, the approach generalizes to other recommendation and retrieval systems that face similar black-box surfacing challenge
INDSocRipple: A Two-Stage Framework for Cold-Start Video Recommendations
by Amit Jaspal (Meta Platforms, Inc.), Kapil Dalwani (Meta Platforms, Inc.), Ajantha Ramineni (Meta Platforms, Inc.)

Most industry-scale recommender systems face critical cold-start challenges—new items lack interaction history, making it difficult to distribute them in a personalized manner. Standard collaborative filtering models underperform due to sparse engagement signals, while content-only approaches lack user- specific relevance. We propose SocRipple, a novel two-stage retrieval framework tailored for cold-start item distribution in social graph-based platforms. Stage 1 leverages the creator’s social connections for targeted initial exposure. Stage 2 builds on early engagement signals and stable user embeddings—learned from historical interactions—to “ripple” outwards via K-Nearest Neighbor (KNN) search. Large scale experiments on a major video platform show that SocRipple boosts cold-start item distribution by +36% while maintaining user engagement rate on cold-start items, effectively balancing new-item exposure with personalized recommendations.
INDStream Normalization for CTR Prediction
by Yizhou Sang (JD.COM), Congcong Liu (JD.COM), Yuying Chen (The Hong Kong University of Science and Technology), Zhiwei Fang (JD.COM), Xue Jiang (JD.COM), Changping Peng (JD.COM), Zhangang Lin (JD.COM), Ching Law (JD.COM), Jingping Shao (JD.COM)

Deep learning models often encounter significant challenges when dealing with non-i.i.d. and non-stationary data, particularly in incremental learning tasks such as CTR prediction. Traditional normalization techniques, including Batch Normalization and Layer Normalization, are limited in their ability to maintain stability and adaptability under rapidly evolving data distributions, leading to degraded model performance. To address these limitations, we propose Stream Normalization (SN), a dynamic and adaptive normalization framework designed to continuously align normalization statistics with shifting data distributions in real-time. SN leverages specialized normalization modules, each optimized to capture distinct statistical patterns inherent in streaming data. Such design enhances model robustness and mitigates the risk of catastrophic forgetting by continuously adapting its normalization strategy. The SN layer is a versatile plugin that enhances model robustness across various normalization settings. Extensive experiments demonstrate that SN achieves state-of-the-art performance on offline datasets, representing a significant advancement in incremental learning for streaming data.
INDStreaming Trends: A Low-Latency Platform for Dynamic Video Grouping and Trending Corpora Building
by Yang Gu (Google), Caroline Zhou (Google), Qiao Zhang (Google), Scott Wang (Google), Yongzhe Wang (Google), Li Zhang (Google), Nikos Parotsidis (Google), Cj Carey (Google), Ashkan Fard (Google), Mingyan Gao (Google), Yaping Zhang (Google), Sourabh Bansod (Google)

This paper presents Streaming Trends, a real-time system deployed on a short-form videos platform that enables dynamic content grouping, tracking videos from upload to their identification as part of a trend. Addressing the latency inherent in traditional batch processing for short-form video, Streaming Trends utilizes online clustering and flexible similarity measures to associate new uploads with relevant groups in near real-time. The system combines online processing for immediate updates triggered by uploads and seed queries with offline processes for similarity modeling and cluster quality maintenance. By facilitating the rapid identification and association of trending videos, Streaming Trends significantly enhances content discovery and user value on the short videos platform.
INDSuggest, Complement, Inspire: Story of Two-Tower Recommendations at Allegro.com
by Aleksandra Osowska-Kurczab (Allegro.com), Klaudia Nazarko (Allegro.com), Mateusz Marzec (Allegro.com), Lidia Wojciechowska (Allegro.com), Eliška Kremeňová (Allegro.com)

Building large-scale e-commerce recommendation systems requires addressing three key technical challenges: (1) designing a universal recommendation architecture across dozens of placements, (2) decreasing excessive maintenance costs, and (3) managing a highly dynamic product catalogue. This paper presents a unified content-based recommendation system deployed at Allegro.com, the largest e-commerce platform of European origin. The system is built on a prevalent Two Tower retrieval framework, representing products using textual and structured attributes, which enables efficient retrieval via Approximate Nearest Neighbour search. We demonstrate how the same model architecture can be adapted to serve three distinct recommendation tasks: similarity search, complementary product suggestions, and inspirational content discovery, by modifying only a handful of components in either the model or the serving logic. Extensive A/B testing over two years confirms significant gains in engagement and profit-based metrics across desktop and mobile app channels. Our results show that a flexible, scalable architecture can serve diverse user intents with minimal maintenance overhead.
INDThe Future is Sparse: Embedding Compression for Scalable Retrieval in Recommender Systems
by Petr Kasalický (Czech Technical University in Prague), Martin Spišák (Recombee), Vojtěch Vančura (Recombee), Daniel Bohuněk (Recombee), Rodrigo Alves (Recombee), Pavel Kordík (Recombee)

Industry-scale recommender systems face a core challenge: representing entities with high cardinality, such as users or items, using dense embeddings that must be accessible during both training and inference. However, as embedding sizes grow, memory constraints make storage and access increasingly difficult. We describe a light-weight, learnable embedding compression technique that projects dense embeddings into a high-dimensional, sparsely activated space. Designed for retrieval tasks, our method reduces memory requirements while preserving retrieval performance, enabling scalable deployment under strict resource constraints. Our results demonstrate that leveraging sparsity is a promising approach for improving the efficiency of large-scale recommenders.
INDUSD: A User-Intent-Driven Sampling and Dual-Debiasing Framework for Large-Scale Homepage Recommendations
by Jiaqi Zheng (Taobao & Tmall Group of Alibaba), Cheng Guo (Taobao & Tmall Group of Alibaba), Yi Cao (Taobao & Tmall Group of Alibaba), Chaoqun Hou (Taobao & Tmall Group of Alibaba), Tong Liu (Taobao & Tmall Group of Alibaba), Bo Zheng (Taobao & Tmall Group of Alibaba)

Large-scale homepage recommendations face critical challenges from pseudo-negative samples caused by exposure bias, where non-clicks may indicate inattention rather than disinterest. Existing work lacks thorough analysis of invalid exposures and typically addresses isolated aspects (e.g., sampling strategies), overlooking the critical impact of pseudo-positive samples — such as homepage clicks merely to visit marketing portals. We propose a unified framework for large-scale homepage recommendation sampling and debiasing. Our framework consists of two key components: (1) a user intent-aware negative sampling module to filter invalid exposure samples, and (2) an intent-driven dual-debiasing module that jointly corrects exposure bias and click bias. Extensive online experiments on Taobao demonstrate the efficacy of our framework, achieving significant improvements in user click-through rates (UCTR) by 35.4% and 14.5% in two variants of the marketing block on the Taobao homepage, Baiyibutie and Taobaomiaosha.
INDUnified Survey Modeling to Limit Negative User Experiences in Recommendation Systems
by Chenghui Yu (TikTok, Inc.), Haoze Wu (TikTok, Inc.), Jian Ding (TikTok, Inc.), Bingfeng Deng (TikTok, Inc.), Hongyu Xiong (TikTok, Inc.)

Reducing negative user experiences is crucial for the success of recommendation platforms. Exposure to inappropriate content can not only harm users’ psychological well-being but also drive them away, ultimately undermining the platform’s long-term growth. However, recommendation algorithms often prioritize positive feedback signals due to the relative scarcity of negative ones, which may lead to the oversight of valuable negative user feedback. In this paper, we propose a method that leverages in-feed surveys to collect user feedback, models this feedback, and integrates the predictions into the recommendation system. We enhance the personalized survey model based on the HoME framework. Our experiments demonstrate that the proposed method significantly outperforms the baseline model. We observed a averaged 0.52% AUC increase and 1.38% LogLoss decline across all heads. After deploying the model on the Tiktok app, we observe 0.82% and 0.67% increase in survey_like_rate and Like, a 4.08%, 2.51%, 2.59% reduction in survey_inappropriate_rate, Reports, Dislikes, respectively, illustrating the improvement of the overall recommandation quality and decline in negative signals.
INDUser Long-Term Multi-Interest Retrieval Model for Recommendation
by Yue Meng (Taobao & Tmall Group of Alibaba), Cheng Guo (Taobao & Tmall Group of Alibaba), Xiaohui Hu (Taobao & Tmall Group of Alibaba), Honghu Deng (Tsinghua University), Yi Cao (Taobao & Tmall Group of Alibaba), Tong Liu (Taobao & Tmall Group of Alibaba), Bo Zheng (Taobao & Tmall Group of Alibaba)

User behavior sequence modeling, which captures user interest from rich historical interactions, is pivotal for industrial recommendation systems. Despite breakthroughs in ranking-stage models capable of leveraging ultra-long behavior sequences with length scaling up to thousands, existing retrieval models remain constrained to sequences of hundreds of behaviors due to two main challenges. One is strict latency budget imposed by real-time service over large-scale candidate pool. The other is the absence of target-aware mechanisms and cross-interaction architectures, which prevent utilizing ranking-like techniques to simplify long sequence modeling. To address these limitations, we propose a new framework named User Long-term Multi-Interest Retrieval Model(ULIM), which enables thousand-scale behavior modeling in retrieval stages. ULIM includes two novel components: 1)Category-Aware Hierarchical Dual-Interest Learning partitions long behavior sequences into multiple category-aware sub-sequences representing multi-interest and jointly optimizes long-term and short-term interests within specific interest cluster. 2)Pointer-Enhanced Cascaded Category-to-Item Retrieval introduces Pointer-Generator Interest Network(PGIN) for next-category prediction, followed by next-item retrieval upon the top-K predicted categories. Comprehensive experiments on Taobao dataset show that ULIM achieves substantial improvement over state-of-the-art methods, and brings 5.54% clicks, 11.01% orders and 4.03% GMV lift for Taobaomiaosha, a notable mini-app of Taobao.
INDYou Say Search, I Say Recs: A Scalable Agentic Approach to Query Understanding and Exploratory Search at Spotify
by Enrico Palumbo (Spotify), Marcus Isaksson (Spotify), Alexandre Tamborrino (Spotify), Maria Movin (Spotify), Catalin Dincu (Spotify), Ali Vardasbi (Spotify), Lev Nikeshkin (Spotify), Oksana Gorobets (Spotify), Anders Nyman (Spotify), Poppy Newdick (Spotify), Hugues Bouchard (Spotify), Paul Bennett (Spotify), Mounia Lalmas (Spotify), Dani Doro (Spotify), Christine Doig Cardet (Spotify), Ziad Sultan (Spotify)

On online content platforms, users often aim to explore the catalog and discover new, personalized content through exploratory searches—such as “new releases for me.” Traditional search systems, which prioritize lexical and semantic matching over personalized retrieval, have historically struggled to support this type of intent. In contrast, recommendation services that leverage user-item and item-item signals tend to be more effective for addressing exploratory queries. Agentic technologies offer a promising opportunity to enhance exploratory search by harnessing large language models (LLMs) to interpret complex query intents and route them to the most suitable downstream services. However, deploying such agentic systems at scale remains a significant challenge. In this pa- per, we present a scalable agentic approach to query understanding and exploratory search at Spotify. Our system combines an LLM router, post-training adaptation techniques, search and recommendation APIs, and specialized sub-agents to interpret user intent and deliver personalized results at scale. We outline the high-level system design and share key experimental results. By addressing the limitations of conventional search, our approach yields substantial improvements across several exploratory use cases, including discovering similar artists (+115%), broad podcast searches (+15%), new music releases (+91%), and broad music searches (+25%).
INDZero-shot Cross-domain Knowledge Distillation: A Case study on YouTube Music
by Srivaths Ranganathan (Google LLC), Chieh Lo (Google LLC), Bernardo Cunha (Google LLC), Nikhil Khani (Google), Li Wei (Google), Aniruddh Nath (Google), Shawn Andrews (Google LLC), Gergo Varady (Google LLC), Yanwei Song (Google LLC), Jochen Klingenhoefer (Google LLC), Tim Steele (Google LLC)

Knowledge Distillation (KD) has been widely used to improve the quality of latency sensitive models serving live traffic. However, applying KD in production recommender systems with low traffic is challenging: the limited amount of data restricts the teacher model size, and the cost of training a large dedicated teacher may not be justified. Cross-domain KD offers a cost-effective alternative by leveraging a teacher from a data-rich source domain, but introduces unique technical difficulties, as the features, user interfaces, and prediction tasks can significantly differ. We present a case study of using zero-shot cross-domain KD for multi-task ranking models, transferring knowledge from a (100x) large-scale video recommendation platform (YouTube) to a music recommendation application with significantly lower traffic. We share offline and live experiment results and present findings evaluating different KD techniques in this setting across two ranking models on the music app. Our results demonstrate that zero-shot cross-domain KD is a practical and effective approach to improve the performance of ranking models on low traffic surfaces.

List of all Late-breaking Results papers accepted for RecSys 2025 (in alphabetical order).

LBRA Dual-Key Attention Framework for Sequential Recommendation with Side Information
by Minje Kim (Gyeongsang National University), Wooseung Kang (Gyeongsang National University), Gun-Woo Kim (Gyeongsang National University), Chie Hoon Song (Gyeongsang National University), Suwon Lee (Gyeongsang National University), Sang-Min Choi (Gyeongsang National University)

Sequential recommendation (SR) aims to predict usersʼ future interactions based on their historical behavior. Recently, deep learning-based SR models leveraging side information have gained considerable attention. Within these systems, items can be viewed from relation-based and attribute-based perspectives. The relation-based perspective characterizes items based on implicit relationships and contextual dependencies derived from user interactions. The attribute-based perspective defines items using inherent properties such as category or genre. However, these perspectives are inherently entangled, making separate learning challenging. To address this issue, we propose a dual-key attention framework for sequential recommendation (DK-SR), which effectively learns both relation-based and attribute-based representations. DK-SR employs an attention mechanism with dual keys: one for item-level attention facilitating relation-based representation learning, and another for attribute-level attention enhancing attribute-based representation. Extensive experiments on four real-world datasets demonstrate that our model outperforms six state-of-the-art SR models leveraging side information. Additionally, an ablation study validates the contribution of the dual-key mechanism.
LBRAddressing Multiple Hypothesis Bias in CTR Prediction for Ad Selection
by Oren Sar Shalom (Linkedin), Neil Daftary (Linkedin)

Predicting click-through rates (CTR) for candidate advertisements is central to many online recommendation and ad-serving systems. However, selecting top-ranked ads based on predicted CTR (pCTR) inherently introduces a systematic bias: since each pCTR contains random estimation error, ads ranked highest tend to exhibit positive error, leading to overestimation of true CTR and miscalibration. Furthermore, as the number of candidates grows, the extreme order statistics amplify this so-called Multiple Hypothesis Bias. Proper calibration of pCTR ensures that estimated probabilities match observed click frequencies, which is essential for setting accurate bids and maximizing revenue in ad auctions. Without reliable calibration, high-accuracy models can still misprice impressions, resulting in both lost revenue and inefficient budget allocation. In this paper, we (1) formally define the bias arising from ranking by noisy estimates and (2) derive an estimator to correct pCTR by subtracting the expected error under mild distributional assumptions. Experiments on large-scale ad data show significant improvements in calibration metrics across multiple ad settings.
LBRBalancing Accuracy and Novelty with Sub-Item Popularity
by Chiara Mallamaci (Politecnico di Bari), Aleksandr V. Petrov (University of Glasgow / Viator, TripAdvisor), Alberto Carlo Maria Mancino (Politecnico di Bari), Vito Walter Anelli (Politecnico di Bari), Tommaso Di Noia (Politecnico di Bari), Craig Macdonald (University of Glasgow)

In the realm of music recommendation, sequential recommenders have shown promise in capturing the dynamic nature of music consumption. A key characteristic of this domain is repetitive listening, where users frequently replay familiar tracks. To capture these repetition patterns, recent research has introduced Personalised Popularity Scores (PPS), which quantify user-specific preferences based on historical frequency. While PPS enhances relevance in recommendation, it often reinforces already-known content, limiting the system’s ability to surface novel or serendipitous items—key elements for fostering long-term user engagement and satisfaction. To address this limitation, we build upon RecJPQ, a Transformer-based framework initially developed to improve scalability in large-item catalogues through sub-item decomposition. We repurpose RecJPQ’s sub-item architecture to model personalised popularity at a finer granularity. This allows us to capture shared repetition patterns across sub-embeddings—latent structures not accessible through item-level popularity alone. We propose a novel integration of sub-ID-level personalised popularity within the RecJPQ framework, enabling explicit control over the trade-off between accuracy and personalised novelty. Our sub-ID-level PPS method (sPPS) consistently outperforms item-level PPS by achieving significantly higher personalised novelty without compromising recommendation accuracy.
LBRBenefiting from Negative yet Informative Feedback by Contrasting Opposing Sequential Patterns
by Veronika Ivanova (Skolkovo Institute of Science and Technology), Evgeny Frolov (AIRI), Alexey Vasilev

We consider the task of learning from both positive and negative feedback in a sequential recommendation scenario, as both types of feedback are often present in user interactions. Meanwhile, conventional sequential learning models usually focus on considering and predicting positive interactions, ignoring that reducing items with negative feedback in recommendations improves user satisfaction with the service. Moreover, the negative feedback can potentially provide a useful signal for more accurate identification of true user interests. In this work, we propose to train two transformer encoders on separate positive and negative interaction sequences. We incorporate both types of feedback into the training objective of the sequential recommender using a composite loss function that includes positive and negative cross-entropy as well as a cleverly crafted contrastive term, that helps better modeling opposing patterns. We demonstrate the effectiveness of this approach in terms of increasing true-positive metrics compared to state-of-the-art sequential recommendation methods while reducing the number of wrongly promoted negative items.
LBRBeyond Clicks: Eye-Tracking Insights into User Responses to Different Recommendation Types
by Georgios Koutroumpas (Telefónica Scientific Research), Matteo Mazzini (University of Padua), Sebastian Idesis (Telefónica Scientific Research), Mireia Masias Bruns (Telefónica Scientific Research), Joemon Jose (University of Glasgow), Sergi Abadal (Universitat Politècnica de Catalunya), Ioannis Arapakis (Telefónica Scientific Research)

Modern recommender systems increasingly rely on implicit human feedback to enhance recommendation quality, personalization,and user engagement. In e-commerce, eye-tracking has emerged as a valuable tool for capturing attention and preference, yet little work has explored how users behave across different recommendation categories. In this study, we analyze eye-tracking data from users exposed to four recommendation types—Exact, Substitute, Complement, and Irrelevant—in a query-based setting. Our results reveal consistent patterns: users exhibit predictable, text-focused viewing for Exact and Substitute items, while Complement and Irrelevant items trigger more distributed, exploratory behavior. Notably, Irrelevant items elicit higher emotional arousal associated with disengagement—a pattern not seen with Complement items, suggesting the latter may increase diversity without harming user experience. These findings highlight the importance of considering recommendation context in user modeling, and provide a foundation for future work on context-aware recommender systems and the use of eye-tracking data.
LBRDebiasing Implicit Feedback Recommenders via Sliced Wasserstein Distance-based Regularization
by Gustavo Escobedo (Johannes Kepler University Linz), David Penz (TU Wien), Markus Schedl (Johannes Kepler University Linz)

Recommendation models often encode users’ sensitive attributes (e.g., gender or age) in their learned representations during training, leading to biased (e.g., stereotypical) recommendations and potential privacy risks. To address this, previous research has predominantly focused on adversarial training to make user representations invariant to sensitive attributes. However, adversarial methods can be unstable and computationally expensive due to additional network parameters. An alternative approach is the use of regularization losses that minimize distributional discrepancies between different demographic groups during training. In particular, the Sliced Wasserstein Distance (SWD) provides a computationally efficient and stable solution for mitigating bias by directly aligning the distributions of user representations across groups. We follow this alternative strategy and propose an in-processing approach to mitigate encoded biases in user representations of implicit feedback-based recommender systems by using SWD-based regularization. We perform extensive experiments targeting the debiasing of the users’ gender on three datasets ML-1M, LFM2b-DB, and EB-NeRD from the movie, music, and news domains, respectively. Our results indicate that SWD-based regularization is an effective approach for mitigating encoded biases in user representations while keeping competitive recommendation accuracy.
LBRDescribe What You See with Multimodal Large Language Models to Enhance Video Recommendations
by Marco De Nadai (Spotify), Andreas Damianou (Spotify), Mounia Lalmas (Spotify)

Existing video recommender systems rely primarily on user-defined metadata or on low-level visual and acoustic signals extracted by specialised encoders. These low-level features describe what appears on the screen but miss deeper semantics such as intent, humour, and world knowledge that make clips resonate with viewers. For example, is a 30‑second clip simply a singer on a rooftop, or an ironic parody filmed amid the fairy‑chimneys of Cappadocia, Turkey? Such distinctions are critical to personalised recommendations yet remain invisible to traditional encoding pipelines. In this paper, we introduce a simple, recommendation system‑agnostic zero-finetuning framework that injects high‑level semantics into the recommendation pipeline by prompting an off‑the‑shelf Multimodal Large Language Model (MLLM) to summarise each clip into a rich natural‑language description (e.g. “a superhero parody with slapstick fights and orchestral stabs”), bridging the gap between raw content and user intent. We use MLLM output with a state‑of‑the‑art text encoder and feed it into standard collaborative, content‑based, and generative recommenders. On the MicroLens‑100K dataset, which emulates user interactions with TikTok‑style videos, our framework consistently surpasses conventional video, audio, and metadata features in five representative models. Our findings highlight the promise of leveraging MLLMs as on‑the‑fly knowledge extractors to build more intent‑aware video recommenders.
LBRDon’t Get Ahead of Yourself: A Critical Study on Data Leakage in Offline Evaluation of Sequential Recommenders
by Huy Hoang Le (University of Helsinki), Yang Liu (University of Helsinki), Alan Medlar (University of Helsinki), Dorota Glowacka (University of Helsinki)

While previous studies have investigated data leakage in recommendation, their findings have had little impact on research practice. These studies show that data leakage exists, it can inflate evaluation metrics, and may cause pathological outcomes, such as models predicting items from the future. However, temporal leave-one-out, the data splitting strategy most widely used to evaluate sequential recommenders, remains prevalent, even though it is known to suffer from data leakage. We found ourselves asking the question: if so many researchers appear unconcerned with data leakage, maybe it’s not such a big deal? In this article, we investigate data leakage in offline evaluation of sequential recommenders. We compare temporal leave-one-out with split-by-timepoint leave-one-out, a comparable data splitting strategy that prevents data leakage. Across four data sets, we show that sampled nDCG@10 drops by 21.7-73.4% with split-by-timepoint leave-one-out. This performance drop is primarily due to the absence of data leakage as controlling for training set size between data splitting strategies yields similar results. Our work highlights the severity of data leakage in sequential recommendation studies and suggests a need to reconsider current research practices and to question the veracity of prior studies.
LBREnd-to-End Time Interval-wise Segmentation for Sequential Recommendation
by Minje Kim (Gyeongsang National University), Wooseung Kang (Gyeongsang National University), Gun-Woo Kim (Gyeongsang National University), Chie Hoon Song (Gyeongsang National University), Suwon Lee (Gyeongsang National University), Sang-Min Choi (Gyeongsang National University)

Sequential recommendation aims to predict a user’s next interaction based on their historical behavior. While recent models have achieved remarkable success, they often overlook time intervals between interactions or rely on fixed thresholds for session segmentation, which can lead to suboptimal results. To address these limitations, several approaches incorporate time intervals via relative positional embeddings or session segmentation based on fixed thresholds. However, these methods are highly sensitive to threshold selection and are prone to inaccurate segmentation. Inspired by these challenges, we propose TiSRec, a Time Interval-wise Segmentation framework that dynamically divides user sequences into Local Preference Blocks (LPBs) by selecting significant time intervals. TiSRec captures evolving user preferences through intra-block and inter-block encoders. Experiments on four real-world datasets demonstrate that TiSRec consistently outperforms state-of-the-art methods, and ablation studies confirm the effectiveness of LPB-based modeling.
LBREvaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge
by Francesco Fabbri (Spotify), Gustavo Penha (Spotify), Edoardo D’Amico (Spotify), Alice Wang (Spotify), Marco De Nadai (Spotify), Jackie Doremus (Spotify), Paul Gigioli (Spotify), Andreas Damianou (Spotify), Oskar Stål (Spotify), Mounia Lalmas (Spotify)

Evaluating personalized recommendations remains a central challenge, especially in long-form audio domains like podcasts, where traditional offline metrics suffer from exposure bias and online methods such as A/B testing are costly and operationally constrained. In this paper, we propose a novel framework that leverages Large Language Models (LLMs) as offline judges to assess the quality of podcast recommendations in a scalable and interpretable manner. Our two-stage profile-aware approach first constructs natural-language user profiles distilled from 90 days of listening history. These profiles summarize both topical interests and behavioral patterns, serving as compact, interpretable representations of user preferences. Rather than prompting the LLM with raw data, we use these profiles to provide high-level, semantically rich context—enabling the LLM to reason more effectively about alignment between a user’s interests and recommended episodes. This reduces input complexity and improves interpretability. The LLM is then prompted to deliver fine-grained pointwise and pairwise judgments based on the profile-episode match. In a controlled study with 47 participants, our profile-aware judge matched human judgments with high fidelity and outperformed or matched a variant using raw listening histories. The framework enables efficient, profile-aware evaluation for iterative testing and model selection in recommender systems.
LBRFine-tuning for Inference-efficient Calibrated Recommendations
by Oleg Lesota (Johannes Kepler University Linz), Adrian Bajko (Johannes Kepler University Linz), Max Walder (Johannes Kepler University Linz), Matthias Wenzel (Johannes Kepler University Linz), Antonela Tommasel (CONICET-UNCPBA), Markus Schedl (Johannes Kepler University Linz)

Calibration is the degree to which a recommender system is able to match the distribution of a certain item attribute among the items consumed by a user with their respective recommendations. Recent work suggests that many recommenders tend to provide miscalibrated recommendations. Furthermore, most approaches aimed at improving calibration adopt the post-processing paradigm, making them computationally costly at the inference time. This work proposes CaliTune, a fine-tuning approach for collaborative filtering based recommenders to allow them generate better calibrated recommendations without relying on costly post-processing. We compare CaliTune to an established post-processing approach on two backbone models and datasets from movie and music domains, focusing on popularity calibration. Our results suggest that CaliTune can offer a competitive accuracy–calibration trade-off in several settings, particularly when the backbone model exhibits high miscalibration and accuracy remains important, making it a promising inference-efficient alternative in such cases.
LBRFrom Previous Plays to Long-Term Tastes
by Robin Ungruh (Delft University of Technology), Alejandro Bellogín (Universidad Autónoma de Madrid), Maria Soledad Pera (Delft University of Technology)

Studying the interplay of children and recommender systems (RS) is ethically and practically challenging, making simulation a promising alternative for exploration. However, recent simulation approaches that aim to model natural user-RS interactions typically rely on behavioral data and assume that user preferences remain consistent over time—an assumption that may not hold for children who undergo continuous developmental changes. With that in mind, we explore the extent to which simulations based on historical data can meaningfully reflect children’s long-term consumption patterns. We do this via a simulation study using real-world data in which user behavior is modeled from observed listening preferences. Specifically, we probe whether simulation mirrors user preferences over time by comparing with organic (i.e., real) consumption patterns. Our findings offer a critical reflection on the reliability of simulation-based RS research for children and question the reliability of using behavioral assumptions to model users.
LBRHow Fair is Your Diffusion Recommender Model?
by Daniele Malitesta (CentraleSupélec, Inria, Université Paris-Saclay), Giacomo Medda (University of Cagliari), Erasmo Purificato (European Commission, Joint Research Centre (JRC)), Mirko Marras (University of Cagliari), Fragkiskos Malliaros (CentraleSupélec, Inria, Université Paris-Saclay), Ludovico Boratto (University of Cagliari)

Diffusion-based learning has settled as a rising paradigm in generative recommendation, outperforming traditional approaches built upon variational autoencoders and generative adversarial networks. Despite their effectiveness, concerns have been raised that diffusion models – widely adopted in other machine-learning domains – could potentially lead to unfair outcomes, since they are trained to recover data distributions that often encode inherent biases. Moti- vated by the related literature, and acknowledging the extensive discussion around bias and fairness aspects in recommendation, we propose, to the best of our knowledge, the first empirical study of fairness for DiffRec, chronologically the pioneer technique in diffusion-based recommendation. Our empirical study involves DiffRec and its variant L-DiffRec, tested against nine recommender systems on two benchmarking datasets to assess recommendation utility and fairness from both consumer and provider perspectives. Specifically, we first evaluate the utility and fairness dimensions separately and, then, within a multi-criteria setting to investigate whether, and to what extent, these approaches can achieve a trade-off between the two. While showing worrying trends in alignment with the more general machine-learning literature on diffusion models, our results also indicate promising directions to address the unfairness issue in future work.
LBRInvestigating Carbon Footprint of Recommender Systems Beyond Training Time
by Josef Schodl (Johannes Kepler University Linz), Oleg Lesota (Johannes Kepler University Linz), Antonela Tommasel (CONICET-UNCPBA), Markus Schedl (Johannes Kepler University Linz)

The environmental footprint of recommender systems has received growing attention in the research community. While recent work has examined the trade-off between model accuracy and the estimated carbon emissions during training, we argue that a comprehensive evaluation should also account for the emissions produced during inference time, especially in applications where models are deployed for extended periods with frequent inference cycles. In this study, we extend previous carbon footprint analyses from the literature by incorporating the inference phase into the carbon footprint assessment and exploring how variations in training configurations affect emissions. Our findings reveal that models with higher training emissions can, in some cases, offer lower environmental costs at inference time. Moreover, we show that minimizing the number of validation metrics computed during training can lead to significant reductions in overall carbon footprint, highlighting the importance of thoughtful experimental design in sustainable machine learning.
LBRLearning geometry-aware recommender systems with manifold regularization
by Zaira Zainulabidova (ITMO University), Julia Borisova (ITMO University), Alexander Hvatov (ITMO University)

Recent work shows that hyperbolic geometry may be a better option for recommendation systems in some cases due to the natural hierarchy present in user demands. However, the choice of geometry often determines the model architecture by fixing the type of embedding. This paper discusses the manifold regularization problem statement, which allows for preserving the original architecture and standard embeddings while imposing a non-strict geometry constraint. We demonstrate using hyperbolic geometry for neural collaborative filtering in two distinct recommendation tasks based on multilayer perceptron (MLP) networks: top-k recommendation and explicit rating prediction. For a more comprehensive architecture, we also test SASRec. All tasks are evaluated on the Amazon Reviews and MovieLens1M datasets. Experiments show that manifold regularization achieves performance comparable to hyperbolic embeddings on datasets with hierarchical structure without requiring changes to the model architecture and thus leaves initial model inference unaffected.
LBRLeveraging Geometric Insights in Hyperbolic Triplet Loss for Improved Recommendations
by Viacheslav Yusupov (HSE University), Maxim Rakhuba (HSE University), Evgeny Frolov (AIRI)

Recent studies have demonstrated the potential of hyperbolic geometry for capturing complex patterns from interaction data in recommender systems. In this work, we introduce a novel hyperbolic recommendation model that uses geometrical insights to improve representation learning and increase computational stability at the same time. We reformulate the notion of hyperbolic distances to unlock additional representation capacity over conventional Euclidean space and learn more expressive user and item representations. To better capture user-items interactions, we construct a triplet loss that models ternary relations between users and their corresponding preferred and non-preferred choices through a mix of pairwise interaction terms driven by the geometry of data. Our hyperbolic approach not only outperforms existing Euclidean and hyperbolic models but also reduces popularity bias, leading to more diverse and personalized recommendations.
LBRLift It Up Right: A Recommender System for Safer Lifting Postures
by Gaetano Dibenedetto (University of Bari Aldo Moro), Pasquale Lops (University of Bari Aldo Moro), Marco Polignano (University of Bari Aldo Moro), Helma Torkamaan (Delft University of Technology)

Work-related musculoskeletal disorders, often caused by poor lifting posture and unsafe manual handling, continue to pose a significant threat to worker health and safety. This paper presents a health recommender system designed to prevent injury by assessing and correcting posture for lifting techniques. Leveraging monocular video input, our method estimates key ergonomic parameters to compute the Lifting Index based on the Revised NIOSH Lifting Equation. When the computed Lifting Index exceeds a predefined safety threshold, the system automatically generates graphical and textual recommendations to guide the worker towards safer postural strategies. This safety-aware recommender system provides interpretable and actionable feedback without requiring wearable sensors or multi-camera setups, making it suitable for deployment in real-world workplace environments. By integrating ergonomics with recommender system design, we contribute to a new class of context-aware, safety-oriented recommendation technologies tailored for occupational health.
LBRMeta Off-Policy Estimation
by Olivier Jeunen (Aampe)

Off-policy estimation (OPE) methods enable unbiased offline evaluation of recommender systems, directly estimating the online reward some target policy would have obtained, from offline data and with statistical guarantees. The theoretical elegance of the framework combined with practical successes have led to a surge of interest, with many competing estimators now available to practitioners and researchers. Among these, Doubly Robust methods provide a prominent strategy to combine value- and policy-based estimators. In this work, we take an alternative perspective to combine a set of OPE estimators and their associated confidence intervals into a single, more accurate estimate. Our approach leverages a correlated fixed-effects meta-analysis framework, explicitly accounting for dependencies among estimators that arise due to shared data. This yields a best linear unbiased estimate (BLUE) of the target policy’s value, along with an appropriately conservative confidence interval that reflects inter-estimator correlation. We validate our method on both simulated and real-world data, demonstrating improved statistical efficiency over existing individual estimators.
LBRMitigating Popularity Bias in Counterfactual Explanations using Large Language Models
by Arjan Hasami (Delft University of Technology), Masoud Mansoury (Delft University of Technology)

Counterfactual explanations (CFEs) offer a tangible and actionable way to explain recommendations by showing users a “what-if” scenario that demonstrates how small changes in their history would alter the system’s output. However, existing CFE methods are susceptible to bias, generating explanations that might misalign with the user’s actual preferences. In this paper, we propose a pre-processing step that leverages large language models to filter out-of-character history items before generating an explanation. In experiments on two public datasets, we focus on popularity bias and apply our approach to ACCENT, a neural CFE framework. We find that it creates counterfactuals that are more closely aligned with each user’s popularity preferences than ACCENT alone.
LBRNormative Alignment of Recommender Systems via Internal Label Shift
by Johannes Kruse (JP/Politikens Media Group), Kasper Lindskow (JP/Politikens Media Group), Michael Riis Andersen (Technical University of Denmark), Ryotaro Shimizu (University of California San Diego), Julian McAuley (University of California San Diego), Pierre-Alexandre Mattei (Inria, Université Côte d’Azur), Jes Frellsen (Technical University of Denmark)

Recommender systems optimized solely for user engagement often fail to meet broader normative objectives such as fairness, diversity, or editorial values. We introduce NAILS (Normative Alignment of recommender systems via Internal Label Shift), a simple and scalable method for aligning recommendation outputs with target distributions over item-level attributes (e.g., categories). NAILS modifies the user-conditional item distribution to induce a specified marginal distribution over attributes, leveraging existing user–item preferences without retraining the model. To achieve this, we recast the problem as a form of label shift applied internally within a hierarchical classification framework. Adopting a stakeholder-centric perspective, NAILS enables alignment with global normative goals. Empirically, we show that NAILS consistently improves attribute-level alignment with minimal impact on user engagement, providing a practical mechanism for value-driven recommendation.
LBROpening the Black Box: Interpretable Remedies for Popularity Bias in Recommender Systems
by Parviz Ahmadov (Delft University of Technology), Masoud Mansoury (Delft University of Technology)

Popularity bias is a well-known challenge in recommender systems, where a small number of popular items receive disproportionate attention, while the majority of less popular items are largely overlooked. This imbalance often results in reduced recommendation quality and unfair exposure of items. Although existing mitigation techniques address this bias to some extent, they typically lack transparency in how they operate. In this paper, we propose a post-hoc method using a Sparse Autoencoder (SAE) to interpret and mitigate popularity bias in deep recommendation models. The SAE is trained to replicate a pre-trained model’s behavior while enabling neuron-level interpretability. By introducing synthetic users with clear preferences for either popular or unpopular items, we identify neurons encoding popularity signals based on their activation patterns. We then adjust the activations of the most biased neurons to steer recommendations toward fairer exposure. Experiments on two public datasets using a sequential recommendation model show that our method significantly improves fairness with minimal impact on accuracy. Moreover, it offers interpretability and fine-grained control over the fairness–accuracy trade-off.
LBRPAIRSAT: Integrating Preference-Based Signals for User Satisfaction Estimation in Dialogue Systems
by Eran Fainman (University of Haifa), Adir Solomon (University of Haifa), Osnat Mokryn (University of Haifa)

User satisfaction estimation in dialogue systems is a fundamental measure for assessing and improving conversational-AI quality and user experience. Current approaches rely on users’ satisfaction annotations, referred to as supervised labels. Yet these labels are scarce, costly to collect, and often domain-specific. Another form of feedback arises when a user selects one of two offered responses in a conversation, usually called a preference signal. In this work, we propose PAIRSAT, a new model for user-satisfaction estimation that integrates both satisfaction labels and preference signals. We reformulate satisfaction prediction as a bounded regression task on a continuous scale, enabling fine-grained modeling of satisfaction levels. To exploit the preference data, we incorporate a pairwise ranking loss that encourages higher predicted satisfaction for accepted conversation responses over rejected ones. PAIRSAT jointly optimizes regression on labeled data and ranking on preference pairs using a Transformer-based encoder. Experiments demonstrate that our model outperforms baselines that rely solely on supervised satisfaction labels, demonstrating the value of adding preference signals. Further, our results underscore the value of leveraging additional signals for satisfaction estimation in dialogue systems.
LBRParameter-Efficient Single Collaborative Branch for Recommendation
by Marta Moscati (Johannes Kepler University Linz), Shah Nawaz (Johannes Kepler University Linz), Markus Schedl (Johannes Kepler University Linz)

Recommender Systems (RS) often rely on representations of users and items in a joint embedding space and on a similarity metric to compute relevance scores. In modern RS, the modules to obtain user and item representations consist of two distinct and separate neural networks (NN). In multimodal representation learning, weight sharing has been proven effective in reducing the distance between multiple modalities of a same item. Inspired by these approaches, we propose a novel RS that leverages weight sharing between the user and item NN modules used to obtain the latent representations in the shared embedding space. The proposed framework consists of a single Collaborative Branch for Recommendation (CoBraR). We evaluate CoBraR by means of quantitative experiments on e-commerce and movie recommendation. Our experiments show that by reducing the number of parameters and improving beyond-accuracy aspects without compromising accuracy, CoBraR has the potential to be applied and extended for real-world scenarios.
LBRProbabilistic Modeling, Learnability and Uncertainty Estimation for Interaction Prediction in Movie Rating Datasets
by Jennifer Poernomo (Singapore Management University), Nicole Gabrielle Lee Tan (Singapore Management University), Rodrigo Alves (Czech Technical University in Prague), Antoine Ledent (Singapore Management University)

In this paper, we examine the hypothesis that the interactions recorded in many Recommendation Systems datasets are distributed according to a low-rank distribution, i.e. a mixture of factorizable distributions. Surprisingly, we find that on several popular datasets, a simple non-negative matrix factorization method equals or outperforms more modern methods such as LightGCN, which indicates that the sampling distribution over interactions is indeed low-rank. Furthermore, we mathematically prove that low-rank distributions are learnable with a sparse number Õ((m+n)r) of observations (where m/n and r refer to the number of users/items and the non-negative rank respectively) both in terms of the total variation norm and in terms of the expected recall at k, arguably providing some of the first generalization bounds for recommender systems in the implicit feedback setting. We also provide a modified version of the NMF algorithm which provides further performance improvements compared to the standard NMF baseline on the smaller datasets considered. Finally, we propose the theoretically grounded concept of empirical expected recall as an uncertainty estimate for probabilistic models of the recommendation task, and demonstrate its success in a setting where user-wise abstentions are allowed.
LBRRecommendation Is a Dish Better Served Warm
by Danil Gusak (AIRI), Nikita Sukhorukov (AIRI), Evgeny Frolov (AIRI)

In modern recommender systems, experimental settings typically include filtering out cold users and items based on a minimum interaction threshold. However, these thresholds are often chosen arbitrarily and vary widely across studies, leading to inconsistencies that can significantly affect the comparability and reliability of evaluation results. In this paper, we systematically explore the cold-start boundary by examining the criteria used to determine whether a user or an item should be considered cold. Our experiments incrementally vary the number of interactions for different items during training, and gradually update the length of user interaction histories during inference. We investigate the thresholds across several widely used datasets, commonly represented in recent papers from top-tier conferences, and on multiple established recommender baselines. Our findings show that inconsistent selection of cold-start thresholds can either result in the unnecessary removal of valuable data or lead to the misclassification of cold instances as warm, introducing more noise into the system.
LBRRecurrent Autoregressive Linear Model for Next-Basket Recommendation
by Tereza Zmeškalová (Czech Technical University in Prague), Antoine Ledent (Singapore Management University), Martin Spišák (Recombee), Pavel Kordík (Czech Technical University in Prague), Rodrigo Alves (Czech Technical University in Prague)

Next-basket recommendation aims to predict the (sets of) items that a user is most likely to purchase during their next visit, capturing both short-term sequential patterns and long-term user preferences. However, effectively modeling these dynamics remains a challenge for traditional methods, which often struggle with interpretability and computational efficiency, particularly when dealing with intricate temporal dependencies and inter-item relationships. In this paper, we propose ReALM, a Recurrent Autoregressive Linear Model that explicitly captures temporal item-to-item dependencies across multiple time steps. By leveraging a recurrent loss function and a closed-form optimization solution, our approach offers both interpretability and scalability while maintaining competitive accuracy. Experimental results on real-world datasets demonstrate that ReALM outperforms several state-of-the-art baselines in both recommendation quality and efficiency, offering a robust and interpretable solution for modern personalization systems.
LBRRethinking Subjective Features in Recommender Systems: Personal Views Over Aggregated Values
by Arsen Matej Golubovikj (University of Primorska), Marko Tkalčič (University of Primorska)

Subjective features of content items, such as emotional resonance and aesthetic quality, have become increasingly important in recommender systems (RecSys), as the field moves beyond objective content and behavioral signals. Traditionally, such features were treated as fixed item-level properties, aggregated across users. However, emerging evidence suggests that subjective features are inherently user-dependent, shaped by individual interpretations and personal perspectives. This paper presents the first direct comparison between fixed (aggregated) and user-specific (subjective) item representations for modeling subjective features in RecSys. Using three datasets spanning movies, videos, and images, with with subjective features, such as eudaimonia, hedonia, emotion, and aesthetics, we evaluate the impact of the representation strategy (i.e. fixed vs. user-specific) on recommendation performance across multiple algorithms. Our findings show that user-specific representations consistently outperform aggregate ones, often with statistically significant improvements. These results underscore the importance of modeling subjectivity at the user level, offering concrete guidance for more personalized and effective recommendation systems.
LBRRicciFlowRec: A Geometric Root Cause Recommender Using Ricci Curvature on Financial Graphs
by Zhongtian Sun (University of Kent), Anoushka Harit (University of Cambridge)

We propose RicciFlowRec, a geometric recommendation framework that performs root cause attribution via Ricci curvature and flow on dynamic financial graphs. By modelling evolving interactions among stocks, macroeconomic indicators, and news, we quantify local stress using discrete Ricci curvature and trace shock propagation via Ricci flow. Curvature gradients reveal causal substructures, informing a structural risk-aware ranking function. Preliminary results on S&P 500 data with FinBERT-based sentiment show improved robustness and interpretability under synthetic perturbations. This ongoing work supports curvature-based attribution and early-stage risk-aware ranking, with portfolio optimisation and return forecasting planned. RicciFlowRec is, to our knowledge, the first recommender to apply geometric flow-based reasoning in financial decision support.
LBRSAGEA: Sparse Autoencoder-based Group Embeddings Aggregation for Fairness-Preserving Group Recommendations
by Vít Koštejn (Faculty of Mathematics and Physics, Charles University), Ladislav Peška (Faculty of Mathematics and Physics, Charles University), Martin Spišák (Recombee)

Group recommender systems (GRS) deliver suggestions to users who plan to engage in activities together, rather than individually. To be effective, they must reflect shared group interests while maintaining fairness by accounting for the preferences of individual members. Traditional approaches address fairness through post-processing, aggregating the recommendations after they are generated for each group member. However, this strategy adds significant complexity and offers only limited impact due to its late position in the GRS pipeline. In contrast, we propose an efficient in-processing method combining (1) monosemantic sparse user representations generated via a sparse autoencoder (SAE) bridge module, and (2) fairness-preserving group profile aggregation strategies. By leveraging disentangled representations, our Sparse Autoencoder-based Group Embeddings Aggregation (SAGEA) approach enables transparent, fairness-preserving profile aggregation within the GRS process. Experiments show that SAGEA improves both recommendation accuracy and fairness over profile and results aggregation baselines, while being more efficient than post-processing techniques.
LBRSemantic IDs for Joint Generative Search and Recommendation
by Gustavo Penha (Spotify), Edoardo D’Amico (Spotify), Marco De Nadai (Spotify), Enrico Palumbo (Spotify), Alexandre Tamborrino (Spotify), Ali Vardasbi (Spotify), Max Lefarov (Spotify), Shawn Lin (Spotify), Timothy Heath (Spotify), Francesco Fabbri (Spotify), Hugues Bouchard (Spotify)

Generative models powered by Large Language Models (LLMs) are emerging as a unified solution for powering both recommendation and search tasks. A key design choice in these models is how to represent items, traditionally through unique identifiers (IDs) and more recently with Semantic IDs composed of discrete codes, obtained from embeddings. While task-specific embedding models can improve performance for individual tasks, they may not generalize well in a joint setting. In this paper, we explore how to construct Semantic IDs that perform well both in search and recommendation when using a unified model. We compare a range of strategies to construct Semantic IDs, looking into task-specific and cross-tasks approaches, and also whether each task should have its own semantic ID tokens in a joint search and recommendation generative model. Our results show that using a bi-encoder model fine-tuned on both search and recommendation tasks to obtain item embeddings, followed by the construction of a unified Semantic ID space provides an effective trade-off, enabling strong performance in both tasks. We hope these findings spark follow-up work on generalisable, semantically grounded ID schemes and inform the next wave of unified generative recommender architectures.
LBRSlateLLM: Distilling LLM Semantics into Session-Aware Slate Recommendation without Inference Overhead
by Aayush Roy (University College Dublin), Elias Tragos (University College Dublin), Aonghus Lawlor (University College Dublin), Neil Hurley (University College Dublin)

Session-based slate recommendation systems curate ranked sets of items in real-time, adapting to evolving user interactions. Balancing relevance, diversity, and novelty remains challenging for reinforcement learning (RL) methods. Recent advances in large language models (LLMs) offer a new possibility to leverage their semantic reasoning capabilities to refine slate composition. In this work, we examine the impact of LLM-driven reasoning on slate generation by integrating LLMs with an RL-based slate recommender and evaluating in terms of accuracy, similarity, diversity, and novelty. We extend the RecSim framework with real-world interaction data and introduce a session-aware evaluation protocol that captures long-term engagement. Our analysis reveals that LLM reasoning enhances subcategory-level diversity while maintaining relevance, leading to increased user engagement. By visualizing category-level shifts in slate composition we uncover systematic patterns in how LLMs refine recommendation diversity. Although direct LLM use during inference may be hampered by computational demands and latency concerns, our experimental results demonstrate that integrating LLM modifications during training enables the model to internalize the nuanced characteristics of LLM reasoning without incurring inference overhead, thereby improving recommendation performance, serving time efficiency, and deployability.
LBRt-Testing the Waters
by Olivier Jeunen (Aampe)

A/B-tests are a cornerstone of experimental design on the web, with wide-ranging applications and use-cases. The statistical t-test comparing differences in means is the most commonly used method for assessing treatment effects, often justified through the Central Limit Theorem (CLT). The CLT ascertains that, as the sample size grows, the sampling distribution of the Average Treatment Effect converges to normality, making the t-test valid for sufficiently large sample sizes. When outcome measures are skewed or non-normal, quantifying what “sufficiently large” entails is not straightforward.
To ensure that confidence intervals maintain proper coverage and that p-values accurately reflect the false positive rate, it is critical to validate this normality assumption. We propose a practical method to test this, by analysing repeatedly resampled A/A-tests. When the normality assumption holds, the resulting ????-value distribution should be uniform, and this property can be tested using the Kolmogorov-Smirnov test. This provides an efficient and effective way to empirically assess whether the t-test’s assumptions are met, and the A/B-test is valid. We demonstrate our methodology and highlight how it helps to identify scenarios prone to inflated Type-I errors. Our approach provides a practical framework to ensure and improve the reliability and robustness of A/B-testing practices.
LBRThe Hidden Cost of Defaults in Recommender System Evaluation
by Hannah Berling (University of Gothenburg), Robin Svahn (University of Gothenburg), Alan Said (University of Gothenburg)

Hyperparameter optimization is critical for improving the performance of recommender systems, yet its implementation is often treated as a neutral or secondary concern. In this work, we shift focus from model benchmarking to auditing the behavior of RecBole, a widely used recommendation framework. We show that RecBole’s internal defaults, particularly an undocumented early-stopping policy, can prematurely terminate Random Search and Bayesian Optimization. This limits search coverage in ways that are not visible to users. Using six models and two datasets, we compare search strategies and quantify both performance variance and search path instability. Our findings reveal that hidden framework logic can introduce variability comparable to the differences between search strategies. These results highlight the importance of treating frameworks as active components of experimental design and call for more transparent, reproducibility-aware tooling in recommender systems research. We provide actionable recommendations for researchers and developers to mitigate hidden configuration behaviors and improve the transparency of hyperparameter tuning workflows.
LBRUnobserved Negative Items in Recommender Systems: Challenges and Solutions for Evaluation and Learning
by Masahiro Sato (FUJIFILM)

Properly conducting offline evaluation is crucial for recommender systems. While sampling negative items has traditionally been employed for its efficiency in evaluation, recent studies have highlighted the limitations of this approach, fostering researchers to adopt a more cautious stance toward item-sampling evaluation. However, even in the absence of intentional sampling, negative items may still be missing. This issue arises because typical implicit feedback datasets contain only items that have been interacted with by at least one user in the dataset. Consequently, the included items may not encompass the entire catalog of items that serve as true candidate items during online deployment. In this paper, we investigate the impact of missing candidate items on both the evaluation and learning processes of recommender systems. Our findings demonstrate that missing candidate items lead to the overestimation of model performance and inconsistencies in identifying superior models. Moreover, their absence significantly impairs model training. To address this challenge, we propose evaluation and learning methods based on inverse probability weighting, complemented by a novel protocol for estimating the probabilities of missing items. We show that the proposed evaluation methods recover metrics that closely approximate their true values. Furthermore, the proposed learning method yields a more robust model, even when candidate items are missing from the training data.
LBReSASRec: Enhancing Transformer-based Recommendations in a Modular Fashion
by Daria Tikhonovich (MTS), Nikita Zelinskiy (MTS), Aleksandr V. Petrov (Independent Researcher), Mayya Spirina (MTS), Andrei Semenov (Yandex), Andrey V. Savchenko, Sergei Kuliev (MTS)

Since their introduction, Transformer-based models, such as SAS- Rec and BERT4Rec, have become common baselines for sequential recommendations, surpassing earlier neural and non-neural methods. A number of following publications have shown that the effectiveness of these models can be improved by, for example, slightly updating the architecture of the Transformer layers, using better training objectives, and employing improved loss functions. However, the additivity of these modular improvements has not been systematically benchmarked – this is the gap we aim to close in this paper. Through our experiments, we identify a very strong model that uses SASRec’s training objective, LiGR Transformer layers, and Sampled Softmax Loss. We call this combination eSASRec (Enhanced SASRec). While we primarily focus on realistic, production-like evaluation, in our preliminarily study we find that common academic benchmarks show eSASRec to be 23% more effective compared to the most recent state-of-the-art models, such as ActionPiece. In our main production-like benchmark, eSASRec resides on the Pareto frontier in terms of the accuracy–coverage tradeoff (alongside the recent industrial models HSTU and FuXi-????). As the modifications compared to the original SASRec are relatively straightforward and no extra features are needed (such as timestamps in HSTU), we believe that eSASRec can be easily integrated into existing recommendation pipelines and can can serve as a strong yet very simple baseline for emerging complicated algorithms. To facilitate this, we provide the open-source implementations for our models and benchmarks.

List of all Demo papers accepted for RecSys 2025 (in alphabetical order).

DEMOAPS Explorer: Navigating Algorithm Performance Spaces for Informed Dataset Selection
by Tobias Vente (University of Antwerp), Michael Heep (University of Siegen), Abdullah Abbas (University of Siegen), Theodor Sperle (University of Siegen), Joeran Beel (University of Siegen), Bart Goethals (University of Antwerp)

Dataset selection is crucial for offline recommender system experiments, as mismatched data (e.g., sparse interaction scenarios require datasets with low user-item density) can lead to unreliable results. Yet, 86% of ACM RecSys 2024 papers provide no justification for their dataset choices, with most relying on just four datasets: Amazon (38%), MovieLens (34%), Yelp (15%), and Gowalla (12%). While Algorithm Performance Spaces (APS) were proposed to guide dataset selection, their adoption has been limited due to the absence of an intuitive, interactive tool for APS exploration. Therefore, we introduce the APS Explorer, a web-based visualization tool for interactive APS exploration, enabling data-driven dataset selection. The APS Explorer provides three interactive features: (1) an interactive PCA plot showing dataset similarity via performance patterns, (2) a dynamic meta-feature table for dataset comparisons, and (3) a specialized visualization for pairwise algorithm performance.
DEMOArtAICare: An End-to-End Platform for Personalized Art Therapy
by Bereket A. Yilma (University of Luxembourg), Saravanakumar Duraisamy (University of Luxembourg), Stefan Penchev (University of Luxembourg), Tudor Pristav (University of Luxembourg), Luis A. Leiva (University of Luxembourg)

We introduce a platform powered by visual art recommender systems (VA RecSys) to support art therapy for patients with Post-Intensive Care Syndrome (PICS) or experiencing psychiatric sequelae symptoms such as anxiety, depression, and Post Traumatic Stress Disorder (PTSD). The contribution is threefold: (1) integration of unimodal, multimodal, and cross-domain VA RecSys engines as plug-and-play external APIs for therapeutic art recommendations; (2) development of an end-to-end platform with desktop/mobile/tablet and immersive VR interfaces to connect therapists and patients; and (3) a therapist dashboard providing post-session analytics, including objective and subjective measures, to inform future recommendations. A pilot test with licensed art therapists and patients with PICS demonstrated that the platform enables therapist-supervised, personalized therapy, reducing preparation time by 50% and improving affective states by 70.5%.
DEMOArtEx: A User-Controllable Web Interface for Visual Art Recommendations
by Rully Agus Hendrawan (University of Pittsburgh), Peter Brusilovsky (University of Pittsburgh), Luis A. Leiva (University of Luxembourg), Bereket A. Yilma (University of Luxembourg)

We introduce a web-based interface for visual art recommendations, empowering users to adjust popularity and diversity through intuitive sliders. Built on the SemArt dataset and leveraging multimodal BLIP features, ArtEx allows users to fine-tune recommendations across dimensions like genre, time period, and artist. This demo paper presents ArtEx’s interactive interface, showcasing its ability to enhance user engagement and satisfaction through transparent, user-driven personalization.
DEMOBlooming Beats: An Interactive Music Recommender System Grounded in TRACE Principles and Data Humanism
by Ibrahim Al-Hazwani (University of Zurich), Daniel Lutziger (University of Zurich), Carlos Kirchdorfer (University of Zurich), Luca Huber (University of Zurich), Oliver Robin Aschwanden (University of Zurich), Jürgen Bernard (University of Zurich, Digital Society Initiative), Ludovico Boratto (University of Cagliari)

Music streaming platforms reduce rich listening experiences to algorithmic black boxes, overlooking personal narratives that make music meaningful. We present Blooming Beats, an explainable recommender system that transforms Spotify listening data into visual narratives using Data Humanism principles. The system embodies TRACE principles: Transparency through visual explanations, Context-awareness by integrating personal context, and Empathy by matching listening stories rather than user profiles. A user study with 8 participants exploring a decade of listening data shows that narrative-driven visualization suggests potential for enhancing transparency and engagement.
DEMOFlights Pricelock Fee Recommendation on Online Travel Agent Platform
by Akash Khetan (MakeMyTrip pvt ltd), Narasimha Medeme (MakeMyTrip pvt Ltd), Deepak Yadav (MakeMyTrip pvt Ltd), Anmol Porwal (MakeMyTrip pvt Ltd)

In this study, we present a neural network (NN) based recommender system with novel custom loss function developed to recommend fee for its pricelock product. It is a popular add-on product that allows users to lock a flight price and book it later at the same locked price, even if the price increases while flight booking. The core challenge in enabling this product lies in predicting the magnitude of future price changes over time horizons.
We formulate this problem as a multi-task learning (MTL) setup, where price change magnitudes are modeled as ordinal categoriesacross several time intervals modeled as heads. Crucially, we address the ordinal nature of price change buckets by introducinga novel loss function called Learnable Soft Ordinal Regression (L-SORD).
Our demo showcases how this system improves both predictive accuracy and revenue performance, enabling more effective price recommendations in a high stakes, real world environment. This work highlights the potential of combining MTL architectures with custom loss functions in production grade pricing recommender systems.
DEMOInteractive Playlist Generation from Titles
by Eléa Vellard (EURECOM), Enzo Charolois-Pasqua (EURECOM), Youssra Rebboud (EURECOM), Pasquale Lisena (EURECOM), Raphaël Troncy (EURECOM)

This demo presents an interactive playlist recommendation system that relies exclusively on playlist titles. By fine-tuning a transformer-based language model on clustered playlists, we enable real-time playlist generation for a given title, relying on the semantic meaning of known playlists’ and tracks’ titles. The playlist title provided in input is freely expressed in natural language in a user-friendly web interface. The system is lightweight, fast, and fully accessible through a simple web page.
DEMOLarge Language Model-based Recommendation System Agents
by Tommaso Carraro (SonyAI), Brijraj Singh (Sony Research India), Niranjan Pedanekar (Sony Research India)

A Large Language Model-based agent is an AI assistant that makes use of advanced Tool Calling and Retrieval Augmented Generation techniques to access external tools (e.g., Python code, databases). This allows the agent to consult additional sources of information that are complementary to its pre-trained knowledge. By doing so, re-training or fine-tuning of the LLM each time new knowledge becomes available can be avoided, as the assistant can access this information thanks to the available tools. In this demo, we investigate this idea in the Recommendation Systems scenario. In particular, we design an AI assistant for recommendation that can access (i) a pre-trained recommender system, (ii) a database, and (iii) a vector store. The demo shows how the assistant is able to interact with these tools to reply to complex recommendation and explanation queries that require reasoning on the tool’s results. To the best of our knowledge, this is the first attempt at designing LLM-based recommendation system agents.
DEMOPRISM: From Individual Preferences to Group Consensus through Conversational AI-Mediated and Visual Explanations
by Ibrahim Al-Hazwani (University of Zurich), Oliver Aschwanden (University of Zurich), Oana Inel (University of Zurich), Jürgen Bernard (University of Zurich, Digital Society Initiative), Ludovico Boratto (University of Cagliari)

Group accommodation booking forces travelers to coordinate externally through messaging apps and informal voting, missing opportunities for transparent preference alignment. We present PRISM, an interactive group recommender system that transforms opaque recommendation processes into transparent collaborative visual experiences. PRISM employs a two-phase interaction paradigm: individual preference elicitation through conversational AI, followed by collaborative decision-making via bivariate map preference visualization. A controlled user study with 6 pairs shows PRISM enhances transparency (+1.83 on 5-point scale), consensus building (+2.0), and reduces conformity pressure compared to traditional approaches and interfaces.
DEMORecViz: Intuitive Graph-based Visual Analytics for Dataset Exploration and Recommender System Evaluation
by Jackson Dam (University of Glasgow), Zixuan Yi (University of Glasgow), Iadh Ounis (University of Glasgow)

We present RecViz, a novel web application designed to support qualitative analysis of recommender system performance on large datasets. RecViz offers real-time, interactive graph visualisation of recommendation data, enabling side-by-side comparisons of models through dual graph views. Leveraging GPU acceleration via CUDA and WebGL, it delivers fast, responsive force-directed layouts, even at scale. Unlike prior tools limited to small datasets, RecViz shows the potential to handle large datasets efficiently. For example, it maintains an average of 28 FPS while visualising the full MovieLens-1M dataset, with all 1 million interactions.
DEMOTravel Together, Play Together: Gamifying a Group Recommender System for Tourism
by Patrícia Alves (ISEP, Polytechnic of Porto), Joana Neto (ISEP, Polytechnic of Porto), Jorge Lima (ISEP, Polytechnic of Porto), José Silva (ISEP, Polytechnic of Porto), Luís Conceição (ISEP, Polytechnic of Porto), Goreti Marreiros (ISEP, Polytechnic of Porto)

Gamification is increasingly being used in a variety of domains, such as in education to motivate students learning, in healthcare contexts to help patients follow medical indications or improve healthy habits, or even in tourism to enrich the tourists’ experience. Recommender Systems (RS) are an application example, where gamification has been added to motivate and challenge tourists while visiting a destination, but only few use gamification to motivate using the RS itself, and, to the best of our knowledge, there are no Group RS (GRS) that use gamification. Psychological aspects, such as personality, are also being studied to enhance recommendations, since they have shown to produce better results than generic approaches, but to acquire personality without the social desirability bias associated to questionnaires or a great amount of user interactions is a challenge. In previous studies, we showed serious games can be the leverage needed to implicitly acquire the tourists’ personality and improve recommendations without the observer’s bias. In this demo, we show how we gamified a GRS for tourism prototype by using rewards, a virtual pet, and the serious games.
DEMOVisualReF: Interactive Image Search Prototype with Visual Relevance Feedback
by Bulat Khaertdinov (Maastricht University), Mirela Popa (Maastricht University), Nava Tintarev (Maastricht University)

In the absence of interaction history, image recommendations often depend on content-based approaches. Prompted by user queries in natural language, such systems rank items based on the similarity between textual and visual features. However, these approaches typically rely on static queries and do not offer alternative feedback mechanisms. In this paper, we present VisualReF: an interactive image retrieval prototype that introduces visual relevance feedback through fine-grained user annotations. Built on vision-language models (VLMs) for retrieval, our system allows users to label relevant and irrelevant regions in retrieved images. These regions are captioned using a generative vision-language model to refine the query vector. Our work bridges the gap between conventional static image retrieval and interactive, user-guided search by introducing visual relevance feedback. Finally, our prototype contributes to the field of visual recommendation by empowering researchers with practical tools for: (i) collecting region-level visual relevance signals from users, (ii) supporting integration of human feedback into interactive search pipelines, and (iii) explaining how the relevance feedback model perceives user input.

List of all doctoral symposium papers accepted for RecSys 2025 (in alphabetical order).

DSAdding Value to Low-Resource Industrial Recommender Systems
by Cornelia Kloppers (Stellenbosch University)

This research proposes a modular, resource-aware framework for industrial recommender systems that enables the integration and evaluation of stakeholder values at each stage of the recommendation pipeline. Motivated by the practical constraints of data availability and computational capacity, the framework supports stage-wise optimisation and selective retraining, making it suitable for low-resource environments. Ongoing experiments on open-source and real-world datasets aim to validate the framework’s adaptability, offering a contribution to the design of value-aware and operationally viable recommender systems.
DSAddressing Multi-stakeholder Fairness Concerns in Recommender Systems Through Social Choice
by Amanda Aird (University of Colorado Boulder)

Fairness in recommender systems has been discussed on the group and individual level with concerns for both providers and consumers. But many current solutions to improving fairness in recommender systems can only address one fairness concern or have limited definitions of fairness. My research revolves around improving fairness in recommender systems with an approach that addresses multiple and complex fairness concerns. I use SCRUF-D (Social Choice for Recommendation Under Fairness – Dynamic), a multi-agent social choice-based architecture, for reranking recommendations to improve fairness across multiple dimensions. My completed research has evaluated trade-offs between accuracy and fairness when reranking for multiple fairness definitions on the provider side. This includes exploring how different social choice rules and agent allocation mechanisms impact this trade-off. Currently, I am focused on expanding these studies to include individual and consumer-side fairness metrics. My ongoing research aims to evaluate the trade-offs between accuracy and fairness, incorporating consumer-side fairness metrics. Research to handle tensions between different types of fairness and human research to demonstrate the value of SCRUF-D is being planned.
DSAdvancing User-Centric Evaluation and Design of Conversational Recommender Systems
by Michael Müller (University of Innsbruck)

Conversational Recommender Systems (CRS) are rapidly evolving with advancements in large language models (LLMs), enabling richer, more adaptive user interactions. However, existing evaluation practices remain largely system-centric, underestimating nuanced factors like conversational quality, empathy, and real-world user satisfaction. This doctoral research aims to bridge that gap by advancing holistic, user-centric evaluation frameworks for CRS. The work pursues four directions: (1) identifying key drivers of user satisfaction through targeted user studies and dataset analyses; (2) systematically investigating LLMs as annotators and user simulators to support scalable CRS assessment; (3) developing scalable, standardized evaluation protocols that balance objective accuracy with subjective conversational experience; and (4) deriving actionable design guidelines by comparing strategies for preference elicitation and context integration. Ultimately, this research seeks to provide reproducible methods, and evidence-based guidance to foster the development of CRS that genuinely center the user.
DSAre Recommender Systems Serving Children? Toward Child-Aware Design and Evaluation
by Robin Ungruh (Delft University of Technology)

Recommender Systems research continuously improves recommendation strategies to meet the needs of a wide range of users and other stakeholders. However, much of this research centers on the traditional, adult user, often overlooking underrepresented demographics. One such group is children, frequent users of platforms driven by recommender systems. Children differ from adults in preferences and can be particularly vulnerable to certain content, raising questions about the harm recommender systems may pose. This PhD project advocates for child-aware recommender systems: systems that explicitly account for children as part of their user base, recognizing their distinct needs, vulnerabilities, and rights. In pursuit of this goal, we investigate how well current recommender systems serve children, auditing algorithmic strategies from two complementary perspectives: The ‘traditional’ perspective focuses on the degree to which recommendations align with children’s preferences. The perspective of ‘non-maleficence’ assesses suitability of content recommended, evaluating whether it respects children’s vulnerabilities and avoids potentially harmful material. To do so, we audit current recommender systems according to both perspectives—not only in the short term, but also in the long term through simulation studies. Beyond auditing, we explore strategies and design directions for making recommender systems more responsible. Outcomes from this work aim to inform both the academic and practitioner communities about the gaps in current systems and to lay the groundwork for more equitable, safe, and meaningful recommendations for children.
DSBayesian Perspectives on Offline Evaluation for Recommender Systems
by Michael Benigni (Politecnico di Milano)

Offline evaluation is a fundamental component in the deployment and development of better recommender systems. In recent years, the contextual bandit framework has emerged as a valuable approach for counterfactual evaluation, leading to the increasing interest to estimators based on inverse propensity scoring (IPS), direct methods (DM), and doubly robust (DR) techniques. However, nearly all existing methods rely on frequentist statistics, which limits their ability to capture model uncertainty and reflect it through evaluation outcomes. This work explores the novel research direction of Bayesian statistics for Off-Policy Evaluation in recommendation tasks, motivated by the need for reliable estimators that are more robust to distribution shift, data sparsity, and model misspecification. Three underexplored research directions are identified in this work: (i) using posterior uncertainty from Bayesian reward models to design adaptive hybrid estimators, (ii) explicitly modeling all components of the OPE problem—contexts, actions, and rewards—using a joint probabilistic framework, and (iii) quantifying epistemic uncertainty over policy value estimates via posterior inference. By leveraging the Bayesian framework, the aim is to improve the reliability, interpretability, and safety of offline evaluation protocols, offering a new lens on one of the most persistent challenges in recommender systems research. This perspective is especially relevant in data-scarce or high-stakes settings, where understanding uncertainty is essential for trustworthy decision-making.
DSBeyond Persuasion: Adaptive Warnings and Balanced Explanations for Informed Decision-Making in Recommender Systems
by Elaheh Jafari (University of Saskatchewan)

As recommender systems become deeply embedded in digital platforms, designing explanations that are ethical, effective, and user-centered is increasingly important. Traditional strategies often prioritize persuasiveness or transparency but neglect user agency and cognitive differences. This research explores alternative explanation formats, warnings that highlight potential drawbacks and balanced pros-and-cons summaries, to support more informed and autonomous decision-making. In the first year, we published a paper discussing ethical considerations in explanation design for recommender systems. We then conducted a systematic review of user perceptions, a study of warning messages in mobile app interfaces, and a controlled e-commerce experiment comparing baseline, warning, and pros-and-cons explanations. Results indicate that layered explanations improve decision satisfaction, reduce cognitive load, and better align with individual traits like decision style and need for cognition. Building on these findings, we propose a multi-level explanation approach that combines upfront warnings with on-demand balanced details, adaptable across domains. Future work will explore personalization strategies, real-time adaptivity, and generalizability to domains such as media, news, and job recommendations. This research aims to inform the design of transparent, fair, and trustworthy explanation interfaces in recommender systems.
DSChallenges in Perfume Recommender Systems: Navigating Subjectivity, Context and Sensory Data
by Elena-Ruxandra Lutan (University of Craiova)

Compared to other recommender systems domains, perfume recommendation proves to be highly personalized and more challenging due to the very subjective factors and complex mixture of involved senses. Individual perfume preferences are influenced by subtle elements such as emotional associations, personal memories, and unique biochemistry, making it difficult for users to clearly express their olfactory preferences. This paper provides an insight of significant challenges in perfume recommendations planned to be addressed in the context of my ongoing PhD project. By exploring these areas, I aim to make a meaningful contribution to the ongoing development of perfume recommender systems.
DSFair and Transparent Recommender Systems for Advertisements
by Dina Zilbershtein (Maastricht University)

Recommender systems are central to digital platforms, powering content personalization, user engagement, and revenue generation. In advertising, they operate within a multi-stakeholder environment, bringing together viewers, advertisers, and platform providers with often competing objectives. While such systems enhance targeting precision, their opacity raises concerns around fairness, transparency, and trust. This research, conducted in collaboration with RTL Netherlands, focuses on building fair and transparent recommender systems for advertisements, with particular emphasis on Video-on-Demand (VoD) platforms. I investigate algorithmic interventions and explainability techniques aimed at aligning system behavior with stakeholders’ expectations. By addressing tensions between stakeholders’ objectives and challenges of the ad delivery process, this work contributes to the design of ethically responsible advertising systems that balance commercial goals with accountability and user trust.
DSFull-Page Recommender: A Modular Framework for Multi-Carousel Recommendations
by Jan Kislinger (Czech Technical University)

Full-page layouts with multiple carousels are widely used in video streaming platforms, yet understudied in recommender systems research. This paper introduces a structured approach to generating such pages by recommending coherent item collections and optimizing their arrangement. We break the problem into subcomponents and propose methods that balance user relevance, diversity, and coherence. We also present an evaluation framework tailored to this setting. We argue that this approach can improve recommendation quality beyond traditional ranked lists.
DSNarrative-Driven Itinerary Recommendation: LLM Integration for Immersive Urban Walking
by Fabio Ferrero (University of Turin)

Sedentary behavior, dubbed the disease of the 21st century, is a ubiquitous force driving chronic illness. Yet, traditional itinerary and Point-of-Interest (POI) Recommender Systems (RSs) lack engaging elements that motivate routine urban walking. This research proposes a novel framework combining narrative-driven storytelling with location-based RSs to promote physical activity and immersive urban exploration. This approach introduces a bidirectional alignment between POI and itinerary recommendations and LLM-generated narratives, transforming routine urban walks into dynamic journeys where contextually relevant stories unfold across city locations. Unlike sequential POI recommendations, this framework embeds location suggestions within contextually relevant narratives of various genres, simultaneously promoting health benefits and deeper city exploration. The research addresses three research questions using a method that builds a structured knowledge base by extracting entities (e.g., POIs, and characters) and semantic links from narrative corpora, enabling semantic alignment between recommended physical locations and story elements. The core aspects of this work are: (i) context-aware itinerary recommendations and personalized story generation, (ii) bidirectional mapping between RSs and story generation, and (iii) systems design bridging user’s needs to promote urban walking as a health activity. Evaluation employs comparative user studies measuring quality and engagement, route-narrative semantic alignment, and narrative analysis to validate the integrated proposed approach.
DSPersonalized Image Generation for Recommendations Beyond Catalogs
by Gabriel Patron (University of Michigan)

Retrieval-based recommender systems are constrained by fixed catalogs, limiting their ability to serve diverse and evolving user preferences. We propose REBECA (REcommendations BEyond CAtalogs), a new class of preference-aware generative models for recommendation that synthesizes images tailored to individual tastes rather than retrieving items. REBECA conditions a diffusion model on users’feedback (e.g., ratings) to generate personalized image embeddingsin CLIP space, which are decoded into images via a hierarchical adapter architecture that bypasses the need for image captions during training. By leveraging an expressive pre-trained image decoder and a lightweight probabilistic adapter, REBECA enables general-purpose image generation aligned with users’ visual preferences across diverse domains without expensive fine-tuning. We also introduce a new benchmark for personalized generation based on a curated version of the FLICKR-AES dataset, along with two novel personalization metrics tailored to the generative setting. Empirical results show that REBECA produces high-quality, diverse, and preference-aligned outputs, outperforming prompt-based personalization baselines on key personalization and quality metrics. By augmenting traditional retrieval with generative modeling, REBECA opens new opportunities for applications such as content design, personalization-first creative platforms, and preference-aware synthetic media.
DSRecommender Systems for Digital Humanities and Archives: Multistakeholder Evaluation, Scholarly Information Needs, and Multimodal Similarity
by Florian Atzenhofer-Baumgartner (Graz University of Technology)

Recommender systems (RecSys) in digital humanities (DH) and archives face unique challenges, including balancing competing stakeholder values, serving complex scholarly information needs, and modeling multimodal historical artifacts. This paper reports on ongoing research that tackles these issues through three interconnected strands: (1) the development of co-designed multistakeholder evaluation frameworks that move beyond simple engagement metrics to capture diverse priorities among archivists, researchers, and platform owners; (2) a systematic examination of the information behaviors of humanities scholars to inform user models adapted to exploratory, non-linear research; and (3) the creation of multimodal similarity metrics that exploit scholarly markup, material characteristics, and specialized domain knowledge. Validated through Monasterium.net—the world’s largest charter archive—this research contributes novel approaches to value-driven evaluation, scholarly user modeling, and historical document similarity. It provides methodological frameworks to bridge the computer science and DH communities, and to advance multistakeholder RecSys for complex, non-traditional domains.

Accepted Contributions

RecSys 2025 (Prague)

Diamond Supporter

Platinum Supporter

Gold Supporter

Bronze Supporter

Challenge Supporter

Women in RecSys’s Event Supporter

Breakfast Symposium

Coffee Break Sponsor

Special Supporters

About this site

RecSys 2026