RecSys 2025 - Tuesday Posters - RecSys

Tuesday Poster Session: Full + Short + Doctoral Symposium + ACM TORS

Date: Tuesday September 23

Doctoral Symposium Papers

SPOT #1Adding Value to Low-Resource Industrial Recommender Systems
by Cornelia M Klopper

This research proposes a modular, resource-aware framework for industrial recommender systems that enables the integration and evaluation of stakeholder values at each stage of the recommendation pipeline. Motivated by the practical constraints of data availability and computational capacity, the framework supports stage-wise optimisation and selective retraining, making it suitable for low-resource environments. Ongoing experiments on open-source and real-world datasets aim to validate the framework’s adaptability, offering a contribution to the design of value-aware and operationally viable recommender systems.

Full text in ACM Digital Library
SPOT #2Addressing Multi-stakeholder Fairness Concerns in Recommender Systems Through Social Choice
by Amanda Aird

Fairness in recommender systems has been discussed on the group and individual level with concerns for both providers and consumers. But many current solutions to improving fairness in recommender systems can only address one fairness concern or have limited definitions of fairness. My research revolves around improving fairness in recommender systems with an approach that addresses multiple and complex fairness concerns. I use SCRUF-D (Social Choice for Recommendation Under Fairness – Dynamic), a multi-agent social choice-based architecture, for reranking recommendations to improve fairness across multiple dimensions. My completed research has evaluated trade-offs between accuracy and fairness when reranking for multiple fairness definitions on the provider side. This includes exploring how different social choice rules and agent allocation mechanisms impact this trade-off. Currently, I am focused on expanding these studies to include individual and consumer-side fairness metrics. My ongoing research aims to evaluate the trade-offs between accuracy and fairness, incorporating consumer-side fairness metrics. Research to handle tensions between different types of fairness and human research to demonstrate the value of SCRUF is being planned.

Full text in ACM Digital Library
SPOT #3Advancing User-Centric Evaluation and Design of Conversational Recommender Systems
by Michael Müller

Conversational Recommender Systems (CRS) are rapidly evolving with advancements in large language models (LLMs), enabling richer, more adaptive user interactions. However, existing evaluation practices remain largely system-centric, underestimating nuanced factors like conversational quality, empathy, and real-world user satisfaction. This doctoral research aims to bridge that gap by advancing holistic, user-centric evaluation frameworks for CRS. The work pursues four directions: (1) identifying key drivers of user satisfaction through targeted user studies and dataset analyses; (2) systematically investigating LLMs as annotators and user simulators to support scalable CRS assessment; (3) developing scalable, standardized evaluation protocols that balance objective accuracy with subjective conversational experience; and (4) deriving actionable design guidelines by comparing strategies for preference elicitation and context integration. Ultimately, this research seeks to provide reproducible methods, and evidence-based guidance to foster the development of CRS that genuinely center the user.

Full text in ACM Digital Library
SPOT #4Are Recommender Systems Serving Children? Toward Child-Aware Design and Evaluation
by Robin Ungruh

Recommender Systems research continuously improves recommendation strategies to meet the needs of a wide range of users and other stakeholders. However, much of this research centers on the traditional, adult user, often overlooking underrepresented demographics. One such group is children, frequent users of platforms driven by recommender systems. Children differ from adults in preferences and can be particularly vulnerable to certain content, raising questions about the harm recommender systems may pose.This PhD project advocates for child-aware recommender systems: systems that explicitly account for children as part of their users, recognizing their distinct needs, vulnerabilities, and rights. In pursuit of this goal, we investigate how well current recommender systems serve children, auditing algorithmic strategies from two complementary perspectives: The ‘traditional’ perspective focuses on whether recommendations align with children’s preferences. The perspective of ‘non-maleficence’ assesses suitability of content recommended, evaluating whether it respects children’s vulnerabilities to potentially harmful material. To do so, we audit current recommender systems according to both perspectives—not only in the short term, but also in the long term through simulation studies. Beyond auditing, we explore strategies and design directions for making recommender systems more responsible. Outcomes from this work should inform both academic and practitioner communities about the gaps in current systems and lay the groundwork for more equitable, safe, and meaningful recommendations for children.

Full text in ACM Digital Library
SPOT #5Bayesian Perspectives on Offline Evaluation for Recommender Systems
by Michael Benigni

Offline evaluation is a fundamental component in the deployment and development of better recommender systems. In recent years, the contextual bandit framework has emerged as a valuable approach for offline and counterfactual evaluation, leading to the increasing interest in estimators based on inverse propensity scoring (IPS), direct methods (DM), and doubly robust (DR) techniques. However, nearly all existing methods rely on frequentist statistics, limiting their ability to capture model uncertainty and reflecting it in evaluation outcomes.This work explores the novel research direction of Bayesian statistics for Off-Policy Evaluation in recommendation tasks, motivated by the need for reliable estimators that are more robust to distribution shift, data sparsity, and model misspecification. Three underexplored research directions are identified in this work: (i) using posterior uncertainty from Bayesian reward models to design adaptive hybrid estimators, (ii) explicitly modeling all components of the OPE problem (contexts, actions, and rewards) using a joint probabilistic framework, and (iii) quantifying epistemic uncertainty over policy value estimates via posterior inference.By leveraging the Bayesian framework, the aim is to improve the reliability, interpretability, and safety of offline evaluation protocols, offering a new perspective on one of the most persistent challenges in recommender systems research. This perspective is especially relevant in data-scarce or high-stakes settings, where understanding uncertainty is essential for trustworthy decision-making.

Full text in ACM Digital Library
SPOT #6Beyond Persuasion: Adaptive Warnings and Balanced Explanations for Informed Decision-Making in Recommender Systems
by Elaheh Jafari

As recommender systems become deeply embedded in digital platforms, designing explanations that are ethical, effective, and user-centered is increasingly important. Traditional strategies often prioritize persuasiveness or transparency but neglect user agency and cognitive differences. This research explores alternative explanation formats, warnings that highlight potential drawbacks and balanced pros-and-cons summaries, to support more informed and autonomous decision-making. In the first year, we published a paper discussing ethical considerations in explanation design for recommender systems. We then conducted a systematic review of user perceptions, a study of warning messages in mobile app interfaces, and a controlled e-commerce experiment comparing baseline, warning, and pros-and-cons explanations. Results indicate that layered explanations improve decision satisfaction, reduce cognitive load, and better align with individual traits like decision style and need for cognition. Building on these findings, we propose a multi-level explanation approach that combines upfront warnings with on-demand balanced details, adaptable across domains. Future work will explore personalization strategies, real-time adaptivity, and generalizability to domains such as media, news, and job recommendations. This research aims to inform the design of transparent, fair, and trustworthy explanation interfaces in recommender systems.

Full text in ACM Digital Library
SPOT #7Challenges in Perfume Recommender Systems: Navigating Subjectivity, Context and Sensory Data
by Elena-Ruxandra Lutan

Compared to other recommender systems domains, perfume recommendation proves to be highly personalized and more challenging due to the very subjective factors and complex mixture of involved senses. Individual perfume preferences are influenced by subtle elements such as emotional associations, personal memories, and unique biochemistry, making it difficult for users to clearly express their olfactory preferences. This paper provides an insight of significant challenges in perfume recommendations planned to be addressed in the context of my ongoing PhD project. By exploring these areas, I aim to make a meaningful contribution to the ongoing development of perfume recommender systems.

Full text in ACM Digital Library
SPOT #8Fair and Transparent Recommender Systems for Advertisements
by Dina Zilbershtein

Recommender systems are central to digital platforms, powering content personalization, user engagement, and revenue generation. In advertising, they operate within a multi-stakeholder environment, bringing together viewers, advertisers, and platform providers with often competing objectives. While such systems enhance targeting precision, their opacity raises concerns around fairness, transparency, and trust. This research, conducted in collaboration with RTL Netherlands, focuses on building fair and transparent recommender systems for advertisements, with particular emphasis on Video-on-Demand (VoD) platforms. I investigate algorithmic interventions and explainability techniques aimed at aligning system behavior with stakeholders’ expectations. By addressing tensions between stakeholders’ objectives and challenges of the ad delivery process, this work contributes to the design of ethically responsible advertising systems that balance commercial goals with accountability and user trust.

Full text in ACM Digital Library
SPOT #9Full-Page Recommender: A Modular Framework for Multi-Carousel Recommendations
by Jan Kislinger

Full-page layouts with multiple carousels are widely used in video streaming platforms, yet understudied in recommender systems research. This paper introduces a structured approach to generating such pages by recommending coherent item collections and optimizing their arrangement. We break the problem into subcomponents and propose methods that balance user relevance, diversity, and coherence. We also present an evaluation framework tailored to this setting. We argue that this approach can improve recommendation quality beyond traditional ranked lists.

Full text in ACM Digital Library
SPOT #10Narrative-Driven Itinerary Recommendation: LLM Integration for Immersive Urban Walking
by Fabio Ferrero

Sedentary behavior, dubbed the disease of the 21st century, is a ubiquitous force driving chronic illness. Yet, traditional itinerary and Point-of-Interest (POI) Recommender Systems (RSs) lack engaging elements that motivate routine urban walking. This research proposes a novel framework combining narrative-driven storytelling with location-based RSs to promote physical activity and immersive urban exploration. This approach introduces a bidirectional alignment between POI and itinerary recommendations and LLM-generated narratives, transforming routine urban walks into dynamic journeys where contextually relevant stories unfold across city locations. Unlike sequential POI recommendations, this framework embeds location suggestions within contextually relevant narratives of various genres, simultaneously promoting health benefits and deeper city exploration. The research addresses three research questions using a method that builds a structured knowledge base by extracting entities (e.g., POIs, and characters) and semantic links from narrative corpora, enabling semantic alignment between recommended physical locations and story elements. The core aspects of this work are: (i) context-aware itinerary recommendations and personalized story generation, (ii) bidirectional mapping between RSs and story generation, and (iii) systems design bridging user’s needs to promote urban walking as a health activity. Evaluation employs comparative user studies measuring quality and engagement, route-narrative semantic alignment, and narrative analysis to validate the integrated proposed approach.

Full text in ACM Digital Library
SPOT #11Personalized Image Generation for Recommendations Beyond Catalogs
by Gabriel Alfonso Patron

Retrieval-based recommender systems are constrained by fixed catalogs, limiting their ability to serve diverse and evolving user preferences. We propose REBECA (REcommendations BEyond CAtalogs), a new class of preference-aware generative models for recommendation that synthesizes images tailored to individual tastes rather than retrieving items. REBECA conditions a diffusion model on users’ feedback (e.g., ratings) to generate personalized image embeddings in CLIP space, which are decoded into images via an adapter-on-adapter architecture that bypasses the need for image captions during training. By leveraging an expressive pre-trained image decoder and a lightweight probabilistic adapter, REBECA enables general-purpose image generation aligned with users’ visual preferences across diverse domains without expensive fine-tuning. We also introduce a new benchmark for personalized generation based on a curated version of the FLICKR-AES dataset, along with two novel personalization metrics tailored to the generative setting. Empirical results show that REBECA produces high-quality, diverse, and preference-aligned outputs, outperforming prompt-based personalization baselines on key personalization and quality metrics. By augmenting traditional retrieval with generative modeling, REBECA opens new opportunities for applications such as content design, personalization-first creative platforms, and preference-aware synthetic media.

Full text in ACM Digital Library
SPOT #12Recommender Systems for Digital Humanities and Archives: Multistakeholder Evaluation, Scholarly Information Needs, and Multimodal Similarity
by Florian Atzenhofer-Baumgartner

Recommender systems (RecSys) in digital humanities (DH) and archives face unique challenges, including balancing competing stakeholder values, serving complex scholarly information needs, and modeling multimodal historical artifacts. This paper reports on ongoing research that tackles these issues through three interconnected strands: (1) the development of co-designed multistakeholder evaluation frameworks that move beyond simple engagement metrics to capture diverse priorities among archivists, researchers, and platform owners; (2) a systematic examination of the information behaviors of humanities scholars to inform user models adapted to exploratory, non-linear research; and (3) the creation of multimodal similarity metrics that exploit scholarly markup, material characteristics, and specialized domain knowledge. Validated through Monasterium.net—the world’s largest charter archive—this research contributes novel approaches to value-driven evaluation, scholarly user modeling, and historical document similarity. It provides methodological frameworks to bridge the computer science and DH communities, and to advance multistakeholder RecSys for complex, non-traditional domains.

Full text in ACM Digital Library

Full and Short Papers

SPOT #13A Multistakeholder Approach to Value-Driven Co-Design of Recommender Systems Evaluation Metrics in Digital Archives
by Florian Atzenhofer-Baumgartner, Georg Vogeler, Dominik Kowald
Topic: Optimization, Evaluation, Robustness of Recommneder Systems

This paper presents the first multistakeholder approach for translating diverse stakeholder values into an evaluation metric setup for Recommender Systems (RecSys) in digital archives. While commercial platforms mainly rely on engagement metrics, cultural heritage domains require frameworks that balance competing priorities among archivists, platform owners, researchers, and other stakeholders. To address this challenge, we conducted high-profile focus groups (5 groups × 5 persons) with upstream, provider, system, consumer, and downstream stakeholders, identifying value priorities across critical dimensions: visibility/representation, expertise adaptation, and transparency/trust. Our analysis shows that stakeholder concerns naturally align with four sequential research funnel stages: discovery, interaction, integration, and impact. The resulting framework addresses domain-specific challenges including collection representation imbalances, non-linear research patterns, and tensions between specialized expertise and broader accessibility. We propose tailored metrics for each stage in this research journey, such as research path quality for discovery, contextual appropriateness for interaction, metadata-weighted relevance for integration, and cross-stakeholder value alignment for impact assessment. Our contributions extend beyond digital archives to the broader RecSys community, offering transferable evaluation approaches for domains where value emerges through sustained engagement rather than immediate consumption.

Full text in ACM Digital Library
SPOT #14Counterfactual Inference under Thompson Sampling
by Olivier Jeunen
Topic: Optimization, Evaluation, Robustness of Recommneder Systems

Recommender systems exemplify sequential decision-making under uncertainty, strategically deciding what content to serve to users, to optimise a range of potential objectives. To balance the explore-exploit trade-off successfully, Thompson sampling provides a natural and widespread paradigm to probabilistically select which action to take. Questions of causal and counterfactual inference, which underpin use-cases like off-policy evaluation, are not straightforward to answer in these contexts. Specifically, whilst most existing estimators rely on action propensities, these are not readily available under Thompson sampling procedures. In this work, we derive exact and efficiently computable expressions for action propensities under a variety of parameter and outcome distributions, enabling the use of off-policy estimators in such settings. This opens up a range of practical use-cases where counterfactual inference is crucial, including unbiased offline evaluation of recommender systems, as well as general applications of causal inference in online advertising, personalisation, and beyond.

Full text in ACM Digital Library
SPOT #15Estimating Quantum Execution Requirements for Feature Selection in Recommender Systems Using Extreme Value Theory
by Jiayang Niu, Qihan Zou, Jie Li, Ke Deng, Mark Sanderson, Yongli Ren
Topic: Optimization, Evaluation, Robustness of Recommneder Systems

Recent advances in quantum computing have significantly accelerated research into quantum-assisted information retrieval and recommender systems, particularly in solving feature selection problems by formulating them as Quadratic Unconstrained Binary Optimization (QUBO) problems executable on quantum hardware. However, while existing work primarily focuses on effectiveness and efficiency, it often overlooks the probabilistic and noisy nature of real-world quantum hardware. In this paper, we propose a solution based on Extreme Value Theory (EVT) to quantitatively assess the usability of quantum solutions. Specifically, given a fixed problem size, the proposed method estimates the number of executions (shots) required on a quantum computer to reliably obtain a high-quality solution, which is comparable to or better than that of classical baselines on conventional computers. Experiments conducted across multiple quantum platforms (including two simulators and two physical quantum processors) demonstrate that our method effectively estimates the number of required runs to obtain satisfactory solutions on two widely used benchmark datasets.

Full text in ACM Digital Library
SPOT #16HiDePCC: A Novel Dual-Pronged Untargeted Attack on Federated Recommendation via Gradient Perturbation and Cluster Crafting
by Yamini Jha, Krishna Tewari, Sukomal Pal
Topic: Optimization, Evaluation, Robustness of Recommneder Systems

Federated recommender systems offer privacy benefits by decentralizing user data and preventing direct data sharing among clients. Although this architecture limits the effectiveness of traditional attack strategies, it remains susceptible to subtle adversarial attacks that can significantly degrade the accuracy of recommendations. To expose these vulnerabilities, we propose a novel untargeted attack (HiDePCC) that degrades overall system performance through a dual-pronged strategy combining adaptive gradient perturbation and hierarchical cluster-based embedding manipulation. We apply adaptive perturbations to item gradients during training and employ hierarchical clustering using several linkage methods to form coherent item clusters. Within these clusters, we converge item embeddings and manipulate boundary points to induce item misclassification. This causes the system to assign similar scores to clustered items and misrank them. We evaluated our attack on two benchmark datasets, MovieLens (with 0.5% and 1% malicious users) and Gowalla (1%), using Matrix Factorization as the base recommendation model and assessing the impact in various robust aggregation techniques. We also examined several permutations of configurations using hierarchical clustering, adaptive gradient perturbation and boundary points misclassification. Our results show that the complete setup outperforms existing state-of-the-art untargeted attacks, with performance drops for HR@5 ranging from 13.93% to 68.02% on MovieLens and ranging from 40.02% and 99.76% on Gowalla dataset . These findings reveal important vulnerabilities in federated recommendation systems.

Full text in ACM Digital Library
SPOT #17Off-Policy Evaluation of Candidate Generators in Two-Stage Recommender Systems
by Peiyao Wang, Zhan Shi, Amina Shabbeer, Ben London
Topic: Optimization, Evaluation, Robustness of Recommneder Systems

We study offline evaluation of two-stage recommender systems, focusing on the first stage, candidate generation. Traditionally, candidate generators have been evaluated in terms of standard information retrieval metrics, using curated or heuristically labeled data, which does not always reflect their true impact to user experience or business metrics. We instead take a holistic view, measuring their effectiveness with respect to the downstream recommendation task, using data logged from past user interactions with the system. Using the contextual bandit formalism, we frame this evaluation task as off-policy evaluation (OPE) with a new action set induced by a new candidate generator. To the best of our knowledge, ours is the first study to examine evaluation of candidate generators through the lens of OPE. We propose two importance-weighting methods to measure the impact of a new candidate generator using data collected from a downstream task. We analyze the asymptotic properties of these methods and derive expressions for their respective biases and variances. This analysis illuminates a procedure to optimize the estimators so as to reduce bias. Finally, we present empirical results that demonstrate the estimators’ efficacy on synthetic and benchmark data. We find that our proposed methods achieve lower bias with comparable or reduced variance relative to baseline approaches that do not account for the new action set.

Full text in ACM Digital Library
SPOT #18Correcting the LogQ Correction: Revisiting Sampled Softmax for Large-Scale Retrieval
by Kirill Khrylchenko, Vladimir Baikalov, Sergei Makeev, Artem Matveev, Sergei Liamaev
Topic: Bias, Fairness & Privacy

Two-tower neural networks are a popular architecture for the retrieval stage in recommender systems. These models are typically trained with a softmax loss over the item catalog. However, in web-scale settings, the item catalog is often prohibitively large, making full softmax infeasible. A common solution is sampled softmax, which approximates the full softmax using a small number of sampled negatives. One practical and widely adopted approach is to use in-batch negatives, where negatives are drawn from items in the current mini-batch. However, this introduces a bias: items that appear more frequently in the batch (i.e., popular items) are penalized more heavily. To mitigate this issue, a popular industry technique known as logQ correction adjusts the logits during training by subtracting the log-probability of an item appearing in the batch. This correction is derived by analyzing the bias in the gradient and applying importance sampling, effectively twice, using the in-batch distribution as a proposal distribution. While this approach improves model quality, it does not fully eliminate the bias. In this work, we revisit the derivation of logQ correction and show that it overlooks a subtle but important detail: the positive item in the denominator is not Monte Carlo-sampled — it is always present with probability 1. We propose a refined correction formula that accounts for this. Notably, our loss introduces an interpretable sample weight that reflects the model’s uncertainty — the probability of misclassification under the current parameters. We evaluate our method on both public and proprietary datasets, demonstrating consistent improvements over the standard logQ correction.

Full text in ACM Digital Library
SPOT #19Exploring the Effect of Context-Awareness and Popularity Calibration on Popularity Bias in POI Recommendations
by Andrea Forster, Simone Kopeinik, Denis Helic, Stefan Thalmann, Dominik Kowald
Topic: Bias, Fairness & Privacy

Point-of-interest (POI) recommender systems help users discover relevant locations, but their effectiveness is often compromised by popularity bias, which disadvantages less popular yet potentially meaningful places. This paper addresses this challenge by evaluating the effectiveness of context-aware models and calibrated popularity techniques as strategies for mitigating popularity bias. Using four real-world POI datasets (Brightkite, Foursquare, Gowalla, Yelp), we analyze the individual and combined effects of these approaches on recommendation accuracy and popularity bias. Our results reveal that context-aware models cannot be considered a uniform solution, as the models studied exhibit divergent impacts on accuracy and bias. In contrast, calibration techniques can effectively align recommendation popularity with user preferences, provided there is a careful balance between accuracy and bias mitigation. Notably, the combination of calibration and context-awareness yields recommendations that balance accuracy and close alignment with the users’ popularity profiles, i.e., popularity calibration.

Full text in ACM Digital Library
SPOT #20On Inherited Popularity Bias in Cold-Start Item Recommendation
by Gregor Meehan, Johan Pauwels
Topic: Bias, Fairness & Privacy

Collaborative filtering (CF) recommender systems struggle in the item cold-start scenario, i.e. with recommending new or unseen items. Cold-start item recommenders, designed to address this challenge, are typically trained with supervision from warm CF models, so that collaborative and content information from the available interaction data can also be leveraged for cold items. However, since they learn to replicate the behavior of CF methods, cold-start systems may therefore also learn to imitate their predictive biases. In this paper, we examine how cold-start models can inherit popularity bias, a common cause of recommender system unfairness arising when CF models overfit to more popular items to maximize overall accuracy, leaving rarer items underrepresented. We show that cold-start recommenders not only mirror the popularity biases of warm models, but are in fact affected more severely because they cannot infer popularity from interaction data, so instead attempt to estimate it based solely on content features. Through experiments on three real-world datasets, we analyze the impact of this issue on several cold-start methods across multiple training paradigms. We then describe a simple post-processing bias mitigation method which, by using embedding magnitude as a proxy for popularity, can produce more balanced recommendations with limited harm to cold-start accuracy.

Full text in ACM Digital Library
SPOT #21Popularity-Bias Vulnerability: Semi-Supervised Label Inference Attack on Federated Recommender Systems
by Kenji Shinoda, Takeyuki Sasai, Shintaro Fukushima
Topic: Bias, Fairness & Privacy

Organizations are increasingly applying Vertical Federated Learning (VFL) to enhance recommender systems without sharing raw data among themselves. However, partial outputs in VFL remain to introduce significant privacy risks. In this study, we propose a novel label inference attack specifically tailored for VFL-based recommender systems, leveraging two common characteristics: (1) item popularity often follows a power-law distribution, and (2) random negative sampling is commonly used for implicit feedback, a substitute for non-existing true labels. By combining partial local information from VFL with this prior knowledge, a malicious party can construct a semi-supervised learning pipeline. The experimental results of three real-world datasets demonstrate that our approach achieves a higher label inference performance than the existing attacks. These findings demonstrate the need for more robust privacy preserving mechanisms in federated recommender systems.

Full text in ACM Digital Library
SPOT #22Privacy-Preserving Social Recommendation: Privacy Leakage and Countermeasure
by Yuyue Chen, Peng Yang, Zoe Lin Jiang, Wenhao Wu, Junbin Fang, Xuan Wang, Chuanyi Liu
Topic: Bias, Fairness & Privacy

Social recommendation systems generally utilize two types of data, user-item interaction matrices (R) from rating platform (P0), and user-user social graphs (S) from social platform (P1). Considering user privacy that neither R nor S can be directly shared, Chen et al. introduced the Secure Social Recommendation (SeSoRec) framework with the Secret Sharing-based Matrix Multiplication (SSMM) protocol. However, we find that the leakage of intermeidate information introduced by SSMM will eventually lead to the leakage of S to P0, which challenges the privacy guarantees of SeSoRec. This work firstly identifies that the claimed “innocuous” leakage in SeSoRec originates from reusing the same One-Time Pad key during two randomization phases in SSMM, with formal proof that SSMM violates semi-honest security. Secondly, this work proposes the Two-Time Pad Attack with two reconstruction algorithms to evaluate the severity of the leakage. The Two-Time Pad Attack can extract the column-wise sum of matrices and , and the row-wise difference of matrices and , where such matrices are closely related to R or S. The Sparse Matrix Reconstruction (SMR) algorithm can achieve 99.35%, 83.83%, and 77.14% reconstruction rates for non-zero entries in S on FilmTrust, Epinions, and Douban datasets, respectively. The Grayscale Image Reconstruction (GIR) algorithm can successfully recover MNIST image contours. Thirdly, when the number of columns/rows of the input matrix A/B in SSMM is odd (requiring zero-padding to an even dimension), this work proposes the Zero-Padding Attack which can directly expose the last column/row of A/B. Finally, this work proposes the Privacy-Preserving Matrix Multiplication (PPMM) protocol with experimental demonstration as a replacement for SSMM, which eliminates such leakage while maintaining efficiency.

Full text in ACM Digital Library
SPOT #23RecPS: Privacy Risk Scoring for Recommender Systems
by Jiajie He, Yuechun Gu, Keke Chen
Topic: Bias, Fairness & Privacy

Recommender systems (RecSys) have become an essential component of many web applications. The core of the system is a recommendation model trained on highly sensitive user-item interaction data. While privacy-enhancing techniques are actively studied in the research community, the real-world model development still depends on minimum privacy protection, e.g., via controlled access. Users of such systems should have the right to choose not to share highly sensitive interactions. However, there is no method allowing the user to know which interactions are more sensitive than others. Thus, quantifying the privacy risk of RecSys training data is a critical step to enabling privacy-aware RecSys model development and deployment. We propose a membership-inference-attack (MIA) based privacy scoring method, RecPS, to measure privacy risks at the interaction and the user levels. The RecPS interaction-level score definition is motivated and derived from differential privacy, which is then extended to the user-level scoring method. A critical component is the interaction-level MIA method RecLiRA, which gives high-quality membership estimation. We have conducted extensive experiments on well-known benchmark datasets and RecSys models to show the unique features and benefits of RecPS scoring in risk assessment and RecSys model unlearning.

Full text in ACM Digital Library
SPOT #24Stairway to Fairness: Connecting Group and Individual Fairness
by Theresia Veronika Rampisela, Maria Maistro, Tuukka Ruotsalo, Falk Scholer, Christina Lioma
Topic: Bias, Fairness & Privacy

Fairness in recommender systems (RSs) is commonly categorised into group fairness and individual fairness. However, there is no established scientific understanding of the relationship between the two fairness types, as prior work on both types has used different evaluation measures or evaluation objectives for each fairness type, thereby not allowing for a proper comparison of the two. As a result, it is currently not known how increasing one type of fairness may affect the other. To fill this gap, we study the relationship of group and individual fairness through a comprehensive comparison of evaluation measures that can be used for both fairness types. Our experiments with 8 RSs across 3 datasets show that recommendations that are highly fair for groups can be very unfair for individuals. Our finding is novel and useful for RS practitioners aiming to improve the fairness of their systems.

Full text in ACM Digital Library
SPOT #25Collaborative Interest Modeling in Recommender Systems
by Yu-Ting Cheng, Yu-Yen Ho, Jyun-Yu Jiang
Topic: Collaborative Filtering & Graph-based Recommendation

In this paper, we introduce Collaborative Interest Modeling (COIN), a novel approach to tackle interest entanglement and sparse interest representations within multi-interest learning for recommender systems. COIN leverages collaborative signals from behaviorally similar interests to refine interest embeddings and enhance recommendation quality, unlike existing methods that primarily focus on individual user-item interactions. The approach aligns collaborative neighbors with sparse interests, employs a structured routing mechanism to distinguish multiple interests, and avoids routing collapse. Experimental results on three real-world datasets demonstrate that COIN outperforms state-of-the-art models by 4.71% to 15.13% in key recommendation metrics, such as recall, NDCG, and hit ratio.

Full text in ACM Digital Library
SPOT #26Measuring Interaction-Level Unlearning Difficulty for Collaborative Filtering
by Haocheng Dou, Tao Lian, Xin Xin
Topic: Collaborative Filtering & Graph-based Recommendation

The growing emphasis on data privacy and user controllability mandates that recommendation models support the removal of specified data, known as recommendation unlearning (RU). Although model retraining is often regarded as the gold standard for machine unlearning, it is inadequate to attain complete unlearning in collaborative filtering recommendation due to interdependency between user-item interactions. To this end, we introduce the concept of interaction-level unlearning difficulty, which serves as a foresighted indicator of the unlearning incompleteness or actual unlearning effectiveness after forgetting each interaction. Through extensive experiments with retraining and model-agnostic unlearning methods, we identify two interpretable data characteristics that can serve as useful unlearning difficulty indicators: Embedding Entanglement Index (EEI) and Subgraph Average Degree (AD). They have a strong correlation with existing membership inference metrics focusing on data removal as well as our proposed unlearning effectiveness metrics from the recommendation perspective—Score Shift, UnlearnMRR, and UnlearnRecall. In addition, we investigate the efficacy of an unlearning enhancement technique named Extra Deletion in handling unlearning requests of different difficulty levels. The results show that more related interactions need to be extra deleted to achieve acceptable unlearning effectiveness for difficult unlearning requests, while fewer or no extra deletions are needed for easier-to-forget requests. This study provides a novel perspective for advancing the development of more tailored RU methods.

Full text in ACM Digital Library
SPOT #27Non-parametric Graph Convolution for Re-ranking in Recommendation Systems
by Zhongyu Ouyang, Mingxuan Ju, Soroush Vosoughi, Yanfang Ye
Topic: Collaborative Filtering & Graph-based Recommendation

Graph knowledge has been proven effective in enhancing item rankings in recommender systems (RecSys), particularly during the retrieval stage. However, its application in the ranking stage, where richer contextual information (e.g., user, item, and interaction features) is available, remains underexplored. A major challenge lies in the substantial computational cost associated with repeatedly retrieving neighborhood information from billions of items stored in distributed systems. This resource-intensive requirement makes it difficult to scale graph-based methods during model training, and apply them in practical RecSys. To bridge this gap, we first demonstrate that incorporating graphs in the ranking stage improves ranking qualities. Notably, while the improvement is evident, we show that the substantial computational overheads entailed by graphs are prohibitively expensive for real-world recommendations. In light of this, we propose a non-parametric strategy that utilizes graph convolution for re-ranking only during test time. Our strategy circumvents the notorious computational overheads from graph convolution during training, and utilizes structural knowledge hidden in graphs on-the-fly during testing. It can be used as a plug-and-play module and easily employed to enhance the ranking ability of various ranking layers of a real-world RecSys with significantly reduced computational overhead. Through comprehensive experiments across four benchmark datasets with varying levels of sparsity, we demonstrate that our strategy yields noticeable improvements (i.e., 8.1% on average) during testing time with little to no additional computational overheads (i.e., 0.5% on average). Anonymous code: https://anonymous.4open.science/r/RecBole_NonParamGC-EBBE.

Full text in ACM Digital Library
SPOT #28Rethinking Overconfidence in VAEs: Can Label Smoothing Help?
by Woo-Seong Yun, YeoJun Choi, Yoon-Sik Cho
Topic: Collaborative Filtering & Graph-based Recommendation

By leveraging the expressive power of deep generative models, Variational Autoencoder (VAE) -based recommender models have demonstrated competitive performance. However, deep neural networks (DNNs) tend to exhibit overconfidence in their predictive distributions as training progresses. This issue is further exacerbated by two inherent characteristics of collaborative filtering (CF): (1) extreme data sparsity and (2) implicit feedback. Despite its importance, there has been a lack of systematic study into this problem. To fill the gap, this paper explores the above limitations with label smoothing (LS) from both theoretical and empirical aspects. Our extensive analysis demonstrates that overconfidence leads to embedding collapse, where latent representations collapse into a narrow subspace. Furthermore, we investigate the conditions under which LS helps recommendation, and observe that the optimal LS factor decreases proportionally with data sparsity. To the best of our knowledge, this is the first study in VAE-based CF that discovers the relationship between overconfidence and embedding collapse, and highlights the necessity of explicitly addressing them.

Full text in ACM Digital Library
SPOT #29SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation
by Weizhi Zhang, Liangwei Yang, Zihe Song, Henry Peng Zou, Ke Xu, Yuanjie Zhu, Philip S. Yu
Topic: Collaborative Filtering & Graph-based Recommendation

Recommender systems (RecSys) are essential for online platforms, providing personalized suggestions to users within a vast sea of information. Self-supervised graph learning seeks to harness high-order collaborative filtering signals through unsupervised augmentation on the user-item bipartite graph, primarily leveraging a multi-task learning framework that includes both supervised recommendation loss and self-supervised contrastive loss. However, this separate design introduces additional graph convolution processes and creates inconsistencies in gradient directions due to disparate losses, resulting in prolonged training times and sub-optimal performance. In this study, we introduce a unified framework of Supervised Graph Contrastive Learning for recommendation (SGCL) to address these issues. SGCL uniquely combines the training of recommendation and unsupervised contrastive losses into a cohesive supervised contrastive learning loss, aligning both tasks within a single optimization direction for exceptionally fast training. Extensive experiments on three real-world datasets show that SGCL outperforms state-of-the-art methods, achieving superior accuracy and efficiency.

Full text in ACM Digital Library
SPOT #30An Off-Policy Learning Approach for Steering Sentence Generation towards Personalization
by Haruka Kiyohara, Daniel Cao, Yuta Saito, Thorsten Joachims
Topic: LLMs, Embeddings & Conversational Recommender Systems

We study the problem of personalizing the output of a large language model (LLM) by training on logged bandit feedback (e.g., personalizing movie descriptions based on likes). While one may naively treat this as a standard off-policy contextual bandit problem, the large action space and the large parameter space make naive applications of off-policy learning (OPL) infeasible. We overcome this challenge by learning a prompt policy for a frozen LLM that has only a modest number of parameters. The proposed Direct Sentence Off-policy gradient (DSO) effectively propagates the gradient to the prompt policy space by leveraging the smoothness and overlap in the sentence space. Consequently, DSO substantially reduces variance while also suppressing bias. Empirical results on our newly established suite of benchmarks, called OfflinePrompts, demonstrate the effectiveness of the proposed approach in generating personalized descriptions for movie recommendations, particularly when the number of candidate prompts and reward noise are large.

Full text in ACM Digital Library
SPOT #31Beyond Visit Trajectories: Enhancing POI Recommendation via LLM-Augmented Text and Image Representations
by Zehui Wang, Wolfram Höpken, Dietmar Jannach
Topic: LLMs, Embeddings & Conversational Recommender Systems

Recommender systems often rely on user visit trajectories, but the integration and representation of diverse side information remains a key challenge. Recent advances in large language models (LLMs) have enabled new strategies for enhancing this process. This study investigates how different types of side information support next Point-of-Interest (POI) recommendation, using a business-level dataset derived from Yelp. An LLM-based summarization pipeline is introduced to convert unstructured reviews and visual content into structured text via instruction-tuned models. These summaries, together with other business features, are each encoded into fixed- length embeddings. Based on these embeddings, four input configurations are constructed for BERT4Rec: trajectory-only, single feature categories, pairwise category combinations, and full combination. Our results show that side information consistently improves performance over the trajectory-only baseline, and their combinations exhibit useful synergies. These findings highlight the importance of modality-aware design and point toward adaptive fusion and selective use of side information. To support further research, we publicly release a multimodal POI recommendation dataset based on the Yelp Open Dataset.

Full text in ACM Digital Library
SPOT #32Consistent Explainers or Unreliable Narrators? Understanding LLM-generated Group Recommendations
by Cedric Waterschoot, Nava Tintarev, Francesco Barile
Topic: LLMs, Embeddings & Conversational Recommender Systems

Large Language Models (LLMs) are increasingly being implemented as joint decision-makers and explanation generators for Group Recommender Systems (GRS). In this paper, we evaluate these recommendations and explanations by comparing them to social choice-based aggregation strategies. Our results indicate that LLM-generated recommendations often resembled those produced by Additive Utilitarian (ADD) aggregation. However, the explanations typically referred to averaging ratings (resembling but not identical to ADD aggregation). Group structure, uniform or divergent, did not impact the recommendations. Furthermore, LLMs regularly claimed additional criteria such as user or item similarity, diversity, or used undefined popularity metrics or thresholds. Our findings have important implications for LLMs in the GRS pipeline as well as standard aggregation strategies. Additional criteria in explanations were dependent on the number of ratings in the group scenario, indicating potential inefficiency of standard aggregation methods at larger item set sizes. Additionally, inconsistent and ambiguous explanations undermine transparency and explainability, which are key motivations behind the use of LLMs for GRS.

Full text in ACM Digital Library
SPOT #33Do We Really Need Specialization? Evaluating Generalist Text Embeddings for Zero-Shot Recommendation and Search
by Matteo Attimonelli, Alessandro De Bellis, Claudio Pomo, Dietmar Jannach, Eugenio Di Sciascio, Tommaso Di Noia
Topic: LLMs, Embeddings & Conversational Recommender Systems

Pre-trained language models (PLMs) are widely used to derive semantic representations from item metadata in recommendation and search. In sequential recommendation, PLMs enhance ID-based embeddings through textual metadata, while in product search, they align item characteristics with user intent. Recent studies suggest task and domain-specific fine-tuning are needed to improve representational power. This paper challenges this assumption, showing that Generalist Text Embedding Models (GTEs), pre-trained on large-scale corpora, can guarantee strong zero-shot performance without specialized adaptation. Our experiments demonstrate that GTEs outperform traditional and fine-tuned models in both sequential recommendation and product search. We attribute this to a superior representational power, as they distribute features more evenly across the embedding space. Finally, we show that compressing embedding dimensions by focusing on the most informative directions (e.g., via PCA) effectively reduces noise and improves the performance of specialized models.

Full text in ACM Digital Library
SPOT #34Enhancing Sequential Recommender with Large Language Models for Joint Video and Comment Recommendation
by Bowen Zheng, Zihan Lin, Enze Liu, Chen Yang, Enyang Bai, Cheng Ling, Han Li, Wayne Xin Zhao, Ji-Rong Wen
Topic: LLMs, Embeddings & Conversational Recommender Systems

Nowadays, reading or writing comments on captivating videos has emerged as a critical part of the viewing experience on online video platforms. However, existing recommender systems primarily focus on users’ interaction behaviors with videos, neglecting comment content and interaction in user preference modeling. In this paper, we propose a novel recommendation approach called LSVCR that utilizes user interaction histories with both videos and comments to jointly perform personalized video and comment recommendation. Specifically, our approach comprises two key components: sequential recommendation (SR) model and supplemental large language model (LLM) recommender. The SR model functions as the primary recommendation backbone (retained in deployment) of our method for efficient user preference modeling. Concurrently, we employ a LLM as the supplemental recommender (discarded in deployment) to better capture underlying user preferences derived from heterogeneous interaction behaviors. In order to integrate the strengths of the SR model and the supplemental LLM recommender, we introduce a two-stage training paradigm. The first stage, personalized preference alignment, aims to align the preference representations from both components, thereby enhancing the semantics of the SR model. The second stage, recommendation-oriented fine-tuning, involves fine-tuning the alignment-enhanced SR model according to specific objectives. Extensive experiments in both video and comment recommendation tasks demonstrate the effectiveness of LSVCR. Moreover, online A/B testing on a real-world video platform verifies the practical benefits of our approach. In particular, we attain a cumulative gain of 4.13\% in comment watch time.

Full text in ACM Digital Library
SPOT #35Failure Prediction in Conversational Recommendation Systems
by Maria Vlachou
Topic: LLMs, Embeddings & Conversational Recommender Systems

In a Conversational Image Recommendation task, users can provide natural language feedback on a recommended image item, which leads to an improved recommendation in the next turn. While typical instantiations of this task assume that the user’s target item will (eventually) be returned, this might often not be true, for example, the item the user seeks is not within the item catalogue. Failing to return a user’s desired item can lead to user frustration, as the user needs to interact with the system for an increased number of turns. To mitigate this issue, in this paper, we introduce the task of Supervised Conversational Performance Prediction, inspired by Query Performance Prediction (QPP) for predicting effectiveness in response to a search engine query. In this regard, we propose predictors for conversational performance that detect conversation failures using multi-turn semantic information contained in the embedded representations of retrieved image items. Specifically, our AutoEncoder-based predictor learns a compressed representation of top-retrieved items of the train turns and uses the classification labels to predict the evaluation turn. Our evaluation scenario addressed two recommendation scenarios, by differentiating between system failure, where the system is unable to find the target, and catalogue failure, where the target does not exist in the item catalogue. In our experiments using the Shoes and FashionIQ Dresses datasets, we measure the accuracy of predictors for both system and catalogue failures. Our results demonstrate the promise of our proposed predictors for predicting system failures (existing evaluation scenario), while we detect a considerable decrease in predictive performance in the case of catalogue failure prediction (when inducing a missing item scenario) compared to system failures.

Full text in ACM Digital Library
SPOT #36R⁴ec: A Reasoning, Reflection, and Refinement Framework for Recommendation Systems
by Hao Gu, Rui Zhong, Yu Xia, Wei Yang, Chi Lu, Peng Jiang, Kun Gai
Topic: LLMs, Embeddings & Conversational Recommender Systems

Harnessing Large Language Models (LLMs) for recommendation systems has emerged as a prominent avenue, drawing substantial research interest. However, existing approaches primarily involve basic prompt techniques for knowledge acquisition, which resemble System-1 thinking. This makes these methods highly sensitive to errors in the reasoning path, where even a small mistake can lead to an incorrect inference. To this end, in this paper, we propose R⁴ec, a reasoning, reflection and refinement framework that evolves the recommendation system into a weak System-2 model. Specifically, we introduce two models: an actor model that engages in reasoning, and a reflection model that judges these responses and provides valuable feedback. Then the actor model will refine its response based on the feedback, ultimately leading to improved responses. We employ an iterative reflection and refinement process, enabling LLMs to facilitate slow and deliberate System-2-like thinking. Ultimately, the final refined knowledge will be incorporated into a recommendation backbone for prediction. We conduct extensive experiments on Amazon-Book and MovieLens-1M datasets to demonstrate the superiority of R⁴ec. We also deploy R⁴ec on a large scale online advertising platform, showing 2.2\% increase of revenue. Furthermore, we investigate the scaling properties of the actor model and reflection model.

Full text in ACM Digital Library
SPOT #37TreatRAG: A Framework for Personalized Treatment Recommendation
by Chao-Chin Liu, Hao-Ren Yao, Der-Chen Chang, Ophir Frieder
Topic: LLMs, Embeddings & Conversational Recommender Systems

Medication recommendation is a critical function of clinical decision support systems, directly influencing patient safety and treatment efficacy. While large language models (LLMs) show promise in clinical tasks such as summarization and question answering, their ability to make accurate treatment predictions remains limited due to a lack of specialized medical knowledge and exposure to real-world patient data. We introduce TreatRAG, a retrieval-augmented generation (RAG) framework designed to enhance treatment recommendation by integrating structured electronic health record (EHR) data with pretrained LLMs. TreatRAG retrieves similar patient cases, i.e., so called “digital twins”, using interpretable N-gram Jaccard similarity and augments the input prompt to ground LLM predictions in real clinical scenarios. We evaluate our framework on the MIMIC-IV dataset using BioGPT, BioMistral, Phi3, and Flan-T5. In all cases, TreatRAG statistically significantly improves medication prediction performance. TreatRAG-enhanced BioGPT improves its F1-score from 0.14 to 0.34, BioMistral from 0.22 to 0.54, Phi-3 from 0.09 to 0.16, and Flan-T5 from 0.23 to 0.30. Our model-agnostic framework offers a flexible, effective, and interpretable solution to advance the reliability of LLMs in clinical decision support.

Full text in ACM Digital Library
SPOT #38USB-Rec: An Effective Framework for Improving Conversational Recommendation Capability of Large Language Model
by Jianyu Wen, Jingyun Wang, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Ying Zhang
Topic: LLMs, Embeddings & Conversational Recommender Systems

Recently, Large Language Models (LLMs) have been widely employed in Conversational Recommender Systems (CRSs).Unlike traditional language model approaches that focus on training, all existing LLMs-based approaches are mainly centered around how to leverage the summarization and analysis capabilities of LLMs while ignoring the issue of training.Therefore, in this work, we propose an integrated training-inference framework, User-Simulator-Based framework (USB-Rec), for improving the performance of LLMs in conversational recommendation at the model level.Firstly, we design a LLM-based Preference Optimization (PO) dataset construction strategy for RL training, which helps the LLMs understand the strategies and methods in conversational recommendation.Secondly, we propose a Self-Enhancement Strategy (SES) at the inference stage to further exploit the conversational recommendation potential obtained from RL training.Extensive experiments on various datasets demonstrate that our method consistently outperforms previous state-of-the-art methods.

Full text in ACM Digital Library
SPOT #39Beyond the past: Leveraging Audio and Human Memory for Sequential Music Recommendation
by Viet Anh Tran, Bruno Sguerra, Gabriel Meseguer-Brocal, Lea Briand, Manuel Moussallam
Topic: Music Recommendation

On music streaming services, listening sessions are often composed of a balance of familiar and new tracks. Recently, sequential recommender systems have adopted psychology-informed approaches based on human memory models, such as Adaptive Control of Thought—Rational (ACT-R), to successfully improve the prediction of the most relevant tracks for the next user session. However, one limitation of using a model based on human memory (or the past), is that it struggles to recommend new tracks that users have not previously listened to. To bridge this gap, here we propose a model that leverages audio information to predict in advance the ACT-R-like activation of new tracks and incorporates them into the recommendation scoring process. We demonstrate the empirical effectiveness of the proposed model using proprietary data from a global music streaming service, which we publicly release along with the model’s source code to foster future research in this field.

Full text in ACM Digital Library
SPOT #40Just Ask for Music (JAM): Multimodal and Personalized Natural Language Music Recommendation
by Alessandro B. Melchiorre, Elena Epure, Shahed Masoudian, Gustavo Escobedo, Anna Hausberger, Manuel Moussallam, Markus Schedl
Topic: Music Recommendation

Natural language interfaces offer a compelling approach for music recommendation, enabling users to express complex preferences conversationally. While Large Language Models (LLMs) show promise in this direction, their scalability is limited by high costs and latency. Retrieval-based approaches using smaller language models mitigate these issues but often rely on single-modal item representations, overlook long-term user preferences, and require full model retraining, posing challenges for real-world deployment. In this paper, we present JAM (Just Ask for Music), a lightweight and intuitive framework for natural language music recommendation. JAM models user–query–item interactions as vector translations in a shared latent space, inspired by knowledge graph embedding methods like TransE. To capture the complexity of music and user intent, JAM aggregates multimodal item features via cross-attention and sparse mixture-of-experts. We also introduce JAMSessions, a new dataset of over 100k user–query–item triples with anonymized user/item embeddings, uniquely combining conversational queries and user long-term preferences. Our results show that JAM provides accurate recommendations, produces intuitive representations suitable for practical use cases, and can be easily integrated with existing music recommendation stacks.

Full text in ACM Digital Library
SPOT #41Towards Personality-Aware Explanations for Music Recommendations Using Generative AI
by Gabrielle Alves, Dietmar Jannach, Luan Soares de Souza, Marcelo Garcia Manzato
Topic: Music Recommendation

It is well established that the provision of explanations can positively impact the effectiveness of a recommender system. In many proposals in the literature, these explanations are personalized in that they refer to a user’s known individual preferences. Some recent works, however, also indicate that personalization should also happen at a higher level, where the system, in a first step, decides in which specific way an explanation should be provided, depending, for example, on the user’s expertise. In this research, we take the first steps towards personality-aware explanations by exploring how users perceive explanations designed to match a given personality trait. To this purpose, we leverage the capabilities of modern Generative AI tools to create personality-based explanations at scale in the context of a music recommendation scenario. A linguistic analysis of the generated explanations confirms that they properly reflect expected language patterns associated with individual personality traits. Furthermore, a user study shows that certain forms of explaining are preferred over others, for example, ones that match low-neuroticism linguistic patterns. In addition, we find that some explanation forms are more effective than others regarding persuasiveness and perceived overall quality.

Full text in ACM Digital Library
SPOT #42Disentangling User and Item Sequence Patterns in Sequential Recommendation Data Sets
by Kaiyue Liu, Yang Liu, Alan Medlar, Dorota Glowacka
Topic: Sequence-based & Cross-Domain Recommendation

Sequential recommenders use the ordering of user-item interactions to perform next-item prediction. Several studies have attempted to estimate how much sequential information is available in data sets used in the offline evaluation of sequential recommenders by randomly shuffling users’ interaction histories and breaking the sequential dependencies between interactions. However, random shuffling fails to distinguish between sequential patterns from user behaviour, (i.e., users consuming items based on previous interactions, such as watching a movie and its sequel) and item availability (when items enter the system and become available for user consumption, e.g., the release date of a movie or song). In this article, we analyse several widely used data sets in sequential recommendation studies using two shuffling techniques: random shuffling and constrained shuffling. While random shuffling reorders interactions arbitrarily, constrained shuffling does not allow user-item interactions to occur prior to the item’s first appearance in the data set. Our experiments show that sequential information can either come exclusively from user behaviour patterns or item availability, or from a combination of the two. These findings have implications for understanding evaluation results in sequential recommendation and highlights why some data sets may be less appropriate for offline evaluation given how little sequential information comes from user behaviour.

Full text in ACM Digital Library
SPOT #43MDSBR: Multimodal Denoising for Session-based Recommendation
by Yutong Li, Xinyi Zhang
Topic: Sequence-based & Cross-Domain Recommendation

Multimodal session-based recommendation (SBR) has emerged as a promising direction for capturing user intent using visual and textual item content. However, existing methods often overlook a fundamental issue: the modality features extracted from pre-trained models (e.g., BERT, CLIP) are inherently noisy and misaligned with user-specific preferences. This noise arises from label errors, task mismatch, and over-inclusion of irrelevant content, ultimately degrading recommendation quality. In this work, we propose a diffusion-based denoising framework that explicitly refines noisy pre-trained representations without full fine-tuning. By progressively removing noise through a structured denoising process, our Multimodal Denoising Diffusion Layer enhances task-specific semantics. Furthermore, we introduce two auxiliary modules: an Interest-Guided Denoising Layer that filters modality features using session context, and a Multimodal Alignment Layer that enforces cross-modal coherence. Extensive experiments on real-world datasets demonstrate that our model significantly outperforms state-of-the-art methods while maintaining practical training efficiency.

Full text in ACM Digital Library
SPOT #44Tag-augmented Dual-target Cross-domain Recommendation
by Mingfan Pan, Qingyang Mao, Xu An, Jianhui Ma, Gang Zhou, Mingyue Cheng, Enhong Chen
Topic: Sequence-based & Cross-Domain Recommendation

Cross-domain recommendation (CDR) has been proposed to alleviate the data sparsity issue in recommendation systems and has garnered substantial research interest. In recent years, dual-target CDR has been an increasingly prevalent research topic that emphasizes simultaneous enhancement in both the source and target domains. Many existing approaches rely on overlapping users as bridges between domains, yet in real-world scenarios, the number of such users is often severely limited, restricting their practical applicability. To overcome this limitation, alternative methods for cross-domain connections are needed, and item tags serve as a promising solution. However, real-world tags suffer from severe deficiencies in terms of both quantity and diversity, and existing studies have not fully exploited their potential. In this paper, we introduce Tag-augmented Dual-target Cross-domain Recommendation (TA-DTCDR), which is the first to apply LLM-distilled tag information to CDR. TA-DTCDR utilizes item tags distilled by large language models (LLMs) as an additional channel to facilitate information transfer, thereby mitigating performance decline caused by the lack of overlapping users. Furthermore, to fully leverage the natural language information carried by the distilled tags, we design a series of training tasks to align tag semantics across domains while preserving their semantic independence. The proposed method is validated on multiple tasks using public datasets, showing significant improvements over existing state-of-the-art approaches.

Full text in ACM Digital Library
SPOT #45Exploring Scaling Laws of CTR Model for Online Performance Improvement
by Weijiang Lai, Beihong Jin, Jiongyan Zhang, Yiyuan Zheng, Jian Dong, Jia Cheng, Jun Lei, Xingxing Wang
Topic: User Preferences and Engagement

Click-Through Rate (CTR) models play a vital role in improving user experience and boosting business revenue in many online personalized services. However, current CTR models generally encounter bottlenecks in performance improvement. Inspired by the scaling law phenomenon of Large Language Models (LLMs), we propose a new paradigm for improving CTR predictions: first, constructing a CTR model with accuracy scalable to the model grade and data size, and then distilling the knowledge implied in this model into its lightweight model that can serve online users. To put it into practice, we construct a CTR model named SUAN (Stacked Unified Attention Network). In SUAN, we propose the unified attention block (UAB) as a behavior sequence encoder. A single UAB unifies the modeling of the sequential and non-sequential features and also measures the importance of each user behavior feature from multiple perspectives. Stacked UABs elevate the configuration to a high grade, paving the way for performance improvement. In order to benefit from the high performance of the high-grade SUAN and avoid the disadvantage of its long inference time, we modify the SUAN with sparse self-attention and parallel inference strategies to form LightSUAN, and then adopt online distillation to train the low-grade LightSUAN, taking a high-grade SUAN as a teacher. The distilled LightSUAN has superior performance but the same inference time as the LightSUAN, making it well-suited for online deployment. Experimental results show that SUAN performs exceptionally well and holds the scaling laws spanning three orders of magnitude in model grade and data size, and the distilled LightSUAN outperforms the SUAN configured with one grade higher. More importantly, the distilled LightSUAN has been integrated into an online service, increasing the CTR by 2.81\% and CPM by 1.69\% while keeping the average inference time acceptable.

Full text in ACM Digital Library
SPOT #46Large Scale E-Commerce Model for Learning and Analyzing Long-Term User Preferences
by Yonatan Hadar, Yotam Eshel, Tal Franji, Bracha Shapira, Michelle Hwang, Guy Feigenblat
Topic: User Preferences and Engagement

Understanding long-term user preferences is critical for delivering consistent and personalized recommendations that go beyond short-term behavioral cues in large-scale e-commerce platforms. We present NILUS (Neural Inference for Long-Term User Signals), a content-based transformer model trained to predict user behavior over a ????-day future window using up to one year of historical interaction data. NILUS learns user embeddings end-to-end via contrastive learning, using item representations from a fine-tuned sentence encoder. We introduce a novel evaluation framework to assess the model’s ability to capture enduring user interests, and demonstrate that NILUS delivers higher accuracy than strong baselines on a large-scale offline dataset spanning millions of users and diverse product verticals. When combined with short-term signals, NILUS further improves recommendation accuracy and diversity. Finally, a large-scale online A/B test on a multinational e-commerce platform confirms statistically significant gains in user engagement.

Full text in ACM Digital Library
SPOT #47Modeling Long-term User Behaviors with Diffusion-driven Multi-interest Network for CTR Prediction
by Weijiang Lai, Beihong Jin, Yapeng Zhang, Yiyuan Zheng, Rui Zhao, Jian Dong, Jun Lei, Xingxing Wang
Topic: User Preferences and Engagement

CTR (Click-Through Rate) prediction, crucial for recommender systems and online advertising, etc., has been confirmed to benefit from modeling long-term user behaviors. Nonetheless, the vast number of behaviors and complexity of noise interference pose challenges to prediction efficiency and effectiveness. Recent solutions have evolved from single-stage models to two-stage models. However, current two-stage models often filter out significant information, resulting in an inability to capture diverse user interests and build the complete latent space of user interests. Inspired by multi-interest and generative modeling, we propose DiffuMIN (Diffusion-driven Multi-Interest Network) to model long-term user behaviors and thoroughly explore the user interest space. Specifically, we propose a target-oriented multi-interest extraction method that begins by orthogonally decomposing the target to obtain interest channels. This is followed by modeling the relationships between interest channels and user behaviors to disentangle and extract multiple user interests. We then introduce a diffusion module guided by contextual interests and interest channels, which anchor users’ personalized and target-oriented interest types, enabling the generation of augmented interests that align with the latent spaces of user interests, thereby further exploring restricted interest space. Finally, we leverage contrastive learning to ensure that the generated augmented interests align with users’ genuine preferences. Extensive offline experiments are conducted on two public datasets and one industrial dataset, yielding results that demonstrate the superiority of DiffuMIN. Moreover, DiffuMIN increased CTR by 1.52\% and CPM by 1.10\% in online A/B testing.

Full text in ACM Digital Library
SPOT #48Not One News Recommender To Fit Them All: How Different Recommender Strategies Serve Various User Segments
by Hanne Vandenbroucke, Ulysse Maes, Lien Michiels, Annelien Smets
Topic: User Preferences and Engagement

Many news recommender systems (NRS) adopt a one-recommender-for-all approach, overlooking that users engage with news in fundamentally different ways. In this work, we identify user segments based on various engagement metrics that go beyond clicks by employing cluster analysis on two real-world datasets: EB-NeRD and Adressa. Next to that, we evaluate the performance of common recommendation strategies: popularity, collaborative filtering (EASE and ItemKNN), and a content-based model across these user segments, which exhibit varying reading behaviors and information needs. Our findings show that different recommendation strategies are effective to varying degrees depending on the user profile. This study contributes to NRS research by providing a grounded segmentation of users derived from real-world datasets and emphasizes the importance of user-centered evaluations in advancing our understanding for understanding how NRS designs serve audiences with varying levels of news engagement.

Full text in ACM Digital Library

ACM TORS

SPOT #49A Bi-step Grounding Paradigm for Large Language Models in Recommendation Systems
by Keqin Bao, Jizhi Zhang, Wenjie Wang, Yang Zhang, Zhengyi Yang, Yanchen Luo, Chong Chen, Fuli Feng, and Qi Tian

As the focus on Large Language Models (LLMs) in the field of recommendation intensifies, the optimization of LLMs for recommendation purposes (referred to as LLM4Rec) assumes a crucial role in enhancing their recommendation performance. However, existing approaches for LLM4Rec often assess performance using restricted sets of candidates, which may not accurately reflect the models’ overall ranking capabilities. In this article, our objective is to pursue LLM4Rec models with comprehensive ranking capacity and propose a two-step grounding framework known as BIGRec (Bi-step Grounding Paradigm for Recommendation). BIGRecm initially grounds LLMs to the recommendation space by fine-tuning them to generate meaningful tokens for items and subsequently identifies appropriate actual items that correspond to the generated tokens. By conducting extensive experiments on two datasets, we substantiate the superior performance, capacity for handling few-shot scenarios, and versatility across multiple domains exhibited by BIGRec. Furthermore, we observe that the marginal benefits derived from increasing the quantity of training samples are modest for BIGRec, implying that LLMs possess the limited capability to assimilate statistical information, such as popularity and collaborative filtering, due to their robust semantic priors. These findings also underline the efficacy of integrating diverse statistical information into the LLM4Rec framework, thereby pointing towards a potential avenue for future research. Finally, we conduct analysis utilizing BIGRec to explore the characteristics of incorporating recommendations into LLMs, thereby offering prospective insights for the advancement of the field. Our code and data are available at https://github.com/SAI990323/Grounding4Rec.

Full text in ACM Digital Library
SPOT #50Evaluation Measures of Individual Item Fairness for Recommender Systems: A Critical Study
by Theresia Veronika Rampisela, Maria Maistro, Tuukka Ruotsalo, and Christina Lioma

Fairness is an emerging and challenging topic in recommender systems. In recent years, various ways of evaluating and therefore improving fairness have emerged. In this study, we examine existing evaluation measures of fairness in recommender systems. Specifically, we focus solely on exposure-based fairness measures of individual items that aim at quantifying the disparity in how individual items are recommended to users, separate from item relevance to users. We gather all such measures and we critically analyse their theoretical properties. We identify a series of limitations in each of them, which collectively may render the affected measures hard or impossible to interpret, to compute, or to use for comparing recommendations. We resolve these limitations by redefining or correcting the affected measures, or we argue why certain limitations cannot be resolved. We further perform a comprehensive empirical analysis of both the original and our corrected versions of these fairness measures, using real-world and synthetic datasets. Our analysis provides novel insights into the relationship between measures based on different fairness concepts, and different levels of measure sensitivity and strictness. We conclude with practical suggestions of which fairness measures should be used and when. Our code is publicly available. To our knowledge, this is the first critical comparison of individual item fairness measures in recommender systems.

Full text in ACM Digital Library
SPOT #51Formalizing Multimedia Recommendation through Multimodal Deep Learning
by Daniele Malitesta, Giandomenico Cornacchia, Claudio Pomo, Felice Antonio Merra, Tommaso Di Noia, and Eugenio Di Sciascio

Recommender systems (RSs) provide customers with a personalized navigation experience within the vast catalogs of products and services offered on popular online platforms. Despite the substantial success of traditional RSs, recommendation remains a highly challenging task, especially in specific scenarios and domains. For example, human affinity for items described through multimedia content (e.g., images, audio, and text), such as fashion products, movies, and music, is multi-faceted and primarily driven by their diverse characteristics. Therefore, by leveraging all available signals in such scenarios, multimodality enables us to tap into richer information sources and construct more refined user/item profiles for recommendations. Despite the growing number of multimodal techniques proposed for multimedia recommendation, the existing literature lacks a shared and universal schema for modeling and solving the recommendation problem through the lens of multimodality. Given the recent advances in multimodal deep learning for other tasks and scenarios where precise theoretical and applicative procedures exist, we also consider it imperative to formalize a general multimodal schema for multimedia recommendation. In this work, we first provide a comprehensive literature review of multimodal approaches for multimedia recommendation from the last eight years. Second, we outline the theoretical foundations of a multimodal pipeline for multimedia recommendation by identifying and formally organizing recurring solutions/patterns; at the same time, we demonstrate its rationale by conceptually applying it to selected state-of-the-art approaches in multimedia recommendation. Third, we conduct a benchmarking analysis of recent algorithms for multimedia recommendation within Elliot, a rigorous framework for evaluating recommender systems, where we re-implement such multimedia recommendation approaches. Finally, we highlight the significant unresolved challenges in multimodal deep learning for multimedia recommendation and suggest possible avenues for addressing them. The primary aim of this work is to provide guidelines for designing and implementing the next generation of multimodal approaches in multimedia recommendation.

Full text in ACM Digital Library
SPOT #52Mitigating Exposure Bias in Recommender Systems — A Comparative Analysis of Discrete Choice Models
by Thorsten Krause, Alina Deriyeva, Jan H. Beinke, Gerrit Y. Bartels, and Oliver Thomas

When implicit feedback recommender systems expose users to items, they influence the users’ choices and, consequently, their own future recommendations. This effect is known as exposure bias, and it can cause undesired effects such as filter bubbles and echo chambers. Previous research has used multinomial logit models to reduce exposure bias through over-exposure on synthesized data. We hypothesized that these findings hold true for human choice data to a limited degree and that advanced discrete choice models further reduced bias. We also investigated whether the composition of choice sets can cause exposure bias. In pursuing our research questions, we collected partially biased human choices in a controlled online user study. In two experiments, we evaluated how discrete choice–based recommender systems and baselines react to over-exposure and to over- and under-competitive choice sets. Our results confirmed that leveraging choice set information mitigates exposure bias. The multinomial logit model reduced exposure bias, comparably with the other discrete choice models. Choice set competitiveness biased the models that did not consider choice alternatives. Our findings suggest that discrete choice models are highly effective at mitigating exposure bias in recommender systems and that existing recommender systems may suffer more exposure bias than previously thought.

Full text in ACM Digital Library

Tuesday Poster Session: Full + Short + Doctoral Symposium + ACM TORS

Doctoral Symposium Papers

Full and Short Papers

ACM TORS

RecSys 2025 (Prague)

Diamond Supporter

Platinum Supporter

Gold Supporter

Bronze Supporter

Challenge Supporter

Women in RecSys’s Event Supporter

Breakfast Symposium

Coffee Break Sponsor

Special Supporters

About this site

RecSys 2026