Wednesday Poster Session: Industry

Date: Wednesday September 24

Industry papers

  • SPOT #1Personalized Interest Graphs for Theme-Driven User Behavior
    by Oded Zinman, Nazmul Chowdhury, Leandro Fiaschetti, Yuri Brovman, Guy Feigenblat, Yotam Eshel

    Many eBay users turn to our platform to pursue theme-centric interests that span diverse product categories—for example, a Star Wars fan might search for related video games, toys, memorabilia, and artwork. Existing recommendation systems, typically optimized for short-term engagement, often fail to surface cross-category items aligned with these deeper interests. We present an end-to-end recommendation framework built around a user-interest graph generated by LLM chain. The graph captures user preferences at multiple levels of granularity, enabling a balance between relevance-driven and serendipity-driven recommendations. The system has been deployed at scale, serving millions of users across billions of items. An online A/B test on the eBay homepage showed a significant improvement in engagement with previously unseen categories, alongside gains in purchases and buyer count.

    Full text in ACM Digital Library

  • SPOT #2Pareto-Optimal Solution: Optimizing Engagement and Revenue
    by Shaghayegh Agah, Shaun Schaeffer, Maria Peifer, Neeraj Sharma, Ankit Maheshwari, Sardar Hamidian

    This paper introduces a multi-objective ranking framework deployed on a large-scale entertainment platform to jointly optimize user engagement, revenue, and content pricing. Unlike prior work, our system addresses a critical real-world challenge: extreme label imbalance across objectives, with monetization signals being over 100× sparser than engagement. To overcome this, we adopt an output aggregation strategy that supports runtime tuning of objective weights, enabling fast iteration and dynamic prioritization without retraining. We further introduce a robust offline evaluation pipeline based on Pareto analysis and distribution-aware test datasets, exposing trade-offs that would otherwise remain hidden. Beyond engagement and revenue, we incorporate a third price-based objective optimized via constrained Bayesian search over a high-dimensional simplex by demonstrating how monetization goals can be achieved without degrading user experience. Our approach is validated both offline and through online A/B tests, showing measurable revenue improvements with minimal impact on engagement. This work provides a novel, end-to-end blueprint for scalable multi-objective optimization under production constraints, where business trade-offs must be explicit, tunable, and validated.

    Full text in ACM Digital Library

  • SPOT #3Suggest, Complement, Inspire: Story of Two-Tower Recommendations at Allegro.com
    by Aleksandra Osowska-Kurczab, Klaudia Nazarko, Mateusz Marzec, Lidia Wojciechowska, Eliška Kremeňová

    Building large-scale e-commerce recommendation systems requires addressing three key technical challenges: (1) designing a universal recommendation architecture across dozens of placements, (2) decreasing excessive maintenance costs, and (3) managing a highly dynamic product catalogue. This paper presents a unified content-based recommendation system deployed at Allegro.com, the largest e-commerce platform of European origin. The system is built on a prevalent Two Tower retrieval framework, representing products using textual and structured attributes, which enables efficient retrieval via Approximate Nearest Neighbour search. We demonstrate how the same model architecture can be adapted to serve three distinct recommendation tasks: similarity search, complementary product suggestions, and inspirational content discovery, by modifying only a handful of components in either the model or the serving logic. Extensive A/B testing over two years confirms significant gains in engagement and profit-based metrics across desktop and mobile app channels. Our results show that a flexible, scalable architecture can serve diverse user intents with minimal maintenance overhead.

    Full text in ACM Digital Library

  • SPOT #4Item-centric Exploration for Cold Start Problem
    by Dong Wang, Junyi Jiao, Arnab Bhadury, Yaping Zhang, Mingyan Gao, Onkar Dalal

    Recommender systems face a critical challenge in the item cold-start problem, which limits content diversity and exacerbates popularity bias by struggling to recommend new items. While existing solutions often rely on auxiliary data, but this paper illuminates a distinct, yet equally pressing, issue stemming from the inherent user-centricity of many recommender systems. We argue that in environments with large and rapidly expanding item inventories, the traditional focus on finding the “best item for a user” can inadvertently obscure the ideal audience for nascent content. To counter this, we introduce the concept of item-centric recommendations, shifting the paradigm to identify the optimal users for new items. Our initial realization of this vision involves an item-centric control integrated into an exploration system. This control employs a Bayesian model with Beta distributions to assess candidate items based on a predicted balance between user satisfaction and the item’s inherent quality. Empirical online evaluations reveal that this straightforward control markedly improves cold-start targeting efficacy, enhances user satisfaction with newly explored content, and significantly increases overall exploration efficiency.

    Full text in ACM Digital Library

  • SPOT #5Balancing Fine-tuning and RAG: A Hybrid Strategy for Dynamic LLM Recommendation Updates
    by Changping Meng, Hongyi Ling, Jianling Wang, Yifan Liu, Shuzhou Zhang, Dapeng Hong, Mingyan Gao, Onkar Dalal, Ed Chi, Lichan Hong, Haokai Lu, Ningren Han

    Large Language Models (LLMs) empower recommendation systems through their advanced reasoning and planning capabilities. However, the dynamic nature of user interests and content poses a significant challenge: While initial fine-tuning aligns LLMs with domain knowledge and user preference, it fails to capture such real-time changes, necessitating robust update mechanisms. This paper investigates strategies for updating LLM-powered recommenders, focusing on the trade-offs between ongoing fine-tuning and Retrieval-Augmented Generation (RAG). Using an LLM-powered user interest exploration system as a case study, we perform a comparative analysis of these methods across dimensions like cost, agility, and knowledge incorporation. We further explore hybrid approaches that combine fine-tuning and RAG to dynamically maintain recommendation relevance and performance.

    Full text in ACM Digital Library

  • SPOT #6LLM-Powered Nuanced Video Attribute Annotation for Enhanced Recommendations
    by Boyuan Long, Yueqi Wang, Hiloni Mehta, Mick Zomnir, Omkar Pathak, Changping Meng, Ruolin Jia, Yajun Peng, Dapeng Hong, Xia Wu, Mingyan Gao, Onkar Dalal, Ningren Han

    This paper presents a case study on deploying Large Language Models (LLMs) as an advanced “annotation” mechanism to achieve nuanced content understanding (e.g., discerning content “vibe”) at scale within a large-scale industrial short-form video recommendation system. Traditional machine learning classifiers for content understanding face protracted development cycles and a lack of deep, nuanced comprehension. The “LLM-as-annotators” approach addresses these by significantly shortening development times and enabling the annotation of subtle attributes. This work details an end-to-end workflow encompassing: (1) iterative definition and robust evaluation of target attributes, refined by offline metrics and online A/B testing; (2) scalable offline bulk annotation of video corpora using LLMs with multimodal features, optimized inference, and knowledge distillation for broad application; and (3) integration of these rich annotations into the online recommendation serving system, for example, through personalized restrict retrieval. Experimental results demonstrate the efficacy of this approach, with LLMs outperforming human raters in offline annotation quality for nuanced attributes and yielding significant improvements of user participation and satisfied consumption in online A/B tests. The study provides insights into designing and scaling production-level LLM pipelines for rich content evaluation, highlighting the adaptability and benefits of LLM-generated nuanced understanding for enhancing content discovery, user satisfaction, and the overall effectiveness of modern recommendation systems.

    Full text in ACM Digital Library

  • SPOT #7Enhancing Embedding Representation Stability in Recommendation Systems with Semantic ID
    by Carolina Zheng, Minhui Huang, Dmitrii Pedchenko, Kaushik Rangadurai, Siyu Wang, Fan Xia, Gaby Nahum, Jie Lei, Yang Yang, Tao Liu, Zutian Luo, Xiaohan Wei, Dinesh Ramasamy, Jiyan Yang, Yiping Han, Lin Yang, Hangjun Xu, Rong Jin, Shuang Yang

    The exponential growth of online content has posed significant challenges to ID-based models in industrial recommendation systems, ranging from extremely high cardinality and dynamically growing ID space, to highly-skewed engagement distributions, to prediction instability as a result of natural id life cycles. This paper examines these challenges and introduces Semantic ID prefix-ngram, a novel token parameterization technique that significantly improves the performance of the original Semantic ID. Semantic ID prefix-ngram creates semantically meaningful collisions by hierarchically clustering items based on their content embeddings, as opposed to random assignments. Through extensive experimentation, we demonstrate that Semantic ID prefix-ngram not only addresses embedding instability but also significantly improves tail id modeling, and mitigates representation shifts. We report our experience of integrating Semantic ID into Meta’s production Ads Ranking system, leading to notable performance gains.

    Full text in ACM Digital Library

  • SPOT #8The Future is Sparse: Embedding Compression for Scalable Retrieval in Recommender Systems
    by Petr Kasalický, Martin Spišák, Vojtěch Vančura, Daniel Bohuněk, Rodrigo Alves, Pavel Kordík

    Industry-scale recommender systems face a core challenge: representing entities with high cardinality, such as users or items, using dense embeddings that must be accessible during both training and inference. However, as embedding sizes grow, memory constraints make storage and access increasingly difficult. We describe a lightweight, learnable embedding compression technique that projects dense embeddings into a high-dimensional, sparsely activated space. Designed for retrieval tasks, our method reduces memory requirements while preserving retrieval performance, enabling scalable deployment under strict resource constraints. Our results demonstrate that leveraging sparsity is a promising approach for improving the efficiency of large-scale recommenders. We release our code at https://github.com/recombee/CompresSAE.

    Full text in ACM Digital Library

  • SPOT #9Decoupled Entity Representation Learning for Pinterest Ads Ranking
    by Jie Liu, Yinrui Li, Jiankai Sun, Kungang Li, Han Sun, Sihan Wang, Huasen Wu, Siyuan Gao, Paulo Soares, Nan Li, Zhifang Liu, Haoyang Li, Siping Ji, Ling Leng, Prathibha Deshikachar

    In this paper, we introduce a novel framework following an upstream-downstream paradigm to construct user and item (Pin) embeddings from diverse data sources, which are essential for Pinterest to deliver personalized Pins and ads effectively. Our upstream models are trained on extensive data sources featuring varied signals, utilizing complex architectures to capture intricate relationships between users and Pins on Pinterest. To ensure scalability of the upstream models, entity embeddings are learned, and regularly refreshed, rather than real-time computation, allowing for asynchronous interaction between the upstream and downstream models. These embeddings are then integrated as input features in numerous downstream tasks, including ad retrieval and ranking models for CTR and CVR predictions. We demonstrate that our framework achieves notable performance improvements in both offline and online settings across various downstream tasks. This framework has been deployed in Pinterest’s production ad ranking systems, resulting in significant gains in online metrics.

    Full text in ACM Digital Library

  • SPOT #10Agentic Personalisation of Cross-Channel Marketing Experiences
    by Sami Abboud, Eleanor Hanna, Olivier Jeunen, Vineesha Raheja, Schaun Wheeler

    Consumer applications provide ample opportunities to surface and communicate various forms of content to users. From promotional campaigns for new features or subscriptions, to evergreen nudges for engagement, or personalised recommendations; across e-mails, push notifications, and in-app surfaces. The conventional approach to orchestration for communication relies heavily on labour-intensive manual marketer work, and inhibits effective personalisation of content, timing, frequency, and copy-writing. We formulate this task under a sequential decision-making framework, where we aim to optimise a modular decision-making policy that maximises incremental engagement for any funnel event. Our approach leverages a Difference-in-Differences design for Individual Treatment Effect estimation, and Thompson sampling to balance the explore-exploit trade-off. We present results from a multi-service application, where our methodology has resulted in significant increases to a variety of goal events across several product features, and is currently deployed across 75 million users.

    Full text in ACM Digital Library

  • SPOT #11You Say Search, I Say Recs: A Scalable Agentic Approach to Query Understanding and Exploratory Search at Spotify
    by Enrico Palumbo, Marcus Isaksson, Alexandre Tamborrino, Maria Movin, Catalin Dincu, Ali Vardasbi, Lev Nikeshkin, Oksana Gorobets, Anders Nyman, Poppy Newdick, Hugues Bouchard, Paul Bennett, Mounia Lalmas, Dani Doro, Christine Doig Cardet, Ziad Sultan

    On online content platforms, users often aim to explore the catalog and discover new, personalized content through exploratory searches—such as “new releases for me.” Traditional search systems, which prioritize lexical and semantic matching over personalized retrieval, have historically struggled to support this type of intent. In contrast, recommendation services that leverage user-item and item-item signals tend to be more effective for addressing exploratory queries. Agentic technologies offer a promising opportunity to enhance exploratory search by harnessing large language models (LLMs) to interpret complex query intents and route them to the most suitable downstream services. However, deploying such agentic systems at scale remains a significant challenge. In this paper, we present a scalable agentic approach to query understanding and exploratory search at Spotify. Our system combines a router LLM, post-training adaptation techniques, search and recommendation APIs, and specialized sub-agents to interpret user intent and deliver personalized results at scale. We outline the high-level system design and share key experimental results. By addressing the limitations of conventional search, our approach yields substantial improvements across several exploratory use cases, including discovering similar artists (+115%), broad podcast searches (+15%), and new music releases (+91%).

    Full text in ACM Digital Library

  • SPOT #12Cold Starting a New Content Type: A Case Study with Netflix Live
    by Yunan Hu, Mark Thornburg, Mario Garcia Armas, Vito Ostuni, Anne Cocos, Kriti Kohli, Christoph Kofler, Rob Saltiel

    Industrial recommender systems often face challenges when personalizing content under an ever-changing, heterogeneous item catalog. With Netflix for example, members can watch TV shows and movies on demand, play the latest games, or tune in to thrilling live events. The difficulty of recommending new items with limited historical interaction data is often referred to as “the cold start problem.” This problem becomes exacerbated when an entirely new type of content is introduced into a recommender system, requiring the cold-start of a new content type. The purpose of this work is to review an algorithmic approach we implemented at Netflix to efficiently cold-start live events. We validated this approach through a series of online experiments that resulted in increased live engagement (+20%) across Netflix’s global member base without negatively impacting core business metrics.

    Full text in ACM Digital Library

  • SPOT #13Improve the Personalization of Large-Scale Ranking Systems by Integrating User Survey Feedback
    by Mengxi Lv, Drew Hogg, Thomas Grubb, Shashank Bassi, Min Li, Cayman Simpson, Senthil Rajagopalan

    Learning user interests is a crucial aspect of personalized recommendation, as it can create a more personal experience for users to drive their deep-engagement, satisfaction, and loyalty. In this work, we focus on improving users’ interest relevance experience, making users truly feel “this app knows me!” and thus leading to long-term user retention. However, accurately capturing users’ interest remains a significant challenge. Traditional approaches using users’ historical engagements with interest clusters lack sensitivity and accuracy; because such heuristic rules on predefined clusters can easily fall into the ranking feedback loop and thus poorly align with users’ true interest preferences. In this paper, we built a User True Interest Survey (UTIS) model to directly train on user survey data and predict a user’s interest affinity on any given piece of content. The UTIS model is added to the main ranking system to reduce feedback bias and leads to better relevance towards users’ core interests. The UTIS model demonstrates high offline accuracy and high generalization capability in online experiments. On a commercial videos platform serving billion of users, we observed significant metrics wins, including tier 0 user retention and engagements, higher quality and more trustworthy content recommendations, and higher user satisfaction in surveys. Overall, this work demonstrates that improving the relevance of a ranking system by leveraging direct user survey feedback can be a promising solution to enhance personalization of large-scale ranking system and lead to user satisfaction.

    Full text in ACM Digital Library

  • SPOT #14Scaling Retrieval for Web-Scale Recommenders: Lessons from Inverted Indexes to Embedding Search
    by Yuchin Juan, Jianqiang Shen, Shaobo Zhang, Qianqi Shen, Caleb Johnson, Luke Simon, Liangjie Hong, Wenjing Zhang

    Web-scale search and recommendation systems depend on efficient retrieval to manage massive datasets and user traffic. This paper chronicles our evolutionary path in building the retrieval layer at LinkedIn, progressing from a CPU-based inverted index system to a GPU-accelerated embedding-based retrieval system. Initially anchored by traditional term-based retrieval, we enhanced relevance and productivity through learning-to-retrieve approaches by generating mappings among inferred attributes. As these early efforts encountered limitations in inferring and matching attributes at scale, we transitioned to embedding-based retrieval for greater flexibility and performance, but found that existing infrastructure couldn’t support large-scale production needs. This led us to develop a GPU-based retrieval system designed for high performance, flexible modeling, and multi-objective business optimization. We present the infrastructure innovations, optimizations, and key lessons learned throughout this transition, offering practical insights for building scalable, flexible retrieval systems.

    Full text in ACM Digital Library

  • SPOT #15Identifying Offline Metrics that Predict Online Impact: A Pragmatic Strategy for Real-World Recommender Systems
    by Timo Wilm), Philipp Normann

    A critical challenge in recommender systems is to establish reliable relationships between offline and online metrics that predict real-world performance. Motivated by recent advances in Pareto front approximation, we introduce a pragmatic strategy for identifying offline metrics that align with online impact. A key advantage of this approach is its ability to simultaneously serve multiple test groups, each with distinct offline performance metrics, in an online experiment controlled by a single model. The method is model-agnostic for systems with a neural network backbone, enabling broad applicability across architectures and domains. We validate the strategy through a large-scale online experiment in the field of session-based recommender systems on the OTTO e-commerce platform. The online experiment identifies significant alignments between offline metrics and real-word click-through rate, post-click conversion rate and units sold. Our strategy provides industry practitioners with a valuable tool for understanding offline-to-online metric relationships and making informed, data-driven decisions.

    Full text in ACM Digital Library

  • SPOT #16Industry Insights from Comparing Deep Learning and GBDT Models for E-Commerce Learning-to-Rank
    by Yunus Lutz, Timo Wilm, Philipp Duwe

    In e-commerce recommender and search systems, tree-based models, such as LambdaMART, have set a strong baseline for Learning-to-Rank (LTR) tasks. Despite their effectiveness and widespread adoption in industry, the debate continues whether deep neural networks (DNNs) can outperform traditional tree-based models in this domain. To contribute to this discussion, we systematically benchmark DNNs against our production-grade LambdaMART model. We evaluate multiple DNN architectures and loss functions on a proprietary dataset from OTTO and validate our findings through an 8-week online A/B test. The results show that a simple DNN architecture outperforms a strong tree-based baseline in terms of total clicks and revenue, while achieving parity in total units sold.

    Full text in ACM Digital Library

  • SPOT #17SEMORec: A Scalarized Efficient Multi-Objective Recommendation Framework
    by Sofia Maria Nikolakaki, Siyong Ma, Srivas Chennu, Humeyra Topcu Altintas

    Recommendation systems in multi-stakeholder environments often require optimizing for multiple objectives simultaneously to meet supplier and consumer demands. Serving recommendations in these settings relies on efficiently combining the objectives to address each stakeholder’s expectations, often through a scalarization function with pre-determined and fixed weights. In practice, selecting these weights becomes a consequent problem. Recent work has developed algorithms that adapt these weights based on application-specific needs by using RL to train a model. While, this solves for automatic weight computation, such approaches are not efficient for frequent weight adaptation. They also do not allow for human intervention oftentimes determined by business needs. To bridge this gap, we propose a novel multi-objective recommendation framework that is highly efficient for a small number of objectives. It also enables business decision makers to easily tune the optimization by assigning different importance to multiple objectives. Through online experiments, we demonstrate the efficacy and efficiency of our framework through improvements in online business metrics.

    Full text in ACM Digital Library

  • SPOT #18Balanced Public Service Media Recommendation Trade-offs with a Light Carbon Footprint
    by Marcel Hauck, Michael Huber, Juri Diels, David Wittenberg, Dietmar Jannach

    Public service media (PSM) providers commonly face the challenge of balancing user engagement metrics and public value. In this case study, we report on the insights obtained at ARD, Germany’s largest PSM provider, when investigating the effectiveness of different collaborative filtering techniques on their video-on-demand platform ARD Mediathek. While an offline evaluation indicated that a modern model based on a denoising auto-encoder might lead to the best prediction accuracy, A/B testing revealed that an item-based nearest-neighbor technique excelled both in terms of engagement and public value metrics. Our findings thus suggest that traditional, light-weight values should not be easily dismissed, given also their comparably limited resource requirements and light carbon footprint. To enable future research on this topic, we provide a real-world dataset with usage data from our platform.

    Full text in ACM Digital Library

  • SPOT #19In-context Learning for Addressing User Cold-start in Sequential Movie Recommenders
    by Xurong Liang, Vu Nguyen, Vuong Le, Paul Albert, Julien Monteil

    The user cold-start problem remains a fundamental challenge for sequential recommender systems, particularly in large-scale video streaming services where a substantial portion of users have limited or no historical interaction data. In this work, we formulate an attempt at solving this issue by proposing a framework that leverages Large Language Models (LLMs) to enrich interaction histories using user metadata. Our approach generates a set of imaginary video items relevant to a given user’s demographic, represented through structured item key-value attributes. The generated items are inserted into users’ interaction sequences using early or late fusion strategies. We find that the generated user histories enable better initial user profiling for absolute cold users and enhanced preference modeling for nearly cold users. Experimental results on the public ML-1M dataset and an internal dataset from Amazon MX Player streaming service demonstrate the effectiveness of our LLM-based augmentation method in mitigating cold-start challenges.

    Full text in ACM Digital Library

  • SPOT #20Minimize Negative Experiences in Video Recommendation Systems with Multimodal Large Language Models
    by Suman Malani, Youwei Zhang, Liang Liu

    Detecting and limiting negative user experiences in recommendation systems with survey feedback modeling is difficult due to ultra-sparse, imbalanced, and noisy data. The proposed approach outlines fine-tuning a multimodal Large Language Model (MLLM) on survey data enriched with contextual information, like post engagement features and community data as a teacher model to generate silver labels. A highly negative ranking model (HNRM) is trained using both the original sparse survey labels and the generated silver labels knowledge distillation. This approach significantly improves model generalization, decreases calibration error rate, increases engagement while reducing negative experiences measured by survey negative experience rates in online A/B tests, and allows the model to scale beyond the limitations imposed by the original sparse and noisy dataset.

    Full text in ACM Digital Library

  • SPOT #21Orthogonal Low Rank Embedding Stabilization
    by Kevin Zielnicki, Ko-Jen Hsiao

    The instability of embedding spaces across model retraining cycles presents significant challenges to downstream applications using user or item embeddings derived from recommendation systems as input features. This paper introduces a novel orthogonal low-rank transformation methodology designed to stabilize the user/item embedding space, ensuring consistent embedding dimensions across retraining sessions. Our approach leverages a combination of efficient low-rank singular value decomposition and orthogonal Procrustes transformation to map embeddings into a standardized space. This transformation is computationally efficient, lossless, and lightweight, preserving the dot product and inference quality while reducing operational burdens. Unlike existing methods that modify training objectives or embedding structures, our approach maintains the integrity of the primary model application and can be seamlessly integrated with other stabilization techniques.

    Full text in ACM Digital Library

  • SPOT #22A Media Content Recommendation Method for Playlist Curators using LLM-Based Query Expansion
    by Yuta Hagio, Chigusa Yamamura, Hiromu Ogawa, Hisayuki Ohmata, Arisa Fujii

    Playlist curation is a key factor in media content discovery service, yet efficiently finding diverse, relevant content is challenging for curators owing to time-consuming manual query crafting. We propose a recommendation method that uses large language models (LLMs) for query expansion to assist curators. The proposed system generates multiple diverse queries from a playlist theme (title and optional description) using an LLM. The vectors derived from these expanded queries, along with the original theme vector, retrieve candidates by a vector search of a content database (using multilingual embeddings), enhancing discovery comprehensiveness and diversity. Experiments on Japanese TV programs show that the proposed method significantly improves the precision (e.g., P@50 +22 points) compared to a baseline using only the theme vector. This approach enhances curator efficiency, improves playlist quality, and promotes more comprehensive content discovery.

    Full text in ACM Digital Library

  • SPOT #23Location Matters: Leveraging Multi-Resolution Geo-Embeddings for Housing Search
    by Ivo Silva, Guilherme Bonaldo, Pedro Nogueira

    QuintoAndar Group is Latin America’s largest housing platform, revolutionizing property rentals and sales. Headquartered in Brazil, it simplifies the housing process by eliminating paperwork and enhancing accessibility for tenants, buyers, and landlords. With thousands of houses available for each city, users struggle to find the ideal home. In this context, location plays a pivotal role, as it significantly influences property value, access to amenities, and life quality. A great location can make even a modest home highly desirable. Therefore, incorporating location into recommendations is essential for their effectiveness. We propose a geo-aware embedding framework to address sparsity and spatial nuances in housing recommendations on digital rental platforms. Our approach integrates an hierarchical H3 [3] grid at multiple levels into a two-tower neural architecture. We compare our method with a traditional matrix factorization baseline and a single-resolution variant using interaction data from our platform. Embedding specific evaluation reveals richer and more balanced embedding representations, while offline ranking simulations demonstrate a substantial uplift in recommendation quality.

    Full text in ACM Digital Library

  • SPOT #24Leveraging Explicit Negative Feedback in Large-Scale Recommendation Systems: A Case Study
    by Madhura Raju, Manisha Sharma, Hongyu Xiong, Bingfeng Deng, Meng Na

    What users dislike can be just as important as what they engage with, yet explicit negative user feedback remains underutilized in most recommendation systems. This paper presents practical ap- proaches for capturing such feedback through lightweight, context- aware surveys and in-feed interactions. Referencing a case study on large-scale implementations at TikTok, we demonstrate how incorporating user feedback signals, once denoised and modeled, can improve feed quality, content relevance, and long-term user engagement. Our findings highlight that even small, well-designed feedback mechanisms can meaningfully improve user experience and trust.

    Full text in ACM Digital Library

  • SPOT #25Not All Impressions Are Created Equal: Psychology-Informed Retention Optimization for Short-Form Video Recommendation
    by Yuyan Wang, Jing Zhong, Yuxin Cui, Zhaohui Guo, Chuanqi Wei, Yanchen Wang, Zellux Wang

    Recommender systems that are optimized only for short-term engagement can lead to undesirable outcomes and hurt long-term consumer experience. In response, researchers and practitioners have proposed to incorporate retention signals into recommender systems. Existing retention models are built on item-level interactions where every impression is weighted equally. However, on short-form video platforms where content is presented sequentially and passively consumed, users are unlikely to engage equally with every video, and it is hard to establish any meaningful relationships between a short video watch and long-term retention behaviors. In this work, we propose a psychology-informed retention modeling approach grounded in the peak–end rule, which suggests that people evaluate past experiences largely based on the most intense moment (“peak”) and the final moment (“end”). Specifically, we train a retention model that predicts user return based on the peak and end moments of each session, which is then incorporated into a multi-stage recommender system. We implemented our approach on Facebook Reels, one of the world’s largest short-form video recommendation platforms. In a long-term A/B test against the production system, our model delivered significant improvements in Daily Active Users and total sessions, suggesting an improved long-term user experience.

    Full text in ACM Digital Library

  • SPOT #26Metadata Generation and Evaluation using LLMs – Case Study on Canonical Titles
    by Sinan Zhu, Sanja Simonovikj, Darren Edmonds, Yang Sun

    In online job search platforms, autocomplete plays a crucial role in providing immediate, structured suggestions that guide users through their query process. However, inconsistencies in job title expressions, such as ‘sr data scientist’ versus ‘data scientist senior’, or embellished forms such as ‘superstar software engineer’, can undermine the quality of autocomplete suggestions and diminish user satisfaction. Traditional normalization methods rely on manually curated vocabularies, which are labor intensive and often insufficient to capture the diverse variations in raw job titles. In this work, we present an automated and scalable framework for canonical job title generation that leverages large language models (LLMs) alongside embedding-based similarity measures to derive normalized job titles directly from raw data. Our approach systematically removes irrelevant information, enforces a consistent format, and eliminates overly generic or redundant titles by combining LLM-generated normalization with a two-stage deduplication process. Evaluated on a dataset labeled via a human/LLM mix, our method demonstrates significant improvements in normalization quality, with offline accuracy gains of 18.6% over baseline methods and online A/B tests showing over 160% enhancement in user engagement metrics.

    Full text in ACM Digital Library

  • SPOT #27Semantic IDs for Music Recommendation
    by M. Jeffrey Mei, Florian Henkel, Samuel E. Sandberg, Oliver Bembom, Andreas F. Ehmann

    Training recommender systems for next-item recommendation often requires unique embeddings to be learned for each item, which may take up most of the trainable parameters for a model. Shared embeddings, such as using content information, can reduce the number of distinct embeddings to be stored in memory. This allows for a more lightweight model; correspondingly, model complexity can be increased due to having fewer embeddings to store in memory. We show the benefit of using shared content-based features (‘semantic IDs’) in improving recommendation accuracy and reducing model size for two music recommendation datasets, including an online A/B test.

    Full text in ACM Digital Library

  • SPOT #28SASRec in Action: Real-World Adaptations for ZDF Streaming Service
    by Venkata Harshit Koneru, Xenija Neufeld, Sebastian Loth, Andreas Grün

    The ZDF streaming platform uses SASRec (Self-attentive Sequential Recommendation Model) for generating personalized recommendations. In our present study, we tested a novel combination of a) sampling strategies of negative items, and b) augmenting the model’s input data with Repeated Padding (RepPad). We have compared different model variants in three use cases in an A/B test. Depending on the use case, the modifications affected the viewing volume and the popularity in different ways.

    Full text in ACM Digital Library

  • SPOT #29Cross-Batch Aggregation for Streaming Learning from Label Proportions in Industrial-Scale Recommendation Systems
    by Jonathan Valverde, Tiansheng Yao, Xiang Li, Yuan Gao, Yin Zhang, Andrew Evdokimov, Adam Kraft, Samuel Ieong, Jerry Zhang, Ed Chi, Derek Cheng, Ruoxi Wang

    Recent controls over user data have diluted user signals essential to train industrial recommendation systems, replacing traditional event-level labels with aggregated item-level labels. Fitting these noisy aggregates into the event-level paradigm used by industrial recommendation systems causes models to be biased and miscalibrated, hurting critical business metrics. Learning from Label Proportions (LLP), a framework where instance-level prediction models are trained from aggregated signals, offers a principled solution to this problem — as long as all samples from an aggregate are present within the same training batch. Unfortunately, industry-scale recommender systems impose infrastructure constraints that fail this critical assumption because (1) they are trained in a sequential streaming framework that spreads aggregates across batches, (2) aggregates often exceed the size of a single batch, and (3) label noise makes it difficult to identify the time boundaries that correspond to the aggregated label. To address these issues, we propose a novel technique called Cross Batch Aggregate (XBA) Loss to adapt LLP to the streaming setting. We design the loss to have a gradient that mimics the true aggregated loss gradient, approximating the distribution of the aggregate by using cumulative statistics across each aggregate. This enables (1) optimizing for model calibration and (2) learning a conversion model from the aggregate signals. We have deployed this technique to a Google Ads system impacted by conversion signal loss due to privacy constraints, delivering significant improvements on model calibration (48.8% reduction in online bias), advertiser value, and business metrics. Our key contribution is extend LLP to the streaming setting, providing a practical solution that bridges the gap between LLP research and industrial applications.

    Full text in ACM Digital Library

  • SPOT #30Kamae: Bridging Spark and Keras for Seamless ML Preprocessing
    by George Barrowclough, Marian Andrecki, James Shinner, Daniele Donghi

    In production recommender systems, feature preprocessing must be faithfully replicated across training and inference environments. This often requires duplicating logic between offline and online environments, increasing engineering effort and introducing risks of dataset shift. We present Kamae, an open-source Python library that bridges this gap by translating PySpark preprocessing pipelines into equivalent Keras models. Kamae provides a suite of configurable Spark transformers and estimators, each mapped to a corresponding Keras layer, enabling consistent, end-to-end preprocessing across the ML lifecycle. Framework’s utility is illustrated on real-world use cases, including MovieLens dataset and Expedia’s Learning-to-Rank pipelines. The code is available at https://github.com/ExpediaGroup/kamae.

    Full text in ACM Digital Library

  • SPOT #31RADAR: Recall Augmentation through Deferred Asynchronous Retrieval
    by Amit Jaspal, Qian Dang, Ajantha Ramineni

    Modern large-scale recommender systems employ multi-stage ranking funnel (Retrieval, Pre-ranking, Ranking) to balance engagement and computational constraints (latency, CPU). However, the initial retrieval stage, often relying on efficient but less precise methods like K-Nearest Neighbors (KNN), struggles to effectively surface the most engaging items from billion-scale catalogs, particularly distinguishing highly relevant and engaging candidates from merely relevant ones. We introduce Recall Augmentation through Deferred Asynchronous Retrieval (RADAR), a novel framework that leverages asynchronous, offline computation to pre-rank a significantly larger candidate set for users using the full complexity ranking model. These top- ranked items are stored and utilized as a high-quality retrieval source during online inference, bypassing online retrieval and pre- ranking stages for these candidates. We demonstrate through offline experiments that RADAR significantly boosts recall (2X Recall@200 vs DNN retrieval baseline) by effectively combining a larger retrieved candidate set with a more powerful ranking model. Online A/B tests confirm a +0.8% lift in topline engagement metrics, validating RADAR as a practical and effective method to improve recommendation quality under strict online serving constraints.

    Full text in ACM Digital Library

  • SPOT #32SocRipple: A Two-Stage Framework for Cold-Start Video Recommendations
    by Amit Jaspal, Kapil Dalwani, Ajantha Ramineni

    Most industry-scale recommender systems face critical cold-start challenges—new items lack interaction history, making it difficult to distribute them in a personalized manner. Standard collaborative filtering models underperform due to sparse engagement signals, while content-only approaches lack user- specific relevance. We propose SocRipple, a novel two-stage retrieval framework tailored for cold-start item distribution in social graph-based platforms. Stage 1 leverages the creator’s social connections for targeted initial exposure. Stage 2 builds on early engagement signals and stable user embeddings—learned from historical interactions—to “ripple” outwards via K-Nearest Neighbor (KNN) search. Large scale experiments on a major video platform show that SocRipple boosts cold-start item distribution by +36% while maintaining user engagement rate on cold-start items, effectively balancing new-item exposure with personalized recommendations.

    Full text in ACM Digital Library

  • SPOT #33Scaling Image Variant Optimization Through Customer Bucketing and Response Caching: A Large-Scale Implementation at Amazon Prime Video
    by Haiyun Jin, Bobby Patel

    Multi-Armed Bandit (MAB) models are widely used in industrial recommender systems to manage the ongoing trade-off between exploration and exploitation. At scale, the computational cost of running inference for every incoming request can become prohibitively high. In this paper, we describe a practical solution deployed at Amazon Prime Video to address the cost challenges of a production MAB-based image-ranking system known as \emph{Summer}. Our method combines two key strategies: (1) caching the ranking results and (2) bucketing users to distribute the inference workload across customer cohorts. We show that these strategies reduce hourly inference calls by up to 77.8\% relative to an uncached fully user-specific baseline, leading to significant operational and infrastructural savings. Despite lowering inference volume, the approach maintained user engagement and improved specific outcomes such as video streams and Amazon Video (AV) purchases. A 21-day global online experiment showed a 0.02\% increase in video streams ($p = 0.031$) and a 0.19\% increase in AV purchase units ($p = 0.003$), demonstrating that the technique reduces inference costs without compromising user experience or model performance. We describe the system design, experimental findings, and practical considerations for applying caching and bucketing strategies at scale.

    Full text in ACM Digital Library

  • SPOT #34Operational Twin–Driven AI Recommender for Strategic Service Planning
    by Vivek Singh, Sarith Mohan, Chetan Srinidhi, Santosh Pai, Ullaskrishnan Poikavila, Codruta Ene, Ankur Kapoor, Neil Biehn, Dorin Comaniciu

    Traditional service management relies heavily on manual processes due to data complexity and human involvement, limiting the impact of AI in strategic planning. We present an AI recommender system that leverages an operational twin of service operations to optimize long-term KPIs using Monte Carlo search and mixed-integer programming. Focusing on personnel allocation for large healthcare equipment, the system accounts for domain-specific constraints like specialization and continuity. We deployed the system at Siemens Healthineers to support over 300,000 equipment across the U.S. and report productivity gains from over an year of real-world use and key lessons for adoption at scale.

    Full text in ACM Digital Library

  • SPOT #35Simulating Discoverability for Upcoming Content in TV Entertainment Platforms
    by Adeep Hande, Kishorekumar Sundararajan, Yidnekachew Endale, Sardar Hamidian

    In entertainment platforms, search and browse are critical entry points for content discovery. Yet, newly ingested titles often fail to surface at the moment of highest user interest due to a range of practical issues: lack of user-item interaction data, cold-start sparsity, or filtering strategies that deprioritize fresh content. These visibility gaps are difficult to detect before user complaints or engagement drops emerge. We present a simulation-based evaluation framework that assesses the discoverability of upcoming content that is about to be released or has just been ingesetd into our catalog. Our system uses large language models, grounded in item metadata and historical query patterns, to generate realistic search queries that reflect how users are likely to look for content. These synthetic queries are executed in a staging environment that mirrors production, capturing UI-level responses to compute a discoverability score for each entity. The score identifies visibility risks without modifying the search engine itself, enabling proactive editorial and QA interventions. Integrated into Comcast’s daily workflows, this framework scales to thousands of titles and supports operational search quality assurance. While built for voice and text-based entertainment search, the approach generalizes to other recommendation and retrieval systems that face similar black-box surfacing challenge

    Full text in ACM Digital Library

  • SPOT #36RankGraph: Unified Heterogeneous Graph Learning for Cross-Domain Recommendation
    by Renzhi Wu, Junjie Yang, Li Chen, Hong Li, Li Yu, Hong Yan

    Cross-domain recommendation systems face the challenge of integrating fine-grained user and item relationships across various product domains. To address this, we introduce RankGraph, a scalable graph learning framework designed to serve as a core component in recommendation foundation models (FMs). By constructing and leveraging graphs composed of heterogeneous nodes and edges across multiple products, RankGraph enables the integration of complex relationships between users, posts, ads, and other entities. Our framework employs a GPU-accelerated Graph Neural Network and contrastive learning, allowing for dynamic extraction of subgraphs such as item-item and user-user graphs to support similarity-based retrieval and real-time clustering. Furthermore, RankGraph integrates graph-based pretrained representations as contextual tokens into FM sequence models, enriching them with structured relational knowledge. RankGraph has demonstrated improvements in click (+0.92%) and conversion rates (+2.82%) in online A/B tests, showcasing its effectiveness in cross-domain recommendation scenarios.

    Full text in ACM Digital Library

  • SPOT #37Contrastive Conditional Embeddings for Item-based Recommendation at E-commerce Scale
    by Akira Fukumoto, Aghiles Salah, Sarthak Shrivastava, Alexandru Tatar, Yannick Schwartz, Vincent Michel, Lee Xiong

    Item-based recommendation is crucial in e-commerce for helping users navigate the myriad of options available to them. While embedding-based methods are standard, learning high-quality item representations from sparse co-occurrence data is challenging. Deployment at scale is even harder, with a lack of well-documented real-world successes. The two main obstacles are the model size, which scales linearly with the number of items, and the co-occurrence-based training data, which is massive and sparse leading to significant memory, storage, and compute demands. In this work, we propose a conditional factor model combining item co-occurrences and textual information to generate effective embeddings through a contrastive loss with mixed negative sampling for e-commerce recommendations. Our production model exceeds 10 billion parameters–half trainable daily on over 2 billion item-item co-occurrence pairs. We detail key implementation choices that allowed us to overcome the above challenges and successfully deploy the model on Rakuten Group, Inc’s large-scale e-commerce platform in Japan. A/B tests show strong impact, with purchase rate gains of +16.38% and +4.01% across two major recommendation widgets.

    Full text in ACM Digital Library

  • SPOT #38Unified Survey Modeling to Limit Negative User Experiences in Recommendation Systems
    by Chenghui Yu, Haoze Wu, Jian Ding, Bingfeng Deng, Hongyu Xiong

    Reducing negative user experiences is crucial for the success of recommendation platforms. Exposure to inappropriate content can not only harm users’ psychological well-being but also drive them away, ultimately undermining the platform’s long-term growth. However, recommendation algorithms often prioritize positive feedback signals due to the relative scarcity of negative ones, which may lead to the oversight of valuable negative user feedback. In this paper, we propose a method that leverages in-feed surveys to collect user feedback, models this feedback, and integrates the predictions into the recommendation system. We enhance the personalized survey model based on the HoME framework. Our experiments demonstrate that the proposed method significantly outperforms the baseline model. We observed a averaged 0.52% AUC increase and 1.38% LogLoss decline across all heads. After deploying the model on the Tiktok app, we observe 0.82% and 0.67% increase in survey_like_rate and Like, a 4.08%, 2.51%, 2.59% reduction in survey_inappropriate_rate, Reports, Dislikes, respectively, illustrating the improvement of the overall recommandation quality and decline in negative signals.

    Full text in ACM Digital Library

  • SPOT #39USD: A User-Intent-Driven Sampling and Dual-Debiasing Framework for Large-Scale Homepage Recommendations
    by Jiaqi Zheng, Cheng Guo, Yi Cao, Chaoqun Hou, Tong Liu, Bo Zheng

    Large-scale homepage recommendations face critical challenges from pseudo-negative samples caused by exposure bias, where non-clicks may indicate inattention rather than disinterest. Existing work lacks thorough analysis of invalid exposures and typically addresses isolated aspects (e.g., sampling strategies), overlooking the critical impact of pseudo-positive samples — such as homepage clicks merely to visit marketing portals. We propose a unified framework for large-scale homepage recommendation sampling and debiasing. Our framework consists of two key components: (1) a user intent-aware negative sampling module to filter invalid exposure samples, and (2) an intent-driven dual-debiasing module that jointly corrects exposure bias and click bias. Extensive online experiments on Taobao demonstrate the efficacy of our framework, achieving significant improvements in user click-through rates (UCTR) by 35.4% and 14.5% in two variants of the marketing block on the Taobao homepage, Baiyibutie and Taobaomiaosha.

    Full text in ACM Digital Library

  • SPOT #40User Long-Term Multi-Interest Retrieval Model for Recommendation
    by Yue Meng, Cheng Guo, Xiaohui Hu, Honghu Deng, Yi Cao, Tong Liu, Bo Zheng

    User behavior sequence modeling, which captures user interest from rich historical interactions, is pivotal for industrial recommendation systems. Despite breakthroughs in ranking-stage models capable of leveraging ultra-long behavior sequences with length scaling up to thousands, existing retrieval models remain constrained to sequences of hundreds of behaviors due to two main challenges. One is strict latency budget imposed by real-time service over large-scale candidate pool. The other is the absence of target-aware mechanisms and cross-interaction architectures, which prevent utilizing ranking-like techniques to simplify long sequence modeling. To address these limitations, we propose a new framework named User Long-term Multi-Interest Retrieval Model(ULIM), which enables thousand-scale behavior modeling in retrieval stages. ULIM includes two novel components: 1)Category-Aware Hierarchical Dual-Interest Learning partitions long behavior sequences into multiple category-aware sub-sequences representing multi-interest and jointly optimizes long-term and short-term interests within specific interest cluster. 2)Pointer-Enhanced Cascaded Category-to-Item Retrieval introduces Pointer-Generator Interest Network(PGIN) for next-category prediction, followed by next-item retrieval upon the top-K predicted categories. Comprehensive experiments on Taobao dataset show that ULIM achieves substantial improvement over state-of-the-art methods, and brings 5.54% clicks, 11.01% orders and 4.03% GMV lift for Taobaomiaosha, a notable mini-app of Taobao.

    Full text in ACM Digital Library

  • SPOT #41Improving Visual Recommendation on E-commerce Platforms Using Vision-Language Models
    by Yuki Yada, Sho Akiyama, Ryo Watanabe, Yuta Ueno, Yusuke Shido, Andre Rusli

    On large-scale e-commerce platforms with tens of millions of active monthly users, recommending visually similar products is essential for enabling users to efficiently discover items that align with their preferences. This study presents the application of a vision-language model (VLM)—which has demonstrated strong performance in image recognition and image-text retrieval tasks—to product recommendations on Mercari, a major consumer-to-consumer marketplace used by more than 20 million monthly users in Japan. Specifically, we fine-tuned SigLIP, a VLM employing a sigmoid-based contrastive loss, using one million product image-title pairs from Mercari collected over a three-month period, and developed an image encoder for generating item embeddings used in the recommendation system. Our evaluation comprised an offline analysis of historical interaction logs and an online A/B test in a production environment. In offline analysis, the model achieved a 9.1% improvement in nDCG@5 compared with the baseline. In the online A/B test, the click-through rate improved by 50% whereas the conversion rate improved by 14% compared with the existing model. These results demonstrate the effectiveness of VLM-based encoders for e-commerce product recommendations and provide practical insights into the development of visual similarity-based recommendation systems.

    Full text in ACM Digital Library

  • SPOT #42Stream Normalization for CTR Prediction
    by Yizhou Sang, Congcong Liu, Yuying Chen, Zhiwei Fang, Xue Jiang, Changping Peng, Zhangang Lin, Ching Law, Jingping Shao

    Deep learning models often encounter significant challenges when dealing with non-i.i.d. and non-stationary data distributions, particularly in incremental learning tasks such as click-through rate (CTR) prediction in recommender systems. Traditional normalization techniques, such as Batch Normalization and Layer Normalization, struggle to maintain stability and adaptability in the face of rapidly changing data distributions. To overcome these challenges, we introduce Stream Normalization (SN), a novel normalization method designed to dynamically adjust to shifting data distributions. SN enhances model robustness and mitigates the risk of catastrophic forgetting by continuously adapting its normalization strategy. Extensive experiments demonstrate that SN achieves state-of-the-art performance on both offline datasets and real-world online A/B tests, representing a significant advancement in incremental learning for streaming data.

    Full text in ACM Digital Library

  • SPOT #43An Analysis of Learned Product Embeddings in an E-Commerce Context
    by Mate Hartstein, Eva Giannatou, Martin Tegner

    Recommender systems often represent products with learnable embeddings. Yet, we seldom examine the structure of the embedding space, and what implications it has for the recommendation task at hand. In contrast, embeddings in natural language processing are well-understood and offer intuitive properties through word analogies (e.g. “queen – king = woman – man”). In this work, we present a corresponding approach that reveals latent knowledge in the structure of product embeddings. We prove their relevance in evaluating several embeddings learned from different data modalities in a home-furnishing context. Our findings evince distinct embedding strengths: visual embeddings capture explicit attributes like colour and shape; textual embeddings encode abstract concepts like style and functionality; while behavioural embeddings offer versatile representations driven by user interactions. We also highlight trade-offs, and link our evaluations to practical considerations in embedding development within the e-commerce domain.

    Full text in ACM Digital Library

  • SPOT #44Closing the Online-Offline Gap: A Scalable Framework for Composed Model Evaluation
    by Mahanth Kumar Beeraka, Chen Chen, Yining Lu, Briac Marcatte, Weikun Lyu, Brooke Bian, Enriko Aryanto, Ellie Wen, Mohamed Radwan, Tianshan Cui, Wenjing Lu, Mohsen Malmir, Yang Li

    We propose iPCF (Intelligent Prediction Composition Framework), a platform for training and evaluating ranking models. Unlike traditional approaches – such as frequent retraining, robust feature selections or output calibration that focus solely on a model’s standalone prediction quality, iPCF evaluates the model’s performance in a production-like environment where multiple models are composed together to estimate the final conversion probability (eCVR). This framework is especially critical in Meta’s Lattice based modeling stack, where multi-task models produce several predictions used downstream in business logic. By introducing the new metric based on simulated recomposed final eCVR, iPCF enables more accurate offline evaluation and informed candidate selection. In production use, the framework has led to up to 18% improvement in L1 distance correlation with final top line results. Beyond evaluation, iPCF brings serving-awareness into the model development cycle, improving the robustness, efficiency, and impact of ranking models.

    Full text in ACM Digital Library

  • SPOT #45Enhancing Online Ranking Systems via Multi-Surface Co-Training for Content Understanding
    by Gwendolyn Zhao, Yilin Zheng, Raghu Keshavan, Lukasz Heldt, Qian Sun, Fabio Soldo, Li Wei, Aniruddh Nath, Nikhil Khani, Weilong Yang, Dapo Omidiran, Rein Zhang, Mei Chen, Lichan Hong, Xinyang Yi

    Content understanding is an important part in real-world recommendation systems. This paper introduces a Multi-surface Co-training (MulCo) system, designed to enhance online ranking systems by improving content understanding. The model is trained through a task-aligned co-training approach, leveraging objectives and data from multiple video discovery feeding surfaces and various pre-trained embeddings. It separates video content understanding into an offline model, enabling scalability and efficient resource use. Experiments demonstrate that MulCo significantly outperforms non-task-aligned pre-trained embeddings and achieves substantial gains in online user value, e.g. satisfied engagement and freshness metrics. This system presents a practical solution to improve content understanding in multi-surface large-scale recommender systems.

    Full text in ACM Digital Library

  • SPOT #46Scaling Generative Recommendations with Context Parallelism on Hierarchical Sequential Transducers
    by Yue Dong, Han Li, Shen Li, Nikhil Patel, Xing Liu, Xiaodong Wang, Chuanhao Zhuge

    Large-scale recommendation systems are pivotal in handling tens of billions of daily user actions, relying heavily on high cardinality and heterogeneous features for accurate predictions. In a previous study, we identified that Hierarchical Sequential Transducers (HSTU) is an effective attention architecture for modeling high cardinality, non- stationary streaming recommendation data, providing good scaling law in the generative recommender framework (GR). Recent studies and experiments demonstrate that attending to longer user history sequences yields significant metric improvements. However, scal- ing sequence length is activation-heavy, necessitating parallelism solutions to effectively shard activation memory. In transformer- based LLM, a common practice is to adopt context parallelism (CP) mechanism to distribute computation along sequence-length dimen- sion among GPUs, to reduce attention activation memory usage. Compared with LLM, ranking models usually adopt jagged input tensors to represent user feature interactions, as is the implemen- tation mechanism. In this work, we introduce context parallelism with jagged tensor support for HSTU attention, to lay foundation work on scaling up sequence dimensions. Our work enabled 5.3x longer user interaction sequence length, with a scaling factor (train- ing throughput) of 1.55x when used together with distributed data parallelism (DDP).

    Full text in ACM Digital Library

  • SPOT #47Zero-shot Cross-domain Knowledge Distillation: A Case study on YouTube Music
    by Srivaths Ranganathan, Chieh Lo, Bernardo Cunha, Nikhil Khani, Li Wei, Aniruddh Nath, Shawn Andrews, Gergo Varady, Yanwei Song, Jochen Klingenhoefer, Tim Steele

    Knowledge Distillation (KD) has been widely used to improve the quality of latency sensitive models serving live traffic. However, applying KD in production recommender systems with low traffic is challenging: the limited amount of data restricts the teacher model size, and the cost of training a large dedicated teacher may not be justified. Cross-domain KD offers a cost-effective alternative by leveraging a teacher from a data-rich source domain, but introduces unique technical difficulties, as the features, user interfaces, and prediction tasks can significantly differ. We present a case study of using zero-shot cross-domain KD for multi-task ranking models, transferring knowledge from a (100x) large-scale video recommendation platform (YouTube) to a music recommendation application with significantly lower traffic. We share offline and live experiment results and present findings evaluating different KD techniques in this setting across two ranking models on the music app. Our results demonstrate that zero-shot cross-domain KD is a practical and effective approach to improve the performance of ranking models on low traffic surfaces.

    Full text in ACM Digital Library

  • SPOT #48Never Miss an Episode: How LLMs are Powering Serial Content Discovery on YouTube
    by Aditee Kumthekar, Li Wei, Andrea Bettale, Mahesh Sathiamoorthy, Zrinka Puljiz, Aditya Mahajan

    Leveraging large language models (LLMs) through prompting presents a cost-effective approach to build scalable systems without traditional model training. This paper showcases the effectiveness of using simple few-shot LLM prompt to develop a scalable and easily maintainable system that addresses a real- world user need. A critical user journey in video recommendation is watching serial content, which requires viewing episodes in a specific sequence. The existing method on YouTube for identifying serial content relied on manual creator tagging of playlists or inflexible regular expressions. These methods proved difficult to maintain and scale, limiting the system’s ability to effectively identify and recommend serial content. This paper demonstrates that a carefully designed few-shot LLM prompt can accurately identify serial playlists at scale, improving user experience with minimal engineering. The paper details the challenges and lessons learned in developing and deploying this prompting-based system.

    Full text in ACM Digital Library

  • SPOT #49LADDER: LLM-Annotated Data for Dogfooded Evaluation of Rankings
    by Mattia Ottoborgo

    In this paper we showcase the implementation of LADDER: A method utilizing Large Language Model to annotate thousands of consumer reviews to train a point-wise learning to rank algorithm. By applying LADDER, we significantly improved the relevance of the top 4 reviews presented to users, demonstrably reducing the need to access the full review collection by 5%. This outcome highlights LADDER’s ability to enhance user experience by providing sufficient information within the initial review set, thereby streamlining the decision-making process. We discuss the efficiency gains in large-scale data labeling, the positive impact on trust and relevance in review presentation without sacrificing usability, and key insights into effectively integrating domain expertise into LLM annotation for high-quality results.

    Full text in ACM Digital Library

  • SPOT #50Generalized User Representations for Large-Scale Recommendations and Downstream Tasks
    by Ghazal Fazelnia, Sanket Gupta, Claire Keum, Mark Koh, Timothy Heath, Guillermo Carrasco Hernández, Stephen Xie, Nandini Singh, Ian Anderson, Maya Hristakeva, Petter Skiden, Mounia Lalmas

    Accurately capturing diverse user preferences at scale is a core challenge for large-scale recommender systems like Spotify’s, given the complexity and variability of user behavior. To address this, we propose a two-stage framework that combines representation learning and transfer learning to produce generalized user embeddings. In the first stage, an autoencoder compresses rich user features into a compact latent space. In the second, task-specific models consume these embeddings via transfer learning, removing the need for manual feature engineering. This approach enhances flexibility by allowing dynamic updates to input features, enabling near-real-time responsiveness to user behavior. The framework has been deployed in production at Spotify with an efficient infrastructure that allows downstream models to operate independently. Extensive online experiments in a live setting show significant improvements in metrics such as consumption share, content discovery, and search success. Additionally, our method achieves these gains while substantially reducing infrastructure costs.

    Full text in ACM Digital Library

  • SPOT #51Streaming Trends: A Low-Latency Platform for Dynamic Video Grouping and Trending Corpora Building
    by Yang Gu, Caroline Zhou, Qiao Zhang, Scott Wang, Yongzhe Wang, Li Zhang, Nikos Parotsidis, Cj Carey, Ashkan Fard, Mingyan Gao, Yaping Zhang, Sourabh Bansod

    This paper presents Streaming Trends, a real-time system deployed on a short-form videos platform that enables dynamic content grouping, tracking videos from upload to their identification as part of a trend. Addressing the latency inherent in traditional batch processing for short-form video, Streaming Trends utilizes online clustering and flexible similarity measures to associate new uploads with relevant groups in near real-time. The system combines online processing for immediate updates triggered by uploads and seed queries with offline processes for similarity modeling and cluster quality maintenance. By facilitating the rapid identification and association of trending videos, Streaming Trends significantly enhances content discovery and user value on the short videos platform.

    Full text in ACM Digital Library

  • SPOT #52Efficient Off-Policy Evaluation of Content Blending in Station-Based Music Experiences
    by Chelsea Weaver, Arvind Balasubramanian, Juan Borgnino, Ben London

    Audio streaming services, on both voice assistants and in visual apps, often field requests such as “play more like Foo Fighters.” The service then returns a sequence of tracks that is both relevant to the request and personalized to the requester. While it is natural to evaluate the policies that produce these sequences in terms of customer engagement, such metrics do not assess their performance on other key business goals. We present our work to implement a content blending strategy to increase the prevalence of specific strategically-important content in these sequences and show how it allowed us to meet the needs of our artist and record label customers while minimizing harm to playback rates. In particular, we describe our efficient extension of off-policy evaluation to evaluate how blending impacts both engagement and the number of successful new release plays. We demonstrate how we used this work to choose blend rates for new policies so as to maximize our engagement metric while preserving the new release metric baseline set by the current production policy. We also investigate the accuracy of these methods by comparing our estimates to online results.

    Full text in ACM Digital Library

  • SPOT #53Deep Reinforcement Learning for Ranking Utility Tuning in the Ad Recommender System at Pinterest
    by Xiao Yang, Mehdi Ayed, Longyu Zhao, Fan Zhou, Yuchen Shen, Abe Engle, Jinfeng Zhuang, Ling Leng, Jiajing Xu, Charles Rosenberg, Prathibha Deshikachar

    The ranking utility function in an ad recommender system, which linearly combines predictions of various business goals, plays a central role in balancing values across the platform, advertisers, and users. Traditional manual tuning, while offering simplicity and interpretability, often yields suboptimal results due to its unprincipled tuning objectives, the vast amount of parameter combinations, and its lack of personalization and adaptability to seasonality. In this work, we propose a general Deep Reinforcement Learning framework for Personalized Utility Tuning (DRL-PUT) to address the challenges of multi-objective optimization within ad recommender systems. Our key contributions include: 1) Formulating the problem as a reinforcement learning task: given the state of an ad request, we predict the optimal hyperparameters to maximize a pre-defined reward. 2) Developing an approach to directly learn an optimal policy model using online serving logs, avoiding the need to estimate a value function, which is inherently challenging due to the high variance and unbalanced distribution of immediate rewards. We evaluated DRL-PUT through an online A/B experiment in Pinterest’s ad recommender system. Compared to the baseline manual utility tuning approach, DRL-PUT improved the click-through rate by 9. 7% and the long click-through rate by 7.7% on the treated segment.

    Full text in ACM Digital Library

  • SPOT #54Practical Multi-Task Learning for Rare Conversions in Ad Tech
    by Yuval Dishi, Ophir Friedler, Yonatan Karni, Natalia Silberstein, Yulia Stolin

    We present a Multi-Task Learning (MTL) approach for improving predictions for rare (e.g., <1%) conversion events in online advertising. The conversions are classified into ``rare'' or ``frequent'' types based on historical statistics. The model learns shared representations across all signals while specializing through separate task towers for each type. The approach was tested and fully deployed to production, demonstrating consistent improvements in both offline (0.69% AUC lift) and online KPI performance metric (2% Cost per Action reduction). Full text in ACM Digital Library

Diamond Supporter
 
 
 
Platinum Supporter
 
Gold Supporter
 
Bronze Supporter
 
 
 
Challenge Supporter
 
Women in RecSys’s Event Supporter
 
Breakfast Symposium
 
Coffee Break Sponsor
 
Special Supporters
 
 
 

This event is supported by the Capital City of Prague