RecSys 2025 - Session 8 - RecSys

Session 8: Multimodal Moments: Leveraging Vision, Sound, and/or Text for Recommendation

Date: Thursday September 25, 11:30–12:50 (GMT+2)
Session Chair: Li Chen

RESBeyond Immediate Click: Engagement-Aware and MoE-Enhanced Transformers for Sequential Movie Recommendation
by Haotian Jiang, Sibendu Paul, Haiyang Zhang, Caren Chen

Modern video streaming services heavily rely on recommender systems. Although there are many methods for content personalization and recommendation, sequential recommendation models stand out due to their ability to summarize user behavior over time. We propose a novel sequential recommendation framework to address the following key issues: suboptimal negative sampling strategies, fixed user-history context lengths, and single-task optimization objectives, insufficient engagement-aware learning, and short-sighted prediction horizons, ultimately improving both immediate and multi-step next-title prediction for video streaming services. In this work, we propose a novel approach to capture patterns of interaction at different time scales. We also align long-term user happiness with instantaneous intent signals using multi-task learning with engagement-aware personalized loss. Finally, we extend traditional next-item prediction into a next-K forecasting task using a training strategy with soft positive label. Extensive experiments on large-scale streaming data validate the effectiveness of our approach. Our best model outperforms the baseline in NDCG@1 by up to 3.52% under realistic ranking scenarios showing the effectiveness of our engagement-aware and MoE-enhanced designs. Results also show that soft-label Multi-K training is a practical and scalable extension, and that a balanced personalized negative sampling strategy generalizes well. Our framework outperforms baselines across all ranking metrics, providing a robust solution for production-scale streaming recommendations.

Full text in ACM Digital Library
RESEnhancing Online Video Recommendation via a Coarse-to-fine Dynamic Uplift Modeling Framework
by Chang Meng, Chenhao Zhai, Xueliang Wang, Shuchang Liu, Xiaoqiang Feng, Lantao Hu, Xiu Li, Han Li, Kun Gai

The popularity of short video applications has brought new opportunities and challenges to video recommendation. In addition to the traditional ranking-based pipeline, industrial solutions usually introduce additional distribution management components to guarantee a diverse and content-rich user experience. However, existing solutions are either non-personalized or fail to generalize well to the ever-changing user preferences. Inspired by the success of uplift modeling in online marketing, we attempt to implement uplift modeling in the video recommendation scenario to mitigate the problems. However, we face two main challenges when migrating the technique: 1) the complex-response causal relation in distribution management problem, and 2) the modeling of long-term and real-time user preferences. To address these challenges, we correspond each treatment to a specific adjustment of the distribution over video types, then propose a Coarse-to-fine Dynamic Uplift Modeling (CDUM) framework for real-time video recommendation scenarios. Specifically, CDUM consists of two modules, a coarse-grained module that utilizes the offline features of users to model their long-term preferences, and a fine-grained module that leverages online real-time contextual features and request-level candidates to model users’ real-time interests. These two modules collaboratively and dynamically identify and target specific user groups, and then apply treatments effectively. We conduct comprehensive experiments on two offline public datasets, an industrial offline dataset, and an online A/B test, demonstrating the superiority and effectiveness of CDUM. The proposed method is fully deployed on Kuaishou platform, serving hundreds of millions of users every day. Our code and datasets are available at https://github.com/UpliftVideo/CDUM.

Full text in ACM Digital Library
INDImproving Visual Recommendation on E-commerce Platforms Using Vision-Language Models
by Yuki Yada, Sho Akiyama, Ryo Watanabe, Yuta Ueno, Yusuke Shido, Andre Rusli

On large-scale e-commerce platforms with tens of millions of active monthly users, recommending visually similar products is essential for enabling users to efficiently discover items that align with their preferences. This study presents the application of a vision-language model (VLM)—which has demonstrated strong performance in image recognition and image-text retrieval tasks—to product recommendations on Mercari, a major consumer-to-consumer marketplace used by more than 20 million monthly users in Japan. Specifically, we fine-tuned SigLIP, a VLM employing a sigmoid-based contrastive loss, using one million product image-title pairs from Mercari collected over a three-month period, and developed an image encoder for generating item embeddings used in the recommendation system. Our evaluation comprised an offline analysis of historical interaction logs and an online A/B test in a production environment. In offline analysis, the model achieved a 9.1% improvement in nDCG@5 compared with the baseline. In the online A/B test, the click-through rate improved by 50% whereas the conversion rate improved by 14% compared with the existing model. These results demonstrate the effectiveness of VLM-based encoders for e-commerce product recommendations and provide practical insights into the development of visual similarity-based recommendation systems.

Full text in ACM Digital Library
INDIn-context Learning for Addressing User Cold-start in Sequential Movie Recommenders
by Xurong Liang, Vu Nguyen, Vuong Le, Paul Albert, Julien Monteil

The user cold-start problem remains a fundamental challenge for sequential recommender systems, particularly in large-scale video streaming services where a substantial portion of users have limited or no historical interaction data. In this work, we formulate an attempt at solving this issue by proposing a framework that leverages Large Language Models (LLMs) to enrich interaction histories using user metadata. Our approach generates a set of imaginary video items relevant to a given user’s demographic, represented through structured item key-value attributes. The generated items are inserted into users’ interaction sequences using early or late fusion strategies. We find that the generated user histories enable better initial user profiling for absolute cold users and enhanced preference modeling for nearly cold users. Experimental results on the public ML-1M dataset and an internal dataset from an Amazon streaming service demonstrate the effectiveness of our LLM-based augmentation method in mitigating cold-start challenges.

Full text in ACM Digital Library
RESMulti-Granularity Distribution Modeling for Video Watch Time Prediction via Exponential-Gaussian Mixture Network
by Xu Zhao, Ruibo Ma, Jiaqi Chen, Weiqi Zhao, Ping Yang, Yao Hu

Accurate watch time prediction is crucial for enhancing user engagement in streaming short-video platforms, although it is challenged by complex distribution characteristics across multi-granularity levels. Through systematic analysis of real-world industrial data, we uncover two critical challenges in watch time prediction from a distribution aspect: (1) coarse-grained skewness induced by a significant concentration of quick-skips1, (2) fine-grained diversity arising from various user-video interaction patterns. Consequently, we assume that the watch time follows the Exponential-Gaussian Mixture (EGM) distribution, where the exponential and Gaussian components respectively characterize the skewness and diversity. Accordingly, an Exponential-Gaussian Mixture Network (EGMN) is proposed for the parameterization of EGM distribution, which consists of two key modules: a hidden representation encoder and a mixture parameter generator. We conducted extensive offline experiments on public datasets and online A/B tests on the industrial short-video feeding scenario of Xiaohongshu App to validate the superiority of EGMN compared with existing state-of-the-art methods. Remarkably, comprehensive experimental results have proven that EGMN exhibits excellent distribution fitting ability across coarse-to-fine-grained levels. We open source related code on Github: https://github.com/BestActionNow/EGMN.

Full text in ACM Digital Library
INDNot All Impressions Are Created Equal: Psychology-Informed Retention Optimization for Short-Form Video Recommendation
by Yuyan Wang, Jing Zhong, Yuxin Cui, Zhaohui Guo, Chuanqi Wei, Yanchen Wang, Zellux Wang

Recommender systems that are optimized only for short-term engagement can lead to undesirable outcomes and hurt long-term consumer experience. In response, researchers and practitioners have proposed to incorporate retention signals into recommender systems. Existing retention models are built on item-level interactions where every impression is weighted equally. However, on short-form video platforms where content is presented sequentially and passively consumed, users are unlikely to engage equally with every video, and it is hard to establish any meaningful relationships between a short video watch and long-term retention behaviors. In this work, we propose a psychology-informed retention modeling approach grounded in the peak–end rule, which suggests that people evaluate past experiences largely based on the most intense moment (“peak”) and the final moment (“end”). Specifically, we train a retention model that predicts user return based on the peak and end moments of each session, which is then incorporated into a multi-stage recommender system. We implemented our approach on Facebook Reels, one of the world’s largest short-form video recommendation platforms. In a long-term A/B test against the production system, our model delivered significant improvements in Daily Active Users and total sessions, suggesting an improved long-term user experience

Full text in ACM Digital Library
REPRSee the Movie, Hear the Song, Read the Book: Extending MovieLens-1M, Last.fm-2K, and DBbook with Multimodal Data
by Giuseppe Spillo, Elio Musacchio, Cataldo Musto, Marco de Gemmis, Pasquale Lops, Giovanni Semeraro

The last few years have seen an increasing interest of the RecSys community in the multimodal recommendation research field, as shown by the numerous contributions proposed in the literature. Our paper falls in this research line, as we released a multimodal extension of three state-of-the-art datasets (MovieLens-1M, DBbook, Last.fm-2K) in the movie, book, music recommendation domains, respectively. Although these datasets have been widely adopted for classical recommendation tasks (e.g., collaborative filtering), their use in multimodal recommendation has been hindered by the absence of multimodal information. To fill this gap, we have manually collected multimodal item raw files from different modalities (text, images, audio, and video, when available) for each dataset. Specifically, we have collected, for MovieLens-1M, movie plots (textual information), movie posters (images) and movie trailers (audio and video); for Last.fm-2K, we have collected, for each artist, the tags provided by users (textual information), the most popular album covers (images), and the most popular songs (audio); finally, for DBbook we have collected book abstracts (textual information) and book covers (image). We encoded all this information using state-of-the-art feature encoders, and we released the extended datasets, which include the mappings to the raw multimodal information and the encoded features. Finally, we conduct a benchmark analysis of various recommendation models using MMRec as a multimodal recommendation framework. Our results show that multimodal information can further enhance the quality of recommendations in these domains compared to single collaborative filtering. We release the multimodal version of such datasets to foster this research line, including links to download the raw multimodal files and the encoded item features.

Full text in ACM Digital Library
RESVL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings
by Ramin Giahi, Kehui Yao, Sriram Kollipara, Kai Zhao, Vahid Mirjalili, Jianpeng Xu, Topojoy Biswas, Evren Korpeoglu, Kannan Achan

Multimodal learning plays a critical role in e-commerce recommendation platforms today, enabling accurate recommendations and product understanding. However, existing vision-language models, such as CLIP, face key challenges in e-commerce recommendation systems: 1) Weak object-level alignment, where global image embeddings fail to capture fine-grained product attributes, leading to suboptimal retrieval performance; 2) Ambiguous textual representations, where product descriptions often lack contextual clarity, affecting cross-modal matching; and 3) Domain mismatch, as generic vision-language models may not generalize well to e-commerce-specific data. To address these limitations, we propose a framework, VL-CLIP, that enhances CLIP embeddings by integrating Visual Grounding for fine-grained visual understanding and an LLM-based agent for generating enriched text embeddings. Visual Grounding refines image representations by localizing key products, while the LLM agent enhances textual features by disambiguating product descriptions. Our approach significantly improves retrieval accuracy, multimodal retrieval effectiveness, and recommendation quality across tens of millions of items on one of the largest e-commerce platforms in the U.S., increasing CTR by 18.6%, ATC by 15.5%, and GMV by 4.0%. Additional experimental results show that our framework outperforms vision-language models, including CLIP, FashionCLIP, and GCL, in both precision and semantic alignment, demonstrating the potential of combining object-aware visual grounding and LLM-enhanced text representation for robust multimodal recommendations.

Full text in ACM Digital Library

Back to program

Session 8: Multimodal Moments: Leveraging Vision, Sound, and/or Text for Recommendation

RecSys 2025 (Prague)

Diamond Supporter

Platinum Supporter

Gold Supporter

Bronze Supporter

Challenge Supporter

Women in RecSys’s Event Supporter

Breakfast Symposium

Coffee Break Sponsor

Special Supporters

About this site

RecSys 2026