RecSys 2024 — Session 16: Large Language Models 2 - RecSys

Session 16: Large Language Models 2

Date: Thursday October 17, 14:30 PM – 16:20 PM (GMT+2)
Room: Petruzzelli Theater
Session Chair: Yashar Deldjoo

RES 🕓15Unleashing the Retrieval Potential of Large Language Models in Conversational Recommender Systems
by Ting Yang (Hong Kong Baptist University) and Li Chen (Hong Kong Baptist University)

Conversational recommender systems (CRSs) aim to capture user preferences and provide personalized recommendations through interactive natural language interaction. The recent advent of large language models (LLMs) has revolutionized human engagement in natural conversation, driven by their extensive world knowledge and remarkable natural language understanding and generation capabilities. However, introducing LLMs into CRSs presents new technical challenges. Directly prompting LLMs for recommendation generation requires understanding a large and evolving item corpus, as well as grounding the generated recommendations in the real item space. On the other hand, generating recommendations based on external recommendation engines or directly integrating their suggestions into responses may constrain the overall performance of LLMs, since these engines generally have inferior representation abilities compared to LLMs. To address these challenges, we propose an end-to-end large-scale CRS model, named as ReFICR, a novel LLM-enhanced conversational recommender that empowers a retrievable large language model to perform conversational recommendation by following retrieval and generation instructions through lightweight tuning. By decomposing the complex CRS task into multiple subtasks, we formulate these subtasks into two types of instruction formats: retrieval and generation. The hidden states of ReFICR are utilized for generating text embeddings for retrieval, and simultaneously ReFICR is fine-tuned to handle generation subtasks. We optimize the contrastive objective to enhance text embeddings for retrieval and jointly fine-tune the large language model objective for generation. Our experimental results on public datasets demonstrate that ReFICR significantly outperforms baselines in terms of recommendation accuracy and response quality. Our code is publicly available at the link: https://github.com/yt556677/ReFICR.

Full text in ACM Digital Library
RES 🕓15The Elephant in the Room: Rethinking the Usage of Pre-trained Language Model in Sequential Recommendation
by Zekai Qu (China University of Geosciences Beijing), Ruobing Xie (Tencent Inc.), Chaojun Xiao (Tsinghua University), Zhanhui Kang (Tencent Inc.) and Xingwu Sun (Tencent Inc.)

Sequential recommendation (SR) has seen significant advancements with the help of Pre-trained Language Models (PLMs). Some PLM-based SR models directly use PLM to encode user historical behavior’s text sequences to learn user representations, while there is seldom an in-depth exploration of the capability and suitability of PLM in behavior sequence modeling. In this work, we first conduct extensive model analyses between PLMs and PLM-based SR models, discovering great underutilization and parameter redundancy of PLMs in behavior sequence modeling. Inspired by this, we explore different lightweight usages of PLMs in SR, aiming to maximally stimulate the ability of PLMs for SR while satisfying the efficiency and usability demands of practical systems. We discover that adopting behavior-tuned PLMs for item initializations of conventional ID-based SR models is the most economical framework of PLM-based SR, which would not bring in any additional inference cost but could achieve a dramatic performance boost compared with the original version. Extensive experiments on five datasets show that our simple and universal framework leads to significant improvement compared to classical SR and SOTA PLM-based SR models without additional inference costs. Our code can be found in https://github.com/777pomingzi/Rethinking-PLM-in-RS.

Full text in ACM Digital Library
RES 🕓15ReLand: Integrating Large Language Models’ Insights into Industrial Recommenders via a Controllable Reasoning Pool
by Changxin Tian (Ant Group), Binbin Hu (Ant Group), Chunjing Gan (Ant Group), Haoyu Chen (Ant Group), Zhuo Zhang (Ant Group), Li Yu (Ant Group), Ziqi Liu (Ant Group), Zhiqiang Zhang (Ant Group), Jun Zhou (Ant Group) and Jiawei Chen (Zhejiang University)

Recently, Large Language Models (LLMs) have shown significant potential in addressing the isolation issues faced by recommender systems. However, despite performance comparable to traditional recommenders, the current methods are cost-prohibitive for industrial applications. Consequently, existing LLM-based methods still need to catch up regarding effectiveness and efficiency. To tackle the above challenges, we present an LLM-enhanced recommendation framework named ReLand, which leverages Retrieval to effortlessly integrate Large language models’ insights into industrial recommenders. Specifically, ReLand employs LLMs to perform generative recommendations on sampled users (a.k.a., seed users), thereby constructing an LLM Reasoning Pool. Subsequently, we leverage retrieval to attach reliable recommendation rationales for the entire user base, ultimately effectively improving recommendation performance. Extensive offline and online experiments validate the effectiveness of ReLand. Since January 2024, ReLand has been deployed in the recommender system of Alipay, achieving statistically significant improvements of 3.19% in CTR and 1.08% in CVR.

Full text in ACM Digital Library
RES 🕓15Bayesian Optimization with LLM-Based Acquisition Functions for Natural Language Preference Elicitation
by David Austin (University of Waterloo), Anton Korikov (University of Toronto), Armin Toroghi (University of Toronto) and Scott Sanner (University of Toronto)

Designing preference elicitation (PE) methodologies that can quickly ascertain a user’s top item preferences in a cold-start setting is a key challenge for building effective and personalized conversational recommendation (ConvRec) systems. While large language models (LLMs) enable fully natural language (NL) PE dialogues, we hypothesize that monolithic LLM NL-PE approaches lack the multi-turn, decision-theoretic reasoning required to effectively balance the exploration and exploitation of user preferences towards an arbitrary item set. In contrast, traditional Bayesian optimization PE methods define theoretically optimal PE strategies, but cannot generate arbitrary NL queries or reason over content in NL item descriptions – requiring users to express preferences via ratings or comparisons of unfamiliar items. To overcome the limitations of both approaches, we formulate NL-PE in a Bayesian Optimization (BO) framework that seeks to actively elicit NL feedback to identify the best recommendation. Key challenges in generalizing BO to deal with natural language feedback include determining: (a) how to leverage LLMs to model the likelihood of NL preference feedback as a function of item utilities, and (b) how to design an acquisition function for NL BO that can elicit preferences in the infinite space of language. We demonstrate our framework in a novel NL-PE algorithm, PEBOL, which uses: 1) Natural Language Inference (NLI) between user preference utterances and NL item descriptions to maintain Bayesian preference beliefs, and 2) BO strategies such as Thompson Sampling (TS) and Upper Confidence Bound (UCB) to guide LLM query generation. We numerically evaluate our methods in controlled simulations, finding that after 10 turns of dialogue, PEBOL can achieve an MRR@10 of up to 0.27 compared to the best monolithic LLM baseline’s MRR@10 of 0.17, despite relying on earlier and smaller LLMs.

Full text in ACM Digital Library
RES 🕓15Towards Empathetic Conversational Recommender Systems
by Xiaoyu Zhang (Shandong University), Ruobing Xie (Tencent), Yougang Lyu (Shandong University; University of Amsterdam), Xin Xin (Shandong University), Pengjie Ren (Shandong University), Mingfei Liang (Tencent), Bo Zhang (Tencent), Zhanhui Kang (Tencent), Maarten de Rijke (University of Amsterdam) and Zhaochun Ren (Leiden University)

Conversational recommender systems (CRSs) are able to elicit user preferences through multi-turn dialogues. They typically incorporate external knowledge and pre-trained language models to capture the dialogue context. Most CRS approaches, trained on benchmark datasets, assume that the standard items and responses in these benchmarks are optimal. However, they overlook that users may express negative emotions with the standard items and may not feel emotionally engaged by the standard responses. This issue leads to a tendency to replicate the logic of recommenders in the dataset instead of aligning with user needs. To remedy this misalignment, we introduce empathy within a CRS. With empathy we refer to a system’s ability to capture and express emotions. We propose an empathetic conversational recommender (ECR) framework.

ECR contains two main modules: emotion-aware item recommendation and emotion-aligned response generation. Specifically, we employ user emotions to refine user preference modeling for accurate recommendations. To generate human-like emotional responses, ECR applies retrieval-augmented prompts to fine-tune a pre-trained language model aligning with emotions and mitigating hallucination. To address the challenge of insufficient supervision labels, we enlarge our empathetic data using emotion labels annotated by large language models and emotional reviews collected from external resources. We propose novel evaluation metrics to capture user satisfaction in real-world CRS scenarios. Our experiments on the ReDial dataset validate the efficacy of our framework in enhancing recommendation accuracy and improving user satisfaction.

Full text in ACM Digital Library
RES 🕓15FLIP: Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction
by Hangyu Wang (Shanghai Jiao Tong University), Jianghao Lin (Shanghai Jiao Tong University), Xiangyang Li (Huawei Noah’s Ark Lab), Bo Chen (Huawei Noah’s Ark Lab), Chenxu Zhu (Huawei Noah’s Ark Lab), Ruiming Tang (Huawei Noah’s Ark Lab), Weinan Zhang (Shanghai Jiao Tong University) and Yong Yu (Shanghai Jiao Tong University)

Click-through rate (CTR) prediction plays as a core function module in various personalized online services. The traditional ID-based models for CTR prediction take as inputs the one-hot encoded ID features of tabular modality, which capture the collaborative signals via feature interaction modeling. But the one-hot encoding discards the semantic information included in the textual features. Recently, the emergence of Pretrained Language Models (PLMs) has given rise to another paradigm, which takes as inputs the sentences of textual modality obtained by hard prompt templates and adopts PLMs to extract the semantic knowledge. However, PLMs often face challenges in capturing field-wise collaborative signals and distinguishing features with subtle textual differences. In this paper, to leverage the benefits of both paradigms and meanwhile overcome their limitations, we propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models (FLIP) for CTR prediction. Unlike most methods that solely rely on global views through instance-level contrastive learning, we design a novel jointly masked tabular/language modeling task to learn fine-grained alignment between tabular IDs and word tokens. Specifically, the masked data of one modality (i.e., IDs and tokens) has to be recovered with the help of the other modality, which establishes the feature-level interaction and alignment via sufficient mutual information extraction between dual modalities. Moreover, we propose to jointly finetune the ID-based model and PLM by adaptively combining the output of both models, thus achieving superior performance in downstream CTR prediction tasks. Extensive experiments on three real-world datasets demonstrate that FLIP outperforms SOTA baselines, and is highly compatible with various ID-based models and PLMs. The code is available.

Full text in ACM Digital Library
REPR 🕓10A Comparative Analysis of Text-Based Explainable Recommender Systems
by Alejandro Ariza-Casabona (University of Barcelona), Ludovico Boratto (University of Cagliari) and Maria Salamó (University of Barcelona)

One way to increase trust among users towards recommender systems is to provide the recommendation along with a textual explanation. In the literature, extraction-based, generation-based, and, more recently, hybrid solutions based on retrieval-augmented generation have been proposed to tackle the problem of text-based explainable recommendation. However, the use of different datasets, preprocessing steps, target explanations, baselines, and evaluation metrics complicates the reproducibility and state-of-the-art assessment of previous work among different model categories for successful advancements in the field. Our aim is to provide a comprehensive analysis of text-based explainable recommender systems by setting up a well-defined benchmark that accommodates generation-based, extraction-based, and hybrid approaches. Also, we enrich the existing evaluation of explainability and text quality of the explanations with a novel definition of feature hallucination. Our experiments on three real-world datasets unveil hidden behaviors and confirm several claims about model patterns. Our source code and preprocessed datasets are available at https://github.com/alarca94/text-exp-recsys24.

Full text in ACM Digital Library
REPR 🕓10Reproducibility of LLM-based Recommender Systems: the Case Study of P5 Paradigm
by Pasquale Lops (University of Bari Aldo Moro), Antonio Silletti (University of Bari Aldo Moro), Marco Polignano (University of Bari Aldo Moro), Cataldo Musto (University of Bari Aldo Moro) and Giovanni Semeraro (University of Bari Aldo Moro)

Recommender systems can significantly benefit from the availability of pre-trained large language models (LLMs), which can serve as a basic mechanism for generating recommendations based on detailed user and item data, such as text descriptions, user reviews, and metadata. On the one hand, this new generation of LLM-based recommender systems paves the way for dealing with traditional limitations, such as cold-start and data sparsity. Still, on the other hand, this poses fundamental challenges for their accountability. Reproducing experiments in the new context of LLM-based recommender systems is challenging for several reasons. New approaches are published at an unprecedented pace, which makes difficult to have a clear picture of the main protocols and good practices in the experimental evaluation. Moreover, the lack of proper frameworks for LLM-based recommendation development and evaluation makes the process of benchmarking models complex and uncertain.

In this work, we discuss the main issues encountered when trying to reproduce P5 (Pretrain, Personalized Prompt, and Prediction Paradigm), one of the first works unifying different recommendation tasks in a shared language modeling and natural language generation framework. Starting from this study, we have developed LaikaLLM, a framework for training and evaluating LLMs, specifically for the recommendation task. It has been used to perform several experiments to assess the impact of using different LLMs, different personalization strategies, and a novel set of more informative prompts on the overall performance of recommendations in a fully reproducible environment.

Full text in ACM Digital Library

Back to program

Session 16: Large Language Models 2

RecSys 2024 (Bari)

Sapphire Supporter

Diamond Supporter

Platinum Supporter

Gold Supporter

Silver Supporter

Bronze Supporter

Women in RecSys’s Event Supporter

Challenge Sponsor

Special Supporters

About this site

RecSys 2026