RecSys 2024 — Session 11: Optimisation and Evaluation 1 - RecSys

Session 11: Optimisation and Evaluation 1

Date: Wednesday October 16, 17:20 PM – 18:45 PM (GMT+2)
Room: Petruzzelli Theater
Session Chair: Maurizio Ferrari Dacrema

RES 🕓15End-to-End Cost-Effective Incentive Recommendation under Budget Constraint with Uplift Modeling
by Zexu Sun (Renmin University of China), Hao Yang (Renmin University of China), Dugang Liu (Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)), Yunpeng Weng (Tencent), Xing Tang (Tencent) and Xiuqiang He (Tencent)

In modern online platforms, incentives (e.g., discounts, bonus) are essential factors that enhance user engagement and increase platform revenue. Over recent years, uplift modeling has been introduced as a strategic approach to assign incentives to individual customers. Especially in many real-world applications, online platforms can only incentivize customers with specific budget constraints. This problem can be reformulated as the multi-choice knapsack problem (MCKP). The objective of this optimization is to select the optimal incentive for each customer to maximize the return on investment (ROI). Recent works in this field frequently tackle the budget allocation problem using a two-stage approach. However, this solution is confronted with the following challenges: (1) The causal inference methods often ignore the domain knowledge in online marketing, where the expected response curve of a customer should be monotonic and smooth as the incentive increases. (2) There is an optimality gap between the two stages, resulting in inferior sub-optimal allocation performance due to the loss of the incentive recommendation information for the uplift prediction under the limited budget constraint. To address these challenges, we propose a novel End-to-End Cost-Effective Incentive Recommendation (E3IR) model under the budget constraint. Specifically, our methods consist of two modules, i.e., the uplift prediction module and the differentiable allocation module. In the uplift prediction module, we construct prediction heads to capture the incremental improvement between adjacent treatments with the marketing domain constraints (i.e., monotonic and smooth). We incorporate integer linear programming (ILP) as a differentiable layer input in the differentiable allocation module. Furthermore, we conduct extensive experiments on public and real product datasets, demonstrating that our E3IR improves allocation performance compared to existing two-stage approaches.

Full text in ACM Digital Library
REPR 🕓10Reproducibility and Analysis of Scientific Dataset Recommendation Methods
by Ornella Irrera (University of Padua), Matteo Lissandrini (University of Verona), Daniele Dell’Aglio (Aalborg University) and Gianmaria Silvello (University of Padua)

Datasets play a central role in scholarly communications. However, scholarly graphs are often incomplete, particularly due to the lack of connections between publications and datasets. Therefore, the importance of dataset recommendation—identifying relevant datasets for a scientific paper, an author, or a textual query—is increasing. Although various methods have been proposed for this task, their reproducibility remains unexplored, making it difficult to compare them with new approaches. We reviewed current recommendation methods for scientific datasets, focusing on the most recent and competitive approaches, including an SVM-based model, a bi-encoder retriever, a method leveraging co-authors and citation network embeddings, and a heterogeneous variational graph autoencoder. These approaches underwent a comprehensive analysis under consistent experimental conditions. Our reproducibility efforts show that three methods can be reproduced, while the graph variational autoencoder is challenging due to unavailable code and test datasets. Hence, we re-implemented this method and performed a component-based analysis to examine its strengths and limitations. Furthermore, our study indicated that three out of four considered methods produce subpar results when applied to real-world data instead of specialized datasets with ad-hoc features.

Full text in ACM Digital Library
REPR 🕓10From Clicks to Carbon: The Environmental Toll of Recommender Systems
by Tobias Vente (University of Siegen), Lukas Wegmeth (University of Siegen), Alan Said (University of Gothenburg) and Joeran Beel (University of Siegen)

As global warming soars, the need to assess the environmental impact of research is becoming increasingly urgent. Despite this, few recommender systems research papers address their environmental impact. In this study, we estimate the environmental impact of recommender systems research by reproducing typical experimental pipelines. Our analysis spans 79 full papers from the 2013 and 2023 ACM RecSys conferences, comparing traditional “good old-fashioned AI’’ algorithms with modern deep learning algorithms. We designed and reproduced representative experimental pipelines for both years, measuring energy consumption with a hardware energy meter and converting it to CO2 equivalents. Our results show that papers using deep learning algorithms emit approximately 42 times more CO2 equivalents than papers using traditional methods. On average, a single deep learning-based paper generates 3,297 kilograms of CO2 equivalents—more than the carbon emissions of one person flying from New York City to Melbourne or the amount of CO2 one tree sequesters over 300 years.

Full text in ACM Digital Library
IND 🕓10Why the Shooting in the Dark Method Dominates Recommender Systems Practice
by David Rohde (Criteo)

The introduction of A/B Testing represented a great leap forward in recommender systems research. Like the randomized control trial for evaluating drug efficacy; A/B Testing has equipped recommender systems practitioners with a protocol for measuring performance as defined by actual business metrics and with minimal assumptions. While A/B testing provided a way to measure the performance of two or more candidate systems, it provides no guide for determining what policy we should test. The focus of this industry talk is to better understand, why the development of A/B testing was the last great leap forward in the development of reward optimizing recommender systems despite more than a decade of efforts in both industry and academia. The talk will survey: industry best practice, standard theories and tools including: collaborative filtering (MovieLens RecSys), contextual bandits, attribution, off-policy estimation, causal inference, click through rate models and will explain why we have converged on a fundamentally heuristic solution or guess and check type method. The talk will offer opinions about which of these theories are useful, and which are not and make a concrete proposal to make progress based on a non-standard use of deep learning tools.

Full text in ACM Digital Library
IND 🕓10Powerful A/B-Testing Metrics and Where to Find Them
by Olivier Jeunen (ShareChat), Shubham Baweja (ShareChat), Neeti Pokharna (ShareChat) and Aleksei Ustimenko (ShareChat)

Online controlled experiments, colloquially known as A/B-tests, are the bread and butter of real-world recommender system evaluation. Typically, end-users are randomly assigned some system variant, and a plethora of metrics are then tracked, collected, and aggregated throughout the experiment. A North Star metric (e.g. long-term growth or revenue) is used to assess which system variant should be deemed superior. As a result, most collected metrics are supporting in nature, and serve to either (i) provide an understanding of how the experiment impacts user experience, or (ii) allow for confident decision-making when the North Star metric moves insignificantly (i.e. a false negative or type-II error). The latter is not straightforward: suppose a treatment variant leads to fewer but longer sessions, with more views but fewer engagements; should this be considered a positive or negative outcome?

The question then becomes: how do we assess a supporting metric’s utility when it comes to decision-making using A/B-testing? Online platforms typically run dozens of experiments at any given time. This provides a wealth of information about interventions and treatment effects that can be used to evaluate metrics’ utility for online evaluation. We propose to collect this information and leverage it to quantify type-I, type-II, and type-III errors for the metrics of interest, alongside a distribution of measurements of their statistical power (e.g. z-scores and p-values). We present results and insights from building this pipeline at scale for two large-scale short-video platforms: ShareChat and Moj; leveraging hundreds of past experiments to find online metrics with high statistical power.

Full text in ACM Digital Library
IND 🕓10Bootstrapping Conditional Retrieval for User-to-Item Recommendations
by Hongtao Lin (Pinterest), Haoyu Chen (Pinterest), Jaewon Yang (Pinterest) and Jiajing Xu (Pinterest)

User-to-item retrieval has been an active research area in recommendation system, and two tower models are widely adopted due to model simplicity and serving efficiency. In this work, we focus on a variant called conditional retrieval, where we expect retrieved items to be relevant to a condition (e.g. topic). We propose a method that uses the same training data as standard two tower models but incorporates item-side information as conditions in query. This allows us to bootstrap new conditional retrieval use cases and encourages feature interactions between user and condition. Experiments show that our method can retrieve highly relevant items and outperforms standard two tower models with filters on engagement metrics. The proposed model is deployed to power a topic-based notification feed at Pinterest and led to +0.26% weekly active users.

Full text in ACM Digital Library
IND 🕓10Self-Auxiliary Distillation for Sample Efficient Learning in Google-Scale Recommenders
by Yin Zhang (Google DeepMind), Ruoxi Wang (Google DeepMind), Xiang Li (Google, Inc), Tiansheng Yao (Google, Inc), Andrew Evdokimov (Google, Inc), Jonathan Valverde (Google DeepMind), Yuan Gao (Google, Inc), Jerry Zhang (Google, Inc), Evan Ettinger (Google, Inc), Ed H. Chi (Google DeepMind) and Derek Zhiyuan Cheng (Google DeepMind)

Industrial recommendation systems process billions of daily user feedback which are complex and noisy. Efficiently uncovering user preference from these signals becomes crucial for high-quality recommendation. We argue that those signals are not inherently equal in terms of their informative value and training ability, which is particularly salient in industrial applications with multi-stage processes (e.g., augmentation, retrieval, ranking). Considering that, in this work, we propose a novel self-auxiliary distillation framework that prioritizes training on high-quality labels, and improves the resolution of low-quality labels through distillation by adding a bilateral branch-based auxiliary task. This approach enables flexible learning from diverse labels without additional computational costs, making it highly scalable and effective for Google-scale recommenders. Our framework consistently improved both offline and online key business metrics across three Google major products. Notably, self-auxiliary distillation proves to be highly effective in addressing the severe signal loss challenge posed by changes such as Apple iOS policy. It further delivered significant improvements in both offline (+17% AUC) and online metrics for a Google Apps recommendation system. This highlights the opportunities of addressing real-world signal loss problems through self-auxiliary distillation techniques.

Full text in ACM Digital Library
IND 🕓10Optimizing for Participation in Recommender System
by Yuan Shao (Google), Bibang Liu (Google), Sourabh Bansod (Google), Arnab Bhadury (Google), Mingyan Gao (Google) and Yaping Zhang (Google)

Full text in ACM Digital Library

Back to program

Session 11: Optimisation and Evaluation 1

RecSys 2024 (Bari)

Sapphire Supporter

Diamond Supporter

Platinum Supporter

Gold Supporter

Silver Supporter

Bronze Supporter

Women in RecSys’s Event Supporter

Challenge Sponsor

Special Supporters

About this site

RecSys 2026