RecSys 2024 — Session 15: Off-policy Learning - RecSys

Session 15: Off-policy Learning

Date: Thursday October 17, 12:30 PM – 13:15 PM (GMT+2)
Room: Petruzzelli Theater
Session Chair: Alan Said

RES 🕓15Multi-Objective Recommendation via Multivariate Policy Learning
by Olivier Jeunen (ShareChat), Jatin Mandav (ShareChat), Ivan Potapov (ShareChat), Nakul Agarwal (ShareChat), Sourabh Vaid (ShareChat), Wenzhe Shi (ShareChat) and Aleksei Ustimenko (ShareChat)

Real-world recommender systems often need to balance multiple objectives when deciding which recommendations to present to users. These include behavioural signals (e.g. clicks, shares, dwell time), as well as broader objectives (e.g. diversity, fairness). Scalarisation methods are commonly used to handle this balancing task, where a weighted average of per-objective reward signals determines the final score used for ranking. Naturally, how these weights are computed exactly, is key to success for any online platform.

We frame this as a decision-making task, where the scalarisation weights are actions taken to maximise an overall North Star reward (e.g. long-term user retention or growth). We extend existing policy learning methods to the continuous multivariate action domain, proposing to maximise a pessimistic lower bound on the North Star reward that the learnt policy will yield. Typical lower bounds based on normal approximations suffer from insufficient coverage, and we propose an efficient and effective policy-dependent correction for this. We provide guidance to design stochastic data collection policies, as well as highly sensitive reward signals. Empirical observations from simulations, offline and online experiments highlight the efficacy of our deployed approach.

Full text in ACM Digital Library
RES 🕓15Optimal Baseline Corrections for Off-Policy Contextual Bandits
by Shashank Gupta (University of Amsterdam), Olivier Jeunen (ShareChat), Harrie Oosterhuis (Radboud University) and Maarten de Rijke (University of Amsterdam)

The off-policy learning paradigm allows for recommender systems and general ranking applications to be framed as decision-making problems, where we aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric. With unbiasedness comes potentially high variance, and prevalent methods exist to reduce estimation variance. These methods typically make use of control variates, either additive (i.e., baseline corrections or doubly robust methods) or multiplicative (i.e., self-normalisation).

Our work unifies these approaches by proposing a single framework built on their equivalence in learning scenarios. The foundation of our framework is the derivation of an equivalent baseline correction for all of the existing control variates. Consequently, our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it. This optimal estimator brings significantly improved performance in both evaluation and learning, and minimizes data requirements. Empirical observations corroborate our theoretical findings.

Full text in ACM Digital Library
RES 🕓15Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits
by Tatsuhiro Shimizu (Independent Researcher) and Koichi Tanaka (Keio Univercity),
Ren Kishimoto (Tokyo Institute of Technology), Haruka Kiyohara (Cornell University), Masahiro Nomura (CyberAgent, Inc.) and Yuta Saito (Cornell University)

We explore off-policy evaluation and learning (OPE/L) in contextual combinatorial bandits (CCB), where a policy selects a subset in the action space. For example, it might choose a set of furniture pieces (a bed and a drawer) from available items (bed, drawer, chair, etc.) for interior design sales. This setting is widespread in fields such as recommender systems and healthcare, yet OPE/L of CCB remains unexplored in the relevant literature. Typical OPE/L methods such as regression and importance sampling can be applied to the CCB problem, however, they face significant challenges due to high bias or variance, exacerbated by the exponential growth in the number of available subsets. To address these challenges, we introduce a concept of factored action space, which allows us to decompose each subset into binary indicators. This formulation allows us to distinguish between the “main effect” derived from the main actions, and the “residual effect”, originating from the supplemental actions, facilitating more effective OPE. Specifically, our estimator, called OPCB, leverages an importance sampling-based approach to unbiasedly estimate the main effect, while employing regression-based approach to deal with the residual effect with low variance. OPCB achieves substantial variance reduction compared to conventional importance sampling methods and bias reduction relative to regression methods under certain conditions, as illustrated in our theoretical analysis. Experiments demonstrate OPCB’s superior performance over typical methods in both OPE and OPL.

Full text in ACM Digital Library

Back to program

Session 15: Off-policy Learning

RecSys 2024 (Bari)

Sapphire Supporter

Diamond Supporter

Platinum Supporter

Gold Supporter

Silver Supporter

Bronze Supporter

Women in RecSys’s Event Supporter

Challenge Sponsor

Special Supporters

About this site

RecSys 2026