Session 10: Reinforcement Learning

Date: Thursday September 21, 2:00 PM – 3:20 PM (GMT+8)
Room: Hall 406D
Session Chair: Oren Sar Shalom
Parallel with: Session 9: Collaborative filtering 2

  • RESInTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models
    by Kabir Nagrecha (University of California, San Diego), Lingyi Liu (Netflix, Inc.), Pablo Delgado (Netflix, Inc.) and Prasanna Padmanabhan (Netflix, Inc.).

    Deep learning-based recommendation models (DLRMs) have become an essential component of many modern recommender systems. Several companies are now building large compute clusters reserved only for DLRM training, driving new interest in cost- & time- saving optimizations. The systems challenges faced in this setting are unique; while typical deep learning (DL) training jobs are dominated by model execution times, the most important factor in DLRM training performance is often online data ingestion.

    In this paper, we explore the unique characteristics of this data ingestion problem and provide insights into the specific bottlenecks and challenges of the DLRM training pipeline at scale. We study real-world DLRM data processing pipelines taken from our compute cluster to both observe the performance impacts of online ingestion and to identify shortfalls in existing data pipeline optimizers. We find that current tooling either yields sub-optimal performance, frequent crashes, or else requires impractical cluster re-organization to adopt. Our studies lead us to design and build a new solution for data pipeline optimization, InTune. InTune employs a reinforcement learning (RL) agent to learn how to distribute CPU resources across a DLRM data pipeline to more effectively parallelize data-loading and improve throughput. Our experiments show that InTune can build an optimized data pipeline configuration within only a few minutes, and can easily be integrated into existing training workflows. By exploiting the responsiveness and adaptability of RL, InTune achieves significantly higher online data ingestion rates than existing optimizers, thus reducing idle times in model execution and increasing efficiency. We apply InTune to our real-world cluster, and find that it increases data ingestion throughput by as much as 2.29X versus current state-of-the-art data pipeline optimizers while also improving both CPU & GPU utilization.

    Full text in ACM Digital Library

  • RESGenerative Learning Plan Recommendation for Employees: A Performance-aware Reinforcement Learning Approach
    by Zhi Zheng (University of Science and Technology of China), Ying Sun (The Hong Kong University of Science and Technology (Guangzhou)), Xin Song (Baidu), Hengshu Zhu (BOSS Zhipin) and Hui Xiong (The Hong Kong University of Science and Technology (Guangzhou)).

    With the rapid development of enterprise Learning Management Systems (LMS), more and more companies are trying to build enterprise training and course learning platforms for promoting the career development of employees. Indeed, through course learning, many employees have the opportunity to improve their knowledge and skills. For these systems, a major issue is how to recommend learning plans, i.e., a set of courses arranged in the order they should be learned, that can help employees improve their work performance. Existing studies mainly focus on recommending courses that users are most likely to click on by capturing their learning preferences. However, the learning preference of employees may not be the right fit for their career development, and thus it may not necessarily mean their work performance can be improved accordingly. Furthermore, how to capture the mutual correlation and sequential effects between courses, and ensure the rationality of the generated results, is also a major challenge. To this end, in this paper, we propose the Generative Learning plAn recommenDation (GLAD) framework, which can generate personalized learning plans for employees to help them improve their work performance. Specifically, we first design a performance predictor and a rationality discriminator, which have the same transformer-based model architecture, but with totally different parameters and functionalities. In particular, the performance predictor is trained for predicting the work performance of employees based on their work profiles and historical learning records, while the rationality discriminator aims to evaluate the rationality of the generated results. Then, we design a learning plan generator based on the gated transformer and the cross-attention mechanism for learning plan generation. We calculate the weighted sum of the output from the performance predictor and the rationality discriminator as the reward, and we use Self-Critical Sequence Training (SCST) based policy gradient methods to train the generator following the Generative Adversarial Network (GAN) paradigm. Finally, extensive experiments on real-world data clearly validate the effectiveness of our GLAD framework compared with state-of-the-art baseline methods and reveal some interesting findings for talent management

    Full text in ACM Digital Library

  • RESCorrecting for Interference in Experiments: A Case Study at Douyin
    by Vivek Farias (MIT), Hao Li (Bytedance), Tianyi Peng (MIT), Xinyuyang Ren (Bytedance), Huawei Zhang (Bytedance) and Andrew Zheng (MIT).

    Interference is a ubiquitous problem in experiments conducted on two-sided content marketplaces, such as Douyin (China’s analog of TikTok). In many cases, creators are the natural unit of experimentation, but creators interfere with each other through competition for viewers’ limited time and attention. “Naive” estimators currently used in practice simply ignore the interference, but in doing so incur bias on the order of the treatment effect. We formalize the problem of inference in such experiments as one of policy evaluation. Off-policy estimators, while unbiased, are impractically high variance. We introduce a novel Monte-Carlo estimator, based on “Differences-in-Qs” (DQ) techniques, which achieves bias which is second-order in the treatment effect, while remaining sample-efficient to estimate. On the theoretical side, our contribution is to develop a generalized theory of Taylor expansions for policy evaluation, which extends DQ theory to all major MDP formulations. On the practical side, we implement our estimator on Douyin’s experimentation platform, and in the process develop DQ into a truly “plug-and-play” estimator for interference in real-world settings: one which provides robust, low-bias, low-variance treatment effect estimates; admits computationally cheap, asymptotically exact uncertainty quantification; and reduces MSE by 99\% compared to the best existing alternatives in our applications.

    Full text in ACM Digital Library

  • REPReproducibility of Multi-Objective Reinforcement Learning Recommendation: Interplay between Effectiveness and Beyond-Accuracy Perspectives
    by Vincenzo Paparella (Politecnico di Bari), Vito Walter Anelli (Politecnico di Bari), Ludovico Boratto (University of Cagliari) and Tommaso Di Noia (Politecnico di Bari)

    Providing effective suggestions is of predominant importance for successful Recommender Systems (RSs). Nonetheless, the need of accounting for additional multiple objectives has become prominent, from both the final users’ and the item providers’ points of view. This need has led to a new class of RSs, called Multi-Objective Recommender Systems (MORSs). These systems are designed to provide suggestions by considering multiple (conflicting) objectives simultaneously, such as diverse, novel, and fairness-aware recommendations. In this work, we reproduce a state-of-the-art study on MORSs that exploits a reinforcement learning agent to satisfy three objectives, i.e., accuracy, diversity, and novelty of recommendations. The selected study is one of the few MORSs where the source code and datasets are released to ensure the reproducibility of the proposed approach. Interestingly, we find that some challenges arise when replicating the results of the original work, due to the nature of multiple-objective problems. We also extend the evaluation of the approach to analyze the impact of improving user-centred objectives of recommendations (i.e., diversity and novelty) in terms of algorithmic bias. To this end, we take into consideration both popularity and category of the items. We discover some interesting trends in the recommendation performance according to different evaluation metrics. In addition, we see that the multi-objective reinforcement learning approach is responsible for increasing the bias disparity in the output of the recommendation algorithm for those items belonging to positively/negatively biased categories. We publicly release datasets and codes in the following GitHub repository:

    Full text in ACM Digital Library

Back to program

Diamond Supporter
Platinum Supporter
Amazon Science
Gold Supporter
Silver Supporter
Bronze Supporter
Challenge Sponsor
Special Supporters