Off-policy Evaluation and Learning for Interactive Systems


Yi Su


2020-11-02 10:00:00 ~ 2020-11-02 11:30:00


Room 1319, Software Expert Building;ZOOM ID:635 724 68634 Password:245863


Weinan Zhang,Associate Professor, John Hopcroft Center for Computer Science

Many real-world applications, ranging from news recommendation and online advertising to personalized healthcare, are naturally modeled by the contextual-bandit protocol, where a learner repeatedly observes a context, takes an action, and accrues reward. A fundamental question in such settings is: given a new version of the system (i.e. policy), what is the expected reward? Online A/B testing offers a generic way for answering this question through controlled randomized trials. However, such online experimentation is slow, can only be done for a small number of new policies, has high engineering cost, and can have substantial cost for the users when the new policy is of low quality. Overcoming these shortcomings motivates the goal of offline A/B testing, also known as off-policy evaluation (OPE), which does not require new online experiments for every new policy want to evaluate, but instead reuses past data we already have. At the core of this methodology lies the design of counterfactual estimators that accurately evaluate the performance of a new policy by only using logged data of past behavior. In this talk, I will present my recent work on off-policy evaluation. It includes the discovery of a general family of counterfactual estimators, followed by a new optimization-based framework for designing estimators, which obtains a better bias-variance tradeoff than the doubly robust estimator in finite samples. Beyond off-policy evaluation, I will also briefly introduce the estimator selection problem in OPE. Finally, I will survey some of my recent work in off-policy learning: how do we use logged data to safely learn the best policy to deploy in the future.
Yi Su is a Ph.D. student in the Department of Statistics and Data Science at Cornell University, advised by Professor Thorsten Joachims. Her research interests lie in learning from user behavior and implicit feedback in search engines, recommender systems and market platforms. She currently works on off-policy evaluation and learning in contextual bandits and reinforcement learning. Before joining Cornell, she received BSc (Honors) in Mathematics from Nanyang Technological University, Singapore. She is the recipient of the Lee Kuan Yew Gold Medal (2016), Bloomberg Data Science Fellowship (2019-2021) and EECS Rising Stars 2020.
© John Hopcroft Center for Computer Science, Shanghai Jiao Tong University

邮箱:jhc@sjtu.edu.cn 电话:021-54740299