Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01x059cb50m
Title: Policy Evaluation in Batch Reinforcement Learning
Authors: Duan, Yaqi
Advisors: Wang, Mengdi
Contributors: Operations Research and Financial Engineering Department
Keywords: distributional shift
kernel methods
linear function approximation
policy evaluation
reinforcement learning
temporal difference learning
Subjects: Operations research
Statistics
Computer science
Issue Date: 2022
Publisher: Princeton, NJ : Princeton University
Abstract: Policy evaluation is a central problem in batch reinforcement learning. It refers to the assessment of a given decision policy using logged data. Practical methods for policy evaluation typically involve some form of function approximation so as to relieve issues caused by the enormous scales of real-world systems and the length of planning horizons. The thesis is primarily concerned with statistical analysis of policy evaluation and show how function approximation improves the efficacy of methods. In the first part of the thesis, we consider off-policy evaluation with linear function approximation. We show the equivalence of a regression-based fitted Q-iteration method, marginalized importance sampling methods and a model-based method that estimates a conditional mean embedding of the transition operator. Moreover, our theory reveals that the hardness of off-policy evaluation is determined by the mismatch between data and target distributions, which is reflected by a projected chi-square-divergence in error bounds. We prove that the estimators are minimax optimal in terms of sample size, planning horizon and the mismatch term. In the second part of the thesis, we study on-policy evaluation with kernel methods. In particular, we are interested in a regularized form of the kernel least-squares temporal-difference (LSTD) estimate. We use empirical process theory techniques to derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator, as well as the instance-dependent variance of the Bellman residual error. In addition, we prove minimax lower bounds over sub-classes of MRPs, which shows that our rate is optimal in terms of the sample size and the effective horizon. Our analysis sheds light on how we could tune the complexity of function class to favorably strike a balance between the curse of dimension and the curse of horizon in policy evaluation.
URI: http://arks.princeton.edu/ark:/88435/dsp01x059cb50m
Alternate format: The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: catalog.princeton.edu
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Operations Research and Financial Engineering

Files in This Item:
File Description SizeFormat 
Duan_princeton_0181D_14137.pdf2.92 MBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.