Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01x059cb50m
DC FieldValueLanguage
dc.contributor.authorDuan, Yaqi
dc.contributor.otherOperations Research and Financial Engineering Department
dc.date.accessioned2022-06-16T20:34:38Z-
dc.date.available2022-06-16T20:34:38Z-
dc.date.created2022-01-01
dc.date.issued2022
dc.identifier.urihttp://arks.princeton.edu/ark:/88435/dsp01x059cb50m-
dc.description.abstractPolicy evaluation is a central problem in batch reinforcement learning. It refers to the assessment of a given decision policy using logged data. Practical methods for policy evaluation typically involve some form of function approximation so as to relieve issues caused by the enormous scales of real-world systems and the length of planning horizons. The thesis is primarily concerned with statistical analysis of policy evaluation and show how function approximation improves the efficacy of methods. In the first part of the thesis, we consider off-policy evaluation with linear function approximation. We show the equivalence of a regression-based fitted Q-iteration method, marginalized importance sampling methods and a model-based method that estimates a conditional mean embedding of the transition operator. Moreover, our theory reveals that the hardness of off-policy evaluation is determined by the mismatch between data and target distributions, which is reflected by a projected chi-square-divergence in error bounds. We prove that the estimators are minimax optimal in terms of sample size, planning horizon and the mismatch term. In the second part of the thesis, we study on-policy evaluation with kernel methods. In particular, we are interested in a regularized form of the kernel least-squares temporal-difference (LSTD) estimate. We use empirical process theory techniques to derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator, as well as the instance-dependent variance of the Bellman residual error. In addition, we prove minimax lower bounds over sub-classes of MRPs, which shows that our rate is optimal in terms of the sample size and the effective horizon. Our analysis sheds light on how we could tune the complexity of function class to favorably strike a balance between the curse of dimension and the curse of horizon in policy evaluation.
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.publisherPrinceton, NJ : Princeton University
dc.relation.isformatofThe Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: <a href=http://catalog.princeton.edu>catalog.princeton.edu</a>
dc.subjectdistributional shift
dc.subjectkernel methods
dc.subjectlinear function approximation
dc.subjectpolicy evaluation
dc.subjectreinforcement learning
dc.subjecttemporal difference learning
dc.subject.classificationOperations research
dc.subject.classificationStatistics
dc.subject.classificationComputer science
dc.titlePolicy Evaluation in Batch Reinforcement Learning