Skip navigation
Please use this identifier to cite or link to this item:
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorJha, Niraj K-
dc.contributor.authorWu, Xiaorun-
dc.description.abstractTraditionally, Reinforcement Learning (RL) has been concerned with developing stationary policies for Robotics safety & planning. One key assumption traditional RL has relied heavily upon is the Markov assumption, such that the distribution of the future state only depends on the current state. However, we may also model the RL problem as a generic sequence modeling problem, where the goal is to produce a sequence of actions that maximizes the designated reward. Currently, the transformer has been at the heated research forefront, and the high success of Transformer in other sequence modeling tasks such as NLP offers promising potential in modeling the safety tasks in RL as well. In this paper, we introduce a novel mechanism for the agent to learn a robust safety policy: our novelty is two-fold: first, we employ a transformer to generate rewards for the agent to have a richer curriculum of learning; second, we introduce adversaries, each with objective function exactly the negative of the agent. The agent and each of the adversaries play a zero-sum game. Through these processes, we hope that the agent would be better off by looking at longer history behind, and by playing against a random adversary, the agent would be able to learn a more robust policy against random disturbances in the environment. In addition, we also employed additional Trust Region techniques, the associated clipped surogative objectives & adaptive KL penalty coefficient, as well as Lyapunov stability verification techniques as additional stabilization tools to accommodate for more complex environments. We tested the efficacy of our design on twelve continuous control tasks. Using a bottom-up approach, we tested the environments using increasingly refined algorithm designs. Our testing results show much greater stability (more than 70% boost), a higher rate of reproducibility (at least 35%), relatively fast convergence (at least a 50% boost), as well as reduced training time.en_US
dc.titleAdvTranSafemer: Robust Policy Learning via Transformer and Adversarial Attacken_US
dc.typePrinceton University Senior Theses
pu.certificateRobotics & Intelligent Systems Programen_US
Appears in Collections:Robotics and Intelligent Systems Program

Files in This Item:
File Description SizeFormat 
WU-XIAORUN-THESIS.pdf3.18 MBAdobe PDFView/Download

Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.