Skip navigation
Please use this identifier to cite or link to this item:
Title: AdvTranSafemer: Robust Policy Learning via Transformer and Adversarial Attack
Authors: Wu, Xiaorun
Advisors: Jha, Niraj K
Certificate Program: Robotics & Intelligent Systems Program
Class Year: 2022
Abstract: Traditionally, Reinforcement Learning (RL) has been concerned with developing stationary policies for Robotics safety & planning. One key assumption traditional RL has relied heavily upon is the Markov assumption, such that the distribution of the future state only depends on the current state. However, we may also model the RL problem as a generic sequence modeling problem, where the goal is to produce a sequence of actions that maximizes the designated reward. Currently, the transformer has been at the heated research forefront, and the high success of Transformer in other sequence modeling tasks such as NLP offers promising potential in modeling the safety tasks in RL as well. In this paper, we introduce a novel mechanism for the agent to learn a robust safety policy: our novelty is two-fold: first, we employ a transformer to generate rewards for the agent to have a richer curriculum of learning; second, we introduce adversaries, each with objective function exactly the negative of the agent. The agent and each of the adversaries play a zero-sum game. Through these processes, we hope that the agent would be better off by looking at longer history behind, and by playing against a random adversary, the agent would be able to learn a more robust policy against random disturbances in the environment. In addition, we also employed additional Trust Region techniques, the associated clipped surogative objectives & adaptive KL penalty coefficient, as well as Lyapunov stability verification techniques as additional stabilization tools to accommodate for more complex environments. We tested the efficacy of our design on twelve continuous control tasks. Using a bottom-up approach, we tested the environments using increasingly refined algorithm designs. Our testing results show much greater stability (more than 70% boost), a higher rate of reproducibility (at least 35%), relatively fast convergence (at least a 50% boost), as well as reduced training time.
Type of Material: Princeton University Senior Theses
Language: en
Appears in Collections:Robotics and Intelligent Systems Program

Files in This Item:
File Description SizeFormat 
WU-XIAORUN-THESIS.pdf3.18 MBAdobe PDFView/Download

Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.