Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp017s75dg58p
Title: Don’t Shoot the Messenger: Temporal Grounding of Natural Language in Entities and Dynamics
Authors: Ortaoglu, Begum
Advisors: Narasimhan, Karthik
Department: Electrical and Computer Engineering
Class Year: 2022
Abstract: Grounding language in entities refers to matching entities in the world or simulated environments to their referents in text. Grounding entities based on natural language descriptions of their dynamics in a certain environment to drive control policy generalization is a complex task. The Messenger Environment is a game-like environment where an agent is present in a 10×10 with other entities that play different roles such as enemy, goal, and message. The agent’s goal is to deliver the message to the goal and avoid the enemy. Messenger games come with accompanying descriptive text manuals of each game where natural language descriptions of entities, their roles, and dynamics are described. There are 3 levels to messenger where they each present a different challenge. We focus on stage 3 which presents the challenge where entities must be disambiguated according to their dynamics over multiple time frames. To tackle this problem we build upon the EMMA (Entity Mapper with Multi-Modal Attention) model trained end-to-end using Reinforcement Learning in the Messenger environment proposed by Hanjie et al.[5]. The EMMA model leverages an attention mechanism to combine the information from the text and combine it with state representations of each frame in the game to decide on what action the agent should take next. We focus on making improvements to the temporal grounding abilities of the model where the movement over time can be matched to the description of the dynamics. We propose a temporal EMMA (TEMMA) model to achieve this. We found that the TEMMA model has a 6% higher win rate on stage 1 training games and 5 % higher win rate on stage 3 training games compared to the EMMA model bringing the training win rates up to 95% and 27%. We found performing curriculum training where we transfer the learned weights from the previous stage to the current stage, performed the best. Pretraining the TEMMA model on the EMMA model weights did not work. We also found that attention freezing where we don’t update the embedding weights, but instead only update the action layer’s weights, was necessary for training in stage 2 and stage 3. On test games, there is no significant consistent increase in the stage 3 performance, suggesting that our changes to the EMMA model’s temporal abilities don’t improve generalization to unseen dynamics and entities. Stage 3 presents a significant challenge and there is still a lot of room for improvement.
URI: http://arks.princeton.edu/ark:/88435/dsp017s75dg58p
Type of Material: Princeton University Senior Theses
Language: en
Appears in Collections:Electrical and Computer Engineering, 1932-2023

Files in This Item:
File Description SizeFormat 
ORTAOGLU-BEGUM-THESIS.pdf865.73 kBAdobe PDF    Request a copy


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.