Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01jd4730577
Title: Point and Ask: Incorporating Pointing Into Visual Question Answering
Authors: Mani, Arjun
Advisors: Russakovsky, Olga
Department: Computer Science
Class Year: 2021
Abstract: Visual Question Answering (VQA), the task of answering a natural language question about an image, has become a key benchmark of progress in developing AI systems that can understand and communicate about the visual world. While recent VQA datasets test the ability of AI agents to reason about complex sentences (e.g. asking the agent "What color is the cup to the left of the tray on top of the table?"), they are arguably moving farther away from human communication in the visual world, which often involves nonverbal gestures (e.g. a human might instead point and ask "What color is that cup?"). Pointing in particular is a nearly universal gesture among humans and is actually the first communicative gesture developed in infants. Understanding pointing gestures in the context of a visual dialog with humans would be a key (yet underexplored) ability of real-world AI systems. Thus in this work, we expand the VQA task by introducing a new space of visual questions that include pointing gestures. Concretely, we (1) introduce and motivate point-input questions as an extension of VQA, (2) define four novel classes of questions within this space, and (3) for each class, introduce both a benchmark dataset and a series of model designs to handle its unique challenges. Distinct from prior work, we explicitly design the benchmarks to require the point input, i.e., we ensure that the visual question cannot be answered accurately without the point of reference. Through our exploration we uncover and address several important visual recognition challenges, such as the ability to effectively incorporate a point input and reason both locally and globally about visual scenes, as well as infer human intent.
URI: http://arks.princeton.edu/ark:/88435/dsp01jd4730577
Type of Material: Princeton University Senior Theses
Language: en
Appears in Collections:Computer Science, 1987-2023

Files in This Item:
File Description SizeFormat 
MANI-ARJUN-THESIS.pdf2.05 MBAdobe PDF    Request a copy


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.