Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01r494vp330
 Title: Learning to Solve Structured Vision Problems Authors: Newell, Alejandro Advisors: Deng, Jia Contributors: Computer Science Department Subjects: Computer science Issue Date: 2022 Publisher: Princeton, NJ : Princeton University Abstract: We want computer vision models to understand the rich world captured in images and video. This requires not just recognizing objects, but identifying their relationships and interactions. Combining contributions in both neural architecture and loss design, we expand the capacity of convolutional networks to express such interactions and solve a broad range of structured computer vision tasks. We first introduce a convolutional network architecture for dense per-pixel prediction. We show how intermediate supervision and repeated processing across feature scales lead to better network performance, referring to the architecture as a stacked hourglass'' network based on the successive steps of pooling and upsampling during inference. We benchmark on the task of human pose estimation achieving state-of-the-art performance. Next, we introduce associative embedding, a method for supervising networks to solve detection and grouping tasks. A number of problems can be framed in this manner including multi-person pose estimation, instance segmentation, and multi-object tracking. Usually these problems are addressed with multi-stage pipelines, instead we train a model to simultaneously output detections and group assignments. We can then extend the use of associative embeddings to define arbitrary graphs. We demonstrate how to supervise embeddings such that a model both detects the objects in a scene and defines semantic relationships between pairs of objects. Finally, we perform an investigation of self-supervision methods. Recent self-supervised losses rely on a similar learning signal to the loss we leverage in our associative embedding work. But it is unclear how useful these losses are for general purpose visual feature pretraining. We investigate what factors play a role in the utility of such pretraining by evaluating self-supervised algorithms across a comprehensive array of synthetic datasets and downstream tasks. Our experiments highlight how self-supervision can be more or less useful depending on the amount of labeled data, the complexity of the data, and the target downstream task. Together the work in this thesis shows how to build and train better models while providing insights into what steps lead to the best performance across a wide variety of computer vision tasks. URI: http://arks.princeton.edu/ark:/88435/dsp01r494vp330 Alternate format: The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: catalog.princeton.edu Type of Material: Academic dissertations (Ph.D.) Language: en Appears in Collections: Computer Science