Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01zw12z866w
Title: An In-Depth Evaluation of Large Language Models on Olympiad Programming
Authors: Tang, Michael
Advisors: Narasimhan, Karthik
Department: Computer Science
Class Year: 2024
Abstract: In this work, we introduce the USACO benchmark to evaluate the reasoning and coding abilities of large language models. We leverage challenging problems from olympiad-level competitive programming contests to rigorously evaluates models’ ability to problem-solve creatively while simultaneously grounding their reasoning to complex, ad hoc environments. Beyond benchmarking state-of-the-art models and prompt scaffolding techniques, we use code generation as a lens to explore several unintuitive phenomena surrounding reasoning in language models. In Chapter 4, we discuss the unreasonable effectiveness of problem-solving via mass zero-shot sampling, which outperforms sophisticated-but-expensive techniques like self-reflection, as well as investigate the tradeoffs between larger and smaller models in the fixed-budget setting. In Chapter 5, we introduce the USACO Extended Task Suite, a synthetically generated evaluation framework that goes beyond traditional problem-solving benchmarks to evaluate different aspects of code generation such as debugging and execution output prediction in a fine-grained way, and discuss new ideas such as dynamic evaluation that arise naturally from our results.
URI: http://arks.princeton.edu/ark:/88435/dsp01zw12z866w
Type of Material: Princeton University Senior Theses
Language: en
Appears in Collections:Computer Science, 1987-2024

Files in This Item:
File Description SizeFormat 
TANG-MICHAEL-THESIS.pdf2.54 MBAdobe PDF    Request a copy


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.