Please use this identifier to cite or link to this item:
http://arks.princeton.edu/ark:/88435/dsp01zw12z866w
Title: | An In-Depth Evaluation of Large Language Models on Olympiad Programming |
Authors: | Tang, Michael |
Advisors: | Narasimhan, Karthik |
Department: | Computer Science |
Class Year: | 2024 |
Abstract: | In this work, we introduce the USACO benchmark to evaluate the reasoning and coding abilities of large language models. We leverage challenging problems from olympiad-level competitive programming contests to rigorously evaluates models’ ability to problem-solve creatively while simultaneously grounding their reasoning to complex, ad hoc environments. Beyond benchmarking state-of-the-art models and prompt scaffolding techniques, we use code generation as a lens to explore several unintuitive phenomena surrounding reasoning in language models. In Chapter 4, we discuss the unreasonable effectiveness of problem-solving via mass zero-shot sampling, which outperforms sophisticated-but-expensive techniques like self-reflection, as well as investigate the tradeoffs between larger and smaller models in the fixed-budget setting. In Chapter 5, we introduce the USACO Extended Task Suite, a synthetically generated evaluation framework that goes beyond traditional problem-solving benchmarks to evaluate different aspects of code generation such as debugging and execution output prediction in a fine-grained way, and discuss new ideas such as dynamic evaluation that arise naturally from our results. |
URI: | http://arks.princeton.edu/ark:/88435/dsp01zw12z866w |
Type of Material: | Princeton University Senior Theses |
Language: | en |
Appears in Collections: | Computer Science, 1987-2024 |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
TANG-MICHAEL-THESIS.pdf | 2.54 MB | Adobe PDF | Request a copy |
Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.