Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01s7526g80x
Title: Exploring Reasoning and Interactive Benchmarking of Language Models
Authors: Prabhakar, Akshara
Advisors: Narasimhan, Karthik
Department: Computer Science
Class Year: 2024
Publisher: Princeton, NJ : Princeton University
Abstract: This thesis aims to deepen our understanding of the reasoning process undertaken in Large Language Models (LLMs) and introduces a challenging benchmark to evaluate language agents' interactive code-generation capabilities. Part 1 delves into untangling the factors affecting the Chain-of-Thought (CoT) reasoning process in LLMs. Despite CoT prompting having demonstrated significant efficacy in enhancing multi-step reasoning capabilities, it has been found to yield biased answers and produce unfaithful explanations, raising interpretability concerns. Additionally, debates persist regarding whether LLMs truly generalize or rely on heuristics. Focusing on the symbolic reasoning task of decoding shift ciphers, we develop a simple probabilistic approach to identify these factors - the probability of the task's expected output (probabilistic effect), what the model has implicitly learned during pre-training (memorization influenced), and the model's attempt to adopt shorter reasoning steps (noisy reasoning) and show the consequent drastic variability in task accuracy. Through a series of experiments, we conclude that LLM behavior exhibits clear hallmarks of both memorization and true reasoning suggesting that CoT resembles a probabilistic, memorization-influenced form of noisy reasoning. Part 2 transitions from LLMs to the emerging domain of language agents, where the LLM is now grounded in a digital environment to aid in decision-making. We introduce InterCode -- an interactive RL environment to benchmark the interactive code generation abilities of language agents with execution-driven feedback. By converting traditional seq2seq datasets into three interactive code environments (Bash, SQL, and Python), we demonstrate InterCode's viability as a testbed by evaluating multiple state-of-the-art LLMs configured with different prompting strategies such as ReAct and Plan & Solve. Our results underscore the benefits of interactive code generation and establish InterCode as a robust, scalable benchmark for advancing code understanding and generation capabilities. Furthermore, we use this framework in the cybersecurity realm; by developing a Capture the Flag task in the InterCode environment, we find that while language agents possess rudimentary cybersecurity knowledge, they are not able to perform multi-step cybersecurity tasks out of the box. Overall, it highlights the evolving landscape of LLMs, emphasizing the need for comprehensive evaluation and understanding of their capabilities across different domains.
URI: http://arks.princeton.edu/ark:/88435/dsp01s7526g80x
Type of Material: Academic dissertations (M.S.E.)
Language: en
Appears in Collections:Computer Science, 2023

Files in This Item:
File Description SizeFormat 
Prabhakar_princeton_0181G_14975.pdf4.89 MBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.