Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01q811kp00q
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorGriffiths, Thomas L
dc.contributor.authorWynn, Andrea Hui
dc.contributor.otherComputer Science Department
dc.date.accessioned2024-08-08T18:29:33Z-
dc.date.available2024-08-08T18:29:33Z-
dc.date.created2024-01-01
dc.date.issued2024
dc.identifier.urihttp://arks.princeton.edu/ark:/88435/dsp01q811kp00q-
dc.description.abstractHow can we build AI systems that can learn any set of individual human values both quickly and safely, avoiding causing harm or violating societal standards for acceptable behavior during the learning process? We argue that representational alignment between humans and AI agents facilitates learning human values quickly and safely, an important step towards value alignment in AI. Making AI systems learn human-like representations of the world has many known benefits, including improving generalization, robustness to domain shifts, and few-shot learning performance. We propose that this kind of representational alignment between machine learning (ML) agents and humans can also support safely learning and exploring human values. We focus on ten different aspects of human values -- including ethics, honesty, and fairness -- and train ML agents using a variety of methods in a multi-armed bandit setting, where rewards reflect human value judgments over the chosen action. We use a synthetic experiment to demonstrate that agents which have high representational alignment with the environment exhibit safer and more efficient learning behavior. We then repeat this procedure in a realistic setting, using textual action descriptions and similarity judgments collected from humans and a variety of language models, to show that representational alignment enables both safe exploration and improved generalization when grounded in a real-world context.
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.publisherPrinceton, NJ : Princeton University
dc.subjectHuman-Centric AI
dc.subjectMachine Learning
dc.subjectRepresentational Alignment
dc.subjectValue Alignment
dc.subject.classificationArtificial intelligence
dc.titleLearning Human-Like Representations to Enable Learning Human Values
dc.typeAcademic dissertations (M.S.E.)
pu.date.classyear2024
pu.departmentComputer Science
Appears in Collections:Computer Science, 2023

Files in This Item:
File Description SizeFormat 
Wynn_princeton_0181G_15048.pdf1.17 MBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.