Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp016969z416r
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorBousso Dieng, Adji
dc.contributor.advisorTroyanskaya, Olga
dc.contributor.authorFerragu, Constance
dc.contributor.otherComputer Science Department
dc.date.accessioned2024-08-08T18:39:06Z-
dc.date.available2024-08-08T18:39:06Z-
dc.date.created2024-01-01
dc.date.issued2024
dc.identifier.urihttp://arks.princeton.edu/ark:/88435/dsp016969z416r-
dc.description.abstractProtein sequence discrete diffusion models have become increasingly valuable for the design of novel proteins, due to their robust generative capabilities and effective bi-directional processing of sequence data. To improve the generation of sequences with desired functions, gradient guidance methods are often used to guide sampling with a discriminative model. However, sampling from these models presents challenges due to the vast and discrete nature of the sequence space. These models tend to prioritize denoising high-likelihood tokens, resulting in similar sequences. Furthermore, gradient guidance methods tend to collapse generation to fewer modes. Protein design pipelines tend to work with fixed-size batches of sequences. Hence, given the high cost of experimental validation, optimizing sample efficiency of these batches is essential. In this thesis, we propose Vendi Guidance, a guided diffusion sampling algorithm designed to improve the exploration efficiency of sequence space and the diversity of sampled sequence sets. Our method leverages the Vendi Score---a statistical measure of diversity---to select edit positions that will most effectively improve the diversity objective and to guide the model’s hidden representations towards diverse denoising steps. We demonstrate that Vendi Guidance can iteratively refine a seed sequence into a more diverse set of sequences, while ensuring that the quality of the sequences does not deteriorate.
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.publisherPrinceton, NJ : Princeton University
dc.subject.classificationComputer science
dc.titleDiversity-Guided Sampling for Protein Sequence Design via Iterative Refinement
dc.typeAcademic dissertations (M.S.E.)
pu.date.classyear2024
pu.departmentComputer Science
Appears in Collections:Computer Science, 2023

Files in This Item:
File Description SizeFormat 
Ferragu_princeton_0181G_15055.pdf3.33 MBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.