Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01sn00b192c
Title: Reconciliation-Based Methods for Identifying the Evolutionary Origins of Tandem Duplications in Repeat Domain Families
Authors: Aluru, Chaitanya
Advisors: Singh, Mona
Contributors: Computer Science Department
Subjects: Computer science
Evolution & development
Bioinformatics
Issue Date: 2022
Publisher: Princeton, NJ : Princeton University
Abstract: Domains are the structural, functional, and evolutionary building blocks of protein sequences. Proteins can contain multiple domain instances, and duplications and losses of these domains are a key driver of protein evolution. Of particular interest are families of proteins with consecutive repeats of the same domain. These tandem repeat families are involved in a wide variety of functions, including transcriptional regulation, protein transport, muscle contraction, brain size regulation, and many others. Proteins with tandemly repeated domains form a significant portion of the proteome across the tree of life. Despite their prevalence and importance, the evolutionary histories and functional diversification of many of these protein families are largely unknown. Understanding when domains duplicate, whether individually or together as part of an array of domains, could yield deeper insights into the functions of these proteins. Several attempts have been made to understand the evolution of repeat domains within protein sequences. These approaches can largely be categorized into sequence-based and reconciliation-based methods. Sequence-based approaches attempt to identify the existence of tandem duplications, without placing them in an evolutionary context. Reconciliation based methods, on the other hand, use gene and domain trees to simultaneously infer both tandem duplication events and the genes they occurred in. These methods, while more powerful, have not accurately captured tandem duplication events. In this work, we bridge the gap between these two methods, developing reconciliation-based methods that can accurately identify tandem domain duplication events while also placing them correctly in the evolutionary history of their gene families. We extend existing reconciliation frameworks to include flexible cost models for duplication events. Rather than fixed costs regardless of duplication size, we represent costs as arbitrary functions of duplication length. We tackle the problem of distinguishing tandem duplications from other duplication events by incorporating sequence position information from existing domains. We provide both exact solutions and fast, accurate heuristics to these problems. Finally, we apply these approaches to the largest repeat domain family in humans, the Cys2-His2 zinc fingers. In analysis of 494 Cys2-His2 zinc finger orthogroups, we find evidence of numerous tandem domain duplications throughout the placental mammals.
URI: http://arks.princeton.edu/ark:/88435/dsp01sn00b192c
Alternate format: The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: catalog.princeton.edu
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Computer Science

Files in This Item:
File Description SizeFormat 
Aluru_princeton_0181D_13987.pdf2.87 MBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.