Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01f7623g02z
Title: Inference of DNA Community Interactions in Hi-C Contact Data
Authors: Luo, Mo
Advisors: Abbe, Emmanuel
Contributors: Cuff, Paul
Department: Electrical Engineering
Class Year: 2016
Abstract: The Poisson model community detection algorithm [1] is applied to Hi-C contact data. Hi-C contact data is preprocessed in various ways in attempt to identify communities at various distances relative to each other. We nd the following about location of communities within chromosomes in subsection of human chromosome 14 in cell line GM06990: 1) we are able to detect communities that are relatively close in the 1-D strand of DNA, where are large number of interactions exist between nodes 2) we are also able to detect communities of nodes that are separated by a larger distance 3) the vast majority and most dominant communities are among nodes in 1-D proximity. Without preprocessing our data, we nd k=6 communities, the same number as Cabreros et. al. [3]. By eliminating low contact interactions, the number of communities drops to 5. By eliminating interactions along the main diagonal (1-D proximity), we detect 3 communities. Additionally, we veri ed that similar behavior is mostly observable when applying the same techniques to mouse chromosomes 1-5. We do nd, however, that changing the restriction enzyme used to create the Hi-C data can substantively a ect clustering results. This could be because any variability in the main band can greatly skew the clustering. Most notably, however, removing the main diagonal band, up to a certain point, actually makes detecting more communities possible. Finally, we adapted the adjusted mutual information score to compare our clustering results and nd that while clustering results on the preprocessed data seem relatively similar even though the preprocessing techniques removed opposite nodes. The results from unpreprocessed data, however, has a low adjusted mutual information score with all other clustering results. While the results produced by CD are for the most part intuitive, some are di cult to explain and may require new theories and understandings about the 3D structure of DNA.
Extent: 54 pages
URI: http://arks.princeton.edu/ark:/88435/dsp01f7623g02z
Type of Material: Princeton University Senior Theses
Language: en_US
Appears in Collections:Electrical and Computer Engineering, 1932-2023

Files in This Item:
File SizeFormat 
Luo_Mo_seniorthesis.pdf1.26 MBAdobe PDF    Request a copy


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.