Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp016682x619b
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorBlei, David Men_US
dc.contributor.authorGopalan, Prem Krishnaen_US
dc.contributor.otherComputer Science Departmenten_US
dc.date.accessioned2015-02-08T18:13:43Z-
dc.date.available2015-02-08T18:13:43Z-
dc.date.issued2015en_US
dc.identifier.urihttp://arks.princeton.edu/ark:/88435/dsp016682x619b-
dc.description.abstractRecent years have seen explosive growth in data, models and computation. Massive data sets and sophisticated probabilistic models are increasingly used in the fields of high-energy physics, biology, genetics and in personalization applications; however, many statistical algorithms remain inefficient, impeding scientific progress. In this thesis, we present several efficient statistical algorithms for learning from massive discrete data sets. We focus on discrete data because complex and structured activity such as chromosome folding in three dimensions, human genetic variation, social network interactions and product ratings are often encoded as simple matrices of discrete numerical observations. Our algorithms derive from a Bayesian perspective and lie in the framework of directed graphical models and mean-field variational inference. Situated in this framework, we gain computational and statistical efficiency through modeling insights and through subsampling informative data during inference. We begin with additive Poisson factorization models for recommending items to users based on user consumption or ratings. These models provide sparse latent representations of users and items, and capture the long-tailed distributions of user consumption. We use them as building blocks for article recommendation models by sharing latent spaces across readership and article text. We demonstrate that our algorithms scale to massive data sets, are easy to implement and provide competitive user recommendations. Then, we develop a Bayesian nonparametric model in which the latent representations of users and items grow to accommodate new data. In the second part of the thesis, we develop novel algorithms for discovering overlapping communities in large networks. These algorithms interleave non-uniform subsampling of the network with model estimation. Our network models capture the basic ways in which nodes connect to each other, through similarity and popularity, using mixed-memberships representations and generalized linear model formulation. Finally, we present the TeraStructure algorithm to fit Bayesian models of genetic variation in human populations on tera-sample-sized data sets (10^{12} observed genotypes, e.g, 1M individuals at 1M SNPs). On real genomic data collected from thousands of individuals, TeraStructure is faster than existing methods and recovers the latent population structure with equal accuracy. On genomic data simulated at the tera-sample-size scales, TeraStructure is highly accurate and is the only method that can complete its analysis.en_US
dc.language.isoenen_US
dc.publisherPrinceton, NJ : Princeton Universityen_US
dc.relation.isformatofThe Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the <a href=http://catalog.princeton.edu> library's main catalog </a>en_US
dc.subjectlatent variable modelsen_US
dc.subjectmachine learningen_US
dc.subjectnetworksen_US
dc.subjectrecommendation systemsen_US
dc.subjectscalable data analysisen_US
dc.subjectvariational inferenceen_US
dc.subject.classificationComputer scienceen_US
dc.titleScalable inference of discrete data: user behavior, networks and genetic variationen_US
dc.typeAcademic dissertations (Ph.D.)en_US
pu.projectgrantnumber690-2143en_US
Appears in Collections:Computer Science

Files in This Item:
File Description SizeFormat 
Gopalan_princeton_0181D_11226.pdf4.09 MBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.