High-Dimensional Similarity Search for Large Datasets

Dong, Wei

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01pc289j087

Title:	High-Dimensional Similarity Search for Large Datasets
Authors:	Dong, Wei
Advisors:	Li, Kai
Contributors:	Computer Science Department
Keywords:	k-nearest neighbor locality-sensitive hashing near-duplicate sketch
Subjects:	Computer science
Issue Date:	2011
Publisher:	Princeton, NJ : Princeton University
Abstract:	The volume of images and other non-text data is growing exponentially in today's digital universe. A popular way of extracting useful information from such data is to conduct content-based similarity search, e.g., to search for images with content similar to a query image. How to build data structures to support efficient similarity search on a large scale is an issue of increasing importance. The challenge is that feature-rich data are usually represented as high-dimensional feature vectors, and the curse of dimensionality dictates that as dimensionality grows, any search strategy examines an increasingly large portion of the dataset and eventually degenerates to brute-force scan. In this dissertation, we study several key issues to improve the accuracy and efficiency of high-dimensional similarity search. Specifically, we make the following contributions: First, we enhance Multi-Probe LSH, the state-of-the-art high-dimensional similarity search technique, by developing an accurate performance model that predicts search accuracy with an error rate of less than 5% using only a 1/10 sample of the whole dataset. We use this model to realize automatic performance tuning, which would otherwise be tedious, and to develop an adaptive query processing strategy to ensure the high search quality of each individual query. Second, we develop an efficient offline method for simultaneously finding the K nearest neighbors of every point in the dataset. Our method is 2 to 16 times faster than the state-of-the-art technique when evaluated with different datasets. Furthermore, we show how to use the offline computed nearest-neighbor information to double the speed of online similarity search. Third, we develop compact representations of high-dimensional feature vectors optimized for similarity search tasks. We present an asymmetric distance estimation framework to exploit the information in the uncompressed query data. We use that to further compress the indexed dataset, by 10% to 40%, without compromising search quality. Fourth, we develop a scheme to compactly represent sets of feature vectors, an increasingly popular data representation that is more accurate than single vectors, but also more expensive. Our method dramatically reduces the matching cost in both space and time. In an image classification task, our method achieves 25 times speedup over one of the best existing techniques. Finally, we demonstrate some of the proposed techniques by building a large-scale near-duplicate image search engine. Our system serves more than 50 million web images on a single commodity server and responds to queries within a few seconds.
URI:	http://arks.princeton.edu/ark:/88435/dsp01pc289j087
Alternate format:	The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog
Type of Material:	Academic dissertations (Ph.D.)
Language:	en
Appears in Collections:	Computer Science

Files in This Item:

File	Description	Size	Format
Dong_princeton_0181D_10018.pdf		1.56 MB	Adobe PDF	View/Download

Show full item record

Search

Browse