A Generic Framework For Network Traffic Analysis

Holland, Jordan

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01ks65hg373

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	FeamsterMittal, NickPrateek
dc.contributor.author	Holland, Jordan
dc.contributor.other	Computer Science Department
dc.date.accessioned	2022-06-16T20:33:59Z	-
dc.date.available	2022-06-16T20:33:59Z	-
dc.date.created	2022-01-01
dc.date.issued	2022
dc.identifier.uri	http://arks.princeton.edu/ark:/88435/dsp01ks65hg373	-
dc.description.abstract	Researchers and practitioners rely on network traffic analysis techniquesfor a variety of critical network security and network management tasks. Ever-increasing traffic volumes and encryption rates have rendered traditional, signature-based solutions less effective. As such, newly developed methods almost universally leverage machine techniques. The development of new machine-learning based traffic analysis techniques shares a common methodological pipeline: curate a network traffic dataset, create a system to separate and associate labels with the traffic (e.g. flows, applications), engineer features for the task, and finally train models using the engineered features. Although this methodological pipeline shared across tasks, each instantiated pipeline is custom-built for the task at hand, requiring new traffic processing systems, features, and models. This dissertation questions the assumption that each stage in the shared methodological pipeline should be custom-built to each task, exploring if several stages of the common pipeline can be better accomplished using generic techniques. First, we examine the process of feature engineering and model training--two of the most manual and painstaking steps for any traffic analysis task. We develop nPrint, a unified packet representation that is amenable to representation learning and model training for a variety of tasks. We then integrate nPrint with automated machine learning to produce nPrintML, a generic feature engineering and model training solution. Next, we study the data collection and data processing steps of the commontraffic analysis pipeline. Unlike other disciplines, such as image recognition, no standard dataset format or ''canonical'' task exists, forcing researchers to develop custom dataset formats and processing systems for each task. We survey existing literature to show that this approach has led to a reproducibility crisis, finding that the lack of a standardized dataset format and the extensive usage of ambiguous terminology are primary causes. We use these findings to develop pcapML, a system that enables reproducible network traffic analysis by providing a standardized dataset format that removes ambiguity in the definitions of traffic analysis tasks. These contributions chart new directions in network traffic analysis, demonstrating that generic methods can outperform many custom-built approaches and significantly enhance the ability to develop, reproduce, and compare new methods.
dc.format.mimetype	application/pdf
dc.language.iso	en
dc.publisher	Princeton, NJ : Princeton University
dc.relation.isformatof	The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: <a href=http://catalog.princeton.edu>catalog.princeton.edu</a>
dc.subject.classification	Computer science
dc.title	A Generic Framework For Network Traffic Analysis
dc.type	Academic dissertations (Ph.D.)
pu.date.classyear	2022
pu.department	Computer Science
Appears in Collections:	Computer Science

Files in This Item:

File	Description	Size	Format
Holland_princeton_0181D_14093.pdf		4.61 MB	Adobe PDF	View/Download

Show simple item record

Search

Browse