Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp013x816m72k
Title: Machine Learning Techniques for the Diagnosis of Pediatric Tuberculosis
Authors: Coston, Amanda
Advisors: Schapire, Robert
Department: Computer Science
Class Year: 2013
Abstract: The goal of this project was two-fold: first, to improve the performance of machine learning algorithms for the diagnosis of pediatric tuberculosis, and second, to use machine learning algorithms to better understand the problem of diagnosis. We constructed and examined Bayes nets using a MATLAB toolbox by Kevin Murphy and we experimented with 26 other machine learning algorithms in the Weka software package. We found that while the Bayes nets have better accuracy when we initialize parameters based on medical knowledge, creating our own structure based on medical knowledge did not increase performance; a naive Bayes net does better than the our handcrafted Bayes net. Neither the Bayes nets nor any of the Weka algorithms performed at the level necessary for use in real medical settings. Calibration curves show that the predicted probabilities of the Bayes nets and Weka algorithms do not correspond to the probability of positive diagnosis. Among the Weka algorithms, we found that decision algorithms generally have better performance, with the alternating decision tree and the ensemble methods (bagging and Adaboost) on decision stumps performing the best. Overall, false negative rates are much higher than false positive rates, which does not bode well for practical applications since false negatives yield significantly dire consequences in real life. We found that we could lower the false negative rates and generally improve the performance of the Bayes nets by guessing the label of unknown instances, a method we call predictive labeling. Using a variety of algorithms, we also tested for which features were most important to diagnosis. The structure of alternating decision trees as well as traditional decision trees contributed to our understanding. We also randomized the data for each feature to see which had the greatest effect on performance, reasoning that the feature whose randomization had the greatest effect would be the most important. In addition, we implemented an explanation algorithm by selecting which feature in each patient would change the probability of diagnosis most if not present. Using these algorithms we found that the most important features for diagnosis were malaise and weight loss. Moving forward, we recommend obtaining larger and more comprehensive data sets that may yield better performance from the Bayes nets and other machine learning algorithms.
Extent: 68 pages
URI: http://arks.princeton.edu/ark:/88435/dsp013x816m72k
Access Restrictions: Walk-in Access. This thesis can only be viewed on computer terminals at the Mudd Manuscript Library.
Type of Material: Princeton University Senior Theses
Language: en_US
Appears in Collections:Computer Science, 1987-2023
Princeton School of Public and International Affairs, 1929-2023

Files in This Item:
File SizeFormat 
Amanda_COSTON_Jocelyn_TANG_.pdf2.53 MBAdobe PDF    Request a copy


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.