Research

Language is our primary and convenient means of communication. Textual content, which can range from social media discussions to product reviews to private physician notes, present naturally occurring data that can be used to create computational models to better represent, enhance, and, ultimately, understand such data.

We are currently working on several key problems to reach the above goal:

Curriculum Learning for Natural Language Processing

Deep neural networks can effectively tackle many tasks, from recognizing and reasoning about objects in images to playing strategy games to modeling valid sequences of words in human language. However, these models could be computationally expensive to train, even with fast hardware. In addition, statistical and machine learning models suffer from spurious data (those with potentially wrong labels), resulting in biased prediction and catastrophic errors. How can we efficiently train models that are robust to the biases imposed by spurious data? The importance of error-free resources cannot be overstated as errors can inversely affect interpretations of the data, models developed from the data, and decisions made based on the data. We are currently investigating schedulers that dynamically schedule training data points for more efficient and effective training, and detect spurious instances in datasets. Such schedulers can uncover the salient characteristics of these learners (networks) and their learning materials (training instances), and can improve the quality of existing resources, which is important for accurate and fair benchmarking.

Relevant publications

Clinical Decision Support

Most medical information, from referral letters to physician notes to scientific articles, are locked in unstructured text and are not readily accessible. NLP and Machine Learning techniques offer potential means to support clinicians with evidence and insight extracted from such data. We are investigating novel decision support systems to accelerate diagnosis for undiagnosed patients. Using patient data from the Undiagnosed Diseases Network (UDN), a nationwide program established by the National Institutes of Health to facilitate research on undiagnosed and rare diseases, we are investigating new and disease-agnostic deep learning technology to classify and triage patient applications, and pinpoint disease-causing gene variants through effective representation of multimodal patient data and reference materials about rare diseases.

Relevant publications

Social Media Surveillance

User Generated Content (UGC) can be used to obtain low-cost and high-resolution views into population behavior. We are investigating effective online surveillance systems that can monitor population health and behavior at scale to detect (health-related) trends and outbreaks, and identify opportunities for decision making or intervention. The results can provide complementary information to the knowledge available in national surveys, and inform policy evaluation and improvement.

Relevant publications