Natural Language Processing (NLP)

Tools

MedDecXtract MedDecXtract is A Clinician-Support System for Extracting, Visualizing, and Annotating Medical Decisions in Clinical Narratives]

CGT. Controlled Graph Translator (CGT) introduces a controllable multi-objective translation model for text-attributed graphs. CGT is designed to effectively and efficiently transforms a given source graph into a new target graph that satisfies multiple, precisely-defined structural attributes. If you use CGT, please cite Vakil & Amiri, Findings 2024

MU-Bench. MU-Bench is the first multitask multimodal benchmark for machine unlearning. We release the unified data splits and original models, to address the concerns of inconsistent and non-exhaustive evaluation of unlearning. If you use MU-Bench, please cite Cheng and Amiri, SafeGenAI@NeurIPS’24

Medical Decision Extractor. MedDec is the first dataset specifically developed for extracting and classifying medical decisions from clinical notes. It includes 451 expert-annotated annotated discharge summaries from the MIMIC-III dataset, offering a valuable resource for understanding and facilitating clinical decision-making. If you use the dataset, please cite Elgaar et al., Findings of ACL 2024

CogniVoice. CogniVoice is a advanced speech-based tool to facilitate clinicians on Mild Cognitive Impairment (MCI) detection and assessment. It can handel multilingual spontaneous speech and multimodal inputs (speech, text, acoustic features). CogniVoice achieves accurate prediction based on spurious correlation mitigation using the Product of Experts framework. If you use CogniVoice, please cite Cheng et al., INTERSPEECH 2024.

MultiDelete. MultiDelete is the first Machine Unlearning / Model Editing method designed for Multi-Modal data and models, such as BLIP. It can effectively and efficiently delete private information and outdated knowledge from an already-trained model, without retraining the model from scratch. If you use MultiDelete, please cite Cheng and Amiri, ECCV 2024.

TGCL. TGCL is an advanced spaced repetition framework designed to enhance the training efficacy of GNNs and understand their learning dynamics. It schedules examples for training, optimize the timing and sequence (order) based on the evolving complexity of training data. To achieve this, it uses a combination of multiview graph and text complexity formalisms. TGCL can effectively tailor the curriculum to the unique learning dynamics of each model and can learn curricula that are transferable across different GNN models and datasets.

Ling-CL. The linguistic curriculum learning algorithm has three features. a) Estimating the importance of linguistic indices using a data-driven approach, b) The application of a “linguistic curriculum” to enhance the model's performance from a linguistic perspective, and c) Identifying the core set of linguistic indices needed to learn a task. This tool also evaluates the model's ability to handle different linguistic indices.

MCCL. Multiview Competence-based Curriculum Learning (MCCL) is a new curriculum learning framework for GNNs that builds on graph complexity formalisms (as difficulty criteria) and model competence during training. The model consists of a scheduling scheme which derives curricula by accounting for different views of sample difficulty. It effectively leverages complexity formalisms of graph data, taking into account multiview difficulty of training data samples and model competency.

HuCurl. A framework for automatically finding an optimal curriculum for a given dataset and model.

GTNN. Graph Text Neural Network (GTNN) is a generic and trend-aware curriculum learning approach that effectively integrates textual and structural information in text graphs for relation extraction between entities, which we consider as node pairs in graphs. The proposed model extends existing curriculum learning approaches by incorporating sample-level loss trends to better discriminate easier from harder samples and schedule them for training.

AMNN. Attentive Multiveiw Neural Network (AMNN) model is a framework that can combine different views (representations) of the same input through effective data fusion and attention strategies for ranking purposes. We developed the model to find the most probable diseases that match with clinical descriptions of patients, using data from the Undiagnosed Diseases Network.

RbF Curriculum Learner. This tool demonstrates spaced repetition for training any neural network. RbF is inspired by the broad evidence in psychology that shows human ability to retain information improves with repeated exposure and exponentially decays with delay since last exposure. RBF works based on spaced repetition in which training instances are repeatedly presented to the network on a schedule determined by a spaced repetition algorithm. RbF shortens or lengthens review intervals for training instances with respect to a few indicators.

Leitner system. Implementation of Leitner system for simultaneously training a given neural network and identifying spurious instances (those with wrong labels) in its input dataset.

Twitter Crawler. A crawler for searching on Twitter and obtaining tweets and user networks.

MedDec: MedDec is the first dataset specifically developed for extracting and classifying medical decisions from clinical notes. It includes 451 expert-annotated annotated discharge summaries from the MIMIC-III dataset, offering a valuable resource for understanding and facilitating clinical decision-making. If you use the dataset, please cite Elgaar et al., Findings 2024.

Cancer Type Dataset: This dataset is developed to obtain population-level statistics of cancer patients. It contains 3.8k Reddit posts annotated by at least three annotators for relevance to specific cancer types. If you use this dataset, please cite (Elgaar et al., 2023).

Gene, Disease, Phenotype Relation (GDPR) Dataset: This dataset is obtained by combining and linking entities across two freely-available datasets: Online Mendelian Inheritance in Man (OMIM, https://omim.org/) and Human Phenotype Ontology (HPO, https://hpo.jax.org/). The dataset contains relations between genes, diseases and phenotypes. If you use this dataset, please cite (Vakil et al., 2022), OMIM and HPO.

Alcohol Risk Dataset: This dataset is developed to obtain population-level statistics of alcohol use reports through social media. It contains more than 9k tweet, annotated by at least three workers for report of first-person alcohol use, intensity of the drinking (light vs. heavy), context of drinking (social vs. individual), and time of drinking (past, present, or future). If you use this dataset, please cite (Amiri et al., 2018).

Churn Dataset: The dataset contains labeled tweets about three telco brands: Verizon, AT&T, and T-Mobile. Tweet are labeled as churny or not-churny, where churny tweets indicate a high risk of canceling the brand's service. Labels are obtained through crowdsourcing and each tweet is labeled by at least three annotators. Fleiss’ kappa is 0.62, which indicates substantial agreement among annotators. Cohen's kappa computed over 1073 instances related to T-Mobile, that were independently labeled by one of our team members and compared against the aggregation of the three annotators judgments over these instances, is 0.93, which indicates substantial annotation agreement as well. If you use this dataset, please cite (Amiri et al., 2015).