NLP in public health monitoring
In this work, our goal is to generate better models of virus migration and evolution (phylogeography). Funded by an R01 grant from the NIAID, as multi-PI with Matthew Scotch, MD, we advanced methods that can expand geospatial metadata in GenBank records by linking geospatial and host metadata from the publication associated with the record. Out of this work linking GenBank records and the literature, we also made available a state-of-the-art, deep neural network approach to geographic location mention extraction from the literature—which is more generally applicable for public health monitoring—and researched if this more refined assignment of geographic location to virus phylogeography results in improved models. We organized a shared task on this topic, running as Task 12 of SemEval 2019, with participation by 18 teams from around the world.
Supervised machine-learning methods were used for the extraction of diseases and genes with BANNER, the best-performing openly available gene-tagging system for years after its release.
Social Media Mining for Pharmacovigilance
Publishing the first paper on extracting adverse reactions from social media postings in 2010 led to obtaining R01 funding from the NLM to explore the topic further, in order to assess if and under which circumstances could social media postings by consumers be a valid source of signals for pharmacovigilance. A significant finding, for example, was the uncovering of behaviors underlying non-adherence to statins. With more than 35 publications and in its second cycle of funding, the problem has gained international attention, with publications on prestigious journals such as JAMA Open and JAMIA. Work on pharmacovigilance was further extended into the toxicovigilance realm, detecting prescription medication abuse and advance language-processing methods adequate to social media, showing the potential impact of this area of research both to pharmacovigilance and social media mining, in general.
Mining user-generated data for pregnancy studies
Despite the fact that pregnancy and reproductive health remain at the top of the list of women’s health concerns and that birth defects are the leading cause of infant mortality in the United States, methods for observing pregnancies and its outcomes remain limited. We identified a cohort of women whose pregnancies with birth defect outcomes could be observed via their publicly available tweets. Then, we used their timelines to conduct an observational case-control study in which we compared select risk factors among the women reporting a birth defect outcome (cases) and users for whom we did not detect a birth defect outcome, selected from the same database (controls). The study found that reports of medication exposure were statistically significantly greater among the cases than the controls. The users we studied were posting tweets not only during pregnancy; most were posting tweets leading up to their pregnancy and before they could have been aware they were pregnant. Thus, our social media-mining approach can be used to observe risk factors in the periconceptional period and early in the first trimester, respectively. We also studied an alternative source of data, health forums to study the information-seeking patterns amongst pregnant women: what questions do they pose in a forum, at what time during pregnancy? This could shed light into effective communication from clinicians and other health other health care professionals, as women could be exposed to misinformation if their information needs are not met.
Automatic concept extraction and normalization from biomedical literature
Our NLP approaches to extract knowledge from biomedical literature are top-ranked at international competitions such as Biocreative, and utilize diverse approaches. Supervised machine-learning methods were used for the extraction of diseases and genes with BANNER, a system that is, to this date, the best performing openly available gene tagging system. We published the first system for inter-species gene normalization using contextual clues for disambiguation, as well as relationship extraction of protein-protein interactions from full-text articles. These methods can be used as part of larger-context biomedical problems. For example, relationships extracted from biomedical literature can support findings about adverse drug reactions found in clinical records or in social media, or be useful for semi-supervised methods for normalization using semantic vectors. More recently, we proposed a framework that uses natural language processing (NLP) for the automatic extraction and normalization of relevant geospatial data from the literature, we then used these locations and the estimates as observation error in the creation of phylogeographic models of zoonotic virus spread.