Computational Text Analysis Working Group (CTAWG)
Title: ML, OCR and NLP for extracting information from biological pathway figures
Presenters: Anders Riutta from the Institute of Data Science and Biotechnology, Gladstone Institutes in San Francisco. Anders will present the work done together with Kristina Hanspers, Martina Summer-Kutmon & Alexander R. Picoon - which deals with information extraction from biological pathway figures using NLP, ML and OCR.
Abstract: Thousands of biological pathway diagrams are published every year as static JPGs, inaccessible to computational queries and analyses. We used computer vision ML to identify 64,643 pathway figures, OCR to extract text, and NLP to extract genes, chemicals, diseases, and other biological concepts. The human genes identified -- over 1 million instances total and 13,464 unique -- participate in a wide variety of biological processes. This collection is an order of magnitude larger than the number of genes found in the text of the same papers and includes thousands of genes missing from existing pathway databases, thus presenting new opportunities for medical discovery and research.
Related publications: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02181-2https://www.biorxiv.org/content/10.1101/379446v1
Check out future working group sessions.