Log in

Sign up for our weekly newsletter!

When & Where
Date: 
Fri, March 5, 2021 - 1:00 PM to 2:00 PM
Location: 
Join via Zoom
Description

Computational Text Analysis Working Group (CTAWG)

Title: ML, OCR and NLP for extracting information from biological pathway figures

Presenters: Anders Riutta from the Institute of Data Science and Biotechnology, Gladstone Institutes in San Francisco. Anders will present the work done together with Kristina Hanspers, Martina Summer-Kutmon & Alexander R. Picoon - which deals with information extraction from biological pathway figures using NLP, ML and OCR.

Abstract:  Thousands of biological pathway diagrams are published every year as static JPGs, inaccessible to computational queries and analyses. We used computer vision ML to identify 64,643 pathway figures, OCR to extract text, and NLP to extract genes, chemicals, diseases, and other biological concepts. The human genes identified -- over 1 million instances total and 13,464 unique -- participate in a wide variety of biological processes. This collection is an order of magnitude larger than the number of genes found in the text of the same papers and includes thousands of genes missing from existing pathway databases, thus presenting new opportunities for medical discovery and research.

Related publications: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02181-2https://www.biorxiv.org/content/10.1101/379446v1

Check out future working group sessions.

Primary Tool: 
None
Details
Training Learner Level: 
Mixed Learning Levels
Training Host: 
Log in to register for this training.