Tools for the COVID-19 Open Research Dataset Challenge (CORD-19)
In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 47,000 scholarly articles, including over 36,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.
The dataset contains all COVID-19 and coronavirus-related research (e.g. SARS, MERS, etc.) from the following sources:
PubMed's PMC open access corpus using this query (COVID-19 and coronavirus research)
Additional COVID-19 research articles from a corpus maintained by the WHO
bioRxiv and medRxiv pre-prints using the same query as PMC (COVID-19 and coronavirus research)
For the purposes of the challenge, a list of initial key scientific questions are drawn from the NASEM’s SCIED (National Academies of Sciences, Engineering, and Medicine’s Standing Committee on Emerging Infectious Diseases and 21st Century Health Threats) research topics and the World Health Organization’s R&D Blueprint for COVID-1. These questions are distributed within the following 9 predefined Tasks, each one of which contains the most important questions with regards to this epidemic:
What is known about transmission, incubation, and environmental stability?
What do we know about COVID-19 risk factors?
What do we know about virus genetics, origin, and evolution?
What do we know about vaccines and therapeutics?
What do we know about non-pharmaceutical interventions?
What do we know about diagnostics and surveillance?
What has been published about medical care?
What has been published about ethical and social science considerations?
What has been published about information sharing and inter-sectoral collaboration?
The Intelligent Data Exploration and Analysis Laboratory (IDEAL) of the University of the Aegean has uitlized its expertise on the areas of machine learning, data mining and natural language processing in order to address these questions. To this end, within the first two weeks of the challenge we have created the following tools and results which explore and automatically provide insights from the coronavirus literature:
A continuous Dataset Exploration and Discovery Service which:
Automatically clusters the papers in the CORD-19 collection based on their content. Each cluster/topic is also labelled by the most important words appearing within its papers. The methodology is based on our variation to Doc2vec paragraph embeddings, combined with an Agglomerative clustering method.
Returns a list of semantically related papers for each paper that the user is browsing. This functionality encourages the continuous and consistent exploration of the content and may lead to targeted scientific discoveries.
Incorporates both a conventional keyword based search engine, and an advanced content based search engine that retrieves the most semantically simialar documents for each user's query.
A Question Answering service which:
Provides answers to free text questions issued by the user. For each question, the tool creates a highlights section with a summary of the results, along with each matching article and each article's text matches. This functionality provides explainability to the user and, more importantly, supports the validity of each generated answer, based on the corresponding literature. The methodology is based on the CORD-19 Analysis with Sentence Embeddings Kaggle kernel developed within the challenge's proposed solutions, combined with a variety of our own developed approaches which we modify on a frequent basis as shown in the Changelog
An automated reporting sevice which:
Automatically generates reports from the coronavirus literature. Users can upload their list of questions and will receive an automated report with answers to these questions. Similarly to the question aswering service, for each question, the service generates a highlights section with a summary of the results, along with each matching article and each article's text matches.
We also provide reports which contain answers to the questions belonging to the 9 Tasks shown above.
The links below lead to the results of this ongoing research. Please note that the CORD-19 dataset is continuously updated with new papers appearing in the coronavirus literature. Accordingly, the results presented in the following links will be continuously updated, as we will also modify our methodology approaches. Also please bear in mind that due to the urgency of the situation, many papers included in the collection have not been properly peer-reviewed.
The automated reporting service accepting lists of questions issued by users was added to the Interactive tools (Updated 14 April 2020 on this site).
 In our lab we have developed similar approaches to the utilized Kaggle kernel, as shown for example in this jupyter notebook snapshot demo for finding "concept" embeddings and extracting concepts/labels for documents.