Extraction d'informations sur les workflows scientifiques à partir de la littérature
Abstract
Scientific workflows provide bioinformaticians a mean to represent, exchange and ensure the reproducibility of their analysis pipelines. Workflows are described in literature (text) and/or stored in workflow repositories (code). A major challenge to ensure better workflow reuse is to rebuild the link between the documentation (text) and the workflow code.
Based on workflow descriptions found in the full text of articles in English, we propose a method for representing and extracting information about the components of workflows. We present a corpus of 24 articles annotated with a schema made of 16 entities and 10 relations. We use this corpus to train and evaluate statistical models for extracting information about workflows. The results obtained show the feasibility of the task and are a first step towards the integration of workflow information from the literature and workflow repositories.