The (Re)usable Data Project

Inspired by the efforts of scientists around the world and the game-changing efforts of projects like the Creative Commons, the Wikipedia Foundation, and the Free Software movement, we hope to engage the larger community in an open and fruitful discussion on issues concerning the use and reuse of scientific data, including the balance of openness and how to make ends meet in an increasingly competitive environment.

If you would like to join our efforts to highlight the use and reuse of data in the sciences, please feel free to contact us on our tracker, create a pull request against our repository, or join our forum.

Who we are

We are not lawyers and this is not legal advice: all institutions and groups have their own perspectives and counsel. We are a group of scientists, engineers, librarians, and specialists that are concerned about the use and reuse of increasingly interconnected, derived, and reprocessed data. We want to make sure that data-driven scientific endeavors can work with one another in meaningful ways without undue legal concerns.

The (Re)usable Data Project is meant provide a resource that looks at some of the issues around the reuse of scientific data and open a conversation about how to deal with them.

We also want to actively work with the community in considering our criteria and in making sure that our information about scientific data resources is up-to-date and correct. If you have any questions, concerns, or see any problems, please open a ticket on our GitHub tracker.

What this is »

What this is

The initial driving concern of this project is the use and reuse of biological and biomedical data. However, this is a general problem in the scientific community and needs to be addressed directly.
For each resource, using our criteria, we attempt to objectively assign zero to five stars for how well we believe a resource's data may build upon, edited, modified, and redistributed.
Grossly speaking:

  • 5 stars ★ ★ ★ ★ ★
    The license unambiguously allows the unfettered (re)use and redistribution of the data.
  • 4 stars ★ ★ ★ ★
    The license unambiguously allows (re)use and redistribution of the data under some terms.
  • 3 stars ★ ★ ★
    The license is clearly stated, unambiguous, and of a standard type, and has clear access, but has terms that may greatly impact the (re)use and redistribution of the data.
  • 2.5 or less stars ★ ★ ½ - ∅
    There are likely issues in definitively finding the license, ambiguities in the license that hamper further analysis, issues with clean data access, or terms that require legal advice.

If you see any problems with our determinations or would like to make corrections or clarifications, please open a ticket for us on our issue tracker.

Our criteria »

Our criteria

This is a short overview of the criteria that we use when evaluating a resource's data license for use and reuse. We have attempted to balance many needs (credit, mutability, commercialization, redistribution, etc.) and focused on trying to objectively see how licenses can interact across resources.

To learn more about how we look at resource data licenses, please see our criteria and license type pages.

  • Clearly stated (A)
    A clearly stated, unambiguous, and hopefully standard, license for data use is critical for any (re)use of data: if there is no license to be found, then rights are unclear and one needs to assume the default: all rights reserved. more »
  • Comprehensive and non-negotiated (B)
    Data that is mixed under different licenses, only partially available, or must be in some way negotiated creates barriers to the (re)use of data. more »
  • Accessible (C)
    Data must be accessible in a reasonable and manner to be useful to the broader community. more »
  • Avoid restrictions on kinds of (re)use (D)
    Data should be able to be copied, built upon, edited, and modified as freely as possible. more »
  • Avoid restrictions on who may (re)use (E)
    Data should should be available to as many people as possible for their (re)use. more »

Our sources data »

Our sources data

You may also explore our data with simple visualizations here.

NameTagsGradeDescriptionLicense InfoLicense Issues
NameTagsGradeDescriptionLicense InfoLicense Issues
Alliance of Genome Resources (AGR) 🔗biology, MOD, functional annotation, disease-gene association, orthology, phenotype and disease models★ ★ ★ ★ ★The primary mission of the Alliance of Genome Resources (the Alliance) is to develop and maintain sustainable genome information resources that facilitate the use of diverse model organisms in understanding the genetic and genomic basis of human biology, health and disease.permissive 🔗
ArrayExpress 🔗biology, microarray experiments, functional genomics, high-throughput, microarray, sequencing★ ★ ★ ★ ½ArrayExpress Archive of Functional Genomics Data stores data from high-throughput functional genomics experiments, and provides these data for reuse to the research community.permissive 🔗
Criteria A.2.2
Minimal custom permissive terms.
Bgee 🔗biomedical, x-species, expression dataBgee is a database to retrieve and compare gene expression patterns in multiple animal species, produced from multiple data types (RNA-Seq, Affymetrix, in situ hybridization, and EST data).unknown 
Criteria A.1.1
While the CC0 tool is explicitly being invoked in main areas, there is some inconsistency, i.e. different ontology licensing and downloads that obviously encompass non-CC0 data.
BioCyc Database Collection (BioCyc, public) 🔗biology, genomic resource, sequence, gene structure, pathways, reactions, functional annotation★ ★ ★BioCyc is a collection of 20,028 Pathway/Genome Databases (PGDBs) for model eukaryotes and for thousands of microbes, plus software tools for exploring them. BioCyc is an encyclopedic reference that contains curated data from 130,000 publications.restrictive 🔗
Criteria A.2.2
Non-standard/custom license.
Criteria B.1
One term of the license is that you must "Notify SRI that you are making BIOCYC DATABASES available in this manner"; this, combined with somewhat bulky access (see comments), I believe rises to a barrier to reuse as a manual step invloving people has been added.
Criteria D.1.2
The license specifically notes that it is being licensed to you and you do not have rights betond that; I believe that this puts second-hand reuse in question: you may remix, but a downstream party from you would have to register and comply directly with BioCyc.
BioGRID 🔗biology, cross-species, protein-protein interaction★ ★ ★ ★ ★BioGRID is an interaction repository with data compiled through comprehensive curation efforts. Our current index is version 3.4.155 and searches 63,959 publications for 1,507,991 protein and genetic interactions, 27,785 chemical associations and 38,559 post translational modifications from major model organism species. All data are freely provided via our search index and available for download in standardized formats.permissive 🔗
BRENDA Tissue Ontology 🔗biology, ontology, enzyme sources★ ★ ★ ★A structured controlled vocabulary for the source of an enzyme. It comprises terms of tissues, cell lines, cell types and cell cultures from uni- and multicellular organisms.permissive 🔗
Criteria A.2.2
While they are obviously attempting to be permissive by listing \"CC-BY\", this does not map onto any of the number CC-BY versions (e.g. 3.0 or 4.0), thereby, breaking the reference and meaning that we would have to contact for them for terms).
Criteria C.2
Once one knows the location of the BTO (ontology file), access is fine. However, we were unable to locate the ontology file after some searching through the main BRENDA website; most text would imply there is no free ontology file to be had.
Cancer Biomarkers database 🔗oncology, interaction, cancer, drug, biomarker, oncology★ ★ ★ ★ ★The Cancer Biomarkers database is curated and maintained by several clinical and scientific experts in the field of precision oncology.permissive 🔗
Catalogue of Life 🔗biology, custom, biodiversity, distribution, biogeography, taxonomy, ontology★ ★The Catalogue of Life is the most comprehensive and authoritative global index of species currently available. It consists of a single integrated species checklist and taxonomic hierarchy. The Catalogue holds essential information on the names, relationships and distributions of over 1.6 million species.restrictive 🔗
Criteria A.2.2
The resource uses custom terms.
Criteria B.1
Use seems to hinge on some contact with Sp2000. For example: \"If you wish to use the Catalogue of Life content on a public portal or webpage you are required to notify the Species 2000 Secretariat, and to assist with a check that the correct credits are given.\" Check check assistance especially seems to violate B.1.
Criteria C.1
The data \"download\" is quite complicated and should not actually be considered a download (see commentary). The API as given would likely require a custom spider to obtain the data in bulk.
Criteria D.1.2
Distribution seems to be prohibited without negotiation; example on the main ToS: \"Commercial use of this compilation or any of the species datasets contained within...or dissemination on the Internet, requires written permission from Species 2000 and ITIS.\"
Criteria E.1.1
Non-commercial restrictions exist on the data from the ToS.
CATH Protein Structure Database 🔗biology, protein families, protein family, superfamily, classification protein structure★ ★ ★ ★ ★CATH is a classification of protein structures downloaded from the Protein Data Bank.permissive 🔗
ChEMBL 🔗biology, biochemical, bioactive drug-like small molecules★ ★ ★ChEMBL is a database of bioactive drug-like small molecules, it contains 2-D structures, calculated properties (e.g. logP, Molecular Weight, Lipinski Parameters, etc.) and abstracted bioactivities (e.g. binding constants, pharmacology and ADMET data).copyleft 🔗
Criteria D.1.2
CC SA prevents some types of reuse, such and modification and redistribution with data from different license types.
Criteria E.1.2
CC SA prevents all parties from reusing the data as D.1.2.
Clinical Interpretation of Variants in Cancer (CIViC) 🔗biomedical, human, cancer, precision medicine, variants, variant disease associations★ ★ ★ ★ ★CIViC is an open access, open source, community-driven web resource for Clinical Interpretation of Variants in Cancer. Our goal is to enable precision medicine by providing an educational forum for dissemination of knowledge and active discussion of the clinical significance of cancer genome alterations.permissive 🔗
ClinVar 🔗biomedical, human, disease-gene association, variant-disease association, variant definitions★ ★ ★ ★ClinVar archives and aggregates information about relationships among variation and human health. ClinVar collects reports of variants found in patient samples, assertions made regarding their clinical significance, information about the submitter, and other supporting data.permissive 🔗
Criteria A.2.1
Public domain declaration
Criteria B.2.1
The license page section \"Molecular Data Usage\" states that all data may not be covered under the public domain (see comments).
Criteria B.2.2
There does not seem to be any easy way to differentiate the \"clean\" data.
COGs 🔗biology, protein function, protein family, function predictionPhylogenetic classification of proteins encoded in complete genomes.copyright 
Criteria A.1.2
Could not find any clear terms on any searched pages. See commentary.
Comparative Toxicogenomics Database (CTD) 🔗biology, x-species, disease-gene association★ ★ ½CTD promotes understanding about the effects of environmental chemicals on human health by integrating data from curated scientific literature to describe chemical interactions with genes and proteins, and associations between diseases and chemicals, and diseases and genes/proteins.restrictive 🔗
Criteria A.2.2
Custom license with interesting use restrictions.
Criteria B.1
For quality control purposes, you must provide CTD with periodic access to your publication of our data.
Criteria D.1.2
Given the four statements in the Additional Terms of Data Use, notably number 4, it looks like any downstream user would have to renegotiate with CTD.
Criteria E.1.1
Without negotiation: "It is to be used only for research and educational purposes."
dbGaP (public) 🔗biology, human, genotype-phenotypeThe database of Genotypes and Phenotypes (dbGaP) was developed to archive and distribute the data and results from studies that have investigated the interaction of genotype and phenotype in Humans. Provides authorized access to protected and raw data (e.g., Genotype-Tissue Expression (GTEx) project).unknown 🔗
Criteria A.1.1
Per the dgGaP data use certification, 'The terms and conditions of using dbGaP data vary by study'. All terms and conditions are to align with NIH GDS.
Criteria C.1
Cannot access all the data.
Criteria C.2
Access methods are not transparent.
DECIPHER 🔗biology, human, gene, genotype, rare disease, phenotype, variant, submicroscopic chromosomal imbalance, rare sequence variantsDECIPHER is used by the clinical community to share and compare phenotypic and genotypic data. The DECIPHER database contains data from 24848 patients who have given consent for broad data-sharing; DECIPHER also supports more limited sharing via consortia.private pool 🔗
Criteria A.2.2
The resource uses an extensive custom license.
Criteria B.1
There are terms that require possible audits in the future, making future free use something that needs to be negotiated.
Criteria C.1
This could not be evaluated as no description of access was found.
Criteria C.2
This could not be evaluated as no description of access was found.
Criteria D.1.2
Given the numerous restrictions and requirements (e.g. purging, forced QA), it seems that downstream reuse would be problematic, even though there is a carve-out for "research".
Criteria E.1.2
Given the numerous restrictions and requirements (e.g. only "registered" users can have access to the data, etc.), it seems that downstream reuse would be problematic, even though there is a carve-out for "research".
dictyBase 🔗biology, MOD, genotype-phenotype association, disease-model association, gene expressiondictyBase is a database that provides a centralized source of Dictyostelium information for Dictyostelium researchers and Non-Dictyostelium researchers. This includes tutorials, dictyNews, techniques, the Stock Center, genomic and molecular information, colleague infromation, Dictyostelium literature, and much more.copyright 🔗
Criteria A.1.2
Could not find any clear terms on any searched pages. See commentary.
DrugBank 🔗pharmacology, drug, bioinformatics, cheminformatics, drugs, drug-protein interactions, targets, pathways★ ★ ★ ★The DrugBank database is a unique bioinformatics and cheminformatics resource that combines detailed drug data with comprehensive drug target information.restrictive 🔗
Criteria D.1.1
The NC license allows for liberal non-commercial reuse.
Criteria E.1.1
The NC license allows for non-commercial reuse with other non-commercial interests.
DrugCentral 🔗pharmacology, drug-target interaction, chemical structure of drugs, drug, disease★ ★ ★DrugCentral provides information on active ingredients chemical entities, pharmaceutical products, drug mode of action, indications, pharmacologic action. We monitor FDA, EMA, and PMDA for new drug approval on regular basis to ensure currency of the resource.copyleft 🔗
Criteria D.1.2
By using a copyleft-style license, there may be issues in mixing and redistributing this data with licenses that have incompatible terms.
Criteria E.1.2
By using a copyleft-style license, there may be some parties with issues in mixing and redistributing this data with licenses that have incompatible terms.
Dryad Digital Repository 🔗general, any, data, literature★ ★ ★ ★ ★DataDryad.org is a curated general-purpose repository that makes the data underlying scientific publications discoverable, freely reusable, and citable. Dryad has integrated data submission for a growing list of journals; submission of data from other publications is also welcome.permissive 🔗
ENCODE 🔗biology, genomic resource, genomic elements★ ★ ★ ★ ½The Encyclopedia of DNA Elements (ENCODE) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active.permissive 🔗
Criteria A.2.2
Easy and perfectly (re)usable, yet custom.
Fantom5 🔗biology, human, gene expression★ ★ ★ ★ ★We are complex multicellular organisms composed of ~400 distinct cell types. This diversity of cell types allow us to see, think, hear, fight infections etc. yet all of this is encoded in the same genome. The difference between all these cells is what parts of the genome they use – for instance, neurons use different genes than muscle cells, and therefore they work very differently. In FANTOM5, we have systematically investigated exactly what are the sets of genes used in virtually all cell types across the human body, and the genomic regions which determine where the genes are read from. We aim to use this information to build transcriptional regulatory models for every primary cell type that makes up a human.permissive 🔗
FlyBase 🔗biology, MOD, genotype-phenotype association★ ★ ½FlyBase is the model organism database providing integrated genetic, genomic, phenomic, and biological data for Drosophila melanogaster.restrictive 🔗
Criteria A.2.2
Copyright statement includes 'This publication may be copied for non-commercial, scientific uses by individuals or organizations (including for-profit organizations). FlyBase is freely distributed to the scientific community on the understanding that it will not be used for commercial gain by any organization. Any commercial use of this publication, or any parts thereof, is expressly prohibited without permission in writing from the FlyBase consortium.'
Criteria B.1
Copyright statement includes 'Certain portions of FlyBase are copyrighted separately.'
Criteria B.2.1
Copyright statement includes 'Certain portions of FlyBase are copyrighted separately.'
Criteria D.1.1
As stated copyright may be interpreted by non-legal professional that the contents may be reused/remixed in a non-commercial context.
Criteria E.1.1
As stated copyright may be interpreted by non-legal professional that the contents may be reused/remixed in a non-commercial context.
Genomic Data Commons (GDC) 🔗biology, human, cancer genome, variants, mRNA and miRNA sequence dataThe NCI\'s Genomic Data Commons (GDC) provides the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine.copyright 
Criteria A.1.2
After searching around GDC and cancer.gov, no centralized licensing or terms could be found.
GISAID 🔗biology, viral, gene, clinical, epidemiology, influenza virus sequence, aves, evolution, demographic, geo-spatial, gisThe GISAID Initiative promotes the international sharing of all influenza virus sequences, related clinical and epidemiological data associated with human viruses, and geographical as well as species-specific data associated with avian and other animal viruses, to help researchers understand how the viruses evolve, spread and potentially become pandemics.private pool 🔗
Criteria A.2.2
The resource uses an extensive custom license.
Criteria B.1
There are terms (e.g. 2b) that would make it seem that GISAID would need to be negotiated with for any further reuse beyond.
Criteria C.1
The webapp-style interface is not suitable for scripting or batch operations; bulk access options were not apparent.
Criteria C.2
I believe it would take a bit of knowledge to probe out the correct calls and get through the authorization, assuming scripting was feasible.
Criteria D.1.2
No pass-through of data seems possible except through the license agreement; no academic exception seems to exist.
Criteria E.1.2
No pass-through of data seems possible except through the license agreement; no academic exception seems to exist.
Genome Aggregation Database (gnomAD) 🔗biology, human, exome sequencing data, genome sequencing data, disease-specific genetic studies, population genetic studies★ ★ ★ The Genome Aggregation Database (gnomAD), is a coalition of investigators seeking to aggregate and harmonize exome and genome sequencing data from a variety of large-scale sequencing projects, and to make summary data available for the wider scientific community. In its first release, which contained exclusively exome data, it was known as the Exome Aggregation Consortium (ExAC).copyleft 🔗
Criteria D.1.2
While extremely open, the ODbL does restrict the ability to mix and redistribute with data without similar terms.
Criteria E.1.2
By using the ODbL, there may be some parties with issues in mixing and redistributing this data with licenses that have incompatible terms.
Gene Ontology (annotations) 🔗biology, x-species, gene annotation, gene association, biological process, molecular function, cellular component★ ★ ★ ★ ★The mission of the GO Consortium is to develop an up-to-date, comprehensive, computational model of biological systems, from the molecular level to larger pathways, cellular and organism-level systems.permissive 🔗
Gene Ontology (ontology) 🔗biology, x-species, ontology, biological process, molecular function, cellular component★ ★ ★ ★ ★The mission of the GO Consortium is to develop an up-to-date, comprehensive, computational model of biological systems, from the molecular level to larger pathways, cellular and organism-level systems.permissive 🔗
GTEx 🔗biology, human, gene expression★ ★ ½The Genotype-Tissue Expression (GTEx) project aims to provide to the scientific community a resource with which to study human gene expression and regulation and its relationship to genetic variation. This project will collect and analyze multiple human tissues from donors who are also densely genotyped, to assess genetic variation within their genomes. By analyzing global RNA expression within individual tissues and treating the expression levels of genes as quantitative traits, variations in gene expression that are highly correlated with genetic variation can be identified as expression quantitative trait loci, or eQTLs.permissive 🔗
Criteria A.2.2
Custom license based on the various datasets and NIH Genomic Data Sharing Policy.
Criteria C.1
No API or URL to access all data groupings with single action.
Criteria C.2
No API or URL and therefore no reasonable and transparent access.
Criteria D.1.1
As stated copyright may be interpreted by non-legal professional that the contents may be reused/remixed for research/scientific purposes.
Criteria E.1.1
As stated copyright may be interpreted by non-legal professional that the contents may be reused/remixed research/scientific purposes.
Human Metabolome Database (HMDB) 🔗biology, human, metabolomics, clinical chemistry, biomarkers★ ★ ★ ½The Human Metabolome Database (HMDB) is a freely available electronic database containing detailed information about small molecule metabolites found in the human body.restrictive 🔗
Criteria A.2.2
Custom terms at linked license and on downloads page.
Criteria D.1.1
The custom license would appear to allow for liberal non-commercial reuse.
Criteria E.1.1
The custom license would appear to explicitly allow for non-commercial reuse with other non-commercial interests.
Human Phenotype Ontology (HPO) 🔗biology, human, disease-phenotype association★ ★ ½A curated database of human hereditary syndromes from OMIM, Orphanet, and DECIPHER mapped to classes of the human phenotype ontology. Various meta-attributes such as frequency, references and negations are associated with each annotation. These are presently limited to rare mendelian diseases.restrictive 🔗
Criteria A.2.2
HPO is copyrighted to protect ontologies and all changes must be made by hpo developers.
Criteria D.1.2
Restricted downstream use. May not be edited.
Criteria E.1.2
Restricted downstream use translates to agents as well.
International Mouse Phenotyping Consortium (IMPC) 🔗biology, mouse, genotype-phenotype associationThe International Mouse Phenotyping Consortium (IMPC) is generating a knockout mouse strain for every protein coding gene by using the embryonic stem cell resource generated by the International Knockout Mouse Consortium (IKMC). Systematic broad-based phenotyping is performed by each IMPC center using standardized procedures found within the International Mouse Phenotyping Resource of Standardised Screens (IMPReSS) resource. Gene-to-phenotype associations are made by a versioned statistical analysis.copyright 
Criteria A.1.2
Could not find licensing information in a reasonable location. Determined ARR by default.
Criteria D.1.2
Restricted downstream use per ARR.
Criteria E.1.2
All downstream agents restricted per ARR.
Kyoto Encyclopedia of Genes and Genomes (KEGG), FTP 🔗biology, genomic resource, gene-pathway association, disease-gene association, orthology★ ★KEGG is an integrated database resource consisting of the seventeen main databases including systems, genomic, chemical, and health information.restrictive 🔗
Criteria A.2.2
For our use case (see commentary), KEGG (FTP) uses custom licensing through NPO Bioinformatics Japan (https://www.bioinformatics.jp/en/keggftp.html).
Criteria B.1
According to the organizational use agreement: "Your Product or Service must not allow Your users to obtain KEGG FTP Data, except in small quantities"; many uses would require negotiation here as continuing reuse is unclear.
Criteria D.1.2
The licensing terms are too restrictive for reasonable reuse (e.g. B.1 violation above).
Criteria E.1.2
Given our interpretation of the licensing terms, there is unlikely to be much ability to freely reuse the KEGG FTP data within any class.
KnowEnG 🔗biology, x-species, graph, cancer, gene, ontology, platform, protein families, molecular interaction, protein interaction★ ½KnowEnG enables knowledge-guided machine learning and graph mining analysis on genomic datasets using scalable cloud computation and exploration of results with interactive visualizations.restrictive 🔗
Criteria B.2.1
Explicit statement that license only applies to new work.
Criteria B.2.2
I found no method of obtaining just the new work data.
Criteria C.1
The resource seems to be a self-contained platform; after some search, I found no bulk methods of download.
Criteria C.2
As C.1.
Criteria D.1.2
While the NC part allows for non-commercial reuse, SA has license requirements.
Criteria E.1.2
While the NC part allows for non-commercial reuse, SA has license requirements.
Mouse Genome Informatics (MGI) 🔗biology, MOD, genotype-phenotype association, disease-model association, gene expression★ ★ ★ ★MGI is the international database resource for the laboratory mouse, providing integrated genetic, genomic, and biological data to facilitate the study of human health and disease.permissive 🔗
Criteria A.2.2
Custom license.
Criteria E.1.1
distinguishes groups, allowing for research/academic. Commercial groups can negotiate.
Monarch Initiative 🔗biology, x-species, gene, genotype, disease, phenotype, variant, disease-phenotype associations, genotype-phenotype associations½Integrate, align, and re-distribute cross-species gene, genotype, variant, disease, and phenotype data. Provide a portal for exploration of phenotype-based similarity. Facilitate identification of animal models of human disease through phenotypic similarity. Enable quantitative comparison of cross-species phenotypes. Develop embeddable widgets for data exploration. Influence genotype and phenotype reporting standards. Improve ontologies to better curate genotype-phenotype data.unknown 🔗
Criteria A.1.1
Text in footer: "Except where forbidden by the original sources..."; how these interact is not clear. There is no indication of which data has what sources, so reusability cannot be determined.
Criteria C.1
The scope of the API is not clear and the data downloads locations are apparently not public.
Mouse Phenome Database (MPD) 🔗biology, MOD, genotype (strain)-phenotype association★ ★ ★ ★ ½The Mouse Phenome Database is a collaborative standardized collection of measured data on laboratory mouse strains, and includes: baseline phenotype data sets; studies of drug, diet, disease and aging effect; protocols, projects, and publications; and SNP, variation and gene expression studies. MPD collects data for classical inbred strains, other fixed-genotype strains, derived lines and populations that are openly acquirable (strain panel examples). Strains can be from JAX-Mice or from any other vendor that\'s a recognized breeding source.permissive 🔗
Criteria A.2.2
Custom license, yet consistent.
MSigDB 🔗biology, gene sets★ ★ ★ ★ ½The Molecular Signatures Database (MSigDB) is a collection of annotated gene sets for use with GSEA software.permissive 🔗
Criteria B.2.1
The main body of work is under CC-BY 4.0, with filterable additional works under their own terms from KEGG, BioCarta, and AAAS/STKE. While the filtering step is an annoyance, is can likely easily be done given their instructions and would be a one-time process per release.
MyGene.info 🔗biology, genomic resource, gene definition★ ★MyGene.info provides simple-to-use REST web services to query/retrieve gene annotation data.restrictive 🔗
Criteria A.2.2
Custom license, but consistent.
Criteria B.2.1
Scope is incomplete - they claim no responsibility for data from other sources on their site.
Criteria D.1.2
No re-use allowed, just use.
Criteria E.1.2
No re-use allowed, just use.
MyVariant.info 🔗biology, genomic resource, variants, variant annotation★ ★MyVariant.info provides simple-to-use REST web services to query/retrieve variant annotation data, aggregated from many popular data resources.restrictive 🔗
Criteria A.2.2
Custom license, but consistent.
Criteria B.2.1
Scope is incomplete - they claim no responsibility for data from other sources on their site.
Criteria D.1.2
No re-use allowed, just use.
Criteria E.1.2
No re-use allowed, just use.
National Center for Biotechnology Information (Gene) 🔗biology, genomic resource, gene definition, taxon definition, gene-publication associationGene integrates information from a wide range of species. A record may include nomenclature, Reference Sequences (RefSeqs), maps, pathways, variations, phenotypes, and links to genome-, phenotype-, and locus-specific resources worldwide.unknown 🔗
Criteria A.1.1
The license apparently uses language to declare something similar to "public domain", but with the caveat that it may contain data that is otherwise. This is judged to be a violation as any (re)use would depend on negotiating with all upstream copyright holders, which are not presented. It is implied that their license does not cover all data and could not find an explicit "clean" version of the data in the downloads.
NCI Thesaurus 🔗biomedical, ontology, cancer, ontology, disease, phenotypes, pharmacology, drugs, biomedical coding and reference★ ★ ★ ★ ★NCI Thesaurus (NCIt) provides reference terminology for many NCI and other systems. It covers vocabulary for clinical care, translational and basic research, and public information and administrative activities. permissive 🔗
neXtProt 🔗biology, human, protein-related data, protein functional data, protein-protein interaction, subcellular location★ ★ ★ ★ ★Developed by the SIB Swiss Institute of Bioinformatics neXtProt is a comprehensive human-centric discovery platform, offering its users a seamless integration of and navigation through protein-related data.permissive 🔗
Online Mendelian Inheritance in Animals (OMIA) 🔗biology, veterinary x-species, gene-disease association★ ★ ★Online Mendelian Inheritance in Animals (OMIA) is a catalogue/compendium of inherited disorders, other (single-locus) traits, and genes in 215 (non-model) animal species.copyright 🔗
Criteria D.1.2
The license seems to be a standard ARR, with no exception for any kind of bulk (re)use.
Criteria E.1.2
The license seems to be a standard ARR, with no exception for any kind of bulk (re)use.
Mendelian Inheritance in Man (OMIM) 🔗biomedical, human, disease-phenotype association, gene-disease association, variant-disease association★ ½OMIM is a comprehensive, authoritative compendium of human genes and genetic phenotypes with full-text, referenced overviews that contains information on all known mendelian disorders and over 12,000 genes. OMIM focuses on the relationship between phenotype and genotype.restrictive 🔗
Criteria A.2.2
The license at link is custom.
Criteria B.1
Agreement section 13 requires downloads to be updated by downstream; section 14 discussion arbitrary API key revocation.
Criteria C.2
Downloads and access are provided post-registration; as the API key is used for access control, it violates the C.2 example.
Criteria D.1.2
Sections 8, 9, and 10 of the agreement make it seems that while a non-profit reasearcher may access the data, there is no reuse (pass-through) possible.
Criteria E.1.2
Sections 8, 9, and 10 of the agreement make it seems that while a non-profit reasearcher may access the data, there is no reuse (pass-through) possible.
OncoKB 🔗biology, human, variants, cancer genes, gene expression★ ★ ★ ½OncoKB is a precision oncology knowledge base and contains information about the effects and treatment implications of specific cancer gene alterations. It is developed and maintained by the Knowledge Systems group in the Marie Josée and Henry R. Kravis Center for Molecular Oncology at Memorial Sloan Kettering Cancer Center (MSK), in partnership with Quest Diagnostics.restrictive 🔗
Criteria A.2.2
While custom, the terms are quite clear.
Criteria D.1.1
Restrictions on type of use appear in the license, even or research:"You may copy, reproduce, or create derivative works of the Content only if:...you are using the Content only to replicate OncoKB locally, whether in whole or in part; or you are aggregating the Content with other data of similar nature for the purposes of advancing cancer research."
Criteria E.1.1
Restrictions exist on who may use the data: "You may copy, reproduce, or create derivative works of the Content only if: you are a researcher or a non-profit entity; and..."
Orphanet portal for rare diseases and orphan drugs (academic access subset) 🔗biomedical, human, disease-gene association, disease-phenotype association, disease classification, clinical metadata, disease epidemiology, orphan drugs, ontology★ ★Orphanet provides reference information on rare diseases and orphan drugs to help improve the diagnosis, care and treatment of patients with rare diseases.restrictive 🔗
Criteria A.2.2
Non-standard/custom license.
Criteria B.1
Section 11.1 of the DTA we looked at discusses the term of the agreement, which would have to be renewed approximately every year.
Criteria D.1.2
The user \"...may freely use the Data only for data analysis...any other use of the Data must receive the authorization of Orphanet management board...\"
Criteria E.1.2
The license is \"personal, non-transferable and non-communicable\", so cannot be freely reused.
Orphanet portal for rare diseases and orphan drugs (open access subset) 🔗biomedical, human, disease-gene association, disease-phenotype association, disease classification, ontology★ ★ ★Orphanet provides reference information on rare diseases and orphan drugs to help improve the diagnosis, care and treatment of patients with rare diseases.restrictive 🔗
Criteria D.1.2
The CC-BY-ND license prevents derivation.
Criteria E.1.2
The CC-BY-ND license prevents derivation.
Protein ANalysis THrough Evolutionary Relationships Classification System (PANTHER) 🔗biology, genomic resource, orthology★ ★ ★The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System was designed to classify proteins (and their genes) according to evolutionary family/subfamily, molecular function, biological process, and pathway. The PANTHER Classifications are the result of human curation as well as sophisticated bioinformatics algorithms.copyright 
Criteria D.1.2
Given the all rights reserved copyright statement, any downstream reuse would require negotiation.
Criteria E.1.2
Given the all rights reserved copyright statement, all user/agent types would need to negotiate downstream reuse.
Pfam 🔗biology, protein families, protein family alignments, HMMs★ ★ ★ ★ ★The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).permissive 🔗
Pharos 🔗biology, disease, targets, ligandsPharos is the user interface to the Knowledge Management Center (KMC) for the Illuminating the Druggable Genome (IDG) program...unknown 🔗
Criteria A.1.1
While any novel data is under a CC license (CC-BY-SA-4.0), the integrated resources are under whatever license the upstream is, with the user expected to sort it out.
PomBase 🔗biology, MOD, genotype-phenotype association, disease-model association, gene expression★ ★ ★ ★ ★PomBase is a comprehensive database for the fission yeast Schizosaccharomyces pombe, providing structural and functional annotation, literature curation and access to large-scale data sets.permissive 🔗
Reactome 🔗biology, pathway, pathway data★ ★ ★ ★ ½Reactome is a free, open-source, curated and peer reviewed pathway database. Our goal is to provide intuitive bioinformatics tools for the visualization, interpretation and analysis of pathway knowledge to support basic research, genome analysis, modeling, systems biology and education.permissive 🔗
Criteria B.2.2
KEGG gene and pathway annotations used to construct Reactome Functional Interaction (FI) Network are not licenced CC-BY-4.0. There is a comment that "the recipient may not distribute this data to other users without a license from Pathway Solutions, Inc."
Rfam 🔗biology, RNA families, RNA family, multiple sequence alignments, msa★ ★ ★ ★ ★The Rfam database is a collection of RNA families, each represented by multiple sequence alignments, consensus secondary structures and covariance models (CMs).permissive 🔗
Rat Genome Database (RGD) 🔗biology, MOD, genotype-phenotype association, disease-model association, gene expression★ ★ ★ ★ ★The Rat Genome Database (RGD) was established in 1999 and is the premier site for genetic, genomic, phenotype, and disease data generated from rat research. In addition, it provides easy access to corresponding human and mouse data for cross-species comparisons.permissive 🔗
Rhea 🔗biology, biochemical, enzymes, metabolic networks, reactions★ ★ ★ ★ ★Rhea is an expert curated resource of biochemical reactions designed for the annotation of enzymes and genome-scale metabolic networks and models. Rhea uses the ChEBI (Chemical Entities of Biological Interest) ontology of small molecules to precisely describe reactions participants and their chemical structures. All reactions are balanced for mass and charge and are linked to source literature, metabolic resources and other functional vocabularies such as the enzyme classification of the NC-IUBMB.permissive 🔗
Saccharomyces Genome Database (SGD) 🔗biology, MOD, genotype-phenotype association, disease-model association, gene expression★ ★ ★ ★ ½The Saccharomyces Genome Database (SGD) provides comprehensive integrated biological information for the budding yeast Saccharomyces cerevisiae along with search and analysis tools to explore these data, enabling the discovery of functional relationships between sequence and gene products in fungi and higher organisms.permissive 🔗
Criteria B.2.1
While the downloadable data seems quite clearly CC-BY-4.0, the API footer has terms that indicate that all data may not be covered under the same license (see comments).
Criteria B.2.2
The API does not seem to be any easy way to differentiate the \"clean\" CC-BY-4.0 data from other licenses.
Simple Modular Architecture Research Tool (SMART) 🔗biology, protein domains, protein domain identification, protein domain annotation★ ½Identification and annotation of genetically mobile domains and the analysis of domain architectures.restrictive 🔗
Criteria A.2.2
custom license used
Criteria C.1
The main site seems to be a web application, with no bulk access possible.
Criteria C.2
While there is a software download where one may be able to recreate the data and/or website, the evaluator believes that the process and package around it violates access criteria.
Criteria D.1.2
The restrictive licensing terms prevent further reuse.
Criteria E.1.2
The restrictive licensing terms prevent further reuse.
STRING 🔗biology, cross-species, protein-protein interaction, protein families★ ★ ★ ★ ★STRING is a database of known and predicted protein-protein interactions. The interactions include direct (physical) and indirect (functional) associations; they stem from computational prediction, from knowledge transfer between organisms, and from interactions aggregated from other (primary) databases.permissive 🔗
SUPERFAMILY 🔗biology, annotation, structural annotation, functional annotation, HMM, protein superfamily★ ★ ★ ★SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes.permissive 🔗
Criteria A.2.2
No formal license used, just a declaration that: "All annotation, models and the database dump are freely available for download to everyone."
Criteria C.1
There seems to be no central clearing location of the data, but rather a hierarchy view, navigate, click on individual set, then download. Scriptable.
The Arabidopsis Information Resource (TAIR, public) 🔗biology, MOD, sequence, gene structure, gene expression, functional annotation★ ★ ★ ★ ★The Arabidopsis Information Resource (TAIR) maintains a database of genetic and molecular biology data for the model higher plant Arabidopsis thaliana.permissive 🔗
TIGRFAMs 🔗biology, protein, MSA, HMM, protein sequence classification★ ★ ★TIGRFAMs is a resource consisting of curated multiple sequence alignments, Hidden Markov Models (HMMs) for protein sequence classification, and associated information designed to support automated annotation of (mostly prokaryotic) proteins. copyleft 🔗
Criteria D.1.2
By using a copyleft-style license, there may be issues in mixing and redistributing this data with licenses that have incompatible terms.
Criteria E.1.2
By using a copyleft-style license, there may be some parties with issues in mixing and redistributing this data with licenses that have incompatible terms.
Unified Medical Language System (UMLS) 🔗biomedial, terminology, standards, concept, ontologyThe UMLS integrates and distributes key terminology, classification and coding standards, and associated resources to promote creation of more effective and interoperable biomedical information systems and services, including electronic health records.restrictive 🔗
Criteria A.2.2
Non-standard/custom license.
Criteria B.1
One term of the license is that you must "[...]provide NLM with a brief report on the usefulness of the UMLS Metathesaurus in general[...]" (term 5); this rises to a barrier to reuse as a manual step invloving people has been added to use.
Criteria B.2.1
Violation occurs in section 11 "Some of the Material in the UMLS Metathesaurus is from copyrighted sources".
Criteria C.1
Was unable to be exavluated due to the terms of the license requiring agreeing to reporting before proceeding, producing a barrier to access.
Criteria C.2
Was unable to be exavluated due to the terms of the license requiring agreeing to reporting before proceeding, producing a barrier to access.
Criteria D.1.2
Various types of restrictions are given for various subsets/sources; for example, section 12.1 forbidding translation and derivative works.
Criteria E.1.1
"UMLS licenses are issued only to individuals and not to groups or organizations"; while an individual research may be able to get access and use, companies (for example) would not.
UniProt 🔗biology, sequence, protein sequence, protein function★ ★ ★ ★ ★The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information. Includes: UniProtKB, UniRef, UniParc, and Proteomes.permissive 🔗
WikiPathways 🔗biology, pathway, disease, micronutrient, nanomaterial, ExDNA, renal genomics, adverse outcomes, regenerative medicine, clinical proteomic tumor analysis★ ★ ★ ★ ★ WikiPathways is a database of biological pathways maintained by and for the scientific community.permissive 🔗
WormBase 🔗biology, model organism genome sequencesWormBase is an international consortium dedicated to providing the research community with accurate, current, accessible information concerning the genetics, genomics and biology of C. elegans and related nematodes.unknown 🔗
Criteria A.1.1
A single license is not provided, rather data users are intructed that they are responsible for identifying and complying with licensing and copyright restrictions for each piece of information in the database.
Zebrafish Information Network (ZFIN) 🔗biology, model organism database★ ★The Zebrafish Information Resource is the community database resource for the laboratory use of zebrafish which develops and supports integrated zebrafish genetic, genomic and developmental information, maintains the definitive reference data sets of zebrafish research information toward facilitation of the use of zebrafish as a model for human biology.restrictive 🔗
Criteria A.2.2
Custom license with non-academic and non-research use restrictions.
Criteria B.1
The license explicity requires intervention for downstream reuse and redistribution.
Criteria D.1.2
The license requires written permission for redistribution.
Criteria E.1.2
The license requires written premission for redistribution even for academic and non commerical parties.

Contact us

All copyrightable materials on this site are © 2019 the (Re)usable Data Project under the CC-BY 4.0 license.
The (Re)usable Data Project is funded by the National Center for Advancing Translational Sciences (NCATS) OT3 TR002019 as part of the Biomedical Data Translator project and U24TR002306 as part of the CTSA Program National Center for Data to Health (CD2H).
The (Re)usable Data Project would like to acknowledge the assistance of many more people than can be listed here. Please visit the about page for the full list.