Curating a Resource

This is a glossary of fields used by the data resource descriptor file convention of the (Re)usable Data Project.

A template file is provided to assist with this process, and this documentation mirrors the description of these fields as they appear on reusabledata.org data source detail pages.

To make sure structured information is displayed properly, users may want to view this file in the original form on GitHub.

id

Field Description

Unique ID of the resource, biodbcore or NAR preferred. But any unique internal ID is usable.

This is a short-form identifier (often an initialism) for the resource created by the curator. By convention, this is typically the same as the file prefix describing the resource. Dashes are used in lieu of spaces when needed.

Example Values

civicdb, bgee, mgi, monarch, pmkb, ncbi-gene

description

Field Description

A full description of the resource from the resource itself, if possible.

Example Value

'Integrate, align, and re-distribute cross-species gene, genotype, variant, disease, and phenotype data. Provide a portal for exploration of phenotype-based similarity. Facilitate identification of animal models of human disease through phenotypic similarity. Enable quantitative comparison of cross-species phenotypes. Develop embeddable widgets for data exploration. Influence genotype and phenotype reporting standards. Improve ontologies to better curate genotype-phenotype data.'

last-curated

Field Description

(Optional) The ISO 8601 date of when the license was last reviewed by a (Re)Usable Data Project curator.

Example Value

2017-12-03

source

Field Description

A full-length, human-readable name for the resource that differentiates it from other similar resources.

Example Value

National Center for Biotechnology Information (Gene)

source-link

Field Description

URL for the resource. (Also described as Location in the reusabledata.org source details view.)

Example Value

http://www.civicdb.org

source-type

Field Description

(Optional) How the resource relates to the data it contains. Used for gross description; it is naturally hard to categorize many resources.

A warehouse value is currently under discussion.

Allowed Values

Current allowable entries are: unknown, repository, source, and integrator.

  • repository: Describes a resource that, as a major component, allows direct submission of data sets under a specific them, which may not undergo further harmonization.
  • source: Describes a primary source of data or knowledge generated by the source.
  • integrator: Describes a resource that, as a major component, collects and harmonizes content from multiple other resources.
  • unknown: Used when a resource does not qualify for one of the above types or is otherwise hard to categorize.

status

Field Description

Whether or not annotation is complete on this resource. (Also described as Curation status in the reusabledata.org source details view.)

Allowed Values

Current allowable entries are: complete, incomplete, and nonpublic.

  • complete: The draft of the resource curation has been completed and is ready for further processing.
  • incomplete: The draft of the resource curation is incomplete and needs to be finished before further processing. Will not be considered in analysis.
  • nonpublic: The resource has been determined to not be a public resource, therefore cannot be considered under the current rubric. Will not be considered in analysis of public resources.

data-field

Field Description

The area of research for the resource, loosely determined by the curator. (Also described as Field in the reusabledata.org source details view.)

Example Values

biomedical, biology, pharmacology

data-type

Field Description

The type of data the resource contains. (Also described as Type in the reusabledata.org source details view.)

Example Values

x-species, cross-species, ontology, MOD, genomic resource, pathway, sequence, human phenotype gene associations

data-categories

Field Description

(Optional) Free tags to describe the resource and its data. (Also described as Categories in the reusabledata.org source details view.)

Example Values

cancer, precision medicine, variants, variant disease associations, food terminology, food ontology, food associations

data-access

Field Description

(Optional) Links and metadata the resource's data in a structured list.

  • location: The location of the data access. Do NOT worry about getting the perfect and "real" location here; a top-level or informational location is fine.
  • type: The type of data access: "download" or "api".
  • label: (Optional) How the data access point should be differentiated from others at a resource. E.g. "expression data" or "HPO ontology". Very very optional as this can be a rat's nest.

Example Values

YAML - type: download location: https://civic.genome.wustl.edu/releases - type: api location: https://griffithlab.github.io/civic-api-docs/ - type: api location: http://foo.bar/api-v1 label: old api

license

Field Description

The license that is used by the resource. We use SPDX where we can or: inconsistent, public domain, unlicensed, all rights reserved, or custom.

See scripts/source.schema.yaml for the most current full list that we use (as well as how to update).

Example Values

CC0-1.0, CC-BY-4.0, custom, all rights reserved

license-type

Field Description

The type of license that is being used. This will be to define compatible data pools in the future; we only use the grossest terms now. If it is not known or unclassifiable "unknown" is used.

Allowed Values

Current possible values are: unknown, copyleft, permissive, copyright, restrictive, private pool

license-link

Field Description

(Optional) The link to the resource license. (Also described as License location in the reusabledata.org source details view.)

Example Values

http://www.omim.org/help/agreement, http://www.orphadata.org/cgi-bin/inc/legal.inc.php

license-hat-used

Field Description

(Optional) Setting this flag to true indicates that the licensing was combinatorially complicated enough (as is the case in some commercial licenses) that the curator chose to wear a single "hat" during the process. From the site text:

"While we try to cover as much of the licensing possibilities of a data resource that we can, in a few cases we may choose a particular "hat" to wear while evaluating to prevent a combinatorial explosion, which may also reduce the clarity of our curations for the community. In these cases, we may take on the role of a (1) non-commercial (2) academic (3) group that is (4) based in the US and trying to (5) create an aggregating resource (integrator), noting that other entities may have different results in the license commentary."

(Also described as Focused curation in the reusabledata.org source details view.)

Allowed Values

true, false

license-issues

Field Description

(Optional) Structured issues with the license.

For every issue discovered with a resource, there should be a corresponding item in the license-issues field that marks the /exact/ violation, along with any comments. This field can be used by resources as the first step to improvement, as well as clarify any surrounding circumstances.

Any issues or thoughts about a resource that do not slot into one of the criteria violations can go into the license-commentary field. They may cross reference.

(Also described as Issues in the reusabledata.org source details view.)

  • criteria: The criteria violated. E.g. "A.1.1", "C.2".
  • comment: How the criteria violated.

Example Values

YAML - criteria: A.1.1 comment: looked under only rock, not there - criteria: E.1.1 comment: clause that effectively uniformly prevents smurfs from accessing the data

license-commentary

Field Description

(Optional) Further commentary on the license, possibly including the process of curation and things like locations of additional licenses. (Also described as Commentary in the reusabledata.org source details view.)

Example Usage

YAML - "one thought" - "another thought"

was-controversial

Field Description

(Optional) Marker noting that there was some extended internal discussion or controversy about the evaluation of the licensing terms. If this is marked at "true", the controversy, or a link to a permanent archive of the controversy, must be sufficiently contained in the "license-commentary" to reconstruct the issues. (Also described as Controversial in the reusabledata.org source details view.)

Allowed Values

true, false

provisional

Field Description

(Optional) In cases where there may not be the bandwidth for multiple (min. 2) people to review an evaluation, this piece of metadata allows desired fixes and new evaluations to start moving through the system, but have a way to keep track of them for additional scrutiny later on. The assumption is that things are not.

Allowed Values

true, false

contacts

Field Description

(Optional) List of resource contact information, link, email, or whatever is public.

Example Value

YAML - https://civic.genome.wustl.edu/contact - foo@bar.bib

grants

Field Description

(Optional) Semi-structured list of supporting grants.

  • label: A string representation of the grant.
  • url: (Optional) The URL for grant information.

Example Value

yaml - label: 'Rhea development and curation activities at the SIB are supported by the Swiss Federal Government through the State Secretariat for Education, Research and Innovation (SERI), and by the SystemsX.ch, The Swiss Initiative in Systems Biology.' url: http://foo.bar - label: 'NIH Grant for Science 123'

All copyrightable materials on this site are © 2019 the (Re)usable Data Project under the CC-BY 4.0 license.
The (Re)usable Data Project is funded by the National Center for Advancing Translational Sciences (NCATS) OT3 TR002019 as part of the Biomedical Data Translator project and U24TR002306 as part of the CTSA Program National Center for Data to Health (CD2H).
The (Re)usable Data Project would like to acknowledge the assistance of many more people than can be listed here. Please visit the about page for the full list.