Board Spring 2009 Repository Policies

See also: The main Old:Repository Policies page

Discussion items related to repository policies for the Spring 2009 Dryad Management Board meeting.

Conventions for data citations
The JDAP is useful for enforcing deposits, but a complementary policy is needed for enforcing citations. Dryad itself has no authority to enforce citation policies. Journals must do this, and not all journals will agree, but we hope the management board can make recommendations that will be followed by all partner journals.

Background

 * Howe et al (2008)
 * "We recommend that all journals and reviewers require that a distinct section of the Methods (or a supplemental document) of all published articles includes approved gene symbols (which are inherently unstable) and model-organism database IDs (which do not change) for genes discussed; nucleotide or protein accession numbers (GenBank or UniProt ID) for isoforms of each gene or protein discussed; and descriptions of species, strains, cell types and genotypes used....This would accelerate literature curation, uphold information integrity, facilitate the proper linkage of data to other resources and support automated mining of data from papers."
 * "we would like to see authors routinely tagging all aspects of the data in their publication semantically using universally agreed tag standards."
 * Costello (2009)
 * Online data centers should publish clear, standard citations for data sets; track data-set access; and develop editorial processes to maximize data quality, data integration, accountability, visibility, and usability.
 * Authors must cite online data sources as they would print publications.
 * Sarah's research
 * GenBank norms
 * Pangaea is a repoistory for geoscience data, with many features similar to Dryad. A sample paper that cites datasets in Pangaea is doi:10.1016/j.palaeo.2007.03.030
 * NOW is the time to determine this policy -- While the culture is shifting. There will not be another chance.

Questions to answer

 * 1) Should all journals have the same policies for citations, or can each journal come up with its own policy?
 * 2) * Recommendation: The management board should determine a policy for all partner journals to follow. This will encourage consistent citation practices across all fields that Dryad covers.
 * 3) Ideally, what should be cited? The publication only, the aggregate of all datasets as represented Dryad's entry for the publication, or each individual dataset?
 * 4) * Recommendation: In the long term, it will be useful to separate the citation of data from the citation of a publication, particularly for influential datasets that contribute to many publications.
 * 5) What should citations look like? (this depends greatly on the answer to the previous question)
 * 6) DOI vs Handle

Identifier format
Dryad uses handles, which are native to DSpace, but are not widely used outside the repository world.

Most publishers use DOIs to identify articles.

Open questions

 * 1) Should Dryad switch to DOIs?

Pros:
 * DOIs are well-known by authors, making data citations seem more "real"
 * DOIs are supported by indexing/aggregation services, such as ISI

Cons:
 * DOIs are expensive ($0.40 per identifier)
 * DOIs carry additional administrative burden
 * DOIs must be individually registered, requiring all citation points to be known in advance (e.g., time offset into a video file)


 * 1) What about a hierarchical system that used DOIs at one level and handles at another level?
 * 2) * This was originally proposed by Rod Page.
 * 3) * It could be confusing for authors.

Background
Some types of data can become quite large, which will increase storage costs. For example:
 * MP3 audio files are about a megabyte per minute.
 * DVD-quality video is about 2 gigabytes per hour.
 * Carol Eunmi Lee's genomics data is about 750 gigabytes per 3-yr project.

Storage required for Dryad

 * currently 45 articles
 * currently 173 datasets
 * currently 200MB storage space (10MB database plus 160MB bitstreams)
 * datasets per article:
 * This is bimodal. A majority of articles contain one dataset, but there is a small group of articles that each have 20 or more datasets.
 * mean: 3.8
 * median: 1
 * min: 1
 * max: 30
 * storage per dataset
 * mean: 1.2MB
 * median: 180kb
 * min: 5kb
 * max: 100MB (but could easily be higher, see above)


 * minimum number of replicas: 3

Cost required for each replica

 * rough cost:
 * basic hardware cost: $1000 per Terabyte (including server and storage), would need to replace every 3 years
 * hosting facility/sysadmin cost: $200 per month (one server and storage)

Overall estimate (hardware and support): for each 10 Terabytes, $500 per month.

Questions to answer

 * 1) Should Dryad store any dataset for free, regardless of size/cost? Or should there be a size limit above which a storage fee is charged?

Data submission and modification authority
Need to modify this section to describe:
 * these are the different levels of authentication, and the benefits/costs of each
 * more hands-on curation means more costs


 * 1) When submitting data, should login be required?
 * 2) Should submissions be approved by a human curator, or should they go live immediately?
 * 3) * If approval is required, the author will not get a handle until the item is approved.
 * 4) Who should have the authority to modify documents?