Preservation Policy

This page collects notes for Dryad's emerging Preservation Policy. A template letter for use in responding to questions about the permanence and stability of the repository can be found here.

Possible Policy Structure
Please see the [[Media:DCC_Preservation_policy_template.pdf|DCC Preservation Policy Template (PDF)]] and OpenDOAR policy creation tool for examples of inclusive preservation policy structures.

Preferred Formats
Non-proprietary, publicly documented formats are always preferred when possible.


 * Text: plain text (ASCII, UTF-8), PDF/A, comma-separated (or otherwise delimited) values, Open Office formats, XML
 * Discipline-specific plain text formats, such as nexus files, matrices, alignments, sequence outputs, scripts, etc., are welcome. Standard exchange formats are encouraged (see BioSharing catalogue for examples).
 * Data submitted to Dryad should always be optimized for reanalysis and reuse. For example, for a table of values, a csv file is preferable to a pdf of the values presented in a table. Dryad curators may contact depositors to request alternate formats when a non-optimal format is submitted.
 * Image: PDF/A, JPEG/JPEG2000, PNG, TIFF, SVG (no Java)
 * Audio: FLAC, AIFF, WAVE
 * Video: AVI, M-JPEG2000
 * Compressed/archived formats: GZIP/TAR, ZIP
 * Files should only be compressed and/or archived when it is necessary due to large file size or the need to gather files together in particular directory structure in order for them to be understood.
 * Readme: Readme files are restricted to plain text or pdf formats, and, if submitted in another format, will be converted into one of these by the Dryad curator.

Format Support Levels

 * 1) Full Support
 * 2) * Dryad will make a best effort to ensure the full functionality of these files into the future, including format transformation/conversion as needed
 * 3) * Applied to all preferred formats and widely used proprietary formats (e.g. Microsoft Office formats)
 * 4) * Incomplete list of fully supported formats:
 * 5) ** Text: TXT (ASCII, UTF-8), PDF/A, RTF, CSV/TSV/TAB, Open Office formats (ODS, ODP, ODT), Microsoft Office formats (DOC, DOCX, XLS, XLSX, PPT, PPTX), HTML/XHTML, XML, Nexus formats (NEX, NEXML)
 * 6) ** Image: PDF/A, JPEG/JPEG2000, PNG, TIFF, SVG (no Java)
 * 7) ** Audio: FLAC, AIFF, WAV, MP3
 * 8) ** Video: AVI, M-JPEG2000, MP4
 * 9) Limited Support
 * 10) * Dryad will make some effort to ensure the future readability and functionality of these files, but there may be some loss
 * 11) * Applied to uncommon formats for which we are less confident there will be robust conversion tools
 * 12) * Also applied to PDF/A-noncompliant PDFs
 * 13) As-Is Bitstream Access
 * 14) * Dryad preserves the bitstream as-is and makes no promises regarding format transformation or future readability of the file
 * 15) * Most commonly applied to software or other non-data files

Working Outline Using DCC Template
See [[Media:DCC_Preservation_policy_template.pdf|DCC Preservation Policy Template (PDF)]] for description of the template sections.

Aim

 * "A clarification of the mission to preserve."
 * To be pulled from other repository documentation.

Standards

 * "What standards, frameworks, and models for digital preservation will be used?"
 * Discussion of CLOCKSS here?
 * Would like to get input from SILS archives specialists (Cal Lee has said he's willing to help), repository managers, and digital library folks.

Content coverage

 * To be pulled from other repository documentation.
 * Data in Dryad must be associated with a publication in the biosciences. Will we be definitely expanding beyond peer-reviewed journals to include theses and dissertations, non-peer-reviewed journals, conference proceedings, etc.?
 * How to phrase Dryad's inclusion of software and other non-data content?
 * Include vague statement that Dryad may exclude content based on lack of scientific merit.

Overview of preservation strategy

 * Statement of preferred formats.
 * Non-preferred formats will be accepted. A version of the file converted into a preferred format will be provided, as a well as access to the original bitstream.
 * Formats will receive varying levels of support (see next section).
 * Periodic evaluation for outmoded formats and mass transformation of files.
 * Information about physical media, CLOCKSS, DataONE, etc.?

Methods / level of preservation

 * Information about the levels of support for different formats.
 * Also need to state here how long content will be preserved. How do we want to phrase this? Is it uninformative to say things like "in perpetuity"?

Implementing the strategy (operational details)

 * 1) Procedures for preservation
 * 2) * When a data file is submitted in a non-preferred format, a Dryad curator will convert it into the most appropriate preferred format. Both formats will be made available, labeled as original deposited file and transformed file for preservation.
 * 3) * When a readme file is uploaded in a non-preferred format, the Dryad curator will convert it into a plain text file (if no content will be lost) or PDF (if there is formatting that is part of the content). Both formats will be made available, labeled as original deposited file and transformed file for preservation.
 * 4) * Transformation of outmoded formats will take place as needed, with an annual evaluation of formats across the repository.
 * 5) Security, authenticity, and integrity
 * 6) * Do we have plans beyond the checksum that we currently use?
 * 7) * Can we make this checksum publicly available in the metadata? (currently it is hidden when visiting the site, though it may be in the metadata available for harvest)
 * 8) Media refreshment
 * 9) * How and when will storage media be refreshed? Can we be more specific than "as needed"?
 * 10) Versioning
 * 11) * Rolling out this feature soon and will be able to describe.
 * 12) Withdrawal of items
 * 13) * Content may be temporarily hidden at the request of the author, journal, or publisher, and may be permanently withdrawn from the archive if rights infringements or private human subject information are discovered.
 * 14) * In cases of withdrawal, the metadata is also removed, and a tombstone page will be displayed (currently the page just says "restricted" but DSpace developers are working on allowing an explanation to be added to this page).
 * 15) * Correcting errors in uploaded files is accomplished through versioning, and the error-containing files will not be removed.

Sustainability plans
Seeking input from Todd, Peggy, Ryan. This section will mention DataONE and CLOCKSS, and ability to point DOIs to new locations if needed.

Questions and Feedback

 * Changes to the preferred formats list?
 * Changes to the list of fully supported formats?
 * Comments on the Procedures for preservation above -- conversion to preferred formats during initial curation, annual evaluation of formats and mass transformation.

Resources

 * DCC Preservation Policy Template (2010) [[Media:DCC_Preservation_policy_template.pdf|PDF]]
 * OpenDOAR policy creation tool
 * Ex Libris Rosetta preservation software
 * XENA format migration tool
 * PRONOM technical registry
 * Library of Congress Sustainability of Digital Formats planning documentation
 * DataONE preservation policy