Old:Summer 2009 Curation Workflow

From The Dryad data repository wiki
Jump to: navigation, search
STATUS: This page is no longer being maintained and of historical interest only.

Contents

Outline and Overview

This is a very informal listing of ideas and topics that have come up during contemplating the curation process for Dryad. Sarah Carrier has been putting together this page. Links to other pages give more formal declarations.

Please see also the new cataloging guidelines. Here is a link to requirements.

Please also see the architecture review group entry on the DSpace wiki - particularly Mark Diggory's comment about curation:

  • The DSpace administration UI is very poor. It doesn't expose all the curatorial functions needed for the system. It provides no way to define local policies for the DSpace-based service or means of enforcing them. It has no serious reporting infrastructure to inform curators about the state of the archive and its contents. It provides poor tools for editing metadata about collections, items and bitstreams, nor does it have good support for withdrawing and deleting items from collections. Some provenance metadata is captured by the History system (which is undergoing significant work at the moment), but this metadata is largely unavailable to curators as a means to help them manage the archive. While this may not have major implications for the DSpace architecture, any significant development effort should include a review of the functionality and user interfaces provided to this critical part of the system.

And see the entry for BitstreamFormat Renovation: This page proposes a set of changes to improve the representation of file formats in the content model, in order to better support preservation activities.

General Policy Ideas

  1. Issue of ADDING metadata - metadata can be added if, for example, an (optional) field is not completed.
  2. Issue of EDITING metadata - how much metadata should the curator edit, if the author/depositor has created it?
    • This could be considered an "ownership" issue.
    • The curator can correct what is "obviously" wrong - what are the limits of a curator's ability to edit, if any?
    • Feedback from management board: author/depositor provided metadata should be edited by the curator.
  3. If a file is in a proprietary format, it should be converted to the "least common denominator" format - and something open source.
  4. SPECIES NAMES: idea - include both the COMMON NAME and the SCIENTIFIC NAME (from a controlled vocabulary). Justification: from the user's perspective - they could potentially search by either. DECISION: right now, we will accept either, as long as they are correct. QUESTION TO RESOLVE: if the species name is a subject keyword, do we repeat it in the taxonomic keywords?
  5. AUTHOR NAMES: full first name, middle initial if available, and last name.
  6. CITATION STYLE: Until imported from ISI or some source (this is still questionable), we will choose a standard from one of the partner journals, and go with that.
  7. GEOGRAPHIC NAMES: to be determined...most likely the same as species names - we will accept anything, even if it is informal, as long as it is correct.
    • An example from NOAA Paleoclimatology Program: "Atlantic, Pacific, and Indian Oceans"
  8. Recommendation for DATASET TITLES: there has to be some uniformity - if this can be done as automatically as possible, so the better for the depositor and the curator.
    • RECOMMENDATION: First author's last name, et al. (if necessary), Journal Name (YEAR) 1, 2, 3, (auto number increment).
      • EXAMPLE: Gibson, et al., Genetics (2009) 2
    • The file name - if there is something usable here, it should go into DESCRIPTION.
    • See also: the National Virtual Observatory - their datasets follow the following convention: "Asteroids II Machine-Readable Data Base (Binzel+ 1987)" (what appears to be a journal article title, first author name with a + if there are more, and the year). Something similar would be useful for Dryad.
      • Another example: "The planet-metallicity correlation (Fischer+, 2005)" for journal article "The planet-metallicity correlation.", Fischer D.A., Valenti J., <Astrophys. J., 622, 1102-1117 (2005)>
    • After a meeting regarding citation in July 2009, it has been decided that we will recommend using meaningful data file naming conventions in the "Best Practices," but we will take whatever the depositor would like to enter for a data file title.

Current Submission Workflow

CurrentWorkflow.jpg

There are two current submission workflows that the author/depositor would undertake: one, where the publication metadata is automatically imported, and two, where the publication metadata has to be manually entered before upload of data.

Required fields to describe the publication: title, authors, journal name. Required fields for the dataset: title (authors, keywords, etc. inherited from publication).

Places where curator enhances metadata creation:

  1. ADD optional metadata
  2. Confirm accuracy and correct author-created metadata, metadata from journal
  3. SUPPLEMENT author-created metadata, i.e., add more keywords, etc.
  4. Use tools, controlled vocabularies to enhance metadata

Curator Tasks and Responsibilities

Tasks mentioned in the grant proposal

  • To maintain both data integrity and metadata quality, data curators will validate and, if necessary, edit, submissions to the repository. Data will be automatically and manually curated to ensure validity of the digital assets and thus their reusability.
    • Curation will be assisted by custom software for metadata quality assessment, as well as existing software such as JHOVE and Xena for format validation and migration. We will study and incorporate methods for automatic measurement of metadata quality by drawing on and extending work of the AMeGA and Infomine projects. Dryad will provide an empirical measure of metadata quality for each metadata record using a variety of metrics (for instance, the match rate between a document and a controlled vocabulary). The rating will help the curator determine which metadata records require review and whether the original depositor needs to be contacted.
    • Participate in the metadata generation and quality evaluation studies.
  • A retrieval interface will be developed that uses the available metadata more fully, and also uses both existing and newly developed and relevant vocabularies to augment queries. Automatic extraction and controlled vocabulary term matching processes will be used for assigning metadata values, which can then be verified or augmented by the user or the data curator.
  • Datasets of special educational value will receive extra curatorial attention and be presented for student use through a dedicated education section of the repository... The curators will also target a limited number datasets of special data packages (i.e. those that are frequently downloaded by users, or those that are particularly suitable for educational purposes) for a higher-level of curatorial attention.
    • Data curators will select a limited number of datasets (1-2 per year) to receive extra curatorial attention, based on popularity or thematic area. Preference will be given to datasets likely to have strong resonance with students (on topics such as the evolution of antibiotic resistance or viral pathogenicity, domestication of companion animals, human origins, origin of life, etc). Curators will work with authors, and with the NESCent Education and Outreach Group, to provide detailed metadata, more extensive background and related material, and a set of suggested exercises appropriate for each dataset. Resources will be targeted at the Advanced Placement, college, and graduate levels.
  • Further tasks: communication with authors/journals when problems arise, helping to verify the usability of metadata, overseeing data format migration, and serving as a help desk for depositors, and presenting tutorials on the use of the repository at the annual meetings of the consortium societies.
    • Dryad tutorials, designed for active investigators in the field, will be prepared by the data curators with the assistance of other project personnel, and presented at the scientific conferences deemed most appropriate by the MB (2-3 conferences/yr). The aim of the tutorials will be to explain the role of NESCent relative to the journals and specialized databases, to demonstrate the deposition and retrieval interface, and to assist authors in increasing the extent and quality of the metadata provided by raising their awareness of metadata in general.

Tasks and things mentioned at Management Board Meeting

  1. Re: legacy publications/datasets -> who has the authority to submit data, associate with publications? This has to be an author/corr. author - and has to be verified by a curator.
    • There will possibly be a pin system for editing metadata -> the curator will verify if it is an author, coauthor, or if they have a pin.
  2. Those datasets that are used the most, are more highly curated; also, quantifying problems in the metadata that will be noticed and flagged for the curator
  3. Question that was not answered: Should submissions be approved by a human curator, or should they go live immediately? If approval is required, handles will not be assigned until the item is approved. One option is to make curator approval contingent on confirmation of the publication (the existence of a DOI in the case of new ones).
  4. Overall, there seems to be general agreement that the curator can heavily edit author-supplied metadata if required.

Other ideas

  1. Idea for DOIs for publication, to ensure consistency: check for "http://dx.doi.org/" - many times this will not be included by the depositor.
  2. Name authority - this needs to be done ASAP
    • Example: Rebecca M. Riley versus Rebeccca Riley-Berger
    • Some general comments on authority control follow (--Janeg@ils.unc.edu 00:32, 19 May 2009 (EDT))
      • issues w/personal names include 1. form of name [the above example w/Rebecca Riley is good]; 2. fullness of name [Rebecca M. Riley; R. M. Riley], and 3.entry element/order of components [Riley, Rebecca [inverted]; Rebecca Riley [direct]. The issue with Riley and Riley-Berger is also CHANGE in name - here, married name versus maiden name.
      • authority control is for a host of named entities: people, organizations, geographic jurisdictions, and structures (Biosphere 2 Center @ Columbia University is a stucture).
        • Example for an organization: authorized form [MARC 110]: National Center for Ecological Analysis and Synthesis; see from refernece [MARC 410]: NCEAS , see the LC authority control record @: http://authorities.loc.gov. See LCNAF RECORD for NCEAS
      • there is authority control for titles, but this would be pretty difficult to maintain in Dryad for most data objects, although it's likely that some key datasets will warrant some form of authority control.
  3. Issues with variability in TITLES
    • Article titles - do all nouns need to be capitalized, or does this variability matter?
    • Data set titles - the default is the name of the file, but should the curator try to at least remove the file extension? (Filenames are rarely informative - we should consider using a unique pub string + datafile1, 2, 3, etc --Tjvision 09:39, 19 May 2009 (EDT))
  4. What to do about articles WITHOUT keywords? -> go to ISI to see keywords assigned by this service, to PubMed to get MeSH terms. Also, this is why a background in biology is important. (The issue w/articles should be resolved by search/retrieval options --Janeg@ils.unc.edu 00:32, 19 May 2009 (EDT)) Jane - could you elaborate on this? --Sarah

What the curator WILL NOT do

  • Curators will not be expected to validate the biological correctness of the data itself, or to determine the completeness of each data package.

Aids to Assist the Curator

Tools

  • Identifying data (for example, where it is located, what formats it is in)
  • Describing data (for example, automated metadata creation)
  • Manipulating data (for example, data management, data storage, repositories)
  • Preserving data (for example, migration)
  • Data registration
  • Documentation of commonly used terms and concepts
  • Rights management and access control


Detect and verify file formats

See the DSpace wiki entry for data file formats, particularly the section on preservation.

  • JHOVE: JHOVE provides functions to perform format-specific identification, validation, and characterization of digital objects. JHOVE is implemented as a Java application, written to conform to J2SE 1.4, using the Sun SDK 1.4.1. JHOVE can be invoked with two interfaces:
    1. A command-line interface
    2. A Swing-based GUI interface
    • From DCC: Level of Expertise required for use of this tool: High. Technical knowledge of applications, APIs (Application Programme Interfaces) to implement the tool. Knowledge of OAIS model, in particular the concept of representation information is of benefit.
    • DSpace wiki entry for JHOVE integration - important, TechMDExtractor
  • PRONOM: PRONOM is an on-line information system about data file formats and their supporting software products. Originally developed to support the accession and long-term preservation of electronic records held by the National Archives, PRONOM has been made available as a resource for anyone requiring access to this type of information.
    • DROID (Digital Record Object Identification) is a software tool developed by The National Archives to perform automated batch identification of file formats. DROID is a platform-independent Java tool, which is freely available to download under an open source license.
    • GDFR/UDFR: The Global Digital Format Registry (GDFR) provides sustainable distributed services to store, discover, and deliver representation information about digital formats. An agreement was forged to bring together the two registry efforts under a new name - the Unified Digital Formats Registry (UDFR). The registry would support the requirements and use cases of the larger community compiled for GDFR and would be seeded with PRONOM's software and formats database. Change proposed April 2009.
  • AONS (Automated Obsolescence Notification System) notifies repository managers about formats within digital resources in their repositories and alerts them to potential problems relevant to obsolescence and long term usage.

Preservation tools

  • LOCKSS: Lots Of Copies Keep Stuff Safe. This service is a freely available preservation service that works on the principle that by persistently caching multiple copies of a web serials over multiple sites, the chances of that particular object being preserved are greatly increased. It is used by libraries to preserve their content over the long-term. The software is cheap and easy to use, and any institution can get involved.
  • Metadata Extraction Tool developed by the National Library of New Zealand. It is designed to:
    1. automatically extracts preservation-related metadata from digital files
    2. output that metadata in a standard format (XML) for use in preservation activities
    • The Metadata Extract Tool includes a number of 'adapters' that extract metadata from specific file types. Extractors are currently provided for:
      1. Images: BMP, GIF, JPEG and TIFF.
      2. Office documents: MS Word (version 2, 6), Word Perfect, Open Office (version 1), MS Works, MS Excel, MS PowerPoint, and PDF.
      3. Audio and Video: WAV and MP3.
      4. Markup languages: HTML and XML.
    • If a file type is unknown the tool applies a generic adapter, which extracts data that the host system 'knows' about any given file (such as size, filename, and date created).
  • List of tools for preservation metadata from PREMIS site. Highlights:
  • SOAPI (Service-Oriented Architecture for Preservation and Ingest of digital objects) is developing an architecture and toolkit for partial automation of preservation and ingest workflows in digital repositories - a JISC project
  • REMAP is developing a model and a tool to embed records management and preservation within the repository workflow - another JISC project
File migration and convertion
  • PADI migration document
    • Move from proprietary formats, ex. Word into Plain Text - "lowest common denominator format"
    • Preservation - as formats become absolete, convertion takes place if possible
  • Some examples:
  • CRiB is a Service Oriented Architecture (SOA) designed to assist cultural heritage institutions in the implementation of migration-based preservation interventions. The CRiB system works by assessing the quality of distinct conversion applications or services to produce recommendations of optimal migration strategies. The recommendations produced by the system take into account the specific preservation requirements of each client institution.
  • IDEA: from the Digital Curation Blog: OpenOffice as a migration tool -> formats currently supported by OpenOffice:
    • Microsoft Word 6.0/95/97/2000/XP) (.doc and .dot)
    • Microsoft Word 2003 XML (.xml)
    • Microsoft WinWord 5 (.doc)
    • StarWriter formats (.sdw, .sgl, and .vor)
    • AportisDoc (Palm) (.pdb)
    • Pocket Word (.psw)
    • WordPerfect Document (.wpd)
    • WPS 2000/Office 1.0 (.wps)
    • DocBook (.xml)
    • Ichitaro 8/9/10/11 (.jtd and .jtt)
    • Hangul WP 97 (.hwp)
    • .rtf, .txt, and .csv
Normalizing tools
  • Xena: "Xena can convert any data object into an ASCII representation containing XML metadata, via Base64 encoding. This is known as 'binary normalisation' and is fully reversible when there is a need to re-create an original data object. Xena can also convert data objects into openly specified file formats, such as XML or PNG, in a process known as 'normalisation.' These normalised files may be accessed via the Xena viewer, or exported for use with other applications."
  • kopal Library for Retrieval and Ingest (koLibRI) represents a library of Java tools that have been developed for the interaction with the DIAS system of IBM within the kopal project. It has been designed with the intention to be re-usable as a whole or in parts within other contexts, too.
Emulators
  • Dioscuri is an x86 computer hardware emulator written in Java. It is designed by the digital preservation community to ensure documents and programs from the past can still be accessed in the future.

Tools to build in house

  • A metadata extractor that looks for key phrases in the article PDF (e.g., "locations of the specimens") which may indicate datasets underlying the article.
    • The extractor can be configured to search for arbitrary phrases.
    • The extractor can be run by a curator pressing a button on the publication page. The results page contains a list of matching phrases in one column, with a list of dataset titles in the second column.
  • Basic implementation:
    • convert PDF to text
    • search within the text for phrases matching the list of target phrases

Other - funding for tool development

  • CASPAR, Cultural, Artistic and Scientific knowledge for Preservation, Access and Retrieval - is an Integrated Project co-financed by the European Union within the Sixth Framework Programme (Priority IST-2005-2.5.10, "Access to and preservation of cultural and scientific resources")
  • Planets, Planets, Preservation and Long-term Access through Networked Services, is a four-year project co-funded by the European Union under the Sixth Framework Programme to address core digital preservation challenges. The primary goal for Planets is to build practical services and tools to help ensure long-term access to our digital cultural and scientific assets. Planets started on 1st June 2006.
  • NDIIPP (the National Digital Information Infrastructure and Preservation Program)
  • Joint Information Systems Committee (JISC)

Controlled Vocabularies and Terminological Resources

Current Interface/Functionality Issues Noted and Suggestions, General Questions

  1. When uploading data objects, there is the possibility that:
    1. Duplicates are inadvertently added, particularly when uploading many at once.
    2. Errors are made with the metadata, or the wrong file is uploaded, and the depositor wants to correct them before finalizing the submission.
    • Before submission, the author/depositor should have the option to change and fix this information - just as there is an option to edit publication metadata.
  2. ISSUE: when you try to upload a READ ME for a data object, the title is reverted back to the default (the name of the file). If you have entered a unique title, this is then lost.
  3. If the submission is finalized and the author/depositor wants to add more datasets to the publication, they should be able to go back in and do this -> an authorization issue.
  4. ISSUE: when adding keywords, etc. to the publication metadata page, the corresponding author keeps reverting back to the first author listed.
  5. QUESTION: what is expected when the depositor is asked for a further description of the publication?
  6. ISSUE: with copying and pasting, there are a lot of issues with special characters. This needs to be dealt with. (It can be the curator's task to fix these).
  7. Suggestion: The second link under "This item appears in the following..." to "Show full item record" seems either unnecessary or it's not in the right place.
  8. ISSUE: there is a problem with the display of search results - for example, when searching full text for "Gibson," the same record is repeated in the results list over and over.
  9. QUESTION: don't we need a field for page numbers for the article??

Ideas for Curator Interface

  1. Feature: upon login, a list of records created since last login. "These data objects need approval before publication."
  2. Feature: a more streamlined batch edit view? --Tjvision 09:44, 19 May 2009 (EDT)
    • Batch Processes and Queries for the Curator
      • Finding duplicate data objects - right now in the system it is too difficult to find these. A helpful batch process would be to withdraw a number at once, to also have the option of withdrawing all the data objects associated with one publication, etc.
      • Add one READ ME file for a number of data objects - this could also be an option for the author/depositor. Also, other ways to do batch editing of data objects. Right now, you have a list of data objects with links, and you have to look at each individual data object in order to edit it.
      • Data set titles -> in a batch, add author names or part of the article title to a set of data objects, or whatever is decided to be appropriate for data object titles.
      • We need to be able to mimic the inheritance that takes place during deposition - if we make changes to the publication metadata, we should have the option to apply it to all associated datasets. The curator can be asked, "Do you want to apply these changes to associated datasets?", yes or no.
  3. Curators should be able to view lists of items that need additional attention:
    • articles that have no associated datasets
    • articles that do not have full bibliographic metadata (volume, number, DOI)
  4. Feature: integration with curator tools - ability to run JHOVE, for example, from the interface.
  5. List of high profile, high use datasets, updated continually -> these will require higher curation focus
  6. Feature: I would like to see a feature like in ContentDM where as new metadata is entered, it becomes part of a controlled vocabulary - the curator can have the option to add or delete items from this CV, but this would be a very good interim solution until HIVE comes into play, and would help with building a name authority for authors. For example, a depositor adds "Ryan Scherle" as an author, and the curator sees that this name is in the cue to be added to the author CV. The curator approves this. The question then is how it is used/displayed to the user - when they are typing, for example, the name can appear as a suggestion, as with other CV terms.

See the Draft mockups for curator interface.

Other

  1. Time with current interface, with very few submissions completed via the web interface:
    • Time to curate 1 submission: 1 hour
    • 4 datasets: 1 hour
    • 39 datasets: 4-5 hours (at least - it takes a lot of time to look at each data object record, having to click on each individual link).
      • Time to verify the accuracy of the metadata, that the person is who they say they are, finding the DOI, etc.; correction of the metadata
  2. Note: workflow software - metadata generated by these tools - can it be used?
  • RECOMMENDATION FOR DATA CITATIONS
    • Example from NOAA Paleoclimatology Program: lists both a "suggested data citation" and the "original reference" (which the citation for the article) - it states that the original reference "should be" used when citing data.
  • RECOMMENDATION FOR VERSIONING
    • METADATA VERSIONING: The recommendation is to always keep the original version that is submitted by the depositor - keep all the metadata, etc. - this can always be reverted back to and/or used as a reference. Further changes made to the metadata by the curator will not be tracked. Only the most "up to date" version will be displayed to the users, with the original version available via the curator interface.
    • DATASET VERSIONING: When there are changes/corrections made to the actual contents of the dataset, and a version has already been published in Dryad, the NEW version should be considered a new, unique entity, therefore assigned a new unique identifier. The following Dublin Core elements should be used to relate the various versions of the dataset: dc.relation.replaces, dc.relation.isreplacedby, dc.relation.isversionof. The actual linking of the datasets via these elements will most likely be done manually, or at least heavily supervised by, the curator.

Resources for the curator