SwordPackage

From Dryad wiki
Jump to: navigation, search
Status: Outdated, never implemented. However, this page remains to inform development of SWORD Submission

This page describes the use and format of the SWORD package for submitting Dryad data packages to TreeBASE. The SWORD protocol is defined by http://purl.org/net/sword/. It is based on the Atom Publishing Protocol which is defined by http://www.ietf.org/rfc/rfc5023.txt. In this case, Dryad will be acting as the client and TreeBASE as the server.

Package Overview

I've outlined two different methods of sending a SWORD package to TreeBASE. Both use the same communication framework of the SWORD protocol, but differ in the format of the metadata and the way the files are transferred. One other variation on the ORE approach would be to package the resource map and files into an archive and send it over as one to TreeBASE, similar to the way it's modelled with METS. In this case, the URIs for the aggregated resources become relative URI references in the "file" scheme. In either case, as noted below, the actual data file's permanent location is missing from the metadata model and is thus not machine operable.

The POST Header

The SWORD protocol defines some extra headers which should be present in the HTTP request when creating a new resource. The following table specifies the recommended headers for Dryad's implementation:

Header Value
Content-MD5 checksum of SWORD package
User-Agent DryadRepository/<version number>
X-On-Behalf-Of submitter's name
X-Packaging http://www.loc.gov/METS/ or http://www.openarchives.org/ore/terms/

The METS Approach

The SWORD package will consist of a tarball containing the publication, the associated data files and a METS document describing the package. It is recommended, for simplicity's sake, that all the files appear in the top level of the archive. If structure is needed within the archive, this must be represented in the structMap element of the METS file. The METS document should be called mets.xml and, if the archive has structure, should appear at the top level. The SWORD server will use this METS file to process the internals of the SWORD package.

METS File

The general structure of the METS file will look like the following:

  1. METS header
  2. dmdSec for each file in the SWORD package
  3. fileSec
    1. a fileGrp for each file in the SWORD package
  4. structMap
    1. a div and fptr for each file in the SWORD package

The following is the skeleton of an example METS record:

<mets xmlns="http://www.loc.gov/METS/"
xmlns:dc="http://purl.org/dc/terms/"
xmlns:dwc="http://rs.tdwg.org/dwc/terms/"
xmlns:ddi="http://www.icpsr.umich.edu/DDI/"
xmlns:premis="http://www.loc.gov/standards/premis/v1/">
<metsHdr CREATEDATE=[datetime] LASTMODDATE=[datetime] />
<dmdSec ID="dmdSec1">
<mdWrap MDTYPE="OTHER">
<xmlData>
<dc:type />
<dc:creator />
<dc:contributor />
<dc:title />
<dc:issued />
<dc:abstract />
<dc:subject />
<dc:publisher />
<dc:bibliographicCitation />
<dc:hasPart />
<dc:isPartOfSeries />
<dc:identifier />
<dwc:scientificName />
<dc:spatial />
<dc:temporal />
</xmlData>
</mdWrap>
</dmdSec>
<dmdSec ID="dmdSec2">
<mdWrap MDTYPE="OTHER">
<xmlData>
<dc:type />
<dc:creator />
<dc:contributor />
<dc:title />
<dc:identifier />
<premis:fixity />
<dc:isPartOf />
<ddi:depositr />
<ddi:contact />
<dc:rights />
<dc:description />
<dc:subject />
<dc:spatial />
<dc:temporal />
<dc:format />
<dc:extent />
<dc:issued />
<dc:available />
<dc:modified />
<dwc:scientificName />
</xmlData>
</mdWrap>
</dmdSec>
<fileSec>
<fileGrp USE="publication">
<file ID="pub1">
<FLocat xlink:href=[DOI of article] LOCTYPE="DOI" />
</file>
</fileGrp>
<fileGrp USE="datafile">
<file ID=[filename of data file #1]>
<FLocat LOCTYPE="OTHER" xlink:href=[filename of data file #1] xlink:type="simple" />
</file>
</fileGrp>
</fileSec>
<structMap>
<div DMDID="dmdSec1">
<fptr FILEID="pub1" />
</div>
<div DMDID="dmdSec1">
<fptr FILEID=[filename of data file #1] />
</div>
</structMap>
</mets>

The OAI-ORE Approach

Using OAI-ORE, the package is more of a logical collection of files. The only information being directly pushed to TreeBASE is the representation of the ORE resource map. The data files themselves will be fetched during the submission process by TreeBASE. As such, the general outline of the entire transaction is as follows:

  1. Dryad initiates the transaction by POSTing the resource map to TreeBASE
  2. TreeBASE processes the resource map and initiates a separate GET request from Dryad for each data file in the aggregation
  3. TreeBASE closes the transaction with a success or failure response

The RDF/XML Resource Map

Each data package will have a resource map describing the aggregation of files. This resource map can be serialized in a number of different ways, but what follows is an example of an RDF/XML serialization of a resource map describing a sample Dryad record.

<?xml version="1.0" encoding="UTF-8" ?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ore="http://www.openarchives.org/ore/terms/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:dwc="http://rs.tdwg.org/dwc/terms/">

<rdf:Description rdf:about="[URI of resource map]">
<ore:describes rdf:resource="[URI of aggregation]" />
<dcterms:creator rdf:parseType="Resource">
<foaf:name>Dryad Repository</foaf:name>
<foaf:page rdf:resource="http://datadryad.org" />
</dcterms:creator>
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">
[last modified time of resource map]
</dcterms:modified>
</rdf:Description>
<rdf:Description rdf:about="[URI of aggregation]">
<ore:aggregates rdf:resource="http://hdl.handle.net/10255/dryad.119" />
<ore:aggregates rdf:resource="http://hdl.handle.net/10255/dryad.121" />
<ore:aggregates rdf:resource="http://hdl.handle.net/10255/dryad.122" />
</rdf:Description>
<rdf:Description rdf:about="http://hdl.handle.net/10255/dryad.119">
<dc:type>Article</dc:type>
<dc:creator>Richard H. Ree</dc:creator>
<dc:creator>Michael J. Donoghue</dc:creator>
<dc:title>Step Matrices and the Interpretation of Homoplasy</dc:title>
<dcterms:issued>1998-12-30</dcterms:issued>
<dcterms:abstract>Assumptions about the costs of character change,
coded in the form of a step matrix...
</dcterms:abstract>
<dc:subject>parsimony</dc:subject>
<dc:subject>phylogenetic inference</dc:subject>
<dc:subject>homoplasy</dc:subject>
<dc:subject>ancestral states</dc:subject>
<dc:subject>character evolution</dc:subject>
<dc:publisher>Taylor & Francis</dc:publisher>
<dcterms:bibliographicCitation>Ree, Richard H. and Donoghue, Michael J. (1998)
'Step Matrices and the Interpretation of Homoplasy', Systematic Biology, 47:4, 582 - 588
</dcterms:bibliographicCitation>
<dcterms:hasPart rdf:resource="http://hdl.handle.net/10255/dryad.121" />
<dcterms:hasPart rdf:resource="http://hdl.handle.net/10255/dryad.122" />
<dcterms:isPartOfSeries>Systematic Biology</dcterms:isPartOfSeries>
<dcterms:isPartOfSeries>47:4, 582 - 588</dcterms:isPartOfSeries>
<dc:identifier rdf:resource="doi:10.1080/106351598260590" />
<dc:identifier rdf:resource="http://hdl.handle.net/10255/dryad.119" />
</rdf:Description>
<rdf:Description rdf:about="http://hdl.handle.net/10255/dryad.121">
<dc:type>Dataset</dc:type>
<dc:creator>Richard H. Ree</dc:creator>
<dc:creator>Michael J. Donoghue</dc:creator>
<dc:title>Ree and Donoghue C Source Code</dc:title>
<dc:identifier rdf:resource="http://hdl.handle.net/10255/dryad.121" />
<dcterms:isPartOf rdf:resource="doi:10.1080/106351598260590" />
<dc:subject>parsimony</dc:subject>
<dc:subject>phylogenetic inference</dc:subject>
<dc:subject>homoplasy</dc:subject>
<dc:subject>ancestral states</dc:subject>
<dc:subject>character evolution</dc:subject>
<dcterms:format>Text file</dcterms:format>
<dcterms:extent>37.64Kb</dcterms:extent>
<dcterms:issued>1998-12-30</dcterms:issued>
<dcterms:available>2008-04-08T19:41:03Z</dcterms:available>
</rdf:Description>
<rdf:Description rdf:about="http://hdl.handle.net/10255/dryad.122">
<dc:type>Dataset</dc:type>
<dc:creator>Richard H. Ree</dc:creator>
<dc:creator>Michael J. Donoghue</dc:creator>
<dc:title>Source Code Readme File</dc:title>
<dc:identifier rdf:resource="http://hdl.handle.net/10255/dryad.122" />
<dcterms:isPartOf rdf:resource="doi:10.1080/106351598260590" />
<dc:subject>parsimony</dc:subject>
<dc:subject>phylogenetic inference</dc:subject>
<dc:subject>homoplasy</dc:subject>
<dc:subject>ancestral states</dc:subject>
<dc:subject>character evolution</dc:subject>
<dcterms:format>Text file</dcterms:format>
<dcterms:extent>5.697Kb</dcterms:extent>
<dcterms:issued>1998-12-30</dcterms:issued>
<dcterms:available>2008-04-08T19:51:42Z</dcterms:available>
</rdf:Description>
</rdf:RDF>

Notes About ORE

In accordance with the OAI-ORE standard, an aggregation must not have a representation. If a resolvable URI is used for an aggregation we will need to use content negotiation to resolve to the URI of one of the resource map serializations.

The Content-MD5 header should represent the checksum of the resource map serialization, not the contents of the data files. However, a checksum should be used to verify the integrity of the data files when they are transferred. Two possible solutions include adding a checksum element to the application profile for the data files, or using the Content-MD5 header in response to all GET requests made of Dryad.

One thing that seems to jump out at me is that the handle being used to identify the data file doesn't actually point to the data file, it points to a page about the data file which, in turn, points to the data file. For a human, there's no problem understanding the relationship between the data file handle and the data file, but for a machine this relationship has to be made explicit. It seems to me that, as the system is currently set up, the handle points to an abstract entity which contains a data file. Maybe this isn't really an issue? There are certainly ways of dealing with this technically. What I've done here is to keep the Dryad handle in the dc:identifier element and use the actual data file location as the URI of the item being aggregated. I think, at the very least, we should consider adding another dc:identifier element with the URI for the actual file as the value. This is a general issue, not specific to ORE. We will need to address it for all handshaking purposes. --Ryan Scherle 17:47, 2 December 2009 (EST)