TreeBASE OAI Provider

Status: Completed by Yale in 2010.

TreeBASE will create an OAI-PMH provider, similar to the Metacat OAI Provider. This will be the official method for TreeBASE content to move into Dryad.

Workflow
When a user submits content to TreeBASE, the content will be harvested by Dryad and appear in Dryad searches. If the user decides to submit related (or the same) content to Dryad, the Dryad submission system will allow the user to give a TreeBASE ID, and the records will automatically be linked. Dryad will provide curators with reports that indicate possible duplicates.

Not that content originally deposited in Dryad follows a different workflow, as described in the TreeBASE Submission Integration.

Provider Details
A high-level [[Media:TreeBASE_OAI.doc‎|architectural design]] for the provider was created by Youjun Guo.

There will be three OAI collections: Studies, Trees, and Matrices.

The OAI provider will output:
 * Dublin Core
 * Nexus
 * Nexml
 * (possibly) OAI-ORE, for Studies only

Dryad will import the Nexml data, or the OAI-ORE data if available.

Data currently available from TreeBASE
TreeBASE web services reply in RDFa (RSS 1.0) and NeXML. Citation metadata will be in Dublin Core (etc) in the RSS. Taxon and other metadata will (in the future) be in the NeXML serialization, with the metadata largely expressed in RDFa.

implementation

 * http://www.treebase.org/treebase-web/top/oai?verb=Identify
 * http://www.treebase.org/treebase-web/top/oai?verb=ListSets
 * http://www.treebase.org/treebase-web/top/oai?verb=ListMetadataFormats
 * http://www.treebase.org/treebase-web/top/oai?verb=ListRecords&metadataPrefix=oai_dc&until=1996-11-04T00:00:00Z
 * http://www.treebase.org/treebase-web/top/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=TB:s1234
 * http://www.treebase.org/treebase-web/top/oai?verb=ListIdentifiers&metadataPrefix=oai_dc&until=1996-11-04T00:00:00Z

Open Questions

 * 1) After Dryad harvests the content, should TB content show up as regular Dryad content? (i.e., not off to the side as "related content", but directly in the regular results)
 * 2) * Ryan needs to come up with some mockups and get stakeholder feedback (TB, journals, etc.)
 * 3) Should information about "in progress" and "ready" items be included? This may be useful information for the TreeBASE Submission Integration, but it probably is not useful for the general-purpose OAI-PMH interface.

Resolved Questions

 * 1) Does TB currently have DC available?
 * 2) * Yes, as part of the RSS feed, but it needs to be reviewed and expanded.
 * 3) How frequently should Dryad harvest content?
 * 4) * Daily. TB generally receives about 1 item per day, but this may ramp up with TB2 and Dryad integration.

Tasks
While designing the interaction of the provider with the core TreeBASE code, it may be useful to reference the [[Media:Metacat-OAI-PMH-Project-Plan.pdf|final report on the OAI provider for MetaCat]].


 * Ryan: provide sample Dryad records with phylo content, to help TB develop DC mapping
 * TB: create DC content mapping
 * TB: Complete/create OAI-PMH provider functionality.
 * follow the OAI Best Practices
 * adapt OCLC's OAIcat
 * test with the OAI Repository Explorer
 * Ryan: Implement harvest of TB &amp; Metacat metadata
 * Ryan: Submission system enhancements to enable above workflow.

Relevant Text from the Grant Proposal

 * "full compliance with OAI-PMH"
 * "we will also populate Dryad with the published data presently in TreeBASE (currently ~1600 records). To provide meaningful metadata, we will extract from the legacy records such things as the original bibliographic citation and DOI, sequence accession numbers, geographic coordinates and specimen identifiers. For those articles that are also available electronically, this will result in an initial test set for automated metadata extraction techniques (SA1.2). We will also explore the possibility of similarly pre-populating Dryad with complete publication data packages from existing journal data archives.  "
 * "Since TreeBASE stores its metadata in a relational data model and not as XML documents, a cross-walk from a native XML format to DC is not needed; instead, the DC elements will be drawn from the metadata attributes stored in the TreeBASE relational database."
 * "Dryad will regularly harvest metadata records from TreeBASE"