From Dryad wiki
Revision as of 10:01, 27 September 2007 by Ryan Scherle (talk | contribs) (DOI (Digital Object Identifier))

Jump to: navigation, search

We must determine the best possible identifier scheme for data in the repository. Once we implement a scheme, it will be nearly impossible to change. The DSpace Lessons Learned page tells us that "The persistent identifier for content is the single best selling point for DSpace when talking with faculty."

Possible Identifier Schemes


Handles are native to DSpace and supported by Fedora.


  • Must register each identifier with a central service.

DOI (Digital Object Identifier)

DOIs are one particular implementation of handles, used widely in the publishing industry. One difference is that the DOI system mandates metadata that must be associated with every DOI, while "plain" handles (and most other ID systems) leave metadata up to the user.

An old post from Rod Page suggests using DOIs for manuscript-level information and handles for more granular information. This would hold down costs, but could be messy.

Generic DOIs may be registered, and parameters may be added to access sub-parts of the DOI object.


  • Must register each identifier with a central service.
  • Must pay to register each identifier

LSID (Life Science Identifier)

LSID is a URN identifier scheme.


  • There are no known sites that use LSIDs as their primary identifier, though a few sites (available from the LSID homepage) can resolve LSIDs into their identifier scheme.
  • LSIDs seem to have fallen out of favor as people have realized that URLs can be identifiers, and tools for other identifier schemes have improved.
  • Unclear whether a central resolution authority really exists
  • Community is much smaller than for other identifier schemes
  • The W3C has written a document that shows non-http schemes aren't any better than http-based schemes.


UNF is a content-based identifier for data objects, somewhat like a fingerprint.


Define our own identifier system, and add DOIs/Handles as appropriate.

  • "Handle-like" identifiers are handed out for free by DSpace, so why not use them?

Identifier schemes in use

  • CiteSeer: custom
  • ChemXSeer: doi, custom
  • GenBank: accession
  • PubMed: custom
  • GBIF: custom
  • KNB: custom
  • OceanPortal: accession
  • Morphbank: accession
  • MorphoBank: accession
  • National Climatic Data Center: custom
  • Paleobiology Database: accession
  • TreeBASE: custom
  • World Data Center: doi
  • PDB: custom, doi
  • ACM: doi
  • Dlib magazine: doi


  • We need to look at other repositories (genbank, treebase, gbif, etc.) to see what types of identifiers they are using.
  • There is a section on identifiers in the grant proposal....
  • We do not want to fall into the same trap as arXiv. They were forced to change their identifier system, because the number of items added to the repository could no longer be accommodated by old scheme. However, the new scheme is still inflexible, and is guaranteed to be invalid eventaully, because they include a 2-character year code.
  • Mike Giarlo says the choice of identifier scheme doesn't really matter, the commitment to persistence is key.
  • Ryan's previous thoughts on semantic vs. non-semantic identifiers.
  • Peter Buneman's thoughts on making identifiers citable.
  • CrossRef suggests creating DOIs of the form DOI-institution-code/Handle-institution-code/Handle-specific-part. These will trivially convert to a form that DSpace can use. But it's unclear why we couldn't just leave out the Handle-institution-code, and declare that the specific parts for both systems to be the same.

Open Questions

  1. Is it possible to create sub-parts of a DOI? For example: http://dx.doi.org/1234/abc1234/subpart1. This would allow us to limit the number of DOIs registered, but provide access at any granularity we wish.
  2. Is it possible to get the same "institution identifier" in both the DOI and CNRI handle systems?
  3. Do we want to assign identifiers to particular bitstreams, like LSID does? This seems ripe for disaster. While we want software to be able to work with a data object in a consistent manner (we don't want to suddenly change the format out from under them), we also don't want to preserve data formats that are definitely dead (in 50 years, we won't have tools to parse an Excel 2003 file, or the current form of a NEXUS file). DSpace by default assigns handles to the item level only (which is abstract), and treats individual bitstreams as manifestations of the item, with identifiers tied to the hostname. Bitstreams which need to be cited are typically placed in their own item.
  4. Should we use the "default handle" as the primary identifier, or assign our own? The default handles are tightly tied into DSpace. We can assign our own identifier and attempt to hide the default handle, the same way the IU Fedora repository hides PIDs. But is there any reason to not use these handles? We might as well use them until a problem comes up.