Difference between revisions of "Old:Identifiers"

From Dryad wiki
Jump to: navigation, search
(Identifier schemes in use)
Line 74: Line 74:
== Identifier schemes in use ==
== Identifier schemes in use ==
Almost all publications use DOI, but data repositories vary more:
* CiteSeer: custom
* CiteSeer: custom
* ChemXSeer: doi, custom
* ChemXSeer: doi, custom
Line 88: Line 89:
* World Data Center: doi
* World Data Center: doi
* PDB: custom, doi
* PDB: custom, doi
* ACM: doi
* Pangaea: doi
* Dlib magazine: doi
== Notes/Decisions ==
== Notes/Decisions ==

Revision as of 15:10, 13 June 2009

We must determine the best possible identifier scheme for data in the repository. Once we implement a scheme, it will be nearly impossible to change. The DSpace Lessons Learned page tells us that "The persistent identifier for content is the single best selling point for DSpace when talking with faculty."

Possible Identifier Schemes



  • Handles are native to DSpace and supported by Fedora.
  • Handles have the same basic architecture as DOIs, but are much less expensive to implement. The total cost is about $100 per year, regardless of the number of identifiers.


  • Must register the local resolver with a central service.

DOI (Digital Object Identifier)

DOIs are one particular implementation of handles, used widely in the publishing industry. One difference is that the DOI system mandates metadata that must be associated with every DOI, while "plain" handles (and most other ID systems) leave metadata up to the user.


  • Publishers and scientists are already familiar with them
  • Supported by ISI (though they may not consciously be supporting DOIs for data objects)


  • Must register the local resolver with a central service.
  • Must pay to register each identifier. Cost is around 60 cents per identifier. This is trivial for a one-identifier-per-article model, but can be quite expensive for more granular schemes, such as the one-identifier-per-observation model that would be ideal for fully machine-readable data.

DOI + Handle (hybrid)

An old post from Rod Page suggests using DOIs for manuscript-level information and handles for more granular information. This would hold down costs, but could be messy.


  • Allows DOIs to be used at the level of a whole article citation, which is the level where DOI provides the most benefits.
  • Since the DOI and handle technologies are nearly identical, only a single service needs to be maintained for resolving identifiers.


  • May be confusing for authors, especially if a single publication cites datasets at both the DOI and handle levels.

DOI parameters

Generic DOIs may be registered, and parameters may be added to access sub-parts of the DOI object.


  • This allows us to limit the number of DOIs registered, while providing access at any granularity we wish.


  • The resultant URLs are long and complex, making them
  • The parameter passing mechanism may change over time, making this approach poorly suited for persistent identifiers.

LSID (Life Science Identifier)

LSID is a URN identifier scheme.


  • There are no known sites that use LSIDs as their primary identifier, though a few sites (available from the LSID homepage) can resolve LSIDs into their identifier scheme.
  • LSIDs seem to have fallen out of favor as people have realized that URLs can be identifiers, and tools for other identifier schemes have improved.
  • Unclear whether a central resolution authority really exists
  • Community is much smaller than for other identifier schemes
  • The W3C has written a document that shows non-http schemes aren't any better than http-based schemes.


UNF is a content-based identifier for data objects, somewhat like a fingerprint.

Identifier schemes in use

Almost all publications use DOI, but data repositories vary more:

  • CiteSeer: custom
  • ChemXSeer: doi, custom
  • GenBank: accession
  • PubMed: custom
  • GBIF: custom
  • KNB: custom
  • OceanPortal: accession
  • Morphbank: accession
  • MorphoBank: accession
  • National Climatic Data Center: custom
  • Paleobiology Database: accession
  • TreeBASE: custom
  • World Data Center: doi
  • PDB: custom, doi
  • Pangaea: doi


  • We will initially focus on handles, and later add DOIs as an extension of the handle scheme.
  • There is a section on identifiers in the grant proposal....
  • We do not want to fall into the same trap as arXiv. They were forced to change their identifier system, because the number of items added to the repository could no longer be accommodated by old scheme. However, the new scheme is still inflexible, and is guaranteed to be invalid eventaully, because they include a 2-character year code.
  • Mike Giarlo says the choice of identifier scheme doesn't really matter, the commitment to persistence is key.
  • Ryan's previous thoughts on semantic vs. non-semantic identifiers.
  • CrossRef suggests creating DOIs of the form DOI-institution-code/Handle-institution-code/Handle-specific-part. These will trivially convert to a form that DSpace can use.

See Also

  • Citing Data -- lists scholarly articles that discuss methods for data citations