Difference between revisions of "Old:Identifiers"

From Dryad wiki
Jump to: navigation, search
(Open Questions)
m
 
(43 intermediate revisions by 2 users not shown)
Line 1: Line 1:
We must determine the best possible identifier scheme for data in the repository. Once we implement a scheme, it will be nearly impossible to change. The [http://wiki.dspace.org/index.php/LessonsLearned DSpace Lessons Learned page] tells us that "The persistent identifier for content is the single best selling point for DSpace when talking with faculty."
+
    STATUS: This page is outdated and of historical use only.
  
 +
'''Dryad has decided to use DOIs for all objects in the repository. For details, see [[DOI Usage]].'''
  
== Possible Identifier Schemes ==
+
== Types of Identifier Schemes ==
  
 
=== Handle ===
 
=== Handle ===
  
Handles are native to DSpace and supported by Fedora.
+
Pro:
 +
* Handles are native to DSpace.
 +
* Handles have the same basic architecture as DOIs, but are much less expensive to implement. The total cost is about $100 per year, regardless of the number of identifiers.
  
 
Con:
 
Con:
* Must register each identifier with a central service.
+
* Must register the local resolver with a central service.
  
=== DOI (Digital Object Identifier) ===  
+
=== DOI (Digital Object Identifier) ===
  
 
DOIs are one particular implementation of handles, used widely in the publishing industry. One difference is that the DOI system mandates metadata that must be associated with every  DOI, while "plain" handles (and most other ID systems) leave metadata up to the user.
 
DOIs are one particular implementation of handles, used widely in the publishing industry. One difference is that the DOI system mandates metadata that must be associated with every  DOI, while "plain" handles (and most other ID systems) leave metadata up to the user.
 +
 +
Pro:
 +
* Publishers and scientists are already familiar with them
 +
* Supported by ISI (though they may not consciously be supporting DOIs for data objects)
 +
* CrossRef provides searching and citation linking services for DOIs in the CrossRef system (it is unclear whether a DOI issued by another agency can be listed in the CrossRef system).
 +
 +
Con:
 +
* Must register the local resolver with a central service.
 +
* Must pay to register each identifier. TIB charges 60 cents per identifier, though this cost will come down through the [http://www.datacite.org/ DataCite] collaboration. CrossRef charges 6 cents per data file identifier and $1 for each database identifier (though it is unclear what constitutes a database). These costs are trivial for a one-identifier-per-article model, but can be quite expensive for more granular schemes, such as the one-identifier-per-observation model that would be ideal for fully machine-readable data.
 +
 +
==== DOI + Handle (hybrid) ====
 +
  
 
An old [http://lists.tdwg.org/pipermail/tdwg-guid/2007-March/000574.html post from Rod Page] suggests using DOIs for manuscript-level information and handles for more granular information. This would hold down costs, but could be messy.
 
An old [http://lists.tdwg.org/pipermail/tdwg-guid/2007-March/000574.html post from Rod Page] suggests using DOIs for manuscript-level information and handles for more granular information. This would hold down costs, but could be messy.
 +
 +
Pro:
 +
* Allows DOIs to be used at the level of a whole article (data package), which is the level where DOI provides the most benefits.
 +
* Since the DOI and handle technologies are nearly identical, only a single service needs to be maintained for resolving identifiers.
 +
 +
Con:
 +
* May be confusing for authors, especially if a single publication cites datas at both the DOI and handle levels.
 +
 +
==== DOI parameters ====
 +
 +
Generic DOIs may be registered, and [http://www.crossref.org/help/Content/07_advanced%20concepts/Passing_parameters.htm parameters may be added] to access sub-parts of the DOI object.
 +
 +
Pro:
 +
* This allows us to limit the number of DOIs registered, while providing access at any granularity we wish.
  
 
Con:
 
Con:
* Must register each identifier with a central service.
+
* The resultant URLs are long and complex
* Must pay to register each identifier
+
* The parameter passing mechanism may change over time, making this approach poorly suited for persistent identifiers.
  
 
=== LSID (Life Science Identifier) ===
 
=== LSID (Life Science Identifier) ===
  
[http://lsids.sourceforge.net/ LSID] is a URN identifier scheme.  
+
[http://lsids.sourceforge.net/ LSID] is a URN identifier scheme.
  
 
* [http://wiki.gbif.org/guidwiki/wikka.php?wakka=HomePage TDWG working groups]
 
* [http://wiki.gbif.org/guidwiki/wikka.php?wakka=HomePage TDWG working groups]
Line 29: Line 58:
 
* BioMoby uses LSIDs for internal processing.
 
* BioMoby uses LSIDs for internal processing.
 
* [http://wiki.tdwg.org/twiki/bin/view/TAG/LsidVocs LSID vocabulries]
 
* [http://wiki.tdwg.org/twiki/bin/view/TAG/LsidVocs LSID vocabulries]
 +
 +
Pro:
 +
* (there are no obvious pros of LSIDs, when compared to DOIs and Handles)
  
 
Con:
 
Con:
Line 35: Line 67:
 
* Unclear whether a central resolution authority really exists
 
* Unclear whether a central resolution authority really exists
 
* Community is much smaller than for other identifier schemes
 
* Community is much smaller than for other identifier schemes
* The W3C has written a document that shows [http://www.w3.org/2001/tag/doc/URNsAndRegistries-50 non-http schemes aren't any better than http-based schemes].
+
* The W3C has written a document that claims [http://www.w3.org/2001/tag/doc/URNsAndRegistries-50 non-http schemes aren't any better than http-based schemes].
  
 
=== UNF ===
 
=== UNF ===
Line 41: Line 73:
 
UNF is a content-based identifier for data objects, somewhat like a fingerprint.
 
UNF is a content-based identifier for data objects, somewhat like a fingerprint.
  
** [http://thedata.org/help/start/citation description at the Dataverse Network]
+
* [http://thedata.org/help/start/citation description at the Dataverse Network]
  
=== Custom ===
+
Pro:
 +
* The UNF can help to detect errors in transcribing identifiers.
  
Define our own identifier system, and add DOIs/Handles as appropriate.
+
Con:
 
+
* The UNF is not easy for humans to interpret.
* "Handle-like" identifiers are handed out for free by DSpace, so why not use them?
+
* The UNF code only works with file formats used by statistical software.
  
 
== Identifier schemes in use ==
 
== Identifier schemes in use ==
  
 +
Almost all repositories of publications use DOI, but data repositories vary more:
 
* CiteSeer: custom
 
* CiteSeer: custom
 
* ChemXSeer: doi, custom
 
* ChemXSeer: doi, custom
Line 65: Line 99:
 
* World Data Center: doi
 
* World Data Center: doi
 
* PDB: custom, doi
 
* PDB: custom, doi
* ACM: doi
+
* Pangaea: doi
* Dlib magazine: doi
 
  
== Notes ==
+
== Notes/Decisions ==
  
* We need to look at other repositories (genbank, treebase, gbif, etc.) to see what types of identifiers they are using.
+
* We do not want to fall into the same trap as [[arXiv]]. They were forced to [http://arxiv.org/help/arxiv_identifier change their identifier system], because the number of items added to the repository could no longer be accommodated by old scheme. However, the new scheme is still inflexible, and is guaranteed to be invalid eventually, because they include a 2-character year code.
* There is a section on identifiers in the grant proposal....
 
 
 
* We do not want to fall into the same trap as [[arXiv]]. They were forced to [http://arxiv.org/help/arxiv_identifier change their identifier system], because the number of items added to the repository could no longer be accommodated by old scheme. However, the new scheme is still inflexible, and is guaranteed to be invalid eventaully, because they include a 2-character year code.
 
 
* [http://www.lackoftalent.org/michael/blog/2007/06/05/identifier-persistence-fundamentals/ Mike Giarlo says] the choice of identifier scheme doesn't really matter, the commitment to persistence is key.
 
* [http://www.lackoftalent.org/michael/blog/2007/06/05/identifier-persistence-fundamentals/ Mike Giarlo says] the choice of identifier scheme doesn't really matter, the commitment to persistence is key.
 
* Ryan's previous thoughts on [http://wiki.dlib.indiana.edu/confluence/display/INF/Identifiers semantic vs. non-semantic identifiers].
 
* Ryan's previous thoughts on [http://wiki.dlib.indiana.edu/confluence/display/INF/Identifiers semantic vs. non-semantic identifiers].
* Peter Buneman's [http://homepages.inf.ed.ac.uk/opb/papers/ssdbm2006.pdf thoughts on making identifiers citable].
+
* [http://www.crossref.org/02publishers/how_to_faq.html CrossRef suggests] creating DOIs of the form DOI-institution-code/Handle-institution-code/Handle-specific-part. These will trivially convert to a form that DSpace can use.
 
 
 
 
== Open Questions ==
 
 
 
# Is it possible to create sub-parts of a DOI? For example: http://dx.doi.org/1234/abc1234/subpart1. This would allow us to limit the number of DOIs registered, but provide access at any granularity we wish.
 
# Is it possible to get the same "institution identifier" in both the DOI and CNRI handle systems?
 
# Do we want to assign identifiers to particular bitstreams, like LSID does? This seems ripe for disaster. While we want software to be able to work with a data object in a consistent manner (we don't want to suddenly change the format out from under them), we also don't want to preserve data formats that are definitely dead (in 50 years, we won't have tools to parse an Excel 2003 file, or the current form of a NEXUS file). DSpace by default assigns handles to the item level only (which is abstract), and treats individual bitstreams as manifestations of the item, with identifiers tied to the hostname. Bitstreams which need to be cited are typically placed in their own item.
 
# Should we use the "default handle" as the primary identifier, or assign our own? The default handles are tightly tied into DSpace. We can assign our own identifier and attempt to hide the default handle, the same way the IU Fedora repository hides PIDs. But is there any reason to not use these handles? We might as well use them until a problem comes up.
 
  
 +
== See Also ==
  
 +
* [[Citing Data]] -- lists scholarly articles that discuss methods for data citations
  
 
[[Category:Software]]
 
[[Category:Software]]
 +
[[Category:Metadata]]
 +
[[Category:Historical Pages]]

Latest revision as of 10:51, 27 April 2012

    STATUS: This page is outdated and of historical use only.

Dryad has decided to use DOIs for all objects in the repository. For details, see DOI Usage.

Types of Identifier Schemes

Handle

Pro:

  • Handles are native to DSpace.
  • Handles have the same basic architecture as DOIs, but are much less expensive to implement. The total cost is about $100 per year, regardless of the number of identifiers.

Con:

  • Must register the local resolver with a central service.

DOI (Digital Object Identifier)

DOIs are one particular implementation of handles, used widely in the publishing industry. One difference is that the DOI system mandates metadata that must be associated with every DOI, while "plain" handles (and most other ID systems) leave metadata up to the user.

Pro:

  • Publishers and scientists are already familiar with them
  • Supported by ISI (though they may not consciously be supporting DOIs for data objects)
  • CrossRef provides searching and citation linking services for DOIs in the CrossRef system (it is unclear whether a DOI issued by another agency can be listed in the CrossRef system).

Con:

  • Must register the local resolver with a central service.
  • Must pay to register each identifier. TIB charges 60 cents per identifier, though this cost will come down through the DataCite collaboration. CrossRef charges 6 cents per data file identifier and $1 for each database identifier (though it is unclear what constitutes a database). These costs are trivial for a one-identifier-per-article model, but can be quite expensive for more granular schemes, such as the one-identifier-per-observation model that would be ideal for fully machine-readable data.

DOI + Handle (hybrid)

An old post from Rod Page suggests using DOIs for manuscript-level information and handles for more granular information. This would hold down costs, but could be messy.

Pro:

  • Allows DOIs to be used at the level of a whole article (data package), which is the level where DOI provides the most benefits.
  • Since the DOI and handle technologies are nearly identical, only a single service needs to be maintained for resolving identifiers.

Con:

  • May be confusing for authors, especially if a single publication cites datas at both the DOI and handle levels.

DOI parameters

Generic DOIs may be registered, and parameters may be added to access sub-parts of the DOI object.

Pro:

  • This allows us to limit the number of DOIs registered, while providing access at any granularity we wish.

Con:

  • The resultant URLs are long and complex
  • The parameter passing mechanism may change over time, making this approach poorly suited for persistent identifiers.

LSID (Life Science Identifier)

LSID is a URN identifier scheme.

Pro:

  • (there are no obvious pros of LSIDs, when compared to DOIs and Handles)

Con:

  • There are no known sites that use LSIDs as their primary identifier, though a few sites (available from the LSID homepage) can resolve LSIDs into their identifier scheme.
  • LSIDs seem to have fallen out of favor as people have realized that URLs can be identifiers, and tools for other identifier schemes have improved.
  • Unclear whether a central resolution authority really exists
  • Community is much smaller than for other identifier schemes
  • The W3C has written a document that claims non-http schemes aren't any better than http-based schemes.

UNF

UNF is a content-based identifier for data objects, somewhat like a fingerprint.

Pro:

  • The UNF can help to detect errors in transcribing identifiers.

Con:

  • The UNF is not easy for humans to interpret.
  • The UNF code only works with file formats used by statistical software.

Identifier schemes in use

Almost all repositories of publications use DOI, but data repositories vary more:

  • CiteSeer: custom
  • ChemXSeer: doi, custom
  • GenBank: accession
  • PubMed: custom
  • GBIF: custom
  • KNB: custom
  • OceanPortal: accession
  • Morphbank: accession
  • MorphoBank: accession
  • National Climatic Data Center: custom
  • Paleobiology Database: accession
  • TreeBASE: custom
  • World Data Center: doi
  • PDB: custom, doi
  • Pangaea: doi

Notes/Decisions

  • We do not want to fall into the same trap as arXiv. They were forced to change their identifier system, because the number of items added to the repository could no longer be accommodated by old scheme. However, the new scheme is still inflexible, and is guaranteed to be invalid eventually, because they include a 2-character year code.
  • Mike Giarlo says the choice of identifier scheme doesn't really matter, the commitment to persistence is key.
  • Ryan's previous thoughts on semantic vs. non-semantic identifiers.
  • CrossRef suggests creating DOIs of the form DOI-institution-code/Handle-institution-code/Handle-specific-part. These will trivially convert to a form that DSpace can use.

See Also

  • Citing Data -- lists scholarly articles that discuss methods for data citations