External Metadata Use

From Dryad wiki
Jump to: navigation, search

Nico Carver is working to analyze how Dryad metadata is being used by external systems.

Tasks

  1. Review metadata use by other systems and document their use:
    • DataCite -- datacite metadata search, DataCite Metadata Store
    • Google Scholar -- in search results, and also My Publications (example).  Also see whether we are correctly using the DSpace metadata for Google Scholar
    • Google
    • ROpenSci
    • Total Impact
    • RSS/Atom readers -- feeds should include everything needed to create a citation
    • Mendeley
    • All other systems that are linked in the Cite/Share sections of Dryad pages.
      • Cite: BibTeX, RIS
      • Share: Delicious, Connotea, Digg, Reddit, Tweet, Facebook
    • DataONE
    • Thompson-Reuters
    • Sciencegate.ch
    • Other data aggregators, search engines, etc.
  2. Review whether the DataCite metadata search is fully Linked Data compliant.
  3. Test DataCite/CrossRef metadata transfer, paying attention to compliance for Linked Data purposes.
  4. Review/update Data Access page.
    • Add information for Atom feeds, more details for ROpenSci
  5. Provide recommendations for metadata curation to make metadata as useful as possible to others.
  6. Provide recommendations for API changes.

Methods

I am investigating each service by searching for Dryad metadata using an experimental set and a control set of Dryad IDs (See below). I search in this order {Title, First Author, Last Author}. I investigate all search and browse methods for each service. I record the metadata elements that display for each ID searched. I will be using the data collected in this manner to determine metadata consistency and completeness in my report. I will then look further into 'fixing' incomplete or inconsistent metadata. In many cases, I will be running up against the capabilities of the system being used, but in some cases Dryad will be able to update it's output formats in order to improve metadata quality.

Experimental Set:

- Single author [1] [2]

- Many authors The vast majority of items in the repository fit this, but here's one with an exceptionally long author list: [3]

- Old [4] [5] [6] The bulk of other early content is all from Systematic Biology archives, and was uploaded by Dryad staff on behalf of the authors such as: [7] [8] [9]

- New [10]

- Incomplete metadata [11] [12] [13]

- Embargoed Until article publication (this will, of course, change when the articles are published, which could be fairly soon): [14] [15] One year (end date not yet set): [16] One year (end date has been set): [17] Longer/custom embargo length: [18]

-No abstract [19] [20] [21]

Control Set:

These items have complete metadata. This includes article citation, article DOI, article publication date, embargo end dates when appropriate, etc. [22] [23] [24] [25] [26]


Findings

Percentage Complete

This is a calculation of the average minimal completeness for each service. The minimal is defined as the metadata needed to make a citation.

DataCite GoogleScholar ROpenSci Totalimpact RSS Atom RIS/bibtex Mendeley Sciencegate.ch
90% 46% 83% 100% 17% 20% 100% 67% 67%


Totalimpact and the Cite features (RIS/bib) are the only services that scored 100.  Datacite will improve to 100 when Dryad moves to the advanced EZID metadata model (which will be soon).

Consistency

This is calculation of the percentage that each metadata attribute returned the correct value.

Service 1stAuthor Multiple Authors Title doi Year/date Dryad Digital Repository Abstract/Description Keywords
DataCite 100% 0% 100% 100% 100% 100% 0% 0%
GoogleScholar 100% 33%* 100% 0% 100% 0% 20%* 0%
ROpenSci 100% 100% 100% 100% 100% 0% 100% 100%
Totalimpact 100% 100% 100% 100% 100% 0% 0% 0%
RSS 0% 0% 100% 0% 0% 0% 55% 0%
Atom 33% 0% 100% 0% 0% 0% 55% 0%
RIS/bibtex 100% 100% 100% 100% 100% 100% 0% 0%
Mendeley 100% 100% 100% 100% 0% 0% 100% 100%
Sciencegate.ch 100% 100% 100% 0% 100% 0% 100% 100%

'*'based on search terms

As you can see most of these technologies are fairly consistent, hence all the 100s and 0s. The least consistent are the RSS/Atom Feeds, and GoogleScholar.

DataCite

Appearance:

-The search results display: Title, DOI hyperlink, first author, whatever information was searched for highlighted, and faceted refinement features to the side.

-The full page results: doi, doi hyperlink (to dryad), full citation (only first author), export (bibtex, RIS) buttons (that don’t work correctly) and then a list of metadata formats for download as plaintext: x-datacite+xml, x-datacite+text, rdf+xml, turtle, x-bibtex, x-research-info-systems, html

Potential Problems:

-Missing metadata: data files DOI, journal article DOI, multiple authors/ creators  (These should be fixed with the release of 1.11)

-The buttons for bibTeX and RIS are reversed. i.e. the bibtex button downloads the RIS file and vice versa. (This is a problem with DataCite's service. I issued a request for this to be fixed on 2/17. Response on 2/19 acknowledged that they were aware of the issue and pointed me to their github page for issue tracking)

-Using Advanced Search, when filtering by Datacentre, the CDL has many options (e.g. "CDL.USGS - USGS") but all Dryad data is categorized under CDL.CDL.  I think we could change this to "CDL.DRYAD – Dryad Digital Repository" to make clear where the data is coming from.

Recommendations:

-Contact the appropriate person at the CDL to see if we can have the proper suffix when sorting by datacentre through the advanced search of Datacite

-All other problems should be fixed with the new release, but I will check back to make sure

Google Scholar

Appearance:

-The search results display: full title of the page (linking to the landing page), the by-line reads: multiple authors, but truncate at certain number of characters with a triple dot ellipsis,  year, datadryad.org.

-The "my publications" section displays title and author(s) in the list form. If you click for more details, you get: Title, Author(s), Publication Date, Abstract, Citation Count, versions.

Potential Problems:

-Publisher information is not displayed because it is not in the metatags when Google crawls the site.  This means that in many views on Google Scholar, there is no indication that the data is coming from Dryad.

-Google Scholar used to crawl the Development site, while this as happening at least one scientist put his dataset from the dev site on his profile.  This is still there, and if someone clicked on it, the landing page is full of dead links.

-AS of 3/18/12, nothing from Dryad is indexed in Google Scholar.

Recommendations:

-If possible, In the DC.terms to google scholar mapping file, include the value "Dryad Digital Repository" for the attribute "citation_journal_title"


ROpenSci

Appearance:

This is a rich command-line tool.  It is possible to extract any metadata available for any data package on the site that is available through OAI-PMH.

Potential Problems:

When querying for total data packages and total data files, the numbers that are returned are substantially higher than what appears on the homepage for datadryad.org. Ryan guesses this is because the number from R includes datasets that have been deleted, and R is failing to account for those that have been tagged deleted.  

Recommendations:

We should communicate the above problem to the developers of ROpenSci.

Total Impact

Appearance:

Landing page has all metadata needed for a citation except 'year'.  It also has counts for 'views' and 'downloads'. The TSV file has more full metadata including year and "mentions"

Potential Problems:

-Total Imapct reports the 'data type' quite prominently, and Dryad datasets are very inconsistent, about half are coming up as "article" and half as "dataset". I have not found a pattern.

-It would be nice if year was displayed on the landing page. Also the author search is very finicky at this point.

Recommendations:

Talk to the developers of Total Impact about these minor quibbles.

RSS/Atom

Appearance:

Minimally, a reader will display the title, with some readers I will also get the abstract, and with the combination of "Google Reader, and the Atom stream for Data Packages only" it will also display one author (sometimes the first author, sometimes the last)

Potential Problems:

Wildly inconsistent, and the metadata displayed coud be better. Todd suggested at the last meeting that these should display everything needed for a citation.

Recommendations:

I will follow up here after the new release, since things will change.

Mendeley

Appearance:

Really depends on your access mechanism and whether one is using the Web App or the Desktop App.  I've had the most luck getting full metadata by downloading a RIS file from a landing page and dropping the file into the desktop app.  

Potential Problems:

-Mendeley metadata inclusion model is the same as Google's, but DSpace has only written the mapping of metatags from DC to HighWire Press for when Google's robots scrap the site. So currently even though they are the same model, Mendeley is only able to work with our DC tags which leaves too much ambiguity for the robots. For instance, there are three different dates. The mapping file for google scholar maps the three DC dates to one highwire press tag "citation_publication_date".  Mendeley is missing this, so it simply does not display a date (for the Web app).

Recommendations:

-Mendeley needs a mapping file like the one for google scholar.  It is possible they could use the same mapping file, and we just have to tell Robots.txt to look at it when the Mendeley robots are scraping the site.  I don't know enough about DSpace to say for sure.  However, I am fairly confident that since it is the same setup as Google scholar, this shouldn't be a huge undertaking.

-The same recommendation about "citation_journal_title" = "Dryad Digital Repository" goes for this one.  Also make sure the various DC.date fields map to "citation_publication_date"

Sciencegate.ch

Appearance:

-Search results display: date of appearance, title, creators, and description (truncated)

-full page displays: title, creators, date of appearance, submission date, description, subject(s), type, "repository URL", mobile tag (recursive)

Potential Problems:

-The "type" field is inconsistent. I've seen "article", "dataset", and "dataset; untilArticle"
-The "creator" field does not list authors in the correct order, nor is it alphabetical.
-Having two date fields seems unnecessary
-The mobile tag is recursive. what is the point of it?
-Repository URL displays as: http://www.datadryad.org/oai/request . When you click it, it brings you to a search on sciencegate for all dryad material. This is not what I expected from what looked like a link to the repository.

Recommendations:

They use OAI-PMH, so make sure they are aware of the new API when it is finished.  I haven't finished my analysis here because the site is down so often.

Cite: BibTeX

I encountered the problem of a comma that was consistently missing from the bib files. I reported this to Ryan (2/14/12).

Cite: RIS


Fine.

Share Functions

Delicious link not working. All others will display a linked title, except tweet will recommend a full tweet that can be modified.

Recommendations:

Once Mendeley is working to our satisfaction add an "import to Mendeley" button. Also investigate services that consolidate share links, like addthis.com or sharethis.com.

Summary of Recommendations

  • Currently the DC metadata for data packages is often an amalgam of metadata about the "package" and metadata about the article.  This is confusing outside of the context of Dryad.  The two fields that provide the most problems on external systems are DC.type and DC.publisher.  This is especially true in reference managers like Mendeley, where currently about half of data packages are classified as type=Generic and half as type=Article.  My recommendation is to decide if Dryad should get out of the business of providing article-specific metadata.  
    • More specifically I would like to see Dryad change the various metadata access mechanisms to all relate "Dryad Digital Repository" as the publisher of all Data Packages/Data Files in the repository.  This would be redundant metadata on the site itself so it might be better to devise a way that this is only machine-readable. (for Mendeley and Google Scholar "DC.publisher" should map to "citation_journal_title") 
    • Similarly, the DC.type value should be consistent for all data packages on external systems. On external systems, the lack of a value is better than "Article", but there may be something that works even better.  It would be nice to differentiate between the data files and the data package through this field.
  • In most cases, potential problems with the metadata on external systems could only be solved by the developers of those systems.  I tried reaching out when I could and received quick, positive response from DataCite, and ROpenSci. I've received no response from Sciencegate.ch, and Citeulike.  So my recommendation is to keep channels of communication open with services that we know are using Dryad metadata, and make sure to inform them once the new API is finished.