User:Chrisftaylor/Project documents/Dryad search

=Background= Several organizations offer data set (metadata) search; for example, DataONE's ONEMercury, ANDS' Research Data Australia, the UK Data Service's Discover and DataCite's Metadata Search (beta). The table below lists some of their main features and compares them with Google Search and Dryad Search. Table One. The main features of a series of search services. Each of the above features appears more than once, indicating some degree of utility. ‘Relevance’ (for which all but Google use Solr/Lucene) and ‘Quality’ (of annotation -- that being the only available estimator) are fuzzy, context-dependent concepts, but offer the only practical handles for large results sets (inevitable as repositories grow in size and number); this is highlighted as issue #2 above.

An issue largely unaddressed by the search engines (other than Google) considered above is the invisibility of data sets with missing or erroneous metadata (for example, DataCite lists more than half of the data sets as having no value for ‘resourceType’). Where classifiers are used, the number of unclassified data sets should always be shown.

=Requirements= ISSUE: The tab-based display of results is limiting and many more data sources are available.
 * Access to more sources, initially through DataCite and DataONE.
 * Specification of either 'Dryad only' or 'all available sources' for search.
 * Clear indication of the host database for data sets.

ISSUE: Quality is difficult to judge; metadata completeness and publication status could help.
 * Clear display and filtering based on status of any linked publication(s).
 * Sorting on ‘metadata completeness’ (the proportion of fixed fields that have content).
 * Sorting (where supported) on number of accesses to a record (such attention, suitably scaled, may indicate quality).


 * Problems with records in DataCite

Increasing the number of metadata sources

 * DataONE comprises diverse resources, each handled by a tailored import function in ONEMercury. The metadata collated by (ONE)Mercury can be accessed or downloaded using OAI-PMH.
 * Other Mercury-based metadata catalogs (listed at the project website) could be added at a later date.
 * Example using ORNL's Mercury instance.
 * Query:
 * Filters:
 * Indexing:
 * DataCite offers both OAI-PMH access to, and RDF download of their catalogue of research datasets, each of which meets their standard for metadata completeness.
 * Other repository catalogs will come under DataCite's aegis by next year (i.e., Databib and re3data; announcement by Datacite)
 * Example using DataCite's OAI-PMH Data Provider (beta).
 * Query:
 * Filters:
 * Indexing:
 * NULL
 * Query:
 * Filters:
 * Indexing:
 * The Dataverse Network, which implements the Data Description Initiative's metadata format and which offers a mature API, offers a third route for expansion.
 * Example using Harvard's Dataverse instance

Supporting additional search facets (metadata completeness & accesses to record)

 * Completeness is not currently represented in catalogs such as DataCite's, only being assessed with respect to a minimum requirement for inclusion by DataCite and ANDS ('exists' = pass).
 * Attention could be measured by views or downloads; 'accesses' covers both. Few repositories support this, but there is an important issue here wrt research impact.

Dryad landing page
The suggested new form of the sidebar search box for the Dryad landing page is shown in Figure One.
 * Radio buttons allow users to choose whether to search only for datasets held by Dryad or to search across all available resources.
 * > Clicking '[?]' should pop up a list of all the databases whose holdings' metadata Dryad can search.


 * The link to the 'Advanced' search interface has been truncated (to make space for the radio buttons).

Left-hand pane ('Search for data')

 * The Dryad/All radio buttons have again been added below the text box.
 * 'Advanced Search' has its full length restored (space not at a premium in this case).
 * Clicking 'Advanced Search' reveals the same components as before (the facet search widget, and the list of refinements applied)
 * but the link and the widget/list have a block of color around them as an aid the eye
 * A permanent horizontal rule divides the search and sorting/results areas.
 * The pagination and sorting controls remain much as before, but should offer sorting by (metadata) completeness and downloads/views/accesses [t.b.d.] when available.
 * Data sets linked to publications shown in the results list have their authors and date bolded to attract the eye.
 * Small icons have been added at the start of each entry in the results list to indicate the source repository (and to add some color)
 * For commonly-used repositories an icon should be created (as shown for Dryad and FigShare) and stored.
 * For repositories lacking a dedicated icon, the first letter of the first meaningful word in the resource's official title is placed against a flat background and a contrasting color pair chosen at random to produce an icon for that repository (which should then be stored). PANGAEA is shown as an example in the right-hand pane.

Right-hand pane ('Refine Search')

 * The tabs used previously have been subsumed into the sidebar filter under 'Source Database', allowing an arbitrary number of sources to be represented.
 * Allows the user to filter by source.
 * Makes the database hosting the dataset explicit.
 * Serves as a key for the icons used in the results list.
 * Also added to the current list of sidebar filters, 'Publication Status' allows the user to display only published (or unpublished) data sets (i.e., those bolded, or not, in the results list).