User:Chrisftaylor/Project reports/Dryad search

=Background= Dryad Search currently covers three sources (Dryad, KNB and TreeBase) and offers the ability to sort and sub-search the result sets from each source individually, against what metadata are available. Current issues with Dryad Search include:
 * 1) The tab-based display of results is limiting (many other data sources are available)
 * 2) Standard sorting and sub-searching is of little help when results sets run to thousands

The next iteration of Dryad Search should meet the following criteria:
 * 1) The interface and underlying service should not limit the number of data sources
 * 2) Related data should be suggested for the members of a search result set
 * 3) The interface should be as simple and intuitive as possible
 * 4) Metadata quality should be addressed and relayed

=The research data search landscape= Several organizations offer data set (metadata) search; for example, DataONE's ONEMercury, ANDS' Research Data Australia, the UK Data Service's Discover and DataCite's Metadata Search (beta). The table below lists some of their main features and compares them with Google Search and Dryad Search. Table One. The main features of a series of search services. Each of the above features (with the exception of Dryad’s free-text filter for specific fields in the results set) appears more than once, indicating some degree of utility. ‘Relevance’ (for which all but Google use Solr/Lucent) and ‘Quality’ (of annotation -- that being the only available estimator) are fuzzy, context-dependent concepts, but offer the only practical handles for large results sets (inevitable as repositories grow in size and number); this is highlighted as issue #2 above.

An issue largely unaddressed by the search engines (other than Google) considered above is the invisibility of data sets with missing or erroneous metadata (for example, DataCite lists more than half of the data sets as having no value for ‘resourceType’). Where classifiers are used, the number of unclassified data sets should always be shown.

=Requirements=
 * Access to more sources, initially through DataCite and DataONE.
 * Specification of either 'Dryad only' or 'all available sources' for search.
 * Clear indications of source database and status of any linked publication for data sets.
 * Sorting on ‘metadata completeness’ (the proportion of fixed fields that have content).
 * Sorting (where supported) on number of accesses; such attention may indicate quality.

Increasing the number of metadata sources

 * DataONE comprises diverse resources, each handled by a tailored import function in ONEMercury. The metadata collated by (ONE)Mercury can be accessed or downloaded using OAI-PMH].
 * Other Mercury-based metadata catalogs (listed at the project website) could be added at a later date.
 * Example using ORNL's Mercury instance.
 * DataCite offers both OAI-PMH access to, and RDF download of their catalogue of research datasets, each of which meets their standard for metadata completeness.
 * Other repository catalogs will come under DataCite's aegis by next year (i.e., Databib and re3data; announcement by Datacite)
 * Example using DataCite's OAI-PMH Data Provider (beta).
 * The Dataverse Network, which implements the Data Description Initiative's metadata format and which offers a mature API, offers a third route for expansion.

Supporting additional search facets (metadata completeness & accesses to record)

 * Completeness not currently assessed other than to pass a minimum requirement (by DataCite & ANDS). Not shared other than as a pass/fail.
 * Attention could be measured by views or downloads; 'accesses' covers both. Few RDRs support this but there is an issue here wrt research impact.

Dryad landing page
The suggested new form of the sidebar search box for the Dryad landing page is shown in Figure One.
 * Radio buttons allow users to choose whether to search only for datasets held by Dryad or to search across all available resources.
 * > Clicking '[?]' should pop up a list of all the databases whose holdings' metadata Dryad can search.


 * The link to the 'Advanced' search interface has been truncated (to make space for the radio buttons).

Main search page
The suggested changes to the Dryad main search page are shown in Figure Two.

Left-hand pane ('Search for data')

 * The Dryad/All radio buttons have again been added below the text box.
 * 'Advanced Search' has its full length restored (space not at a premium in this case).
 * Clicking 'Advanced Search' reveals the same components as before (the facet search widget, and the list of refinements applied)
 * but the link and the widget/list have a block of color around them as an aid the eye
 * and a permanent horizontal rule remains to divide the search box from the sorting/results area below.
 * The pagination and sorting controls remain much as before, but should offer sorting by (metadata) completeness and downloads/views/accesses [t.b.d.] when available.
 * Data sets linked to publications shown in the results list have their authors and date bolded to attract the eye.
 * Small icons have been added at the start of each entry in the results list to indicate the source repository (and to add some color)
 * For commonly-used repositories an icon should be created (as shown for Dryad and FigShare) and stored.
 * For repositories lacking a dedicated icon, the first letter of the first meaningful word in the resource's official title is placed against a flat background and a contrasting color pair chosen at random to produce an icon for that repository (which should then be stored).

Right-hand pane ('Refine Search')

 * The tabs used previously have been subsumed into the sidebar filter under 'Source Database', allowing an arbitrary number of sources to be represented.
 * Allows the user to filter by source.
 * Makes the database hosting the dataset explicit.
 * Serves as a key for the icons used in the results list.
 * Also added to the current list of sidebar filters, 'Publication Status' allows the user to display only published (or unpublished) data sets (i.e., those bolded, or not, in the results list).