User:Chrisftaylor/Project reports/Dryad search OLD

=Background= Dryad Search currently covers three sources (Dryad, KNB and TreeBase) and offers the ability to sort and sub-search the result sets from each source individually, against what metadata are available. Current issues with Dryad Search include: The next iteration of Dryad Search should meet the following criteria:
 * 1) The tab-based display of results is limiting (many other data sources are available)
 * 2) Standard sorting and sub-searching is of little help when results sets run to thousands
 * 1) The interface and underlying service should not limit the number of data sources
 * 2) Related data should be suggested for the members of a search result set
 * 3) The interface should be as simple and intuitive as possible
 * 4) Metadata quality should be addressed and relayed

=The research data search landscape= Several organizations offer data set (metadata) search; for example, DataONE, Research Data Australia, the UK Data Service and DataCite. The table below lists some of their main features and compares them with Google Search and Dryad Search.

Table One. The main features of a series of search services. In summary, each of the above features (with the exception of Dryad’s free-text filter for specific fields in the results set) appears more than once, indicating some degree of utility. ‘Relevance’ (for which all but Google use Solr/Lucent) and ‘Quality’ (of annotation -- that being the only available estimator) are fuzzy, context-dependent concepts, but offer the only practical handles for large results sets (inevitable as repositories grow in size and number); this is highlighted as issue #2 above.

An issue largely left unaddressed by the search engines considered is the invisibility of data sets with missing or erroneous metadata (for example, DataCite lists more than half of the data sets as having no value for ‘resourceType’). Where classifiers are used, the number of unclassified data sets should always be shown.

=Dryad Search: specifications=

Requirements

 * Access to more of the data sources available through DataCite, which collects data from re3data and Databib (an arrangement recently made more explicit), and through DataONE.
 * Sorting on ‘metadata completeness’, a rough proxy for quality showing the proportion of fixed fields that have content (schemata allowing user-defined fields are not directly amenable).
 * Sorting (where supported) on number of downloads (another rough proxy for quality).

Back-end processes
The back-end architecture for Phase One is little different from what is already in place, bar that additional statistics are to be collected (metadata completeness and number of downloads). ""

User interface
The user interface also requires few initial changes. The tabs have been subsumed into the sidebar filter, allowing an arbitrary number of sources to be represented and additional sorting options are offered based on metadata completeness and total downloads. ""

Requiremnents

 * Categorical browsing of all indexed records by common metadata fields.
 * Storage of query history (i.e., URLs) in a user’s Dryad profile.

Back-end processes
The only major back-end change required for Phase Two is the addition of functionality to user profiles; namely to allow users to store their query history. As previously suggested, that history could simply consist of a dated list of the URLs for queries run while logged in (with the added benefit that if hyperlinked they will allow the user to re-run the query without complication). The proposed browsing of all indexed source data by category should not require the collection of additional metadata, or any significant processing. The browse facility is simply a re-implementation of the existing filters, but applied to a notional query that returns all data sets.

User interface
The image below shows a simple implementation of a search history. Next to the date are the terms used in the query; user terms in black, system-supplied filter terms hyperlinked to that term in the browser also suggested for this phase. The direct links should simply launch a new frame/window and pass the stored URL into it. Note that this system does not return a fixed results set as the underlying data will change. It does however offer the opportunity for users to construct and re-run queries carefully tuned to their needs. "" Also required is a simple browser -- a reimplementation of the filter(/browse) facet sidebar already implemented in Dryad search but rendered as a tree-style metadata category browser, presented in its initial state (for example, as when a user clicks straight through to 'Advanced Search'; i.e., initially returning all available data sets from all sources). It is critical that for each category the number of data sets with no metadata in that category is shown (i.e., the last item should always be ‘Unclassified’).

Requirements

 * Wholesale processing of all data sets’ metadata through HIVE or an equivalent to obtain controlled vocabulary tags (noting that typographical errors will remain an issue).
 * Construction/updating of a graph whose nodes are the set of all terms assigned in the HIVE-based analysis; each node listing all data sets assigned that tag.
 * Querying of the graph by data set(s) to find further, related data sets (i.e., those sharing one or more nodes with the query data set, ranked by number of shared nodes).
 * Finding related data by selecting HIVE-assigned terms from the results set in a Wordle.
 * Scheduling of pre-constructed queries (i.e., query URLs), with email alerting.
 * Support for user collections (i.e., stored, viewable, exportable results sets).

Back-end processes
Phase three requires significant additions to the back-end functionality supporting Dryad Search. Arrow #3 in the diagram for the Phase One back-end implementation now has an additional component: as metadata on data sets are collected during (overnight) indexing, all metadata from each data set are sent in parallel to a HIVE-based service that assigns all the tags all that it can (based on a fixed set of vocabularies in specific versions). These tags are then used to build a graph of all HIVE-assigned terms, each term linking to a list of all the data sets to which it has been assigned. That graph will then permit searches for related data sets, where a selected data set’s HIVE-assigned terms (ideally listed unobtrusively below the main search result, perhaps made individually clickable) are matched against all other data sets; the ‘hits’ being ordered by the number of tags they share with the query data set and returned to the user.

User interface
The image below shows an example of a Wordle. In the scenario considered, these terms would be those assigned by HIVE to all the data sets in the query results set. Selecting terms from the Wordle (then clicking ‘Go’) would allow users to formulate tag-based queries they may not otherwise have conceived. "http://biology2.sciencecommunity.wikispaces.net/file/view/climate%20change%20wordle.JPG/354121198/800x349/climate%20change%20wordle.JPG"

Summary
The above three-phase implementation plan ensures that Dryad Search matches the capabilities of equivalent services and introduces desirable features not found elsewhere, such as:
 * ‘Metadata completeness’; without a more-controlled system (e.g., specified formats and vocabularies) it is difficult to find a more accurate metadata-based quality estimation.
 * ‘Relatedness’, which in this form is also a rough measure, but given the low quality of most metadata, recall must be preferred over precision (with tools for the user to refine the set).
 * User query management and results storage. Not only a helpful, efficient measure for users and a way to raise visibility; also a way to begin to profile our users and their behavior.