Multi-Site Statistics

From Dryad wiki
Jump to: navigation, search
Status: Atmire has completed their analysis and is writing the final report. Preliminary results below.

Work Package Description

Dryad pages need to accurately reflect numbers of views/downloads, even though the views/downloads may be performed on several different nodes. Determine the possible methods for implementing this, and analyze their tradeoffs.

Functional requirements:

  • Statistics displayed on data package/file pages are aggregated over accesses that occur, regardless of which Dryad node was used for the access. (These do not need to be updated realtime; nightly updates are ok.)
  • We must know which server served each bitstream.

Options

Centralized Statistics Collection and Querying: In this scenario a Central Solr index is established that will index all events from all DSpace nodes into one Common Solr Instance, this common Solr Instance will reside on a separate server delegated to the purpose of hosting the Solr Index. DSpace Instances will both update and query the common Solr Index.

Master / Slave Topology: This scenario is similar to above, but treat the central Solr index as a Master index for Updates. The Master is Replicated to Slave instances residing on each Node which are then used for Statistical Queries and Analysis.

Sharding Topology: In this scenario, Solr shards are placed across two or more servers, providing horizontal partitioning. In a two node topology, each node would act as a shard for the other, queries would include both nodes, but each node would only hold the statistics events for that node. The advantage of Sharding is that it would reduce the complexity of replicating of Event data across nodes but at the expense of not have the data replicated locally for querying which may introduce performance latency.

Hybrid Approach: The above approaches can be combined to varying degrees to attain a highly optimized distibuted scale search solution. By combining replication and division of the search index into shards, duplicate sets of statistical data may be maintained on all nodes, shards allow for only the latest updates to the solr index to need to be transferred. Cost of complexity in maintaining the Hybrid Replication and Sharding Solution is, of course, higher than other approaches.

Preliminary Results

Atmire has done performance analysis of the best approach for distributing statistics and determined that the best approach is to host a Solr instance independent of the two Dryad servers and have each aggregate the statistics there. This is dependent on a conclusion that the two instances will house an identical replicate of items and that statistics will be aggregated across the two nodes into one dataset.