Digital Repository of Information and Data for Evolution (DRIADE)
A joint project of and the
This wiki is used by the NESCent-MRC working group charged with establishing a repository for heterogeneous digital datasets in the field of evolutionary biology.
The focus will be on published datasets, with tight linkages to major evolutionary biology journals and domain-specific community databases.
Some guiding principles:
- Minimize the technical expertise and time required for data deposition and metadata generation
- Provide tools and incentives to researchers for quality metadata generation and dataset reuse
- Be sensitive to the intellectual property rights of researchers
- Ensure a self-sustaining economic model with a plan for long-term data stewardship
- Engage related efforts in allied fields (e.g. ecology, paleontology, genetics) and in the information science community.
This wiki contains planning documents of various sorts as well as the products of our own research into related efforts in other fields. Apart from a few specialized databases, there has historically been little cyberinfrastructure for data preservation, discovery, sharing, or synthesis for most published data in the field of evolutionary biology. Existing systems for the storage and retrieval of heterogeneous scientific data either put a high burden of metadata generation on the individual researcher or do not capture sufficient metadata to enable resource discovery and reuse. The provider may also be burdened by a requirement to submit different subsets of the data package to one or more specialized databases. Finally, there is no infrastructure in the field for fine-grained and communally-shared, data access privileges that would allow different rights for individuals, collaborative groups, and the general public across multiple repositories and over the digital resource lifecycle.
We propose a Digital Repository for Information and Data on Evolution (DRIADE), that will be the primary home for published data in the field of evolutionary biology. Building on existing technologies and following the OAIS functional model, we will develop a number of software modules supporting digital resource lifecycle management from data ingestion to curation to discovery and reuse. Computer-aided metadata generation and augmentation will assist the data provider in capturing metadata of sufficient richness and quality to enable advanced data discovery, reusability and data integration. Specialized modules will allow data submission to be coordinated with the manuscript review and publication process of participating journals, as well as with the submission process to external specialized databases, including NCBI (for biomolecular sequences), Treebase (for character matrices and phylogenetic trees) and Morphbank (for images). This will provide one-stop data submission for the user. A data curator will oversee data and metadata quality control, supported by a separate data curation software module that employs automatic techniques to evaluate metadata quality. An identity, authority and data security module will be developed to implement fine-grained data access privileges for users using global user identities. Resource discovery, sharing, and interoperability with external repositories will be enabled by implementing the OAI-PMH metadata harvesting standard supplemented by custom web services. These services will be exposed to collaborating journals, specialized data repositories, third-party content aggregators, and the DRIADE web portal itself. Extensive evaluations and user testing will be employed throughout the design and implementation process by conducting metadata generation studies and analyzing the resulting quality of metadata content; developing data use cases; and conducting information retrieval experiments and usability studies to evaluate the effectiveness and performance of the system. A working group of stakeholders will develop a management structure to ensure the long-term maintenance and financial sustainability of the repository.
The proposed work pioneers the application of digital data sharing to a 'small science' discipline. It is anticipated that DRIADE will have a broad impact in making available for discovery and repurposing the data underlying hundreds of studies published annually in evolutionary biology, and staunch the ongoing loss of this body of data that could be used to drive future evolutionary discoveries, with comcomitant benefits to medicine, agriculture, conservation and basic science.