Old:December 2006 Workshop Plans
December "Stakeholder" Meeting
- To inform the society and journal reps of our plans to date and get feedback on those plans
- To discuss how to gather requirements.
- Date: Dec 5, 2006
- Time: core meeting 9am-11pm + extended discussions through lunch
- Place: NESCent
- invite letter - please comment! --Tjvision 16:50, 22 November 2006 (EST)
- Editors in attendance will be asked to communicate with their publisher about constraints / expectations regarding open access to data, to consider how we can develop use-cases (i.e., behaviors that we can use to convert into requirements), and to compile concerns/frequent issues that arise in the publication of supplemental data.
- Todd's slides
- Hilmar's slides
- Jane's slides --Janeg@ils.unc.edu 01:53, 8 December 2006 (EST)and handouts (last changed: --Hlapp 23:13, 6 December 2006 (EST))
- Metadata standards (by Ruth)
- Whiteboard table with prioritization votes
- NESCent-MRC DRIADE team, firstname.lastname@example.org
- Ahrash Bissell (Duke/OpenContext), email@example.com
- Harold Heatwole (editor, Integrative & Comparative Biology), firstname.lastname@example.org
- Mohammed Noor (NESCent working group leader on meta-analysis), email@example.com
- Bob Peet (former editor, Ecology), firstname.lastname@example.org
- Mark Rausher (editor, Evolution), email@example.com
- Michael Whitlock (editor, American Naturalist)
- Kathleen Smith (NESCent), firstname.lastname@example.org
- Marcy Uyenoyama (incoming editor, Molecular Biology & Evolution), email@example.com
- Don Waller (SSE), firstname.lastname@example.org
- 9:00 Goals of this meeting (Todd Vision)
- 9:15 Roundtable introductions
- 9:30 Requirements and open questions (Hilmar Lapp)
- 9:45 Issues regarding metadata (Jane Greenberg)
- 10:00 Roundtable discussion
- Expectations and desires of the journals, publishers and scientific societies
- What are the priorities?
- Ideas for the requirements gathering phase
- Suggestions for attendess at the spring stakeholders meeting
- 11:00-on (for those remaining)
- Further discussion of project plans and the two major upcoming meetings
Minutes were taken by Ruth and Jed and are posted as a separate page.
Jane's summary DRAFT: 2/20/07
Aims and Participants
NESCent sponsored a Stakeholders' Workshop in early December to present and discuss goals and objectives underlying a digital data repository for published data in the field of evolutionary biology. The goals of the workshop included 1. eliciting stakeholder’s feedback and opinions about a data repository, 2. verifying support for an data repository initiative, and 3. identifying challenges and priorities. Stakeholders attending included representatives (primarily editors) from the major journals and societies in evolutionary biology. Among journals being represented were Integrative & Comparative Biology, MEvolution, American Naturalist, Molecular Biology & Evolution. Participating journal representatives are all scientists in their own right, and served a dual purpose representing 1. the publisher community's commitment to recording, preserving, and disseminating scientific research, and 2. the researcher community's demand for open science, including greater access, sharing, and reuse of scientific data. Don Waller, President of the Society for the Study of Evolution (SSE), represented the society stakeholder community, and also provided insight into the publisher community's objectives, drawing from his extensive experience as an editor. Also attending the workshop were NESCent staff members Kathleen Smith, Director; Todd Vision, Director of Informatics; and Hilmar Lapp, Assistant Director of Informatics; UNC Metadata Research Center (MRC) staff member, Jane Greenberg, Director; MRC graduate students; and Ahrash Bissell, representing the OpenContext repository.
Workshop Introduction The Workshop began with participant introductions and three brief presentations:
- Todd Vision introduced NESCent and explained the purpose and basic aims of a data repository for evolutionary biology. The DRIADE (Digital Repository of Information and Data for Evolution) partnership was introduced as an initiative leading this effort. The Knowledge Network for Biocomplexity (KNB) data repository (http://knb.ecoinformatics.org/index.jsp) for ecological data and the OpenContext repository (http://www.opencontext.org/index.php) for archeological documentation were displayed and briefly discussed as contextual examples. The latter was discussed by Ahrabrash Bissell.
- Hilmar Lapp presented four key virtues of a data repository (data sharing, discovery, preservation, and synthesis). He emphasized the need to involve user and define both user and system requirements. Hilmar also identified a series of criteria that could guide repository development (e.g., system must be unambiguous, concise, complete, etc.).
- Jane Greenberg highlighted the significant role of metadata in a digital data repository, and presented a series of questions to guide the roundtable discussion.
The roundtable discussion was open and unstructured in order to elicit stakeholder’s feedback and opinions. The discussion provided clear evidence of support for a digital data repository for evolutionary biology; highlighted key challenges; and considered priorities for moving forward. Discussion topics addressing these three areas (support, challenges, and priorities) are summarized below.
1. Support for a digital data repository
Stakeholders unanimously supported the idea of a digital data repository for evolutionary and the efforts underlying the DRIADE initiative. A data repository was viewed as a necessary vehicle for advancing science in the field. Participants framed the discussion by acknowledging that digital technology, including networked communication, is changing the way science is conducted. Journals are increasingly providing means for storing supplementary data--a provision that differs from when investigators were told to minimize data for publications. New incentives are emerging in the areas of open science and data sharing, as demonstrated by GenBank. In addition to these general theme, there was overall agreement on the following topics supporting the DRIADE initiative:
- A "central" data repository will support and foster new science. The repository should be supported by an "Evolution World" website.
- Key benefits of data repository include data preservation, discovery, reuse, sharing, and potentially synthesis. Interdisciplinary questions can be addressed using the heterogeneous datasets made available via a data repository.
- Journals in the field have moral authority to provide access to data so that results can be replicated.
- Journals and professional societies provide the mechanism for generating researcher interest in contributing to a data repository and fostering (and advocating) cultural change.
- A data repository will help to verify the authenticity of data, aid in policing.
- A data repository can help researchers deposit their data in DRIADE, but also GenBank, TreeBase, etc.
- The data repository design needs to be simple, with low barriers, to avoid problems inherent in KNB.
- The repository must support the wide range of scientists (funded and not funded) conducting evolutionary biology research. The repository should host data and other supporting documentation from scientists no longer living.
- The repository could be viewed as a safe personal place to store data for a researcher's current use and active collaborations.
The discussion highlighted a series of challenges (listed here) that need to be addressed in order to build an effective and robust repository. In several cases, potential solutions were also suggested.
- Scope: Questions about the scope of a repository primarily focused on what, who, and when.
- What: What data objects will be deposited and represented? Will the repository be restricted to data supporting the published research, or data beyond the published data? What about source code for a tool that supports data analysis? Will the algorithm be shared? (Bioinformatics requires the code to be accessible.). One participant advocated for including doctoral theses.
- Who: Who will deposit data objects and create the representations? Opinions ranged from "everyone publishing in the major journals" should be required to submit to submission should be voluntary. (It was agreed that the repository should support the wide diversity of scientists conducting evolutionary biology research.)
- When: At what stage during the publication life-cycle should the data be contributed to the repository (e.g., when the manuscript is submitted for review, so that reviewers have access to the data; within 6 months of a publication). Data deposition policies will need to integrate with individual journal archiving policies, unless journal policies are revised.
- Rights (data rights and copyright): What are authors rights in terms of datasets? Journal, researcher, funding agency rights can conflict. Do journals have the rights to published data collected with public funds? The repository could operates under Creative Commons license, similar to OpenContext. What are the obligations of the researcher who uses another researcher's data set. Several participants suggested that authors should be able to decide whether to have their names included in research publications that used their original data. One participant stated that "if there is no rights management, this [DRIADE] will fall flat on its face."
- Representation (metadata):
- Standardization: How much data representation standardization will be required? Will the system support free text searching and annotation? It was agreed that a combination of both standardization and flexibly representation will likely be implemented. One participant indicated that he didn't believe that there will ever be a standard that everyone will adhere to. It was suggested that a simple scheme like the Dublin Core can work, at a very general level.
- Metadata generation: Who will generate the metadata? How will metadata be created for unstructured data in published papers? KNB was discussed again as a model for what to do, and for where improvements could be made. DRIADE's relationship with published data will support automatic metadata creation of various data elements. When should metadata be created? A brief discussion noted that metadata will need to be generated during different phases of the data objects life-cycle (submission, re-use/modification, and so forth).
- Granularity: How much data needs to be described for effective use. Some data may not be part of the published paper, but will be valuable to others in re-using a data set, and authors may not realize this because they are intimate with their topic. Granularity will depend on the function being supported (e.g., basic resource discovery or synthesis).
- Additional representation challenges: How should data from experiments that resulted in negative results be represented?
- Security: Data security emerged as a pressing issue. How will the system protect against data piracy and unethical data farmers? One participant warned of potential security attacks ("God's design") that could penetrate the repositories security. A centralized repository may also address various security concerns because provenance is easier to verify and there is built in policing.
- Quality control. How will quality control be maintained. Will there be a "data curator" who reviews the metadata quality? One participant observed that "messy data can be cleaned up, but messy data may not be used."
- Cultural change: How can stakeholders trigger cultural change and foster researcher interest? Suggestions included journal editorials introducing the concept of a shared data repository and breakout groups and informative informational sessions at designated professional conferences.
- Human nature challenges. One participant suggested that mandated deposition may incite passive aggressive behavior? As noted under support above, incentives need to be built in for providing good quality metadata and the process needs to be simple.
- Incentivizing (still notes to add)
- Sustainability (still notes to add)
3. Priorities and next steps
The last segment of the roundtable discussion focused on priorities and next step. There was an obvious sense of immediacy and eagerness to move forward with a simple plan to begin to preserve data.
- Preservation has to be a priority. Not just for published works, but also for data that might lost.
- During an earlier part of the discussion, Michael W. presented a model relating data issues to Maslow's hierarchy of life needs (model: preservation, access, and synthesis). He expressed that the most important goal at this time was "preservation." Without preservation we cannot consider "access" and "synthesis".
- Participants agreed on the need to provide access to a repository of some sort soon as possible (the big truck scenario was presented).
- Several participants advocated for an date to have a mandatory data archiving begin for authors publishing in the journals (i.e., those represented in the meeting). Michael W. proposed July 1, 2007.
- Don suggested a phased approach for deposition. There was some discussion of metadata first, then data later.
- Requirements need to be gathered. Todd informed participants of our next two workshop, and participants agreed these were good vehicles for gathering more requirements.
- Research efforts that can inform the development were identified. Among methods proposed were delphi studies, focus group meetings, surveys, use case studies, and a metadata generation experiment.
A final activity, participants voted on priorities. These are summarized in a consensus table.
(jg still wants to summarize here).