Old:December 2006 Workshop Plans

From Dryad wiki
Revision as of 06:04, 20 February 2007 by Jane (talk | contribs) (Summary)

Jump to: navigation, search

December "Stakeholder" Meeting


  • To inform the society and journal reps of our plans to date and get feedback on those plans
  • To discuss how to gather requirements.


  • Date: Dec 5, 2006
  • Time: core meeting 9am-11pm + extended discussions through lunch
  • Place: NESCent



  • NESCent-MRC DRIADE team, wg-digitaldata@nescent.org
  • Ahrash Bissell (Duke/OpenContext), ahrash.bissell@duke.edu
  • Harold Heatwole (editor, Integrative & Comparative Biology), harold_heatwole@ncsu.edu
  • Mohammed Noor (NESCent working group leader on meta-analysis), noor@duke.edu
  • Bob Peet (former editor, Ecology), peet@unc.edu
  • Mark Rausher (editor, Evolution), mrausher@duke.edu
  • Michael Whitlock (editor, American Naturalist)
  • Kathleen Smith (NESCent), kksmith@duke.edu
  • Marcy Uyenoyama (incoming editor, Molecular Biology & Evolution), marcy@duke.edu
  • Don Waller (SSE), dmwaller@wisc.edu


  • 9:00 Goals of this meeting (Todd Vision)
  • 9:15 Roundtable introductions
  • 9:30 Requirements and open questions (Hilmar Lapp)
  • 9:45 Issues regarding metadata (Jane Greenberg)
  • 10:00 Roundtable discussion
    • Expectations and desires of the journals, publishers and scientific societies
    • What are the priorities?
    • Ideas for the requirements gathering phase
    • Suggestions for attendess at the spring stakeholders meeting
  • 11:00-on (for those remaining)
    • Further discussion of project plans and the two major upcoming meetings


Minutes were taken by Ruth and Jed and are posted as a separate page.


Jane's summary DRAFT: 2/20/07

Aims and Participants NESCent sponsored a Stakeholders' Workshop in early December to present and discuss goals and objectives underlying a digital data repository for published data in the field of evolutionary biology. The goals of the workshop included 1. eliciting stakeholder’s feedback and opinions about a data repository, 2. verifying support for an data repository initiative, and 3. identifying challenges and priorities. Stakeholders attending included representatives (primarily editors) from the major journals and societies in evolutionary biology. Among journals being represented were Integrative & Comparative Biology, MEvolution, American Naturalist, Molecular Biology & Evolution. Participating journal representatives are all scientists in their own right, and served a dual purpose representing 1. the publisher community's commitment to recording, preserving, and disseminating scientific research, and 2. the researcher community's demand for open science, including greater access, sharing, and reuse of scientific data. Don Waller, President of the Society for the Study of Evolution (SSE), represented the society stakeholder community, and also provided insight into the publisher community's objectives, drawing from his extensive experience as an editor. Also attending the workshop were NESCent staff members Kathleen Smith, Director; Todd Vision, Director of Informatics; and Hilmar Lapp, Assistant Director of Informatics; UNC Metadata Research Center (MRC) staff member, Jane Greenberg, Director; MRC graduate students; and Ahrash Bissell, representing the OpenContext repository.

Workshop Introduction The Workshop began with participant introductions and three brief presentations: - Todd Vision introduced NESCent and explained the purpose and basic aims of a data repository for evolutionary biology. The DRIADE (Digital Repository of Information and Data for Evolution) partnership was introduced as an initiative leading this effort. The Knowledge Network for Biocomplexity (KNB) data repository (http://knb.ecoinformatics.org/index.jsp) for ecological data and the OpenContext repository (http://www.opencontext.org/index.php) for archeological documentation were displayed and briefly discussed as contextual examples. The latter was discussed by Ahrabrash Bissell. - Hilmar Lapp presented four key virtues of a data repository (data sharing, discovery, preservation, and synthesis). He emphasized the need to involve user and define both user and system requirements. Hilmar also identified a series of criteria that could guide repository development (e.g., system must be unambiguous, concise, complete, etc.). - Jane Greenberg highlighted the significant role of metadata in a digital data repository, and presented a series of questions to guide the roundtable discussion.

Roundtable Discussion The roundtable discussion was open and unstructured in order to elicit stakeholder’s feedback and opinions. The discussion provided clear evidence of support for a digital data repository for evolutionary biology; highlighted key challenges; and considered priorities for moving forward. Discussion topics addressing these three areas (support, challenges, and priorities) are summarized below.

1. Support for a digital data repository Stakeholders unanimously supported the idea of a digital data repository for evolutionary and the efforts underlying the DRIADE initiative. A data repository was viewed as a necessary vehicle for advancing science in the field. Participants framed the discussion by acknowledging that digital technology, including networked communication, is changing the way science is conducted. Journals are increasingly providing means for storing supplementary data--a provision that differs from when investigators were told to minimize data for publications. New incentives are emerging in the areas of open science and data sharing, as demonstrated by GenBank. In addition to these general theme, there was overall agreement on the following topics supporting the DRIADE initiative:

- A "central" data repository will support and foster new science.  The repository should be supported by an "Evolution World" website. 

- Key benefits of data repository include data preservation, discovery, reuse, sharing, and potentially synthesis. Interdisciplinary questions can be addressed using the heterogeneous datasets made available via a data repository. - Journals in the field have moral authority to provide access to data so that results can be replicated. - Journals and professional societies provide the mechanism for generating researcher interest in contributing to a data repository and fostering (and advocating) cultural change. - A data repository will help to verify the authenticity of data, aid in policing. - A data repository can help researchers deposit their data in DRIADE, but also GenBank, TreeBase, etc. - The data repository design needs to be simple, with low barriers, to avoid problems inherent in KNB. - The repository must support the wide range of scientists (funded and not funded) conducting evolutionary biology research. The repository should host data and other supporting documentation from scientists no longer living. - The repository could be viewed as a safe personal place to store data for a researcher's current use and active collaborations.

2. Challenges The discussion highlighted a series of challenges (listed here) that need to be addressed in order to build an effective and robust repository. In several cases, potential solutions were also suggested.

- Scope: Questions about the scope of a repository primarily focused on what, who, and when.

  • What: What data objects will be deposited and represented? Will the repository be restricted to data supporting the published research, or data beyond the published data? What about source code for a tool that supports data analysis? Will the algorithm be shared? (Bioinformatics requires the code to be accessible.). One participant advocated for including doctoral theses.
  • Who: Who will deposit data objects and create the representations? Opinions ranged from "everyone publishing in the major journals" should be required to submit to submission should be voluntary. (It was agreed that the repository should support the wide diversity of scientists conducting evolutionary biology research.)
  • When: At what stage during the publication life-cycle should the data be contributed to the repository (e.g., when the manuscript is submitted for review, so that reviewers have access to the data; within 6 months of a publication). Data deposition policies will need to integrate with individual journal archiving policies, unless journal policies are revised.

- Rights (data rights and copyright): What are authors rights in terms of datasets? Journal, researcher, funding agency rights can conflict. Do journals have the rights to published data collected with public funds? The repository could operates under Creative Commons license, similar to OpenContext. What are the obligations of the researcher who uses another researcher's data set. Several participants suggested that authors should be able to decide whether to have their names included in research publications that used their original data. One participant stated that "if there is no rights management, this [DRIADE] will fall flat on its face."

- Representation (metadata):

  • Standardization: How much data representation standardization will be required? Will the system support free text searching and annotation? It was agreed that a combination of both standardization and flexibly representation will likely be implemented. One participant indicated that he didn't believe that there will ever be a standard that everyone will adhere to. It was suggested that a simple scheme like the Dublin Core can work, at a very general level.
  • Metadata generation: Who will generate the metadata? How will metadata be created for unstructured data in published papers? KNB was discussed again as a model for what to do, and for where improvements could be made. DRIADE's relationship with published data will support automatic metadata creation of various data elements. When should metadata be created? A brief discussion noted that metadata will need to be generated during different phases of the data objects life-cycle (submission, re-use/modification, and so forth).
  • Granularity: How much data needs to be described for effective use. Some data may not be part of the published paper, but will be valuable to others in re-using a data set, and authors may not realize this because they are intimate with their topic. Granularity will depend on the function being supported (e.g., basic resource discovery or synthesis).
  • Additional representation challenges: How should data from experiments that resulted in negative results be represented?

- Security: Data security emerged as a pressing issue. How will the system protect against data piracy and unethical data farmers? One participant warned of potential security attacks ("God's design") that could penetrate the repositories security. A centralized repository may also address various security concerns because provenance is easier to verify and there is built in policing.

- Quality control. How will quality control be maintained. Will there be a "data curator" who reviews the metadata quality? One participant observed that "messy data can be cleaned up, but messy data may not be used."

- Cultural change: How can stakeholders trigger cultural change and foster researcher interest? Suggestions included journal editorials introducing the concept of a shared data repository and breakout groups and informative informational sessions at designated professional conferences.

- Human nature challenges. One participant suggested that mandated deposition may incite passive aggressive behavior? As noted under support above, incentives need to be built in for providing good quality metadata and the process needs to be simple.

- Incentivizing (still notes to add)

- Sustainability (still notes to add)

3. Priorities and next steps The last segment of the roundtable discussion focused on priorities and next step. There was an obvious sense of immediacy and eagerness to move forward with a simple plan to begin to preserve data. - Preservation has to be a priority. Not just for published works, but also for data that might lost. - During an earlier part of the discussion, Michael W. presented a model relating data issues to Maslow's hierarchy of life needs (model: preservation, access, and synthesis). He expressed that the most important goal at this time was "preservation." Without preservation we cannot consider "access" and "synthesis". - Participants agreed on the need to provide access to a repository of some sort soon as possible (the big truck scenario was presented). - Several participants advocated for an date to have a mandatory data archiving begin for authors publishing in the journals (i.e., those represented in the meeting). Michael W. proposed July 1, 2007. - Don suggested a phased approach for deposition. There was some discussion of metadata first, then data later. - Requirements need to be gathered. Todd informed participants of our next two workshop, and participants agreed these were good vehicles for gathering more requirements. - Research efforts that can inform the development were identified. Among methods proposed were delphi studies, focus group meetings, surveys, use case studies, and a metadata generation experiment.

A final activity, participants voted on priorities. These are summarized in a consensus table.

(jg still wants to summarize here).

March "Consultant" Meeting

The workshop will take place in the week of March 5-10, 2007.

Invite letter - Draft, please comment! --Tjvision 21:49, 17 December 2006 (EST)

Invitees, including roles, and potential alternates

  • Ahrash Bissell, OpenContext, how raw should the data be? (alternate: Eric/Sandy Kanza)
  • Margret Branchofsky, Dspace, data federation (alternate: McKenzie Smith)
  • Joe Bush, Taxonomy Strategies, digital lifecycle management
  • Adam Goldstein, Darwin Digital Library, what metadata is required?
  • John Graybeal, Marine Metadata Initiative, metadata generation by scientists
  • Jane Greenberg, Dublin Core, metadata requirements
  • Chris Greer, NSF, sustainable funding
  • Kevin Gamiel, RENCI, data federation & grid storage
  • Margaret Hedstrom, U Michigan, trust level & digital lifecycle management
  • Bryan Heidorn, UIUC, use of the grid for storage
  • Dianne Hillman, Cornell, metadata generation by scientists
  • Matt Jones, SEEK, how raw should the data be, user interface rqmnts & metadata generation
  • Paul Jones, iBiblio, sustainable funding
  • Liz Liddy, Center for Natural Language Processing, School of Information Studies, Syracuse Univeristy, metadata generation
  • Josh Madin, SEEK, what metadata is rqd
  • Michael Nelson, OAI-PMH, enabling 3rd party harvesting (alternate: Carl Logoze)
  • Mohammed Noor, NESCent WG, integration of submission with journals, how raw should the data be? (alternates: Maria Servedio, Emila Martins)
  • Sandy Payette, Fedora, interface w/ journals (alternate: Carl Logoze?)
  • Bob Peet, Ecology Society, integration of submission with journals
  • Dav Robertson, NIEHS, repository trust level
  • Val Tannen, Penn, data integration (not sure how important his presence would be)
  • Herbert Van de Sompel, LANL, enabling 3rd party harvesting
  • Mary Vardigan, DDI/ICRSP, administration & sustainable funding, what metadata is rqd
  • John Willbanks, GBIF & Science Commons, intellectual property, and data federation (alternates: Stan Blum, Don Hobern)
  • someone from NASA, incentivizing data sharing

Not currently on invitee list, but probably should be)

  • someone from CIESEN?
  • Bruce Bauer (World Data Center for Paleoclimatology)
  • Tom Hammond (Conservation Commons)
  • Emilia Martins (EthBase)
  • David Schloen (OCHRE)
  • Micah Altman (Virtual Data Center) Micah_Altman@harvard.edu

Potential floaters

  • C. Lynch

Metadata Class, Mock Workshop


The Four Virtues that we strive toward are the Sharing, Reuse, Preservation, and Synthesis of published evolutionary data. Decisions have to be made on how to promote these Virtues, and to what degree.

Questions for the participants:

  • Breakout 1
    • Raw data in repositories or processed data only? Spreadsheet data? (Bissell, M. Jones, others needed)
    • How can depositors be incentivized? (someone from NASA, others needed)
  • Breakout 2
    • How would the system be administered and sustainably funded? (Greer, P. Jones, Vardigan)
    • What intellectual property policies need to be put into place? (Willbanks)
  • Breakout 3
    • What is the role for data federation technology (central vs distributed repository)? (Branchofsky, Gamiel, Willbanks)
    • What is the role for bona fide data integration technology? (Heidorn, Gamiel, Tannen)
    • What is the role of distributed/grid storage? (Gamiel, Heidorn)
  • Breakout 4
    • What level of trust is necessary for the repository, e.g., persistence of data, protection of data from tampering, quality of meta-data? (Headstrom, Robertson)
    • What metadata is required and how to generate it (Dublin Core, DDI-lite, EML, standards imposed by specialized repositories)? (Bissell, Goldstein, Greenberg, Madin, Vardigan)
  • Breakout 5
    • Do we need to plan for metadata lifeycle management, and to what extent? (Bush, Headstrom)
    • Should the system be capable of metadata generation, and if so to what extent, with how much human review? (Greenberg, Hillman, Liddy)
  • Breakout 6
    • How to synchronize ingestion with journal publication and 3rd-part database deposition? (Noor, Peet, others needed)
    • How to enable harvesting of data by 3rd-parties (e.g. OAI-MHP)? (Nelson, van de Stompel)
    • What should be the functionality of the interface to the centralized registry? (Bissell, M. Jones)

Alternative Structure

The alternative draft structure proposed below is based on the original break-out groups and questions above, the knowledge accumulated since the original plan was written, and the discussion between Todd, Jane, and Hilmar on Feb 7, and individual thoughts.

This alternative structure is grouped around 5 themes. Each theme is meant to represent a major challenge that the phase II repository will face, inevitably due to technological or scientific advance, or because we must overcome the challenge to deliver on our mission, or both. These major challenges are either non-technical (e.g., cultural, financial, legal, etc), or if they are technical, they present an unsolved problem and we have little or no core competency in the required area(s).

The invitees would include DRIADE stakeholders, i.e., there would be no 3rd workshop, at least not as originally planned.

The overarching theme of the workshop is "Digital data preservation, sharing, and discovery: Challenges for Small Science Communities in the Digital Era." All themes and questions below are posed in the specific context of small science communities.

The questions associated with each theme would be dealt with by a break-out group. Each break-out group has a moderator, who ensures that the discussion does not stray from the theme, functions as a scribe, and writes a draft summary report of the break-out group's discussion and recommendations. We should consider taping all break-out groups. The break-out groups at the II Workshop were also taped.

  1. Theme: Sustainability
    • Questions:
      1. What models exist for long-term financial sustainability of scientific data repositories? How successful are they? How applicable are they to a small science community? What can we learn from past sustainability break-downs (e.g., Swissprot, PDB)?
      2. What models exist for maximizing compliance among scientists? How can depositors be incentivized? How can use of the data in the repository be maximized among scientists?
      3. What is the relative importance of ease and speed of deposition, discovery, and retrieval to each other with respect to maximizing usability and community acceptance? How big is the risk of "failure due to success"?
    • Desired outcome: A summary of the challenges, and a recommendation of approaches, to ensuring long-term financial sustainabiliy, compliance of prospective depositors, and community embrace.
  2. Theme: Intellectual property and provenance requirements
    • Questions:
      1. Which data licensing schemes are needed to meet the needs of the researchers in a small science community? What (if anything) is missing among the range of licenses defined by CC? Do we need MTA-style licenses under which researchers can safely share data in the repository that is otherwise not publicly accessible?
      2. In which way does the publisher's copyright policy need to be taken into account? Do we preemptively need to mandate that depositors request author addendums, such as the SC addendum? Can we be held liable by publishers?
      3. At which detail and depth does provenance need to be recorded and provided? Are data items "co-owned" by all co-authors, only the corresponding or the lead or the senior author, or does each data item need a separate owner assignment? If the data is from a synthetic study, is the data depositable, and if so, what depth of citation should be required, for example citing the source of each data point, citing sources for each data item, citing sources for the entire study, or not citing any original sources in the deposition.
      4. To what extent does exposure of metadata records need to be coupled to the availability of the data, or need to be under the control of the data owner?
    • Desired outcome: A summary of the challenges, and a recommendation of approaches, to ensuring that the repository is free from copyright liabilities, that the requirements of researchers in a small science community to retain control over their data are met, and that data are appropriately attributed.
  3. Theme: Distribution and replication: possibilities and liabilities
    • Questions:
      1. To what extent is using federation reconcilable with the core priority of data preservation? If we delegate storage of certain data types to specialized repositories, to what extent will these repositories, and we ourselves, become liable to the same sustainability demands?
      2. What models exist for protecting the integrity of data through replicating repositories and data, such as LOCKSS? How could this be applied to a digital data repository in a small science discipline?
      3. To what extent do distributed storage technologies (such as SRB, grid storage) need to part of the initial architecture?
    • Desired outcome: ...
  4. Theme: Lifecycle management requirements
    • Questions:
      1. How important is it to allow scientists to change (edit, add to, remove from) their data once deposited? What information about previous data revisions does the repository need to retain, and for how long? Does on-line access to previous revisions need to be continuously provided?
      2. If a data format becomes unsupported, whose responsibility should it be to up-convert the data? Does up-converted data constitute a new revision, and who should be responsible for validating the up-conversion result?
      3. How important is it to allow depositors to change the metadata of their data once deposited? Should metadata be versioned the same as the data, tied together or separate? What changes in the metadata might constitute a change of the data (e.g., changing the unit of a data column)?
      4. To what extent should augmented metadata and data format require the depositor's approval? Should metadata or data changes by curators be treated the same or different than changes by the data owner? Do curators' changes and identities need to be recorded and attributed?
    • Desired outcome: ...
  5. Theme: Semantic Web and Web 2.0
    • Questions:
      1. How can semantic harvesting from DRIADE and content aggregation by 3rd-parties be maximized? How can discovery through web and semantic search engines be maximized?
      2. What kinds of service-oriented interfaces are useful, and which aren't? Will an OAI-PMH gateway suffice? Should data depositions, or data changes, be broadcast through feeds?
      3. How important, or dangerous, is allowing semantic tagging by users?
      4. What is the potential for a social networking feature within the repository, such as networks of "collaborators"?
    • Desired outcome: ...

Provisional agenda

Day 1:

  • Introductions and presentation of objectives
  • Refine, as a group, tasks for the breakout sessions.
  • Three concurrent breakout sessions over lunch and into early afternoon, with short chalktalks relevant to each topic followed by focused discussion on the breakout tasks.
  • Break
  • Late afternoon breakout group summaries

Day 2:

  • Three concurrent morning breakout sessions, again with short chalktalks relevant to each topic followed by focused discussion on the breakout tasks.
  • Lunch
  • 1-2 hour large group discussion
  • Writing of recommendations

May "Planning" Meeting


  •  ? NCBI interface w/ specialized dbs
  • Bill Piel, Treebase, interface w/ specialized dbs (possibly 2nd meeting instead)
  • Greg Riccardo, Morphbank, interface w/ specialized dbs (possibly 2nd meeting instead)