Old:December 2006 Workshop Plans

From Dryad wiki
Revision as of 06:03, 20 February 2007 by Jane (talk | contribs) (Minutes)

Jump to: navigation, search

December "Stakeholder" Meeting


  • To inform the society and journal reps of our plans to date and get feedback on those plans
  • To discuss how to gather requirements.


  • Date: Dec 5, 2006
  • Time: core meeting 9am-11pm + extended discussions through lunch
  • Place: NESCent



  • NESCent-MRC DRIADE team, wg-digitaldata@nescent.org
  • Ahrash Bissell (Duke/OpenContext), ahrash.bissell@duke.edu
  • Harold Heatwole (editor, Integrative & Comparative Biology), harold_heatwole@ncsu.edu
  • Mohammed Noor (NESCent working group leader on meta-analysis), noor@duke.edu
  • Bob Peet (former editor, Ecology), peet@unc.edu
  • Mark Rausher (editor, Evolution), mrausher@duke.edu
  • Michael Whitlock (editor, American Naturalist)
  • Kathleen Smith (NESCent), kksmith@duke.edu
  • Marcy Uyenoyama (incoming editor, Molecular Biology & Evolution), marcy@duke.edu
  • Don Waller (SSE), dmwaller@wisc.edu


  • 9:00 Goals of this meeting (Todd Vision)
  • 9:15 Roundtable introductions
  • 9:30 Requirements and open questions (Hilmar Lapp)
  • 9:45 Issues regarding metadata (Jane Greenberg)
  • 10:00 Roundtable discussion
    • Expectations and desires of the journals, publishers and scientific societies
    • What are the priorities?
    • Ideas for the requirements gathering phase
    • Suggestions for attendess at the spring stakeholders meeting
  • 11:00-on (for those remaining)
    • Further discussion of project plans and the two major upcoming meetings


Minutes were taken by Ruth and Jed and are posted as a separate page.


Jane's summary draft: 2/20/07


March "Consultant" Meeting

The workshop will take place in the week of March 5-10, 2007.

Invite letter - Draft, please comment! --Tjvision 21:49, 17 December 2006 (EST)

Invitees, including roles, and potential alternates

  • Ahrash Bissell, OpenContext, how raw should the data be? (alternate: Eric/Sandy Kanza)
  • Margret Branchofsky, Dspace, data federation (alternate: McKenzie Smith)
  • Joe Bush, Taxonomy Strategies, digital lifecycle management
  • Adam Goldstein, Darwin Digital Library, what metadata is required?
  • John Graybeal, Marine Metadata Initiative, metadata generation by scientists
  • Jane Greenberg, Dublin Core, metadata requirements
  • Chris Greer, NSF, sustainable funding
  • Kevin Gamiel, RENCI, data federation & grid storage
  • Margaret Hedstrom, U Michigan, trust level & digital lifecycle management
  • Bryan Heidorn, UIUC, use of the grid for storage
  • Dianne Hillman, Cornell, metadata generation by scientists
  • Matt Jones, SEEK, how raw should the data be, user interface rqmnts & metadata generation
  • Paul Jones, iBiblio, sustainable funding
  • Liz Liddy, Center for Natural Language Processing, School of Information Studies, Syracuse Univeristy, metadata generation
  • Josh Madin, SEEK, what metadata is rqd
  • Michael Nelson, OAI-PMH, enabling 3rd party harvesting (alternate: Carl Logoze)
  • Mohammed Noor, NESCent WG, integration of submission with journals, how raw should the data be? (alternates: Maria Servedio, Emila Martins)
  • Sandy Payette, Fedora, interface w/ journals (alternate: Carl Logoze?)
  • Bob Peet, Ecology Society, integration of submission with journals
  • Dav Robertson, NIEHS, repository trust level
  • Val Tannen, Penn, data integration (not sure how important his presence would be)
  • Herbert Van de Sompel, LANL, enabling 3rd party harvesting
  • Mary Vardigan, DDI/ICRSP, administration & sustainable funding, what metadata is rqd
  • John Willbanks, GBIF & Science Commons, intellectual property, and data federation (alternates: Stan Blum, Don Hobern)
  • someone from NASA, incentivizing data sharing

Not currently on invitee list, but probably should be)

  • someone from CIESEN?
  • Bruce Bauer (World Data Center for Paleoclimatology)
  • Tom Hammond (Conservation Commons)
  • Emilia Martins (EthBase)
  • David Schloen (OCHRE)
  • Micah Altman (Virtual Data Center) Micah_Altman@harvard.edu

Potential floaters

  • C. Lynch

Metadata Class, Mock Workshop


The Four Virtues that we strive toward are the Sharing, Reuse, Preservation, and Synthesis of published evolutionary data. Decisions have to be made on how to promote these Virtues, and to what degree.

Questions for the participants:

  • Breakout 1
    • Raw data in repositories or processed data only? Spreadsheet data? (Bissell, M. Jones, others needed)
    • How can depositors be incentivized? (someone from NASA, others needed)
  • Breakout 2
    • How would the system be administered and sustainably funded? (Greer, P. Jones, Vardigan)
    • What intellectual property policies need to be put into place? (Willbanks)
  • Breakout 3
    • What is the role for data federation technology (central vs distributed repository)? (Branchofsky, Gamiel, Willbanks)
    • What is the role for bona fide data integration technology? (Heidorn, Gamiel, Tannen)
    • What is the role of distributed/grid storage? (Gamiel, Heidorn)
  • Breakout 4
    • What level of trust is necessary for the repository, e.g., persistence of data, protection of data from tampering, quality of meta-data? (Headstrom, Robertson)
    • What metadata is required and how to generate it (Dublin Core, DDI-lite, EML, standards imposed by specialized repositories)? (Bissell, Goldstein, Greenberg, Madin, Vardigan)
  • Breakout 5
    • Do we need to plan for metadata lifeycle management, and to what extent? (Bush, Headstrom)
    • Should the system be capable of metadata generation, and if so to what extent, with how much human review? (Greenberg, Hillman, Liddy)
  • Breakout 6
    • How to synchronize ingestion with journal publication and 3rd-part database deposition? (Noor, Peet, others needed)
    • How to enable harvesting of data by 3rd-parties (e.g. OAI-MHP)? (Nelson, van de Stompel)
    • What should be the functionality of the interface to the centralized registry? (Bissell, M. Jones)

Alternative Structure

The alternative draft structure proposed below is based on the original break-out groups and questions above, the knowledge accumulated since the original plan was written, and the discussion between Todd, Jane, and Hilmar on Feb 7, and individual thoughts.

This alternative structure is grouped around 5 themes. Each theme is meant to represent a major challenge that the phase II repository will face, inevitably due to technological or scientific advance, or because we must overcome the challenge to deliver on our mission, or both. These major challenges are either non-technical (e.g., cultural, financial, legal, etc), or if they are technical, they present an unsolved problem and we have little or no core competency in the required area(s).

The invitees would include DRIADE stakeholders, i.e., there would be no 3rd workshop, at least not as originally planned.

The overarching theme of the workshop is "Digital data preservation, sharing, and discovery: Challenges for Small Science Communities in the Digital Era." All themes and questions below are posed in the specific context of small science communities.

The questions associated with each theme would be dealt with by a break-out group. Each break-out group has a moderator, who ensures that the discussion does not stray from the theme, functions as a scribe, and writes a draft summary report of the break-out group's discussion and recommendations. We should consider taping all break-out groups. The break-out groups at the II Workshop were also taped.

  1. Theme: Sustainability
    • Questions:
      1. What models exist for long-term financial sustainability of scientific data repositories? How successful are they? How applicable are they to a small science community? What can we learn from past sustainability break-downs (e.g., Swissprot, PDB)?
      2. What models exist for maximizing compliance among scientists? How can depositors be incentivized? How can use of the data in the repository be maximized among scientists?
      3. What is the relative importance of ease and speed of deposition, discovery, and retrieval to each other with respect to maximizing usability and community acceptance? How big is the risk of "failure due to success"?
    • Desired outcome: A summary of the challenges, and a recommendation of approaches, to ensuring long-term financial sustainabiliy, compliance of prospective depositors, and community embrace.
  2. Theme: Intellectual property and provenance requirements
    • Questions:
      1. Which data licensing schemes are needed to meet the needs of the researchers in a small science community? What (if anything) is missing among the range of licenses defined by CC? Do we need MTA-style licenses under which researchers can safely share data in the repository that is otherwise not publicly accessible?
      2. In which way does the publisher's copyright policy need to be taken into account? Do we preemptively need to mandate that depositors request author addendums, such as the SC addendum? Can we be held liable by publishers?
      3. At which detail and depth does provenance need to be recorded and provided? Are data items "co-owned" by all co-authors, only the corresponding or the lead or the senior author, or does each data item need a separate owner assignment? If the data is from a synthetic study, is the data depositable, and if so, what depth of citation should be required, for example citing the source of each data point, citing sources for each data item, citing sources for the entire study, or not citing any original sources in the deposition.
      4. To what extent does exposure of metadata records need to be coupled to the availability of the data, or need to be under the control of the data owner?
    • Desired outcome: A summary of the challenges, and a recommendation of approaches, to ensuring that the repository is free from copyright liabilities, that the requirements of researchers in a small science community to retain control over their data are met, and that data are appropriately attributed.
  3. Theme: Distribution and replication: possibilities and liabilities
    • Questions:
      1. To what extent is using federation reconcilable with the core priority of data preservation? If we delegate storage of certain data types to specialized repositories, to what extent will these repositories, and we ourselves, become liable to the same sustainability demands?
      2. What models exist for protecting the integrity of data through replicating repositories and data, such as LOCKSS? How could this be applied to a digital data repository in a small science discipline?
      3. To what extent do distributed storage technologies (such as SRB, grid storage) need to part of the initial architecture?
    • Desired outcome: ...
  4. Theme: Lifecycle management requirements
    • Questions:
      1. How important is it to allow scientists to change (edit, add to, remove from) their data once deposited? What information about previous data revisions does the repository need to retain, and for how long? Does on-line access to previous revisions need to be continuously provided?
      2. If a data format becomes unsupported, whose responsibility should it be to up-convert the data? Does up-converted data constitute a new revision, and who should be responsible for validating the up-conversion result?
      3. How important is it to allow depositors to change the metadata of their data once deposited? Should metadata be versioned the same as the data, tied together or separate? What changes in the metadata might constitute a change of the data (e.g., changing the unit of a data column)?
      4. To what extent should augmented metadata and data format require the depositor's approval? Should metadata or data changes by curators be treated the same or different than changes by the data owner? Do curators' changes and identities need to be recorded and attributed?
    • Desired outcome: ...
  5. Theme: Semantic Web and Web 2.0
    • Questions:
      1. How can semantic harvesting from DRIADE and content aggregation by 3rd-parties be maximized? How can discovery through web and semantic search engines be maximized?
      2. What kinds of service-oriented interfaces are useful, and which aren't? Will an OAI-PMH gateway suffice? Should data depositions, or data changes, be broadcast through feeds?
      3. How important, or dangerous, is allowing semantic tagging by users?
      4. What is the potential for a social networking feature within the repository, such as networks of "collaborators"?
    • Desired outcome: ...

Provisional agenda

Day 1:

  • Introductions and presentation of objectives
  • Refine, as a group, tasks for the breakout sessions.
  • Three concurrent breakout sessions over lunch and into early afternoon, with short chalktalks relevant to each topic followed by focused discussion on the breakout tasks.
  • Break
  • Late afternoon breakout group summaries

Day 2:

  • Three concurrent morning breakout sessions, again with short chalktalks relevant to each topic followed by focused discussion on the breakout tasks.
  • Lunch
  • 1-2 hour large group discussion
  • Writing of recommendations

May "Planning" Meeting


  •  ? NCBI interface w/ specialized dbs
  • Bill Piel, Treebase, interface w/ specialized dbs (possibly 2nd meeting instead)
  • Greg Riccardo, Morphbank, interface w/ specialized dbs (possibly 2nd meeting instead)