Workshop May 2007 Ideas

From Dryad wiki
Jump to: navigation, search

May "Consultant" Meeting

Digital data preservation, sharing, and discovery: Challenges for Small Science Communities in the Digital Era

Introduction

This workshop aims to address major challenges for the Digital Repository of Information and Data in Evolution (DRIADE). DRIADE seeks to Uniqueness of small science.

DRIADE will link to selected big science initiatives (e.g., GenBank, ??), and has the potential to connect to additional larger science initiatives on a greater scale. [Phase I functionalities have been identified in a draft specification and are being implemented at this time. [A synopsis of these functionalities are found at: http://driade?]

Our May workshop will consider the long-term existence of the repository in connection with the identification of higher level [more sophisticated] system functionalities. The aim is to help us move toward the development of a successful and more robust repository with a higher level of automated services. Implementing second level functionalities, requires us to address a series of: 1. Organizational and behavorial challenges (e.g., cultural, financial, legal, etc), and 2. Technical challenges (e.g., unsolved problems, of challenges requiring technical expertise beyond DRIADE’s team.) [?? → We are calling upon stakeholders to help us identify second level functionalities in this workshop]

The workshop is structured to address questions related to several themes: . The plan is to, a…b…c… associated with each theme would be dealt with by a break-out group. Each break-out group has a moderator, who ensures that the discussion does not stray from the theme, functions as a scribe, and writes a draft summary report of the break-out group's discussion and recommendations. We should consider taping all break-out groups. The break-out groups at the II Workshop were also taped.

Think of these as questions for the moderators. A reduced set of questions will be provided to all attendees.

Theme 1: Adoption and sustainability

  1. What models exist for long-term financial sustainability of scientific data repositories? How successful are they? How applicable are they to a small science community? What can we learn from past sustainability break-downs (e.g., Swissprot, PDB)?
  2. What models exist for maximizing compliance among scientists? How can depositors be incentivized? How can use of the data in the repository be maximized among scientists?
  3. What is the relative importance of ease and speed of deposition, discovery, and retrieval to each other with respect to maximizing usability and community acceptance? How big is the risk of "failure due to success"?
  • Desired outcome: A summary of the challenges, and a recommendation of approaches, to ensuring long-term financial sustainabiliy, compliance of prospective depositors, and community embrace.
  • Questions for attendees: How to achieve long-term financial sustainability? How to promote adoption by scientists?

Theme 2: Intellectual property and provenance requirements

  1. Which data licensing schemes are needed [appropriate for] to meet the needs of the researchers in a small science community? What (if anything) is missing among the range of licenses defined by CC? Do we need MTA-style licenses under which researchers can safely share data in the repository that is otherwise not publicly accessible?
  2. In which way does the publisher's copyright policy need to be taken into account? Do we preemptively need to mandate that depositors request author addendums, such as the SC addendum? Can we be held liable by publishers?
  3. At which detail and depth does provenance need to be recorded and provided? Are data items "co-owned" by all co-authors, only the corresponding or the lead or the senior author, or does each data item need a separate owner assignment? If the data is from a synthetic study, is the data depositable, and if so, what depth of citation should be required, for example citing the source of each data point, citing sources for each data item, citing sources for the entire study, or not citing any original sources in the deposition.
  4. To what extent does exposure of metadata records need to be coupled to the availability of the data, or need to be under the control of the data owner?
  • Desired outcome: A summary of the challenges, and a recommendation of approaches, to ensuring that the repository is free from copyright liabilities, that the requirements of researchers in a small science community to retain control over their data are met, and that data are appropriately attributed.
  • How do we balance intellectual property concerns with the goals of data sharing?

Theme 3: Distribution and replication: possibilities and liabilities

  1. To what extent is using federation reconcilable with the core priority of data preservation? If we delegate storage of certain data types to specialized repositories, to what extent will these repositories, and we ourselves, become liable to the same sustainability demands?
  2. What models exist for protecting the integrity of data through replicating repositories and data, such as LOCKSS? How could this be applied to a digital data repository in a small science discipline?
  3. To what extent do distributed storage technologies (such as SRB, grid storage) need to part of the initial architecture?
  4. Is it reasonable to relegate responsibility for data preservation to specialized repositories?
  • Desired outcome: ...
  • Questions for attendees: What are the possibilities and liabilities of distributed data management and relationships with specialized repositories?

Theme 4: Lifecycle management requirements

  1. What capacity, and policies, should there be on data and metadata modifications (edits, additions, removal)?
  2. How should the repository deal with data formats that are no longer supported?
  3. How important is it to allow depositors to change the metadata of their data once deposited? Should metadata be versioned the same as the data, tied together or separate? What changes in the metadata might constitute a change of the data (e.g., changing the unit of a data column)?
  4. To what extent should augmented metadata and data format require the depositor's approval? Should metadata or data changes by curators be treated the same or different than changes by the data owner? Do curators' changes and identities need to be recorded and attributed?
  • Desired outcome:
  • Questions for attendees: What capacity, and policies, should there be on data and metadata modifications (edits, additions, removal)? How should the repository deal with data formats that are no longer supported?

Theme 5: Looking toward "Phase III"

  1. How can semantic harvesting from DRIADE and content aggregation by 3rd-parties be maximized? How can discovery through web and semantic search engines be maximized?
  2. What kinds of service-oriented interfaces are useful, and which aren't? Will an OAI-PMH gateway suffice? Should data depositions, or data changes, be broadcast through feeds?
  3. How important, or dangerous, is allowing semantic tagging by users?
  4. What is the potential for a social networking feature within the repository, such as networks of "collaborators"?
  • Desired outcome: ...
  • Questions for attendees: # Are there emergent technologies (e.g. semantic web) that would be valuable to include in future plans?

Provisional agenda

Each breakout to be taped. Moderators report but are not responsible for chief writing of overall report - student/consultant tasked with that.

Chalktalks at the beginning of breakouts to provide context

8 moderators

Day 1

  • 9:00 Presentation of objectives - us
  • 9:20 Stakeholder presentation - Mike's position paper
  • 9:30 3 minute madness for each participant, who they are, why they are here
  • 10:30 break
  • 10:50 Four concurrent breakouts
  • 12:15 Lunch
  • 1:30 Four concurrent breakouts - continuation of groups from morning
  • 3:30 Break
  • 3:50 10 minute group summaries with 10-20 minute discussion
    • identify issues to address the second day

Day 2

  • 9:00 Four concurrent breakouts with different groups - let people change their assignments
  • 10:30 break
  • 10:50 Return to same four concurrent breakouts
  • 12:15 Lunch
  • 1:30 10 minute group summaries with discussions
  • 2:30 Forward-looking theme
  • dismiss general attendees
  • 3:30 Break
  • 3:50 Stakeholder's debriefing

Follow up with stakeholders after sharing report in ~3 weeks

Older plans

Invitees, including roles, and potential alternates

  • Ahrash Bissell, OpenContext, how raw should the data be? (alternate: Eric/Sandy Kanza)
  • Margret Branchofsky, Dspace, data federation (alternate: McKenzie Smith)
  • Joe Bush, Taxonomy Strategies, digital lifecycle management
  • Adam Goldstein, Darwin Digital Library, what metadata is required?
  • John Graybeal, Marine Metadata Initiative, metadata generation by scientists
  • Jane Greenberg, Dublin Core, metadata requirements
  • Chris Greer, NSF, sustainable funding
  • Kevin Gamiel, RENCI, data federation & grid storage
  • Margaret Hedstrom, U Michigan, trust level & digital lifecycle management
  • Bryan Heidorn, UIUC, use of the grid for storage
  • Dianne Hillman, Cornell, metadata generation by scientists
  • Matt Jones, SEEK, how raw should the data be, user interface rqmnts & metadata generation
  • Paul Jones, iBiblio, sustainable funding
  • Liz Liddy, Center for Natural Language Processing, School of Information Studies, Syracuse Univeristy, metadata generation
  • Josh Madin, SEEK, what metadata is rqd
  • Michael Nelson, OAI-PMH, enabling 3rd party harvesting (alternate: Carl Logoze)
  • Mohammed Noor, NESCent WG, integration of submission with journals, how raw should the data be? (alternates: Maria Servedio, Emila Martins)
  • Sandy Payette, Fedora, interface w/ journals (alternate: Carl Logoze?)
  • Bob Peet, Ecology Society, integration of submission with journals
  • Dav Robertson, NIEHS, repository trust level
  • Val Tannen, Penn, data integration (not sure how important his presence would be)
  • Herbert Van de Sompel, LANL, enabling 3rd party harvesting
  • Mary Vardigan, DDI/ICRSP, administration & sustainable funding, what metadata is rqd
  • John Willbanks, GBIF & Science Commons, intellectual property, and data federation (alternates: Stan Blum, Don Hobern)
  • someone from NASA, incentivizing data sharing

Not currently on invitee list, but probably should be)

  • someone from CIESEN?
  • Bruce Bauer (World Data Center for Paleoclimatology)
  • Tom Hammond (Conservation Commons)
  • Emilia Martins (EthBase)
  • David Schloen (OCHRE)
  • Micah Altman (Virtual Data Center) Micah_Altman@harvard.edu

Potential floaters

  • C. Lynch

Metadata Class, Mock Workshop

Goals

The Four Virtues that we strive toward are the Sharing, Reuse, Preservation, and Synthesis of published evolutionary data. Decisions have to be made on how to promote these Virtues, and to what degree.

Questions for the participants:

  • Breakout 1
    • Raw data in repositories or processed data only? Spreadsheet data? (Bissell, M. Jones, others needed)
    • How can depositors be incentivized? (someone from NASA, others needed)
  • Breakout 2
    • How would the system be administered and sustainably funded? (Greer, P. Jones, Vardigan)
    • What intellectual property policies need to be put into place? (Willbanks)
  • Breakout 3
    • What is the role for data federation technology (central vs distributed repository)? (Branchofsky, Gamiel, Willbanks)
    • What is the role for bona fide data integration technology? (Heidorn, Gamiel, Tannen)
    • What is the role of distributed/grid storage? (Gamiel, Heidorn)
  • Breakout 4
    • What level of trust is necessary for the repository, e.g., persistence of data, protection of data from tampering, quality of meta-data? (Headstrom, Robertson)
    • What metadata is required and how to generate it (Dublin Core, DDI-lite, EML, standards imposed by specialized repositories)? (Bissell, Goldstein, Greenberg, Madin, Vardigan)
  • Breakout 5
    • Do we need to plan for metadata lifeycle management, and to what extent? (Bush, Headstrom)
    • Should the system be capable of metadata generation, and if so to what extent, with how much human review? (Greenberg, Hillman, Liddy)
  • Breakout 6
    • How to synchronize ingestion with journal publication and 3rd-part database deposition? (Noor, Peet, others needed)
    • How to enable harvesting of data by 3rd-parties (e.g. OAI-MHP)? (Nelson, van de Stompel)
    • What should be the functionality of the interface to the centralized registry? (Bissell, M. Jones)

Alternative Structure

The alternative draft structure proposed below is based on the original break-out groups and questions above, the knowledge accumulated since the original plan was written, and the discussion between Todd, Jane, and Hilmar on Feb 7, and individual thoughts.

This alternative structure is grouped around 5 themes. Each theme is meant to represent a major challenge that the phase II repository will face, inevitably due to technological or scientific advance, or because we must overcome the challenge to deliver on our mission, or both. These major challenges are either non-technical (e.g., cultural, financial, legal, etc), or if they are technical, they present an unsolved problem and we have little or no core competency in the required area(s).

The invitees would include DRIADE stakeholders, i.e., there would be no 3rd workshop, at least not as originally planned.

The overarching theme of the workshop is "Digital data preservation, sharing, and discovery: Challenges for Small Science Communities in the Digital Era." All themes and questions below are posed in the specific context of small science communities.

The questions associated with each theme would be dealt with by a break-out group. Each break-out group has a moderator, who ensures that the discussion does not stray from the theme, functions as a scribe, and writes a draft summary report of the break-out group's discussion and recommendations. We should consider taping all break-out groups. The break-out groups at the II Workshop were also taped.

  1. Theme: Sustainability
    • Questions:
      1. What models exist for long-term financial sustainability of scientific data repositories? How successful are they? How applicable are they to a small science community? What can we learn from past sustainability break-downs (e.g., Swissprot, PDB)?
      2. What models exist for maximizing compliance among scientists? How can depositors be incentivized? How can use of the data in the repository be maximized among scientists?
      3. What is the relative importance of ease and speed of deposition, discovery, and retrieval to each other with respect to maximizing usability and community acceptance? How big is the risk of "failure due to success"?
    • Desired outcome: A summary of the challenges, and a recommendation of approaches, to ensuring long-term financial sustainabiliy, compliance of prospective depositors, and community embrace.
  2. Theme: Intellectual property and provenance requirements
    • Questions:
      1. Which data licensing schemes are needed to meet the needs of the researchers in a small science community? What (if anything) is missing among the range of licenses defined by CC? Do we need MTA-style licenses under which researchers can safely share data in the repository that is otherwise not publicly accessible?
      2. In which way does the publisher's copyright policy need to be taken into account? Do we preemptively need to mandate that depositors request author addendums, such as the SC addendum? Can we be held liable by publishers?
      3. At which detail and depth does provenance need to be recorded and provided? Are data items "co-owned" by all co-authors, only the corresponding or the lead or the senior author, or does each data item need a separate owner assignment? If the data is from a synthetic study, is the data depositable, and if so, what depth of citation should be required, for example citing the source of each data point, citing sources for each data item, citing sources for the entire study, or not citing any original sources in the deposition.
      4. To what extent does exposure of metadata records need to be coupled to the availability of the data, or need to be under the control of the data owner?
    • Desired outcome: A summary of the challenges, and a recommendation of approaches, to ensuring that the repository is free from copyright liabilities, that the requirements of researchers in a small science community to retain control over their data are met, and that data are appropriately attributed.
  3. Theme: Distribution and replication: possibilities and liabilities
    • Questions:
      1. To what extent is using federation reconcilable with the core priority of data preservation? If we delegate storage of certain data types to specialized repositories, to what extent will these repositories, and we ourselves, become liable to the same sustainability demands?
      2. What models exist for protecting the integrity of data through replicating repositories and data, such as LOCKSS? How could this be applied to a digital data repository in a small science discipline?
      3. To what extent do distributed storage technologies (such as SRB, grid storage) need to part of the initial architecture?
    • Desired outcome: ...
  4. Theme: Lifecycle management requirements
    • Questions:
      1. How important is it to allow scientists to change (edit, add to, remove from) their data once deposited? What information about previous data revisions does the repository need to retain, and for how long? Does on-line access to previous revisions need to be continuously provided?
      2. If a data format becomes unsupported, whose responsibility should it be to up-convert the data? Does up-converted data constitute a new revision, and who should be responsible for validating the up-conversion result?
      3. How important is it to allow depositors to change the metadata of their data once deposited? Should metadata be versioned the same as the data, tied together or separate? What changes in the metadata might constitute a change of the data (e.g., changing the unit of a data column)?
      4. To what extent should augmented metadata and data format require the depositor's approval? Should metadata or data changes by curators be treated the same or different than changes by the data owner? Do curators' changes and identities need to be recorded and attributed?
    • Desired outcome: ...
  5. Theme: Semantic Web and Web 2.0
    • Questions:
      1. How can semantic harvesting from DRIADE and content aggregation by 3rd-parties be maximized? How can discovery through web and semantic search engines be maximized?
      2. What kinds of service-oriented interfaces are useful, and which aren't? Will an OAI-PMH gateway suffice? Should data depositions, or data changes, be broadcast through feeds?
      3. How important, or dangerous, is allowing semantic tagging by users?
      4. What is the potential for a social networking feature within the repository, such as networks of "collaborators"?
    • Desired outcome: ...

Provisional agenda

Day 1:

  • Introductions and presentation of objectives
  • Refine, as a group, tasks for the breakout sessions.
  • Three concurrent breakout sessions over lunch and into early afternoon, with short chalktalks relevant to each topic followed by focused discussion on the breakout tasks.
  • Break
  • Late afternoon breakout group summaries

Day 2:

  • Three concurrent morning breakout sessions, again with short chalktalks relevant to each topic followed by focused discussion on the breakout tasks.
  • Lunch
  • 1-2 hour large group discussion
  • Writing of recommendations

May "Planning" Meeting

Invitees

  •  ? NCBI interface w/ specialized dbs
  • Bill Piel, Treebase, interface w/ specialized dbs (possibly 2nd meeting instead)
  • Greg Riccardo, Morphbank, interface w/ specialized dbs (possibly 2nd meeting instead)