Old:December 2006 Workshop Minutes

From Dryad wiki
Jump to: navigation, search

Status: This page is outdated and of historical use only.

Minutes of the Dec 5, 2006 Stakeholder Workshop with journal editors and society representatives

Attending: Ahrash Bissell (Duke/OpenContext), Harold Heatwole (editor, Integrative & Comparative Biology), Mark Rausher (editor, Evolution), Michael Whitlock (editor, American Naturalist), Kathleen Smith (NESCent), Marcy Uyenoyama (incoming editor, Molecular Biology & Evolution), Don Waller (SSE), Todd, Hilmar, Jane, Ruth, Jed

Agenda:

  1. Goals of this meeting (Todd Vision)
  2. Brief description of OpenContext (Ahrash Bissell)
  3. Roundtable introductions
  4. Requirements and open questions (Hilmar Lapp)
  5. Issues regarding metadata (Jane Greenberg)
  6. Roundtable discussion:
    • Expectations and desires of the journals, publishers and scientific societies - What are the priorities?
    • Ideas for the requirements gathering phase
    • Suggestions for attendees at the spring stakeholders meeting
    • Further discussion of project plans and the two major upcoming meetings

Minutes

Presentations

Todd's presentation: provided an introduction to NESCent, the DRIADE project, and the meeting. He said that the Repository might help researchers deposit their data so that data are acceptable for GenBank, TreeBase, etc. He also indicated that the journals provide a good mechanism that could make it easy for authors to deposit data. There is an interest in data that may be perceived as useless to some, but may be useful to others. As a grand vision, there will be re=purposing of data. There was brief discussion on the inclusion of data from experiments that resulted in negative results.

  • Q: How soon can researchers deposit data in the the repository, in relation to eventual publication?

Ahrash described OpenContext, a repository for archaelogists, founded by Sarah & Eric Kansa at Harvard. He said that minimal metadata was required, in an effort to get the data in. interdisciplinary questions can be addressed (using the heterogenous datasets). Users enter via single sign-on access. The repository operates under Creative Commons license.

Michael suggested that there might be an agreed date to have mandataory data archiving begin for authors publishing in the journals (i.e., those represented in the meeting) - for example, July 1, 2007.

Don suggested that there perhaps be a phased approach for deposits - metadata first, then data later.

Harold said he thinks that some data, not strictly part of the published paper, would be valuable to others - and that authors may not realize this.

  • Comment: data deposition is a huge burden.
  • Q: What is a wiki?
  • Q: What is a folksonomy?
  • Q: Is there research on tagging vs. controlled vocabulary? Jane responded that there is currently growing interest in this research area. Systems currently have keywords created both ways and this changes over time. The central part of the discussion is the design.

Hilmar's presentation: re: requirements - how much they are needed, how they can be derived from use-cases and user-stories, etc. He also showed a matrix of the Repository/System Challenges vs. "Virtues". Hilmar emphasized the importance of gathering the appropriate requirements.

  • Q: When you say gathering requirements, what do you mean? Hilmar responded, clarifying specifing or obtaining from users - there is no formal checklist or guidelines, but might be gleaned from asking users what they want. There should be lots of input.

Todd stated that if will done, the implementations will be easier.

  • Mark asked that "stakeholders" be defined, and there was some discussion about if authors were NOT, why? Authors certainly have a stake (in fact, they will re-use the data), but they are not as readily accessible, unless there were some way for them to be represented...
    • We have identified the journals as proxys for authors. Authors generate data, getting their input is a good idea. We also need to look at peoples behavior. What are the issues for authors when interacting with supplemental data?

Jane's presentation: re: metadata in general, and cited some examples, standards, and the metadata issues underlying the development of DRIADE.

  • Michael asked if the software was open-source. It was explained that metadata is not itself software, but that most metadata schemas and many tools for metadata implementation are open-source.

Jane described metadata standards as a continuum. Once the data objects are defiened, what metadata schemes might work in this scenario? Are there ways to couple and package different schemas, so that there is not reinvbention of the wheel. There was a question regarding the detail level of metadata and the effect of retrieval speed when accessing the saved data. Jane also mentionned the preservation life cycle.

Todd demo'ed Morpho (the KNB ecology metadata/analysis tool that works with (and requires) EML)

  • Q: Does the system allow keywords to be entered that are NOT on the list (EML)?
    • A: YES. A radio-button is clicked for "in the list" or "not in the list".
  • Q: What if your data is not in standardized format?
    • A (Todd): That is an issue. Is it possible to define formats and then allow exceptions?
  • Don attended a workshop in July 2006 where it was determined that something like 12 out of 60 or 80 are using the (Morpho?) system.
  • Jane agreed. NIEHS had similar problem - after many agreed they were interested.
  • (Marcy?) said the fact that data comes in too many different forms is a stumbling block.

Roundtable discussion

Michael presented a model relating data issues to Maslow's hierarchy of life needs. Most important is preservation, then ability to access, then the ability to synthesize. He said that the key is to save the data - that way it won't be lost. Doesn't believe that there will ever be a standard that everyone will adhere to. He stated that there needs to be a culture change. The main problem is the data, after a certain point in time, doesn't exist anymore. The repository should not be fancy, there should be no assumptions about form. Anyone should be able to archive data so that it won't be lost. There should be a generic metadata format for archiving. We need to have a place to put the data before it is lost.

There was some discussion about journal policy on archiving.

  • Integrative and Comparative Biology are published by Oxford Press. OP has a place for supplemental data. Submission is not required and it is up to the author. The system is partially open access. Authors can pay to have it there/ OUP is dedicated to all open access - this would apply to datasets - they are moving in that direction.
  • American Naturalist policy - from the University of Chicago Press. They are keen on archiving and will support the project.
  • Evolution Journal (need name here). They do not have much policy. Published by Blackwell, now Wiley. The thought is that generally, data archiving is a good idea.
  • Q: What are authors rights in terms of datasets? There is some concern about where the data should reside. Not sure that the journals should house the data....but then who does? Concept of Central repository. Concept of Evolution World website. Another group administers rules.

Comment - if journals do it, smaller journals and specialized data may not be represented. Blackwell does not have online supplemental data. Journals and NSF have different views/policies on what is published.

M - Journals have moral authority for what is published. As for author rights - you must provide replicatable results.

M - This is probably a good standard, but some authors generate data with their own funding. It IS their dataset.

Don Waller said saving it first has to be a priority. Not just for published works, but also for data that might be a casualty in a big truck scenario. Interest in repositories that can provide access as soon as possible. Much concern over security. Make sure that "god's design" does not penetrate security of data repository. His concern was a denial-of-service attack. Rights management was the key: the original author should be able to decide whether to have his name included, when the data was re-used. If there is no rights management, this will fall flat on its face. There should be gradual buy in. There should be a statement in the metadata standard on the provision of recognition.

This raises the question of meta-analyses. Bumpasses Sparrows were mentionned. Disussion on public funds vs. public domain. Education should foster a culture for sharing. Mark asked that if data is developed by non-public funds - should it still be fully open-access?

Harold asked about PhD theses. They often don't get published. Nor does the underlying data.

Michael said he believes in a simple caveat: that authors' data in a published paper should be (made available and?) presented in such a manner, that the experiments and results could be recreated.

It was stated that everyone should be creating a low level record. There was discussion on the culture, the value of data, preservation, access and data documentation.

Open context - has a creative commons license. The metadata is public as well as the data. There is nothing preventing the author from saying how the data may be used/attributed.

Comment - there is great author diversity. Why is unstructured data different? GenBank is very structured, perhaps there is more compliance.

Don mentioned a fear of piracy - that there are data farmers. Attribution is a big issue. There could be a lack of shared knowledge of where the data came from. If all of the data is in a centralized repository, provenance is easier. and ther is a built in policing model.

Marcy Uyenoyama states that she sees a massive probelm with quality control. If there is mandated deposition, there may be passive agressive behavior. There is a need to see the source code on the tool that did the analysis. Will the algorithm be shared? Bioinformatics requries the code to be accessible. Quality control will have something to do with how the data is used. Messy data can be cleaned up, but messy data may not be used. May need further description. Over time quality can improve. PhD data frequently is never seen - for a variety of reasons. PhD materials could be required. With quality control for published journals, editors/readers catch mistakes, we don't want mistakes exposed. If mistakes can be found, we have an incentive for providing good data.

Don stated that passive agressive behaviors might be affected by ease of data entry - there may be a hesitation to deposit data if the process is cumbersome. Concern about data from older/retiring faculty/investigators. There are many reasons to want to share - safe personal place to store for own reuse, ability to share with collaborators.

Concept of science advancing through the interpretation of data. Initially, investigators were told to minimize data - just use/provide summary data...things are different now.

Genbank = everyone submits, but there is not place for paper attribution. It is crucial to have a reference for a paper. Add the paper information to the dataset. Permanent link to URL. Data referencing section - could be painless.

Journals will not be happy if the data is cited, but not the paper. Paper data must have the paper cited. What is the appropriate citation?

Eventually, could cite both.

Jane stated that data quality is very important. A stated that the standard and quality changes over time. Jane stated that the data is gathered for a lifetime. One place for sustainability concepts/ideas ios ibiblio.org. It is the people's public library. Carolina Minds has been set up for retiring professors/those no longer living.

Comment made that whatever standard/policy is adopted, higher level metadata will make the dataset more useful. Incentives may help in having a more complex metadata hierarchy..

Comment made that incentives are built in the way science is done. Many are looking for collaborations. In looking at GenBank as a model - what would make more that the EB field interested in the data repository?

Hilmar provided a brief history of Genbanks development. Initially, journals said they would not publish all of the sequence data that was being created. This was a research result. Also, NIH mandated delivery to Genbank. Now there is much more supplemental data than the sequence data.

Comment made that the actual data points should be archived.

Comment made that the copyright lives with the author. Concept of toll-free phone number link. Voluntary as to what author wants to contribute.

Harold stated that secondary material used to be in an appendix as part of the additional materials. Simply giving all data to journals would create copyright issue (XXXXCan someone clarify for me? rm)

Comment made that journals should require the author to provide data that could be used to recreated the analysis. We are talking about data tables, talbes for coordinateds of figures, photos, in an easy to store place.

Don raised question regarding videoclips. Is the data the videoclip or the data surrogate. Differehnt levels of extraction exist. We want to capture the extracted data that goes into publication.

For published datat, will not need to include other types of media. Extraction is interpretation.

When creating policy/standard. We need to state a way that seems clear, but leaves latitude.

Hilmar explained the MIAMI standard - an example from Genomics. The community decided that the standard meant the ability to recreate the result from the given data.

Question? Where is "there" and who will be paying?

Comment - subscribers? NSF stance is that Academic libraries will fill in holes.

Comment - Journals can't charge for to have data deposited.

Jane mentionned OCLC and their potential interest to the question of where this might live. Somewhere global and ubiquitous.

  • We need to sit down with editors and authors to define guidelines and limits.

Simplicity is key.

  • We need to provide some perspective on why we are doing it.

Data submission required for reviewers?

Make the requirement general, detailed screens are police. Reviewers are middle ground. Allow authors to submit, do not require. Once data is public, is public. Provide accession number from database (like genBank). Is data to be immediately accessible?

Who pays? How much will this cost? The main cost is for administration. NSF has said it will NOT fund long term preservation. Will there be public funding? How many curators/administrators does this take?

NCBI has 30-50 curators.

Comment - we need to archive previous versions of the same datasets to prohibit malfeasance.

There are many areas of inquiry as the field evolves.

Geospacial data - there are many ways to represent. With the system, we can track the evolution of standards and trends through recordkeeping.

Needs to be a base standard.

How do we research requirements? Brainstorm, suggestions for further meetings - who should participate?

Don - Should we start with editorials in journals? Follow up with poll? See how willing authors are?

Comment - If editors come to an agreement, use a limited standard. All data need to be in X format, open access, and for X years not used for other activities.

Todd - Should be separate from GenBank/treebase, etc..

Get authors/users used to the idea. then build later.

Jane asked is data in more that one place? Either/or?

Ah - WIll there be a portal? How the protal can lead to other related "data places."

Comment - not needed now.

Ah - no, but good to think forward.

Mark - editorial boards are all functional scientists. Journals get together. Best version of joint policy, then go to editorial board. How encompassing should this be? Synthetic EB journals should have a baseline policy.

Don - should there be an editorial poll?

Mike - no should be top down.

Todd - are there folks we may want to include? European perspectives, Japanese ediotors, not necessary to include past editors.

Need a meta-analysis working group. Folks outside field, but those who habve standards overlapping.

Michael - Propose a model for most of the field. Give a model to pick at.

Don - Have something planned. Mike - here are X fields suggested/

Jane - concept of focus group. Delphi method? Meeting in July? Potential users.

Hilmar asked/stated....does this come from the top down? Perspective and requirements to come from journals. Users help to fine tune from there.

AH - Everyone should be aware that the material can go out there. Many younger folks look for an acceptable place? People don't contribute.

Comment - optional for a few years? Comment - what is critical is time. Comment - what is needed is concept of being told to do it. Hilmar - should be same for code.

Hilmar presented the following list as suggestions for "must have" requirements:

Preservation Some way to save materials ASAP - cost effective Data Quality Security and curation Use minimal set of metadata (as a start) commercial representation - level of data --data summaries --data points -data lifecycle -self-sustaining financially -security

IP, LICENSE, ACCESS Restricted access period depends on funding data published, or not? licensing scheme citation policy for data author control

SCOPE OF DATA PROJECT -published data -field data -any format, standard, any type -methods, source code

Question - can we get supplemental data up and running? comments varied by journal;

Question from Jed - Is there an interest in the quality of the metadata if authors may not want to provide? Answers - yes...

Discussion of description Discovery and preservation ONLY supported Jane mentions Data curators

Top priorities

Don-

  1. Originator intellectual property - author control
  2. Complete repository quickly
  3. Security
  4. Incentives for more rigorous metadata provision

Mark -

  1. Author control, restricted access/licensing
  2. Security issues - who can contribute
  3. Financial stability
  4. Celerity of creation

Marcy -

  1. Data Quality - burden on author if standard format required?
  2. Financial sustainability
  3. Author control
  4. Source code
  5. Explicit policy on deposit

Michael -

  1. ASAP
  2. Minimal metadata
  3. - emphasize first two votes

There was a call for other suggestions/ideas. Comment on legacy datasets for those scholars retiring. Testamentary datasets - contact Karen Streier (U of Wisconsin Anthropology) about interest in collecting primate databases.

Obtaining feedback

  • Series of editorials
  • Utilize editorial boards
  • journals create baseline policy and baseline metadata requirements and pass it town

Additional people to invite/involve

  • European perspective
  • Meta-analysis working group
  • Japanese perspective

Summary

Data Preservation

  • Save materials ASAP
    • cost effective
    • open door policy
  • Data quality
    • curation (data and metadata)
    • data security
  • Use minimal set of metadata (as a start)
    • canonical representation
  • Level of data
    • data summaries and data points
    • rule is to be able to recreate the results
  • Data lifecycle
    • versioning
    • audit trail?
  • Self-sustaining financial model
  • Security from attacks (e.g., D.o.S)

IP, access control & license

  • Dependency on funding source and whether data is published yet or not
  • Authors retain control
    • licensing scheme for access
    • restricted access period
    • citation policy for data

Scope of data for the repository

  • Published data only
    • field data valuable too though
  • Any format, any standard, any type
    • combined with easy ability to promote data and metadata to higher levels of structure and standards compliance
  • Methods
    • source code for custom software or algorithms