Difference between revisions of "Old:December 2006 Workshop Minutes"

From Dryad wiki
Jump to: navigation, search
m
 
(7 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 +
''Status: This page is outdated and of historical use only.''
 +
 
<big>'''Minutes of the Dec 5, 2006 Stakeholder Workshop with journal editors and society representatives'''</big>
 
<big>'''Minutes of the Dec 5, 2006 Stakeholder Workshop with journal editors and society representatives'''</big>
  
Line 6: Line 8:
 
#[[media: 5dec05.tjv.ppt|Goals of this meeting (Todd Vision)]]
 
#[[media: 5dec05.tjv.ppt|Goals of this meeting (Todd Vision)]]
 
#Brief description of OpenContext (Ahrash Bissell)
 
#Brief description of OpenContext (Ahrash Bissell)
#Roundtable introductions  
+
#Roundtable introductions
 
#[[Media:DRIADE Journal Reps workshop HL.ppt|Requirements and open questions (Hilmar Lapp)]]
 
#[[Media:DRIADE Journal Reps workshop HL.ppt|Requirements and open questions (Hilmar Lapp)]]
 
#[[Media:NESCent_publishers_2.ppt|Issues regarding metadata (Jane Greenberg)]]
 
#[[Media:NESCent_publishers_2.ppt|Issues regarding metadata (Jane Greenberg)]]
#Roundtable discussion:  
+
#Roundtable discussion:
#*Expectations and desires of the journals, publishers and scientific societies - What are the priorities?  
+
#*Expectations and desires of the journals, publishers and scientific societies - What are the priorities?
#*Ideas for the requirements gathering phase  
+
#*Ideas for the requirements gathering phase
#*Suggestions for attendees at the spring stakeholders meeting  
+
#*Suggestions for attendees at the spring stakeholders meeting
#*Further discussion of project plans and the two major upcoming meetings  
+
#*Further discussion of project plans and the two major upcoming meetings
  
 
==Minutes==
 
==Minutes==
Line 19: Line 21:
 
===Presentations===
 
===Presentations===
  
[[media: 5dec05.tjv.ppt|'''Todd's presentation''']]:  provided an introduction to NESCent, the DRIADE project, and the meeting. He said that the Repository might help researchers deposit their data so that data are acceptable for GenBank, TreeBase, etc.  He also indicated that the journals provide a good mechanism that could make it easy for authors to deposit data.  There is an interest in data that may be perceived as useless to some, but may be useful to others.  As a grand vision, there will be re=purposing of data.  There was brief discussion on the inclusion of data from experiments that resulted in negative results.
+
[[media: 5dec05.tjv.ppt|'''Todd's presentation''']]:  provided an introduction to NESCent, the DRIADE project, and the meeting. He said that the Repository might help researchers deposit their data so that data are acceptable for GenBank, TreeBase, etc.  He also indicated that the journals provide a good mechanism that could make it easy for authors to deposit data.  There is an interest in data that may be perceived as useless to some, but may be useful to others.  As a grand vision, there will be re=purposing of data.  There was brief discussion on the inclusion of data from experiments that resulted in negative results.
  
 
* Q: How soon can researchers deposit data in the the repository, in relation to eventual publication?
 
* Q: How soon can researchers deposit data in the the repository, in relation to eventual publication?
Line 36: Line 38:
 
* Q:  Is there research on tagging vs. controlled vocabulary?  Jane responded that there is currently growing interest in this research area.  Systems currently have keywords created both ways and this changes over time.  The central part of the discussion is the design.
 
* Q:  Is there research on tagging vs. controlled vocabulary?  Jane responded that there is currently growing interest in this research area.  Systems currently have keywords created both ways and this changes over time.  The central part of the discussion is the design.
  
[[Media:DRIADE Journal Reps workshop HL.ppt|'''Hilmar's presentation''']]: re: requirements - how much they are needed, how they can be derived from use-cases and user-stories, etc. He also showed a matrix of the Repository/System Challenges vs. "Virtues".  Hilmar emphasized the importance of gathering the appropriate requirements.
+
[[Media:DRIADE Journal Reps workshop HL.ppt|'''Hilmar's presentation''']]: re: requirements - how much they are needed, how they can be derived from use-cases and user-stories, etc. He also showed a matrix of the Repository/System Challenges vs. "Virtues".  Hilmar emphasized the importance of gathering the appropriate requirements.
 
* Q: When you say gathering requirements, what do you mean?  Hilmar responded, clarifying specifing or obtaining from users - there is no formal checklist or guidelines, but might be gleaned from asking users what they want.  There should be lots of input.
 
* Q: When you say gathering requirements, what do you mean?  Hilmar responded, clarifying specifing or obtaining from users - there is no formal checklist or guidelines, but might be gleaned from asking users what they want.  There should be lots of input.
Todd stated that if will done, the implementations will be easier.
+
Todd stated that if will done, the implementations will be easier.
 
* Mark asked that "stakeholders" be defined, and there was some discussion about if authors were NOT, why? Authors certainly have a stake (in fact, they will re-use the data), but they are not as readily accessible, unless there were some way for them to be represented...
 
* Mark asked that "stakeholders" be defined, and there was some discussion about if authors were NOT, why? Authors certainly have a stake (in fact, they will re-use the data), but they are not as readily accessible, unless there were some way for them to be represented...
** We have identified the journals as proxys for authors.  Authors generate data, getting their input is a good idea.  We also need to look at peoples behavior.  What are the issues for authors when interacting with supplemental data?
+
** We have identified the journals as proxys for authors.  Authors generate data, getting their input is a good idea.  We also need to look at peoples behavior.  What are the issues for authors when interacting with supplemental data?
  
[[Media:NESCent_publishers_2.ppt|'''Jane's presentation''']]: re: metadata in general, and cited some examples, standards, and the metadata issues underlying the development of DRIADE.  
+
[[Media:NESCent_publishers_2.ppt|'''Jane's presentation''']]: re: metadata in general, and cited some examples, standards, and the metadata issues underlying the development of DRIADE.
 
* Michael asked if the software was open-source. It was explained that metadata is not itself software, but that most metadata schemas and many tools for metadata implementation are open-source.
 
* Michael asked if the software was open-source. It was explained that metadata is not itself software, but that most metadata schemas and many tools for metadata implementation are open-source.
 
Jane described metadata standards as a continuum.  Once the data objects are defiened, what metadata schemes might work in this scenario?  Are there ways to couple and package different schemas, so that there is not reinvbention of the wheel.  There was a question regarding the detail level of metadata and the effect of retrieval speed when accessing the saved data.  Jane also mentionned the preservation life cycle.
 
Jane described metadata standards as a continuum.  Once the data objects are defiened, what metadata schemes might work in this scenario?  Are there ways to couple and package different schemas, so that there is not reinvbention of the wheel.  There was a question regarding the detail level of metadata and the effect of retrieval speed when accessing the saved data.  Jane also mentionned the preservation life cycle.
Line 57: Line 59:
 
===Roundtable discussion===
 
===Roundtable discussion===
  
'''Michael''' presented a model relating data issues to Maslow's hierarchy of life needs.  Most important is preservation, then ability to access, then the ability to synthesize.  He said that '''the key is to save the data - that way it won't be lost'''. Doesn't believe that there will ever be a standard that everyone will adhere to.  He stated that there needs to be a culture change.  The main problem is the data, after a certain point in time, doesn't exist anymore.  The repository should not be fancy, there should be no assumptions about form.  '''Anyone should be able to archive data so that it won't be lost.'''  There should be a generic metadata format for archiving.  We need to have a place to put the data before it is lost.
+
'''Michael''' presented a model relating data issues to Maslow's hierarchy of life needs.  Most important is preservation, then ability to access, then the ability to synthesize.  He said that '''the key is to save the data - that way it won't be lost'''. Doesn't believe that there will ever be a standard that everyone will adhere to.  He stated that there needs to be a culture change.  The main problem is the data, after a certain point in time, doesn't exist anymore.  The repository should not be fancy, there should be no assumptions about form.  '''Anyone should be able to archive data so that it won't be lost.'''  There should be a generic metadata format for archiving.  We need to have a place to put the data before it is lost.
  
There was some discussion about journal policy on archiving.
+
There was some discussion about journal policy on archiving.
 
* Integrative and Comparative Biology are published by Oxford Press.  OP has a place for supplemental data.  Submission is not required and it is up to the author.  The system is partially open access.  Authors can pay to have it there/  OUP is dedicated to all open access - this would apply to datasets - they are moving in that direction.
 
* Integrative and Comparative Biology are published by Oxford Press.  OP has a place for supplemental data.  Submission is not required and it is up to the author.  The system is partially open access.  Authors can pay to have it there/  OUP is dedicated to all open access - this would apply to datasets - they are moving in that direction.
* American Naturalist policy - from the University of Chicago Press.  They are keen on archiving and will support the project.
+
* American Naturalist policy - from the University of Chicago Press.  They are keen on archiving and will support the project.
* Evolution Journal (need name here).  They do not have much policy.  Published by Blackwell, now Wiley.  The thought is that generally, data archiving is a good idea.
+
* Evolution Journal (need name here).  They do not have much policy.  Published by Blackwell, now Wiley.  The thought is that generally, data archiving is a good idea.
* Q: What are authors rights in terms of datasets?  There is some concern about where the data should reside.  Not sure that the journals should house the data....but then who does?  Concept of Central repository.  Concept of Evolution World website.  Another group administers rules.
+
* Q: What are authors rights in terms of datasets?  There is some concern about where the data should reside.  Not sure that the journals should house the data....but then who does?  Concept of Central repository.  Concept of Evolution World website.  Another group administers rules.
  
 
Comment - if journals do it, smaller journals and specialized data may not be represented.
 
Comment - if journals do it, smaller journals and specialized data may not be represented.
Line 69: Line 71:
 
Journals and NSF have different views/policies on what is published.
 
Journals and NSF have different views/policies on what is published.
  
M - Journals have moral authority for what is published.  As for author rights - you must provide replicatable results.
+
M - Journals have moral authority for what is published.  As for author rights - you must provide replicatable results.
  
 
M - This is probably a good standard, but some authors generate data with their own funding.  It IS their dataset.
 
M - This is probably a good standard, but some authors generate data with their own funding.  It IS their dataset.
Line 81: Line 83:
 
Michael said he believes in a simple caveat: that authors' data in a published paper should be (made available and?) presented in such a manner, that the experiments and results could be recreated.
 
Michael said he believes in a simple caveat: that authors' data in a published paper should be (made available and?) presented in such a manner, that the experiments and results could be recreated.
  
It was stated that everyone should be creating a low level record.
+
It was stated that everyone should be creating a low level record.
 
There was discussion on the culture, the value of data, preservation, access and data documentation.
 
There was discussion on the culture, the value of data, preservation, access and data documentation.
  
 
'''Open context''' - has a creative commons license.  The metadata is public as well as the data.  There is nothing preventing the author from saying how the data may be used/attributed.
 
'''Open context''' - has a creative commons license.  The metadata is public as well as the data.  There is nothing preventing the author from saying how the data may be used/attributed.
  
Comment - there is great author diversity.  Why is unstructured data different?  GenBank is very structured, perhaps there is more compliance.
+
Comment - there is great author diversity.  Why is unstructured data different?  GenBank is very structured, perhaps there is more compliance.
  
 
Don mentioned a fear of piracy - that there are data farmers.  Attribution is a big issue.  There could be a lack of shared knowledge of where the data came from.  If all of the data is in a centralized repository, provenance is easier. and ther is a built in policing model.
 
Don mentioned a fear of piracy - that there are data farmers.  Attribution is a big issue.  There could be a lack of shared knowledge of where the data came from.  If all of the data is in a centralized repository, provenance is easier. and ther is a built in policing model.
Line 92: Line 94:
 
'''Marcy Uyenoyama''' states that she sees a massive probelm with quality control.  If there is mandated deposition, there may be passive agressive behavior.  There is a need to see the source code on the tool that did the analysis.  Will the algorithm be shared?  Bioinformatics requries the code to be accessible.  Quality control will have something to do with how the data is used.  Messy data can be cleaned up, but messy data may not be used.  May need further description.  Over time quality can improve.  PhD data frequently is never seen - for a variety of reasons.  PhD materials could be required.  With quality control for published journals, editors/readers catch mistakes, we don't want mistakes exposed.  If mistakes can be found, we have an incentive for providing good data.
 
'''Marcy Uyenoyama''' states that she sees a massive probelm with quality control.  If there is mandated deposition, there may be passive agressive behavior.  There is a need to see the source code on the tool that did the analysis.  Will the algorithm be shared?  Bioinformatics requries the code to be accessible.  Quality control will have something to do with how the data is used.  Messy data can be cleaned up, but messy data may not be used.  May need further description.  Over time quality can improve.  PhD data frequently is never seen - for a variety of reasons.  PhD materials could be required.  With quality control for published journals, editors/readers catch mistakes, we don't want mistakes exposed.  If mistakes can be found, we have an incentive for providing good data.
  
Don stated that passive agressive behaviors might be affected by ease of data entry - there may be a hesitation to deposit data if the process is cumbersome.
+
Don stated that passive agressive behaviors might be affected by ease of data entry - there may be a hesitation to deposit data if the process is cumbersome.
Concern about data from older/retiring faculty/investigators.  There are many reasons to want to share - safe personal place to store for own reuse, ability to share with collaborators.
+
Concern about data from older/retiring faculty/investigators.  There are many reasons to want to share - safe personal place to store for own reuse, ability to share with collaborators.
  
Concept of science advancing through the interpretation of data.  Initially, investigators were told to minimize data - just use/provide summary data...things are different now.
+
Concept of science advancing through the interpretation of data.  Initially, investigators were told to minimize data - just use/provide summary data...things are different now.
  
 
Genbank = everyone submits, but there is not place for paper attribution.  It is crucial to have a reference for a paper.  Add the paper information to the dataset.  Permanent link to URL.  Data referencing section - could be painless.
 
Genbank = everyone submits, but there is not place for paper attribution.  It is crucial to have a reference for a paper.  Add the paper information to the dataset.  Permanent link to URL.  Data referencing section - could be painless.
Line 103: Line 105:
 
Eventually, could cite both.
 
Eventually, could cite both.
  
Jane stated that data quality is very important.
+
Jane stated that data quality is very important.
A stated that the standard and quality changes over time.
+
A stated that the standard and quality changes over time.
 
Jane stated that the data is gathered for a lifetime.  One place for sustainability concepts/ideas ios ibiblio.org.  It is the people's public library.  Carolina Minds has been set up for retiring professors/those no longer living.
 
Jane stated that the data is gathered for a lifetime.  One place for sustainability concepts/ideas ios ibiblio.org.  It is the people's public library.  Carolina Minds has been set up for retiring professors/those no longer living.
  
Line 123: Line 125:
 
Don raised question regarding videoclips.  Is the data the videoclip or the data surrogate.  Differehnt levels of extraction exist.  We want to capture the extracted data that goes into publication.
 
Don raised question regarding videoclips.  Is the data the videoclip or the data surrogate.  Differehnt levels of extraction exist.  We want to capture the extracted data that goes into publication.
  
For published datat, will not need to include other types of media.  Extraction is interpretation.
+
For published datat, will not need to include other types of media.  Extraction is interpretation.
  
 
When creating policy/standard.  We need to state a way that seems clear, but leaves latitude.
 
When creating policy/standard.  We need to state a way that seems clear, but leaves latitude.
  
Hilmar explained the MIAMI standard - an example from Genomics.  The community decided that the standard meant the ability to recreate the result from the given data.
+
Hilmar explained the MIAMI standard - an example from Genomics.  The community decided that the standard meant the ability to recreate the result from the given data.
  
 
Question?  Where is "there" and who will be paying?
 
Question?  Where is "there" and who will be paying?
  
Comment - subscribers?  NSF stance is that Academic libraries will fill in holes.
+
Comment - subscribers?  NSF stance is that Academic libraries will fill in holes.
  
Comment - Journals can't charge for to have data deposited.
+
Comment - Journals can't charge for to have data deposited.
  
Jane mentionned OCLC and their potential interest to the question of where this might live.  Somewhere global and ubiquitous.
+
Jane mentionned OCLC and their potential interest to the question of where this might live.  Somewhere global and ubiquitous.
 
* We need to sit down with editors and authors to define guidelines and limits.
 
* We need to sit down with editors and authors to define guidelines and limits.
 
Simplicity is key.
 
Simplicity is key.
* We need to provide some perspective on why we are doing it.
+
* We need to provide some perspective on why we are doing it.
 
Data submission required for reviewers?
 
Data submission required for reviewers?
  
Make the requirement general, detailed screens are police.  Reviewers are middle ground.
+
Make the requirement general, detailed screens are police.  Reviewers are middle ground.
 
Allow authors to submit, do not require.
 
Allow authors to submit, do not require.
 
Once data is public, is public.
 
Once data is public, is public.
Line 150: Line 152:
 
How much will this cost?  The main cost is for administration.  NSF has said it will NOT fund long term preservation.  Will there be public funding?  How many curators/administrators does this take?
 
How much will this cost?  The main cost is for administration.  NSF has said it will NOT fund long term preservation.  Will there be public funding?  How many curators/administrators does this take?
  
NCBI has 30-50 curators.
+
NCBI has 30-50 curators.
  
 
Comment - we need to archive previous versions of the same datasets to prohibit malfeasance.
 
Comment - we need to archive previous versions of the same datasets to prohibit malfeasance.
  
There are many areas of inquiry as the field evolves.
+
There are many areas of inquiry as the field evolves.
  
Geospacial data - there are many ways to represent.  With the system, we can track the evolution of standards and trends through recordkeeping.  
+
Geospacial data - there are many ways to represent.  With the system, we can track the evolution of standards and trends through recordkeeping.
  
 
Needs to be a base standard.
 
Needs to be a base standard.
  
 
How do we research requirements?
 
How do we research requirements?
Brainstorm, suggestions for further meetings - who should participate?  
+
Brainstorm, suggestions for further meetings - who should participate?
  
 
Don - Should we start with editorials in journals?  Follow up with poll?  See how willing authors are?
 
Don - Should we start with editorials in journals?  Follow up with poll?  See how willing authors are?
  
Comment - If editors come to an agreement, use a limited standard.  All data need to be in X format, open access, and for X years not used for other activities.
+
Comment - If editors come to an agreement, use a limited standard.  All data need to be in X format, open access, and for X years not used for other activities.
  
 
Todd - Should be separate from GenBank/treebase, etc..
 
Todd - Should be separate from GenBank/treebase, etc..
  
Get authors/users used to the idea.  then build later.
+
Get authors/users used to the idea.  then build later.
  
 
Jane asked is data in more that one place?  Either/or?
 
Jane asked is data in more that one place?  Either/or?
Line 179: Line 181:
 
Ah - no, but good to think forward.
 
Ah - no, but good to think forward.
  
Mark - editorial boards are all functional scientists.  Journals get together.  Best version of joint policy, then go to editorial board.  How encompassing should this be?  Synthetic EB journals should have a baseline policy.
+
Mark - editorial boards are all functional scientists.  Journals get together.  Best version of joint policy, then go to editorial board.  How encompassing should this be?  Synthetic EB journals should have a baseline policy.
  
 
Don - should there be an editorial poll?
 
Don - should there be an editorial poll?
Line 187: Line 189:
 
Todd - are there folks we may want to include?  European perspectives, Japanese ediotors, not necessary to include past editors.
 
Todd - are there folks we may want to include?  European perspectives, Japanese ediotors, not necessary to include past editors.
  
Need a meta-analysis working group.
+
Need a meta-analysis working group.
Folks outside field, but those who habve standards overlapping.
+
Folks outside field, but those who habve standards overlapping.
  
Michael - Propose a model for most of the field.  Give a model to pick at.
+
Michael - Propose a model for most of the field.  Give a model to pick at.
  
 
Don - Have something planned.
 
Don - Have something planned.
Line 201: Line 203:
 
AH - Everyone should be aware that the material can go out there.  Many younger folks look for an acceptable place?  People don't contribute.
 
AH - Everyone should be aware that the material can go out there.  Many younger folks look for an acceptable place?  People don't contribute.
  
Comment - optional for a few years?
+
Comment - optional for a few years?
 
Comment - what is critical is time.
 
Comment - what is critical is time.
 
Comment - what is needed is concept of being told to do it.
 
Comment - what is needed is concept of being told to do it.
Line 249: Line 251:
 
# Incentives for more rigorous metadata provision
 
# Incentives for more rigorous metadata provision
  
Mark -  
+
Mark -
 
# Author control, restricted access/licensing
 
# Author control, restricted access/licensing
 
# Security issues - who can contribute
 
# Security issues - who can contribute
Line 255: Line 257:
 
# Celerity of creation
 
# Celerity of creation
  
Marcy -  
+
Marcy -
 
# Data Quality - burden on author if standard format required?
 
# Data Quality - burden on author if standard format required?
 
# Financial sustainability
 
# Financial sustainability
Line 262: Line 264:
 
# Explicit policy on deposit
 
# Explicit policy on deposit
  
Michael -  
+
Michael -
 
# ASAP
 
# ASAP
 
# Minimal metadata
 
# Minimal metadata
Line 292: Line 294:
 
* Data quality
 
* Data quality
 
** curation (data and metadata)
 
** curation (data and metadata)
** data security  
+
** data security
 
* Use minimal set of metadata (as a start)
 
* Use minimal set of metadata (as a start)
 
** canonical representation
 
** canonical representation
Line 317: Line 319:
 
** field data valuable too though
 
** field data valuable too though
 
* Any format, any standard, any type
 
* Any format, any standard, any type
** combined with easy ability to promote data and metadata to higher levels of structure and standards compliance  
+
** combined with easy ability to promote data and metadata to higher levels of structure and standards compliance
 
* Methods
 
* Methods
 
** source code for custom software or algorithms
 
** source code for custom software or algorithms
  
[[Category:Workshop Dec 2006]]
+
[[Category:Workshops]]
 +
[[Category:Historical Pages]]

Latest revision as of 10:43, 9 October 2012

Status: This page is outdated and of historical use only.

Minutes of the Dec 5, 2006 Stakeholder Workshop with journal editors and society representatives

Attending: Ahrash Bissell (Duke/OpenContext), Harold Heatwole (editor, Integrative & Comparative Biology), Mark Rausher (editor, Evolution), Michael Whitlock (editor, American Naturalist), Kathleen Smith (NESCent), Marcy Uyenoyama (incoming editor, Molecular Biology & Evolution), Don Waller (SSE), Todd, Hilmar, Jane, Ruth, Jed

Agenda:

  1. Goals of this meeting (Todd Vision)
  2. Brief description of OpenContext (Ahrash Bissell)
  3. Roundtable introductions
  4. Requirements and open questions (Hilmar Lapp)
  5. Issues regarding metadata (Jane Greenberg)
  6. Roundtable discussion:
    • Expectations and desires of the journals, publishers and scientific societies - What are the priorities?
    • Ideas for the requirements gathering phase
    • Suggestions for attendees at the spring stakeholders meeting
    • Further discussion of project plans and the two major upcoming meetings

Minutes

Presentations

Todd's presentation: provided an introduction to NESCent, the DRIADE project, and the meeting. He said that the Repository might help researchers deposit their data so that data are acceptable for GenBank, TreeBase, etc. He also indicated that the journals provide a good mechanism that could make it easy for authors to deposit data. There is an interest in data that may be perceived as useless to some, but may be useful to others. As a grand vision, there will be re=purposing of data. There was brief discussion on the inclusion of data from experiments that resulted in negative results.

  • Q: How soon can researchers deposit data in the the repository, in relation to eventual publication?

Ahrash described OpenContext, a repository for archaelogists, founded by Sarah & Eric Kansa at Harvard. He said that minimal metadata was required, in an effort to get the data in. interdisciplinary questions can be addressed (using the heterogenous datasets). Users enter via single sign-on access. The repository operates under Creative Commons license.

Michael suggested that there might be an agreed date to have mandataory data archiving begin for authors publishing in the journals (i.e., those represented in the meeting) - for example, July 1, 2007.

Don suggested that there perhaps be a phased approach for deposits - metadata first, then data later.

Harold said he thinks that some data, not strictly part of the published paper, would be valuable to others - and that authors may not realize this.

  • Comment: data deposition is a huge burden.
  • Q: What is a wiki?
  • Q: What is a folksonomy?
  • Q: Is there research on tagging vs. controlled vocabulary? Jane responded that there is currently growing interest in this research area. Systems currently have keywords created both ways and this changes over time. The central part of the discussion is the design.

Hilmar's presentation: re: requirements - how much they are needed, how they can be derived from use-cases and user-stories, etc. He also showed a matrix of the Repository/System Challenges vs. "Virtues". Hilmar emphasized the importance of gathering the appropriate requirements.

  • Q: When you say gathering requirements, what do you mean? Hilmar responded, clarifying specifing or obtaining from users - there is no formal checklist or guidelines, but might be gleaned from asking users what they want. There should be lots of input.

Todd stated that if will done, the implementations will be easier.

  • Mark asked that "stakeholders" be defined, and there was some discussion about if authors were NOT, why? Authors certainly have a stake (in fact, they will re-use the data), but they are not as readily accessible, unless there were some way for them to be represented...
    • We have identified the journals as proxys for authors. Authors generate data, getting their input is a good idea. We also need to look at peoples behavior. What are the issues for authors when interacting with supplemental data?

Jane's presentation: re: metadata in general, and cited some examples, standards, and the metadata issues underlying the development of DRIADE.

  • Michael asked if the software was open-source. It was explained that metadata is not itself software, but that most metadata schemas and many tools for metadata implementation are open-source.

Jane described metadata standards as a continuum. Once the data objects are defiened, what metadata schemes might work in this scenario? Are there ways to couple and package different schemas, so that there is not reinvbention of the wheel. There was a question regarding the detail level of metadata and the effect of retrieval speed when accessing the saved data. Jane also mentionned the preservation life cycle.

Todd demo'ed Morpho (the KNB ecology metadata/analysis tool that works with (and requires) EML)

  • Q: Does the system allow keywords to be entered that are NOT on the list (EML)?
    • A: YES. A radio-button is clicked for "in the list" or "not in the list".
  • Q: What if your data is not in standardized format?
    • A (Todd): That is an issue. Is it possible to define formats and then allow exceptions?
  • Don attended a workshop in July 2006 where it was determined that something like 12 out of 60 or 80 are using the (Morpho?) system.
  • Jane agreed. NIEHS had similar problem - after many agreed they were interested.
  • (Marcy?) said the fact that data comes in too many different forms is a stumbling block.

Roundtable discussion

Michael presented a model relating data issues to Maslow's hierarchy of life needs. Most important is preservation, then ability to access, then the ability to synthesize. He said that the key is to save the data - that way it won't be lost. Doesn't believe that there will ever be a standard that everyone will adhere to. He stated that there needs to be a culture change. The main problem is the data, after a certain point in time, doesn't exist anymore. The repository should not be fancy, there should be no assumptions about form. Anyone should be able to archive data so that it won't be lost. There should be a generic metadata format for archiving. We need to have a place to put the data before it is lost.

There was some discussion about journal policy on archiving.

  • Integrative and Comparative Biology are published by Oxford Press. OP has a place for supplemental data. Submission is not required and it is up to the author. The system is partially open access. Authors can pay to have it there/ OUP is dedicated to all open access - this would apply to datasets - they are moving in that direction.
  • American Naturalist policy - from the University of Chicago Press. They are keen on archiving and will support the project.
  • Evolution Journal (need name here). They do not have much policy. Published by Blackwell, now Wiley. The thought is that generally, data archiving is a good idea.
  • Q: What are authors rights in terms of datasets? There is some concern about where the data should reside. Not sure that the journals should house the data....but then who does? Concept of Central repository. Concept of Evolution World website. Another group administers rules.

Comment - if journals do it, smaller journals and specialized data may not be represented. Blackwell does not have online supplemental data. Journals and NSF have different views/policies on what is published.

M - Journals have moral authority for what is published. As for author rights - you must provide replicatable results.

M - This is probably a good standard, but some authors generate data with their own funding. It IS their dataset.

Don Waller said saving it first has to be a priority. Not just for published works, but also for data that might be a casualty in a big truck scenario. Interest in repositories that can provide access as soon as possible. Much concern over security. Make sure that "god's design" does not penetrate security of data repository. His concern was a denial-of-service attack. Rights management was the key: the original author should be able to decide whether to have his name included, when the data was re-used. If there is no rights management, this will fall flat on its face. There should be gradual buy in. There should be a statement in the metadata standard on the provision of recognition.

This raises the question of meta-analyses. Bumpasses Sparrows were mentionned. Disussion on public funds vs. public domain. Education should foster a culture for sharing. Mark asked that if data is developed by non-public funds - should it still be fully open-access?

Harold asked about PhD theses. They often don't get published. Nor does the underlying data.

Michael said he believes in a simple caveat: that authors' data in a published paper should be (made available and?) presented in such a manner, that the experiments and results could be recreated.

It was stated that everyone should be creating a low level record. There was discussion on the culture, the value of data, preservation, access and data documentation.

Open context - has a creative commons license. The metadata is public as well as the data. There is nothing preventing the author from saying how the data may be used/attributed.

Comment - there is great author diversity. Why is unstructured data different? GenBank is very structured, perhaps there is more compliance.

Don mentioned a fear of piracy - that there are data farmers. Attribution is a big issue. There could be a lack of shared knowledge of where the data came from. If all of the data is in a centralized repository, provenance is easier. and ther is a built in policing model.

Marcy Uyenoyama states that she sees a massive probelm with quality control. If there is mandated deposition, there may be passive agressive behavior. There is a need to see the source code on the tool that did the analysis. Will the algorithm be shared? Bioinformatics requries the code to be accessible. Quality control will have something to do with how the data is used. Messy data can be cleaned up, but messy data may not be used. May need further description. Over time quality can improve. PhD data frequently is never seen - for a variety of reasons. PhD materials could be required. With quality control for published journals, editors/readers catch mistakes, we don't want mistakes exposed. If mistakes can be found, we have an incentive for providing good data.

Don stated that passive agressive behaviors might be affected by ease of data entry - there may be a hesitation to deposit data if the process is cumbersome. Concern about data from older/retiring faculty/investigators. There are many reasons to want to share - safe personal place to store for own reuse, ability to share with collaborators.

Concept of science advancing through the interpretation of data. Initially, investigators were told to minimize data - just use/provide summary data...things are different now.

Genbank = everyone submits, but there is not place for paper attribution. It is crucial to have a reference for a paper. Add the paper information to the dataset. Permanent link to URL. Data referencing section - could be painless.

Journals will not be happy if the data is cited, but not the paper. Paper data must have the paper cited. What is the appropriate citation?

Eventually, could cite both.

Jane stated that data quality is very important. A stated that the standard and quality changes over time. Jane stated that the data is gathered for a lifetime. One place for sustainability concepts/ideas ios ibiblio.org. It is the people's public library. Carolina Minds has been set up for retiring professors/those no longer living.

Comment made that whatever standard/policy is adopted, higher level metadata will make the dataset more useful. Incentives may help in having a more complex metadata hierarchy..

Comment made that incentives are built in the way science is done. Many are looking for collaborations. In looking at GenBank as a model - what would make more that the EB field interested in the data repository?

Hilmar provided a brief history of Genbanks development. Initially, journals said they would not publish all of the sequence data that was being created. This was a research result. Also, NIH mandated delivery to Genbank. Now there is much more supplemental data than the sequence data.

Comment made that the actual data points should be archived.

Comment made that the copyright lives with the author. Concept of toll-free phone number link. Voluntary as to what author wants to contribute.

Harold stated that secondary material used to be in an appendix as part of the additional materials. Simply giving all data to journals would create copyright issue (XXXXCan someone clarify for me? rm)

Comment made that journals should require the author to provide data that could be used to recreated the analysis. We are talking about data tables, talbes for coordinateds of figures, photos, in an easy to store place.

Don raised question regarding videoclips. Is the data the videoclip or the data surrogate. Differehnt levels of extraction exist. We want to capture the extracted data that goes into publication.

For published datat, will not need to include other types of media. Extraction is interpretation.

When creating policy/standard. We need to state a way that seems clear, but leaves latitude.

Hilmar explained the MIAMI standard - an example from Genomics. The community decided that the standard meant the ability to recreate the result from the given data.

Question? Where is "there" and who will be paying?

Comment - subscribers? NSF stance is that Academic libraries will fill in holes.

Comment - Journals can't charge for to have data deposited.

Jane mentionned OCLC and their potential interest to the question of where this might live. Somewhere global and ubiquitous.

  • We need to sit down with editors and authors to define guidelines and limits.

Simplicity is key.

  • We need to provide some perspective on why we are doing it.

Data submission required for reviewers?

Make the requirement general, detailed screens are police. Reviewers are middle ground. Allow authors to submit, do not require. Once data is public, is public. Provide accession number from database (like genBank). Is data to be immediately accessible?

Who pays? How much will this cost? The main cost is for administration. NSF has said it will NOT fund long term preservation. Will there be public funding? How many curators/administrators does this take?

NCBI has 30-50 curators.

Comment - we need to archive previous versions of the same datasets to prohibit malfeasance.

There are many areas of inquiry as the field evolves.

Geospacial data - there are many ways to represent. With the system, we can track the evolution of standards and trends through recordkeeping.

Needs to be a base standard.

How do we research requirements? Brainstorm, suggestions for further meetings - who should participate?

Don - Should we start with editorials in journals? Follow up with poll? See how willing authors are?

Comment - If editors come to an agreement, use a limited standard. All data need to be in X format, open access, and for X years not used for other activities.

Todd - Should be separate from GenBank/treebase, etc..

Get authors/users used to the idea. then build later.

Jane asked is data in more that one place? Either/or?

Ah - WIll there be a portal? How the protal can lead to other related "data places."

Comment - not needed now.

Ah - no, but good to think forward.

Mark - editorial boards are all functional scientists. Journals get together. Best version of joint policy, then go to editorial board. How encompassing should this be? Synthetic EB journals should have a baseline policy.

Don - should there be an editorial poll?

Mike - no should be top down.

Todd - are there folks we may want to include? European perspectives, Japanese ediotors, not necessary to include past editors.

Need a meta-analysis working group. Folks outside field, but those who habve standards overlapping.

Michael - Propose a model for most of the field. Give a model to pick at.

Don - Have something planned. Mike - here are X fields suggested/

Jane - concept of focus group. Delphi method? Meeting in July? Potential users.

Hilmar asked/stated....does this come from the top down? Perspective and requirements to come from journals. Users help to fine tune from there.

AH - Everyone should be aware that the material can go out there. Many younger folks look for an acceptable place? People don't contribute.

Comment - optional for a few years? Comment - what is critical is time. Comment - what is needed is concept of being told to do it. Hilmar - should be same for code.

Hilmar presented the following list as suggestions for "must have" requirements:

Preservation Some way to save materials ASAP - cost effective Data Quality Security and curation Use minimal set of metadata (as a start) commercial representation - level of data --data summaries --data points -data lifecycle -self-sustaining financially -security

IP, LICENSE, ACCESS Restricted access period depends on funding data published, or not? licensing scheme citation policy for data author control

SCOPE OF DATA PROJECT -published data -field data -any format, standard, any type -methods, source code

Question - can we get supplemental data up and running? comments varied by journal;

Question from Jed - Is there an interest in the quality of the metadata if authors may not want to provide? Answers - yes...

Discussion of description Discovery and preservation ONLY supported Jane mentions Data curators

Top priorities

Don-

  1. Originator intellectual property - author control
  2. Complete repository quickly
  3. Security
  4. Incentives for more rigorous metadata provision

Mark -

  1. Author control, restricted access/licensing
  2. Security issues - who can contribute
  3. Financial stability
  4. Celerity of creation

Marcy -

  1. Data Quality - burden on author if standard format required?
  2. Financial sustainability
  3. Author control
  4. Source code
  5. Explicit policy on deposit

Michael -

  1. ASAP
  2. Minimal metadata
  3. - emphasize first two votes

There was a call for other suggestions/ideas. Comment on legacy datasets for those scholars retiring. Testamentary datasets - contact Karen Streier (U of Wisconsin Anthropology) about interest in collecting primate databases.

Obtaining feedback

  • Series of editorials
  • Utilize editorial boards
  • journals create baseline policy and baseline metadata requirements and pass it town

Additional people to invite/involve

  • European perspective
  • Meta-analysis working group
  • Japanese perspective

Summary

Data Preservation

  • Save materials ASAP
    • cost effective
    • open door policy
  • Data quality
    • curation (data and metadata)
    • data security
  • Use minimal set of metadata (as a start)
    • canonical representation
  • Level of data
    • data summaries and data points
    • rule is to be able to recreate the results
  • Data lifecycle
    • versioning
    • audit trail?
  • Self-sustaining financial model
  • Security from attacks (e.g., D.o.S)

IP, access control & license

  • Dependency on funding source and whether data is published yet or not
  • Authors retain control
    • licensing scheme for access
    • restricted access period
    • citation policy for data

Scope of data for the repository

  • Published data only
    • field data valuable too though
  • Any format, any standard, any type
    • combined with easy ability to promote data and metadata to higher levels of structure and standards compliance
  • Methods
    • source code for custom software or algorithms