Workshop May 2007 day1 summaries

Theme 1: Adoption and sustainability (Ahrash Bissell)

 * What models exist for long-term sustainability? Rather than talk about successes, we focused on examples of repositories that have failed do exist.
 * One failure is when a resource is built and people do not come
 * Another failure includes efforts that overload the front-end with consensus building
 * Another failure is a silo project that has no interoperability and never expands its user base
 * Another failure is lack of scalability when it is overly successful


 * Models for maximizing buy-in. There is general agreement about the value of sharing among scientists.  The challenge is getting participation.


 * Too much effort on the front end, raising the bar too high, makes it unlikely that scientists will participate.


 * Some possible carrots were discussed, although none seemed magical


 * We also discussed sticks. Particularly the requirement of data submission in order to publish in a journal.


 * There is a radeoff between getting data in w/ minimal metadata requirements and having rich metadata that would facilitate easy reuse.


 * Need sustained funding, ideally from an endowment, some governmental operating expenses. Additional funding should be sought for applications that enhance the data and put other functionalities in place.

Interesting additional points

 * A memorable quote: "Every three moves of an archive is equivalent to one fire"
 * Educational uses of the repository could help spur the necessary cultural evolution in which data sharing is expected and understood. The way Genbank submission is taken for granted today.
 * Uses of DRIADE will themsleves evolve, and we should be humble in anticipating exactly what it will look like in a few years

Q&As

 * Jane: Were there known models that would be a success?
 * Ahrash: Genbank was discussed, but the context of its inception is different.
 * Jane: So is what we are doing really new?
 * Brad H.: IMLS funded digital library projects tend to go off the map after funding period ends. Gail: ICPSR is presented as a common model for success.  Stu: It is necessary to institutionalize these things.  Europe has a tradition of this, but the US does not.  Do not anticipate that the gov't will step up to this responsibility.
 * Derek: Community-source model in which self-interested institutional consortia are tapped
 * Brad: very similar to the open access arguments. There is pay-for-access for non-subscribers
 * Stu: pay-for-deposit is surprisingly successful and the cost is not necessarily prohibitive (Todd: esp. if it is rolled into page charges)
 * Ahrash: a business plan w/ diversified funding sources
 * Mike: From NSF-CISE repositories wkshp in Pheonix last month. Tired of hearing about sustainability -- they see this as survival of the fittest for useful resources.  They should worry instead about really good services.

Theme 2: Intellectual property and provenance (Brad Hemminger)
Overview: For the most part, we felt the handling of datasets is very much like how articles are currently handled, except that datasets should be stored in repositories separate from the journal publishers, and that the datasets should be publicly available.

0. What types of data are researchers interested in
 * sharing: data supporting conclusions, possibly many types
 * protecting: don't want to share data on which they are doing continuing analyses, would prefer to have an embargo period. Might not want to share sophisticated analysis programs that have been developed by laboratory.

''1. Which data licensing schemes are needed for researchers' needs in the small science community? What is needed beyond CC licenses? Is there a need for Material Transfer Agreement style license?''
 * Creative Commons: what licenses to use? Attribution at least. Less clear about other restrictions. The aspect of defining rights for science is under active discussion in Science Commons (http://sciencecommons.org/), and this group is encouraged to participate in this group.

2. In which way does the publisher's copyright policy need to be taken into account?
 * The datasets must be stored in publicly accessable archives; however if the authors and the publisher agree to place a copy in the publisher maintained archive this is acceptable (any agreement obviously could not be exclusive given the deposit in a public archive).

''3. What liability concerns (from publishers) need to be considered? Not sure what the possible concerns are from the journal publishers perspectives??'' What is different current paper publications. Maybe this is targets towards the archive "publisher"? Assuming the datasets are deposited in public archives, it may be a question for the public archives (Genbank, institutional repository). One possible concern would be if author requested embargo period and the publisher didn't correctly implement this, and the material was released too early.

At what level of depth does provenance needed recorded and provided?

Are data items
 * "co-owned" by all co-authors
 * owned by lead author only
 * each data item have a separate author

All authors sign off at submission, one author is the contact (controlling author), while it would be easiest to have a single author in charge of data items from the publisher's perspective, it may become desirable to allow individual control. For instance many journals now have authors indicate what their contributions to the article were, and these statements of contribution could (should) extend to the datasets. One author may have contributed one dataset, and another author may have contributed a second dataset. Also, if the data is reused from a prior article publication, then presumably the rights would be controlled by the submitter of that dataset. Thus, the authors can spell out the individual responsibilities, else all control resides with the contact author.

(b) What about synthetic studies?
 * do you deposit data?: maybe in situation where an author's work had not been published but they allowed us to incoporate and publish the data as part of our paper (like a work in progress, or personal communication from another author).
 * what level of citation should be provided? Handle just like papers are handled now, formal citation.
 * or not cite at all?: example summaries of symposium where one paper covers 12-15 talks (which may not have had papers).

5. (a) To what extent does exposure of metadata records need to be coupled to the availibility of the data? In general, we thought the metadata should be available regardless of the availability of the data, but that information providing a way to contact or to request access to the data should be provided.

(b) Should the metadata be under the control of the data owner? ''This depends on what is meant by "metadata". Most of the information taken from the original data should not be changed (titles of fields, meanings of data fields); however, things like the addition of keywords describing the dataset might be OK.''

Challenges

 * Need to understand (research)
 * what types of data people want to share,
 * what reasons people have for wanting to protect the data.
 * For instance do we include MatLab code? SAS programs? The answers to these questions have a dramatic what data we are considering may change the answers to the questions we have addressed.
 * Participation is suggested in defining Science Commons licensing, as Creative Commons licensing seems to the best solution, but needs to be adapted to science datasets (which is what Science Commons is doing).
 * what provisions for special cases should be allowed, for instance should embargos be supported?

Suggested Approaches

 * submit datasets to public archives (institutional and society/professional)
 * use Creative Commons attribution license (or similar Science Commons) to keep the materials freely available.
 * datasets should be cited and used in essentially the same way that articles are currently.

Additional factors
Funding agencies requirements for sharing: NSF, NIH, Burroughs Welcome

Resources

 * http://www.bitlaw.com/copyright/database.html
 * http://www.law.duke.edu/cspd/science.html
 * http://hcil.cs.umd.edu/trs/2005-06/2005-06.pdf Parr's paper
 * Proposal: Data archiving for ecology and evolution journals

DIscussion
Todd: Is it worth setting up guidelines that are not legally enforcable? Paul: liability clauses are important. If universities have significant overhead dedicated to your lab, they can assert rights. Brad A: American Chemistry Association wants the copyright on data and are trying to block it. Societies do not all wear white hats. Why would Blackwell want to assert rights to data that they aren't even collecting? Paul: And could they even if they wanted to? Brad H: Elsevier is likely to be favorable to an initiative that is community driven. The best way to get something like this to happen is to get everybody on board.
 * Paul: Creative commons licenses. Science commons hasn't proposed anything usable yet.  We should include the Duke center for the study of the public domain (Jerry Reichman).  Law on databases is quite different from other works.
 * Harold: when publishers were changed a year ago, the attitude of the publisher toward profitability was an issue in choosing Oxford
 * Derek: As publishers change, the playing field changes.
 * Ahrash: we shoudl be careful about setting precedent for licensing data, it's more about guiding behavior than about protecting yourself in a legal sense
 * Paul: The legal standing of "facts" may well change with new laws and court findings.
 * Brad H: The approach of getting societies and journals to band together and direct publisher behavior is a great approach and keep doing it!

Topics

 * Federation is orthogonal to preservation
 * Question of trust of remote resources
 * Preservation requires making lots of copies
 * Recommendation: be more like CiteSeer

Key points

 * Preservation comes down to how much you trust resources and how persistent the resource is expected to be
 * There is no reason not to make lots of copies

CiteSeer
Citation-based preprint index in CS, Math, & related fields. Authors do nothing - it's all automatic. A key decision was to store a local copy and not depend on remote copies. So automatically and inadvertantly provided a preservation functionality.

Data integrity

 * Real danger of overengineering the solution (public key registry, XML signatures,
 * Simplest achievable goal: maintain checksums of data

Distributed storage technology

 * If done correctly, it should be invisible to the user
 * Implementation details change over time
 * DRIADE should focus on a URI/DOI type interface

Key issues to consider

 * Observation: successful systems set the barrier to participation to zero.
 * automatically filled with "good enough" context/metadata
 * Empty boxes are intimidating, as evidenced by a lot of empty institutional repositories

Recommendations: Existing communication methods

 * What is available to pre-ingest? forums/wikis/blogs/email lists/etc.
 * Do these communication channels (forums, etc) already exist in the community?
 * If yes, archive and ingest their artifacts (eg uploaded files)
 * If no, create them (and then mine their contents)
 * Tools you use to discuss your research should be the same ones you use for other aspects of life
 * First order of business should be in the business of creating identifiers (DOIs if possible)
 * promote the datasets to 1st class web resources
 * Two modes
 * user upload data to DRIADE
 * DRIADE creates an ID that acts as a surrogate for the content in the remote specialized repository.
 * Provide community-specific linking services via OpenURL
 * Do not try to build a better search engine and compete w/ google
 * Create a new discovery service
 * The idea is that DRIADE could offer a community-specific resolver that provides services appropriate for the evolution community - rather than setting up a new portal
 * Google scholar is already openurl aware

Discussion
Mike: hold on to everything, but use page rank for importance. Brad A: page rank won't find the 50 yr old dataset that I really want to use. The diversity of data is so enormous that it will be difficult to capture it automatically. Peer review matters. Oya: success of ArXiV is the hidden quality control mechanism - it doesn't place a burden on the depositor. Todd: the full-text of the article provides the most valuable context for the data Joel: the 'simple' metadata should come from the submission process. The journal article is the gold standard. Brad A: metadata flagging fraud or unreliability would be useful, but page rankings will not be useful because average usage is so low that Poisson errors would predominate. Another idea is to keep article open to peer review for 6 months after publication, and the data acould be a useful part of that. Derek: Genbank
 * Bradley A: There's something about peer review that is missing, and that all datasets are equal. and if they are not well enough curated, they might as well have disappeared.  The idea that someone crawls through and found stuff is unsatisfying.

Priorities

 * Preserve the data (urgent!)
 * Think of it as a conversation between humans supporting the need to recreate the experiment and do independent analysis
 * Unambiguous linkage between data and publications
 * Make it accessible (very skeptical of DOIs!)
 * Formulation of public policies pertaining of curation of data
 * Formalization of a vocabulary to describe relations among datasets, change policies, etc.
 * Automation
 * Support for automated data exchange and use
 * Schemas to capture machine-processable data

Data and applications: where does th complexity belong?

 * simple data with complex applications?
 * complex data with simple applications?
 * data and applications with a complex schema in between
 * it is not evident what the best choice is!

How can data standards be motivated an managed?

 * Stakeholders
 * Researchers
 * Publishers (Todd: does this mean societies and journals?)
 * Data curators
 * Funding agencies

What are motivations for adoptying and enforcing standards>?

 * (I didn't get this list down)

Curation issues
Author, machine, 3rd party, user (2.0
 * CLasses of metadata
 * What sort of metadata is important independent of the publications that reference it?

Who can edit datasets?

 * Assertion: Datasets shoudl never change, but be versioned
 * Conventions concerning post-publication status needed
 * Who can edit metadata?

Assertion

 * Metadata is a curatable object independent of the data - it is for the repository to manage, not the depositors

Some closing thoughts

 * What are the natural institutional homes for repositories?
 * How does death fit into the lifecycle?

Discussion

 * DOIs are intended for salable scholarly content, so policy and system in which they exist may not be appropriate for this case
 * The first order of the day is getting the stuff. Then worrying about the engineering.  Promote a bottom-up growth of data standards.  There will clear examples of where to start - prioritize where to put effort on standards.

Big Issues that emerged today

 * Who houses stuff?
 * What identifiers to use?
 * Useful to have a list of existing repositories to pick through and evaluate
 * What are the processes that can be ported from existing archives in their institutional context?