TreeBASE Curation

From Dryad wiki
Jump to: navigation, search

Status: Some bibliographic cleanup was performed by Rosie Kilgore. Other work was performed by Yale. Dryad is not planning to perform further work on this workpackage, except notifiying TreeBASE of inconsistencies we find.

All TreeBASE records need to be curated. The work at NESCent will be prioritized in this order:

  • records associated with articles from the "old" Systematic Biology archives, 1995-2003 (which are already in Dryad)
  • records associated with the remainder of Systematic Biology articles, 2004-present (which will soon be in Dryad)
  • other TreeBASE content, in order of ascending accession numbers


Curation Process

At minimum, every item in TreeBASE needs to have a curated citation and genbank IDs. If time allows, we will move on to taxon names. If there is more time we will move on to lat/lon etc.

As you are working with each item:

  • note progress in google spreadsheet status, along with your initials and date
  • If you run into a problem, send your question to

Curating the citation:

  1. Retrieve a Legacy ID from the spreadsheet
  2. In TB2, search the item by Legacy ID. If it is not available, search TB1
  3. Fill in missing information in EndNote (Journal, volume, number, pages, keywords, full citation, DOI, abstract)
    1. If the item has been published, make sure the Label field does not say "in press"
  4. Download a copy of the PDF, and place it in the webdav.

Adding Dryad/Genbank IDs and other metadata:

  1. Search Dryad for the study name/authors. If this item exists in Dryad, create a link from the Dryad object to the TreeBASE object (dc.relation to the ID)
  2. For each matrix in TreeBASE, view the matrix details (magnifying glass) and Export Row Segment Template. This will create a tab-separated template file.
  3. Save the template in the TreeBASE svn, in a directory named after the Legacy ID.
  4. For the remainder of the steps, fill in the appropriate cells in the template. Beware of these rules:
    • Don't modify the first column, because TreeBASE will use the contents to match the entries back to the existing entry.
    • Always save the template in tab-delimited form!
    • Sometimes a matrix row was created from multiple samples. In this case, duplicate the row in the spreadsheet. Modify the start index and end index, then fill in the rest of the row as appropriate.
  5. Metadata to find:
    • genbank IDs (most important)
    • full taxon names (store in Sample Taxon Label)
    • lat/longs
    • specimen museum/collection/ID records
    • locality info
    • "proper" name of the gene
    • fungal cultures -- the reference numbers often exist in TB (starting with C?) -- look up in japanese & dutch fungal databases and store in Other Accession Num
  6. Review these locations to identify metadata:
    • appended to taxon name (in column A of the template)
    • the nexus file in TreeBASE
    • article pdf
    • supplemental material at the journal website
    • search GenBank by study name/author (both "standard" and PopSet)
    • other material in Dryad
    • search GenBank by blasting the sequence from the nexus file
  7. Finally, see whether branch lengths are available in the original nexus file (e.g., in Dryad). Note this in the Google spreadsheet.


  • The Google Spreadsheet (or later, the endnote file) will be backed up to the TreeBASE svn weekly (or more frequently, depending on the volume of changes). The spreadsheet will be exported to tab-separated file.

Plans from Feb 2010

The following plan of action was hatched at a meeting in Feb 2010 among Bill, Ryan, Hilmar, Kevin and Todd.

  1. Randomly identify 50 studies to manually determine the fraction of studies we can expect to have i) georeferences, ii) genbank accessions or identifiers, iii) specimen IDs or information encoded somehow in the taxon (OTU) labels
    • Goal: Estimate the effort required to extract and curate these metadata from the TB1 content.
  2. Write and run script that extracts GenBank ID/Acc or fungal IDs from taxon labels that have them encoded.
    • Output: Table of candidate GenBank accession numbers or fungal IDs for OTUs.
  3. Write and run script that for those records for which #1 fails, extracts the sequence, BLASTs it against GenBank, and returns the perfect matches.
    • Output: Table of candidate GenBank accession numbers for OTUs.
  4. Write and run script that for those records for which #2 fails, extracts the citation and matches it against citations in the POPSET section of GenBank, or directly against GenBank, and determines the sequence matching an OTU by specimen or voucher ID, or other sequence metadata information (e.g., sequence or locus name).
    • Output: Table of candidate GenBank accession numbers for OTUs.
  5. Write and run script that extracts the geo-reference or lat/long geo-coordinates, specimen or voucher ID, and other Dublin Core metadata (elevation, habitat, etc) from the GenBank records of identified sequence accession numbers. For those GenBank records that do not have this information, use the supplementary (online or not) information of the paper to extract this information manually where it exists.
    • Input: List of OTU labels and corresponding GenBank accession numbers.
    • Output: A table of geo-coordinates or geo-references, specimen/voucher ID, other metadata for each GenBank accession and OTU from the input (automatic and manual step), and a list of accession numbers for which this information could not be extracted (automatic step only).
  6. Prioritization: Journals with SOM (Syst Biol. and other Dryad partners) as well as trees with codes embedded in the OTU labels receive priority for all manual steps.

Storing Curated Content

  • All curated content will be stored in the TreeBASE subversion repository.
  • Metadata updates will be stored as tab-delimited files (exported from a spreadsheet)
  • Article PDFs will be stored privately, in a NESCent WebDAV space.
  • Items to track in a google spreadsheet
      • can citaition be found?
      • appendix found?
      • do taxon entries have trailing genbank accession numbers?
        • if so, extract accession # from nexus file
      • is fungal culture number available?

Open Issues

  • Rosie cannot export matrix row templates, which makes it impossible to curate matrices. (cf. TreeBASE bugs 2854613, 2870903)
  • In publications with doi's that are not properly linked (i.e. when the doi is entered into the address bar, you receive an error message), both the doi and the url for the publication have been entered until the doi problem has been resolved. At the end of citation curation, will follow up with publishers if doi's do not resolve still.
  • There have been some articles that have two DOIs, not sure how to resolve this, but for now just using one of them
  • Journals and their start date for using DOIs:
    • Molecular Biotechnlology - 2003
    • Systematic Botany - 2000
    • Mycologia - 2005
    • Molecular Biology and Evolution - 2003
    • Journal of Arachnology - none
    • American Journal of Botany - Nov. 2008
    • Herpetologica - 2002
    • Marine Biotechnology (formerly Molecular Marine Biology & Biotechnology) - 2000
    • Proceedings of the National Academy of Sciences - Feb. 27, 2001
    • Annales Zoologici - 2004
    • Paleobiology - 2000
    • Iheringia. Série Zoologia - 2001
    • Journal of Experimental Biology - 2003
    • Applied and Environmental Microbiology - Nov. 2006
    • The Bryologist - 2000
    • Science - 1997
    • Harvard Papers in Botany - 2005
    • Entomological Science - 2003
    • Rhodora - 2005
    • The Auk - 2000
    • Journal of Clinical Microbiology - 2006
    • Annals of the Missouri Botanical Garden - 2006
    • Microbiological Research - 2001
    • Plant Biology - 1999
    • Evolution - 2000
    • American Entomologist - none
    • Systematic Botany Monographs - none
    • Journal of Parasitology - 2000
    • Sydowia - none
    • Mycotaxon - none
    • American Museum Novitates - 2000
    • The Biological Bulletin - none
  • Bonner Zoologische Monographien--doesn't have a website
  • Bulletin of the American Museum of Natural History and American Museum Novitates don't have doi's but they have handles
  • Fieldiana, published by the Field Museum of Natural History, website but no doi nor handles
  • The current site for Harvard Papers in Botany only has links to the table of contents, so articles before 2005 have urls that link directly to a pdf. This has been used for other journals that don't have a website until a certain year such as Rhodora, Iheringia, etc., or one that doesn't link to abstracts before a certain year.
  • Sydowia doesn't have a website until 2005

Metadata Enhancement Progress

Citation Improvement

  • By the end of the fall semester, Melanie Plaza had finished updating citation information for articles prior to 2010
  • When when spring semester resumes, we will systematically examine which 2010 articles have since been published, and update them accordingly

Manual Metadata Enhancement

  • Melanie performed a study to examine what percentage of studies have metadata available in the PDF. The results of a sample of 90 studies are as follows:


Only 2% of studies have lat/long data in the PDF, which is unfortunate given that this is fairly valuable information. By contrast, in about half of studies, PDFs contain Genbank accession numbers, which makes this somewhat low hanging fruit.

  • Melanie and Junli have received training in basic approaches to extracting metadata from PDFs, and are generally tasked with searching out PDFs, acquiring, and entering metadata. (However, with Melanie away in Australia, this effort will probably resume in earnest in the spring semester, wherein we will also seek to expand the team working on this).

Automated Metadata Enhancement

  • We have performed a study to find taxon labels that likely have Genbank accession numbers embedded in them. This has produced a list of 23,000 rows of data matrices that appear to have these numbers readily available. The data are available here.
  • The next step is to match these accession numbers against Genbank data. For each accession number, extract:
    • Taxon Name: does it match with our mapping in TreeBASE?
      • No: examine these by eye and group them into either false positives or true accession numbers
      • Yes: assume that it is a true accession number
    • For each true accession number:
      • Extract relevant features and annotations from Genbank
      • Build a set of rowSegment records
    • Filter out rowSegment records if they already exist in TreeBASE
    • Upload remaining rowSegment records
      • Ideally, create an admin-controlled upload feature that uses the existing Hibernate application stack
      • Alternatively, upload records using Perl scripts, and then finish by resetting the Hibernate next rowSegment_id value
  • For DNA matrixrows that still lack Genank accession numbers, we will try a BLAST approach
    • Download all sequence strings and format as FASTA
    • Perform seq-to-seq BLAST runs against Genbank, extracting match parameters
    • Filter out all but perfect matches, extract features and annotations, build rowSegment record set
    • Upload rowSegment recordset (as above)