Old:Vocabulary project

From Dryad wiki
Jump to: navigation, search
STATUS: This page is no longer being maintained and of historical interest only.


Also see the HIVE wiki

Project Documents

  1. Keywords Coding

Encoding Key

The Encoding Key below can be used to explore the excel spreadsheet.

  • 1: perfect match, preferred vocabulary term
  • 2: match, non-preferred vocabulary term
  • 3: partial match, preferred vocabulary term
  • 4: partial match, non-preferred vocabulary term
  • 5: no match

Possible Vocabularies

1. concept -- Gene ontology (search on GO term) http://www.geneontology.org/ KNB (trimmed of the taxonomy part, which overlaps our taxon name facet) http://knb.ecoinformatics.org/index.jsp

2. method -- MESH browser http://www.nlm.nih.gov/mesh/MBrowser.html

3. place name -- This one's giving me trouble: Todd identified the Alexandria digital library gazetteer, but its website seems to be broken? http://www.alexandria.ucsb.edu/

Or, there's geomancer, but this requires a download and installation -- it doesn't appear to be a search able database.... Anyone have any other ideas? http://www.museum.tulane.edu/geolocate/

4. taxon name -- ITIS http://www.itis.gov/ UBIO http://www.ubio.org/

5. person name -- can't find anything that's non-commercial (I assume we want non-commercial?)

6. institution name -- ?

7. anatomical aspect -- BioPortal would be the best candidate I think: http://www.bioontology.org/ncbo/faces/pages/search.xhtml

Otherwise, we would need to combine anatomical ontologies from different organisms: zebrafish http://zfin.org/cgi-bin/webdriver?MIval=aa-anatdict.apg&mode=search flybase http://flybase.bio.indiana.edu/static_pages/termlink/termlink.html plant ontologies: http://www.plantontology.org/amigo/go.cgi

8. field or discipline -- KNB http://knb.ecoinformatics.org/index.jsp MESH browser http://www.nlm.nih.gov/mesh/MBrowser.html

9. habitat type -- KNB http://knb.ecoinformatics.org/index.jsp

10. Time period -- international commission on stratigraphy -- http://www.stratigraphy.org/ This does not have an actual, search able database, but does have this list, which I think would work (?): http://www.stratigraphy.org/geowhen/geolist.html

11: gene -- The Gene Ontology site has a vocab of gene names, (multi-organisms) -- search on gene: http://www.geneontology.org/

Otherwise, this will probably require combining vocabs from model organism data bases: OMIM:http://www.ncbi.nlm.nih.gov/sites/entrez?db=OMIM TAIR http://www.arabidopsis.org/ (search "gene") flybase http://flybase.bio.indiana.edu/ zebrafish http://zfin.org/cgi-bin/webdriver?MIval=aa-newmrkrselect.apg

Possible Tools

Calais

Calais looks to be a commercial project, but could still be useful.

They have a simple interface available for testing.

Ryan tried some sample texts, and while it didn't identify everything, it did very well. It recognizes terms from several different vocabularies, including industry ("submission process"), technology ("search engine" and "API"), place names ("Australia"), personal names ("Roy Tennant"), and organizations ("NIH" and "Congress").

LexGrid (proposal from Google Summer of Code)

Rationale 
Phyloinformatics data are increasingly being typed and semantically "tagged" using ontologies. For example, see the PhyloDB model of BioSQL, and the EvoInformatics General Ontology group. As more documents and datasets are submitted to digital archives, standard "ontology services" for lookup of vocabularies, terms, and hierarchical relationships, as well as structured navigation, become increasingly important. Ontology services may be used by disciplines ranging from bioinformatics to digital library and information sciences. Although ontologies provide better organization and discovery than simple tagging systems, it is currently too difficult for individual users to apply ontology terms to their digital objects. A flexible system for applying ontologies will improve scientific data (and document) repositories as well as general-purpose repositories that currently rely on tagging systems.
Approach 
The LexGrid framework and BioPortal interface will form the core of this project. The student will create an interface that allows users to submit a text document (e.g., as extracted from the digital object) to LexGrid, browse the relevant ontologies for term matches and the sections they are in, and select terms to be applied as annotation to the digital object.
Challenges 
LexGrid is a relatively large project, and it will take some time to become accustomed with the codebase. A system must be developed for browsing multiple ontologies on a single web page, allowing users to select between similar terms with minimal effort.
Involved toolkits or projects 
The LexGrid framework, using the LexBIG and BioPortal source packages, will form the core of the service. Some features of the Ontology Lookup Service may be useful for creating the browsing interface.
Mentors 
Ryan Scherle, Daniel Rubin

Other notes

The PLoS One article "Inflated Impact Factors? The True Impact of Evolutionary Papers in Non-Evolutionary Journals" by Erik Postma in its method section describes how the author identified evolutionary biology papers by a keyword list that he came up with himself using a body of (true positive) papers.

http://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000999

(from Brian O'Meara) Is there any sort of controlled vocabulary for any of the entries? Taxonomic names, place names, etc.? This may make the most sense in temporal coverage: Cretaceous? 20-60 MYA? June 4, 2007 to July 5, 2008? There are a limited number of options, but it will be hard for other people to use this metadata unless it's standardized (someone looking up info on datasets on the Eocene would have to include the Eocene, any names of smaller periods within it, some sort of text search on years (finding studies that say "54 MY to 40 MY", "about 45 million years ago", "from 53,000,000 to 50,000,000 years ago", etc.) -- much easier if dates are internally converted to some representation of years, months, days, perhaps even shorter intervals (since some studies might cover extremely short durations). You could get into similar issues for places -- people interested in Eastern North America would be interested in data sets tagged just with "North Carolina" or "Durham", but they're not going to be able to search for every place name within Eastern North America on Dryad. You could georeference the boundaries, or, less helpful but still pretty useful, the center of each entry (in addition to storing the entered lat-lon, if provided) and then allow searches over a set of georeferenced points rather than over a large amount of full text.