Semantics in Support of Biodiversity Knowledge Discovery: An Introduction to the Biological Collections Ontology and Related OntologiesThe study of biodiversity spans many disciplines and includes data pertaining to species distributions and abundances, genetic sequences, trait measurements, and ecological niches, complemented by information on collection and measurement protocols. A review of the current landscape of metadata standards and ontologies in biodiversity science suggests that existing standards such as the Darwin Core terminology are inadequate for describing biodiversity data in a semantically meaningful and computationally useful way. Existing ontologies, such as the Gene Ontology and others in the Open Biological and Biomedical Ontologies (OBO) Foundry library, provide a semantic structure but lack many of the necessary terms to describe biodiversity data in all its dimensions. In this paper, we describe the motivation for and ongoing development of a new Biological Collections Ontology, the Environment Ontology, and the Population and Community Ontology. These ontologies share the aim of improving data aggregation and integration across the biodiversity domain and can be used to describe physical samples and sampling processes (for example, collection, extraction, and preservation techniques), as well as biodiversity observations that involve no physical sampling. Together they encompass studies of: 1) individual organisms, including voucher specimens from ecological studies and museum specimens, 2) bulk or environmental samples (e.g., gut contents, soil, water) that include DNA, other molecules, and potentially many organisms, especially microbes, and 3) survey-based ecological observations. We discuss how these ontologies can be applied to biodiversity use cases that span genetic, organismal, and ecosystem levels of organization. We argue that if adopted as a standard and rigorously applied and enriched by the biodiversity community, these ontologies would significantly reduce barriers to data discovery, integration, and exchange among biodiversity resources and researchers.
Combine unsupervised learning and heuristic rules to annotate organism morphological descriptionsHong Cui, Sriramu Singaram, Alyssa Janning|Proceedings of the American Society for Information Science and Technology|2011 Abstract Biodiversity literature is a comprehensive compilation of information on living organisms and fossils. Rich factual information on characteristics of organisms is presented in narrative form, hence limiting its repurpose and reuse. Transforming narrative information into atomic forms has been of special concern to informatics researchers and biological researchers alike. Research done previously shows similar results but lacks a detailed, scientific evaluation that would help illuminate the problem and eventually lead to a higher performance approach. Due to the sublanguage nature of morphological descriptions, it is thought that general‐purpose nature language processing (NLP) tools are not effective in this application. A heuristic‐based approach has been suggested in the literature. In this paper, we report our experiments with such an approach, where a set of simple, intuitive heuristic rules, informed by results of an unsupervised learning algorithm, is used to segment taxonomic descriptions and identify the organs along with their associated character/value pairs ( color=white, shape=ovoid ). This model system allows us to investigate the character annotation problem further, study the characteristics of morphological descriptions, identify the areas where the system fails, and suggest ways to address those failures. One such suggestion is to make use of general‐purpose syntactic parsers in a controlled manner.
Evaluating the botanical coverage of PATO using an unsupervised learning algorithmAlyssa Janning, Hong Cui|Proceedings of the 2012 iConference|2012 In this paper, we explore issues in adopting PATO as a standard phenotypic quality ontology for the biological community. Using CharaParser's unsupervised learning algorithm and the Stanford Parser, we extract morphological descriptions from Flora of North America to be matched to terms in PATO. Using the resulting data, we examine PATO's coverage of botanically interesting terms in order to find gaps and to determine accuracy. To maintain PATO's neutrality, we recommend that term definitions be reevaluated and propose that complimentary ontologies be enhanced to close any outstanding gaps in terminology.
Linking data across sites in the Genomic Observatories network's Ocean Sampling Day.<p>(<b>A</b>) Ocean Sampling Day involves the simultaneous sampling of the world's oceans on a single day, as represented by the red stars on the map of the earth. Multiple ocean water sampling processes take place at each location. Those water samples are filtered to produce samples of organismal communities that are submitted to the bioarchive at the Smithsonian Institution. A subsample of the filtered material is analyzed to produce a metagenomic sequence, which may be stored in the Genomes Online Database (<a href="http://www.genomesonline.org/cgi-bin/GOLD/index.cgi" target="_blank">GOLD</a>). To be useful in comparative studies, data from each process at each location must be accessible and interpretable. (<b>B</b>) A graphical representation of how part of the workflow shown in <b>A</b> (from ocean water sampling to filtering to metagenomic sequencing) can be annotated with terms from multiple, coordinated ontologies and queried via an ontology-based data store. Ontology classes are shown as ovals and instances are shown as rectangles, with instances color-coded to match their parent classes. This figure shows how a metagenomic sequence and the taxa associated with it can be linked back to the original Ocean Sampling Day collecting event through a chain of inputs and outputs.</p>
Semantics in Support of Biodiversity: An Introduction to the Biological Collections Ontology and Related OntologiesRamona Walls, John Deck, Robert Guralnik et al.|PhilPapers (PhilPapers Foundation)|2014 The study of biodiversity spans many disciplines and includes data pertaining to species distributions and abundances, genetic sequences, trait measurements, and ecological niches, complemented by information on collection and measurement protocols. A review of the current landscape of metadata standards and ontologies in biodiversity science suggests that existing standards such as the Darwin Core terminology are inadequate for describing biodiversity data in a semantically meaningful and computationally useful way. Existing ontologies, such as the Gene Ontology and others in the Open Biological and Biomedical Ontologies (OBO) Foundry library, provide a semantic structure but lack many of the necessary terms to describe biodiversity data in all its dimensions. In this paper, we describe the motivation for and ongoing development of a new Biological Collections Ontology, the Environment Ontology, and the Population and Community Ontology. These ontologies share the aim of improving data aggregation and integration across the biodiversity domain and can be used to describe physical samples and sampling processes (for example, collection, extraction, and preservation techniques), as well as biodiversity observations that involve no physical sampling. Together they encompass studies of: 1) individual organisms, including voucher specimens from ecological studies and museum specimens, 2) bulk or environmental samples (e.g., gut contents, soil, water) that include DNA, other molecules, and potentially many organisms, especially microbes, and 3) survey-based ecological observations. We discuss how these ontologies can be applied to biodiversity use cases that span genetic, organismal, and ecosystem levels of organization. We argue that if adopted as a standard and rigorously applied and enriched by the biodiversity community, these ontologies would significantly reduce barriers to data discovery, integration, and exchange among biodiversity resources and researchers.