About

The "Euler project" is funded by NSF-IIS, Award # 1118088 (in the III:Small category):

A Logic-Based, Provenance-Aware System for Merging Scientific Data under Context and Classification Constraints
(a.k.a. "Euler")

Abstract. There is a rich research literature on information integration (e.g., on data fusion, data integration, and data exchange, including schema matching, mapping, and composition), knowledge-representation, ontologies, and semantic web technologies. However, there has been little prior work on the related problem of merging annotated datasets that already have largely compatible schemas, but where data values of some fields can come from (or link to) different concept hierarchies (taxonomies). Combining datasets into a single, consistent representation is a prerequisite for addressing many important scientific questions (e.g. those that rely on data to be expressed at broad spatial, temporal, and taxonomic scales). In practice, scientists combine multiple datasets manually, a time-intensive and error-prone process. In many application domains (e.g., biodiversity, ecology, systematics) data are often annotated with concepts from different but interrelated taxonomies. For instance, scientists who wish to combine datasets that record the presence or absence of species at given locations are often faced with datasets that draw species names from different taxonomies. In such cases, merging datasets requires aligning the different taxonomies. However, even for aligned taxonomies (i.e., where formal articulation constraints are given), many different dataset merges are possible, including inconsistent or incomplete ones. These in turn can yield different or even contradictory outcomes in subsequent interpretations and downstream data analysis. The primary goals of this project are to develop new techniques at the interface of data integration, knowledge-representation, and reasoning, to empower scientists by giving them new tools for merging and 'logically debugging' taxonomies and annotated datasets. The proposed Euler toolkit will include a formal framework with a broad range of constraints and data types; novel provenance-based techniques to detect, explain, and repair inconsistencies in taxonomy alignments; and new techniques to reduce uncertainty in alignments.