Performance standards and evaluations in IR test collections: cluster-based retrieval models
Information Processing and Management: an International Journal
XML-based information mediation with MIX
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Reconciling schemas of disparate data sources: a machine-learning approach
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Using an ontology to simplify data access
Communications of the ACM
Using Schema Matching to Simplify Heterogeneous Data Translation
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Discovering word senses from text
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
On schema matching with opaque column names and data values
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Word association norms, mutual information, and lexicography
ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics
Matching and integration across heterogeneous data sources
dg.o '06 Proceedings of the 2006 international conference on Digital government research
Enhancing border security: Mutual information analysis to identify suspect vehicles
Decision Support Systems
Event-triggered data and knowledge sharing among collaborating government organizations
dg.o '07 Proceedings of the 8th annual international conference on Digital government research: bridging disciplines & domains
Evaluating ontology mapping techniques: An experiment in public safety information sharing
Decision Support Systems
Suspect vehicle identification for border safety with modified mutual information
ISI'06 Proceedings of the 4th IEEE international conference on Intelligence and Security Informatics
Hi-index | 0.01 |
As with many large organizations, the Government's data is split in many different ways and is collected at different times by different people. The resulting massive data heterogeneity means government staff cannot effectively locate, share, or compare data across sources, let alone achieve computational data interoperability. A case in point is the California Air Resources Board (CARB), which is faced with the challenge of integrating the emissions inventory databases belonging to California's 35 air quality management districts to create a state inventory. This inventory must be submitted annually to the US EPA which, in turn, must perform quality assurance tests on these inventories and integrate them into a national emissions inventory for use in tracking the effects of national air quality policies. The premise of our research is that it is possible to significantly reduce the amount of manual labor required in database wrapping and integration by automatically learning mappings in the data. In this research, we applied statistical algorithms to discover correspondences across comparable datasets. We have seen particular success in an information theoretic model, called SIfT (Significance Information for Translation), that performs data-driven column alignments. We have applied SIfT to mapping the Santa Barbara County Air Pollution Control District's 2001 emissions inventory database with the California Air Resources Board statewide inventory database. A fully customizable interface to the SIfT toolkit is available at http://sift.isi.edu/, allowing users to create new alignments, navigate the information theoretic model, and inspect alignment decisions. On a broader scale, this work makes strides toward appeasing a central problem in data management of integrating legacy data.