Stop word and related problems in web interface integration

Authors:
Eduard Dragut;Fang Fang;Prasad Sistla;Clement Yu;Weiyi Meng
Affiliations:
University of Illinois at Chicago;University of Illinois at Chicago;University of Illinois at Chicago;University of Illinois at Chicago;SUNY at Binghamton
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 23
Cited 5

Cyc: toward programs with common sense

Communications of the ACM
Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems

Journal of the ACM (JACM)
Modern Information Retrieval

Modern Information Retrieval
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Statistical schema matching across web query interfaces

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application to Schema Matching

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Automatic retrieval and clustering of similar words

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Automatic acquisition of hyponyms from large text corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
An interactive clustering-based approach to integrating source query interfaces on the deep Web

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Finding parts in very large corpora

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Automatic construction of a hypernym-labeled noun hierarchy from text

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Learning by googling

ACM SIGKDD Explorations Newsletter
Light-weight domain-based form assistant: querying web databases on the fly

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Merging Source Query Interfaces onWeb Databases

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
A web-based kernel function for measuring the similarity of short text snippets

Proceedings of the 15th international conference on World Wide Web
Meaningful labeling of integrated query interfaces

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Combining classifiers to identify online databases

Proceedings of the 16th international conference on World Wide Web
Measuring semantic similarity between words using web search engines

Proceedings of the 16th international conference on World Wide Web
Using Google distance to weight approximate ontology matches

Proceedings of the 16th international conference on World Wide Web
The Google Similarity Distance

IEEE Transactions on Knowledge and Data Engineering
COMA: a system for flexible combination of schema matching approaches

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Instance-based schema matching for web databases by domain-specific query probing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
A hierarchical approach to model web query interfaces for web source integration

Proceedings of the VLDB Endowment

Schema label normalization for improving schema matching

Data & Knowledge Engineering
Deep web integration with VisQI

Proceedings of the VLDB Endowment
Web Query Interface Parsing for Building Web-Based Metasearch Systems

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 03
Athena: text mining based discovery of scientific workflows in disperse repositories

RED'10 Proceedings of the Third international conference on Resource Discovery
Automatic classification of web databases using domain-dictionaries

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition

Quantified Score

Hi-index	0.01

Visualization

Abstract

The goal of recent research projects on integrating Web databases has been to enable uniform access to the large amount of data behind query interfaces. Among the tasks addressed are: source discovery, query interface extraction, schema matching, etc. There are also a number of tasks that are commonly ignored or assumed to be apriori solved either manually or by some oracle. These tasks include (1) finding the set of stop words and (2) handling occurrences of "semantic enrichment words" within labels. These two subproblems have a direct impact on determining the synonymy and hyponymy relationships between labels. In (1), a word like "from" is a stop word in general but it is a content word in domains such as Airline and Real Estate. We formulate the stop word problem, prove its complexity and provide an approximation algorithm. In (2), we study the impact of words like AND and OR on establishing semantic relationships between labels (e.g. "departure date and time" is a hypernym of "departure date"). In addition, we develop a theoretical framework to differentiate synonymy relationship from hyponymy relationship among labels involving multiple words. We scrutinize its strength and limitations both analytically and experimentally. We use real data from the Web in our experiments. We analyze over 2300 labels of 220 user interfaces in 9 distinct domains.