Automatic hierarchical classification of structured deep web databases

Authors:
Weifeng Su;Jiying Wang;Frederick Lochovsky
Affiliations:
Hong Kong University of Science & Technology, Hong Kong;City University, Hong Kong;Hong Kong University of Science & Technology, Hong Kong
Venue:
WISE'06 Proceedings of the 7th international conference on Web Information Systems
Year:
2006

Citing 9
Cited 4

The nature of statistical learning theory

The nature of statistical learning theory
Support-Vector Networks

Machine Learning
Probe, count, and classify: categorizing hidden web databases

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
QProber: A system for automatic classification of hidden-Web databases

ACM Transactions on Information Systems (TOIS)
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Hierarchical Text Classification and Evaluation

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Concept Hierarchy Based Text Database Categorization in a Metasearch Engine Environment

WISE '00 Proceedings of the First International Conference on Web Information Systems Engineering (WISE'00)-Volume 1 - Volume 1
Organizing structured web sources by query schemas: a clustering approach

Proceedings of the thirteenth ACM international conference on Information and knowledge management

Category mapping for the automatic integration of category-constrained web search

International Journal of Business Intelligence and Data Mining
Extraction of unexpected sentences: A sentiment classification assessed approach

Intelligent Data Analysis
TODWEB: training-less ontology based deep web source classification

Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services
Understanding query interfaces by statistical parsing

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a method that automatically classifies structured deep Web databases according to a pre-defined topic hierarchy. We assume that there are some manually classified databases, i.e., training databases, in every node of the topic hierarchy. Each training database is probed using queries constructed from the node titles of the topic hierarchy and the query result counts reported by the database are used to represent the content of the database. Hence, when adding a new database it can be probed by the same set of queries and classified to a node whose training databases are most similar to the new one. Specifically, a support vector machine classifier is trained on each internal node of the topic hierarchy with these training databases and the new database can be classified into the hierarchy top-down level by level. A feature extension method is proposed to create discriminant features. Experiments run on real structured Web databases collected from the Internet show that this classification method is quite accurate.