Clustering web documents with tables for information extraction

Authors:
Kostyantyn Shchekotykhin;Dietmar Jannach;Gerhard Friedrich
Affiliations:
University of Klagenfurt, Klagenfurt, Austria;University of Klagenfurt, Klagenfurt, Austria;University of Klagenfurt, Klagenfurt, Austria
Venue:
Proceedings of the 4th international conference on Knowledge capture
Year:
2007

Citing 2
Cited 2

X-means: Extending K-means with Efficient Estimation of the Number of Clusters

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
A string metric for ontology alignment

ISWC'05 Proceedings of the 4th international conference on The Semantic Web

Automated ontology instantiation from tabular web sources-The AllRight system

Web Semantics: Science, Services and Agents on the World Wide Web
Information extraction from web tables

Proceedings of the 11th International Conference on Information Integration and Web-based Applications & Services

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the common approaches to extracting high-quality knowledge from Web sources is to exploit the redundancy of the published information. Therefore, a Web Mining System not only has to search for relevant Web pages but also has to somehow determine whether two pages describe the same entity in order to extract as much knowledge as possible about it. It has been shown that statistical clustering techniques are in general a suitable means to achieve this task by grouping documents that are supposed to contain similar information. However, when data is given in tabular form - which is for instance a typical way of describing items in online shops - existing document clustering algorithms show limited performance as documents containing tabular descriptions typically share a very common set of tokens although they describe different entities. In this paper we therefore propose a new document clustering approach that exploits hyperlinks and document metadata to extract candidates for entity names. These candidate names are subsequently used to cluster the documents and further improve these names, which are finally used to determine whether two documents describe the same entity. The detailed evaluation of our approach in two popular example domains showed its high accuracy in terms of precision and recall (F-Measure 0.9).