Clustering web documents with tables for information extraction

  • Authors:
  • Kostyantyn Shchekotykhin;Dietmar Jannach;Gerhard Friedrich

  • Affiliations:
  • University of Klagenfurt, Klagenfurt, Austria;University of Klagenfurt, Klagenfurt, Austria;University of Klagenfurt, Klagenfurt, Austria

  • Venue:
  • Proceedings of the 4th international conference on Knowledge capture
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

One of the common approaches to extracting high-quality knowledge from Web sources is to exploit the redundancy of the published information. Therefore, a Web Mining System not only has to search for relevant Web pages but also has to somehow determine whether two pages describe the same entity in order to extract as much knowledge as possible about it. It has been shown that statistical clustering techniques are in general a suitable means to achieve this task by grouping documents that are supposed to contain similar information. However, when data is given in tabular form - which is for instance a typical way of describing items in online shops - existing document clustering algorithms show limited performance as documents containing tabular descriptions typically share a very common set of tokens although they describe different entities. In this paper we therefore propose a new document clustering approach that exploits hyperlinks and document metadata to extract candidates for entity names. These candidate names are subsequently used to cluster the documents and further improve these names, which are finally used to determine whether two documents describe the same entity. The detailed evaluation of our approach in two popular example domains showed its high accuracy in terms of precision and recall (F-Measure 0.9).