TitleFinder: extracting the headline of news web pages based on cosine similarity and overlap scoring similarity

  • Authors:
  • Hadi Mohammadzadeh;Thomas Gottron;Franz Schweiggert;Gerhard Heyer

  • Affiliations:
  • University of Ulm, Ulm, Germany;Universität Koblenz-Landau, Koblenz, Germany;University of Ulm, Ulm, Germany;Universität Leipzig, Leipzig, Germany

  • Venue:
  • Proceedings of the twelfth international workshop on Web information and data management
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Automatically extracting the headline of online web articles has many applications in web mining and information retrieval. In this paper, we developed a content-based and domain-and language-independent approach, TitleFinder, for unsupervised extraction of the headline of web articles. TitleFinder starts by using a heuristic to select a candidate headline. In a second step the contents of each text fragment in the HTML file are compared to the candidate headline. We implemented four types of similarity for this comparison: two variations of the cosine similarity based on tf and tf-idf weighting schemata, an overlap scoring similarity and an aggregated metric combining the scores of the previous three similarities. Our method achieves high performance in terms of effectiveness and efficiency and outperforms approaches operating on structural and visual features on a test set consisting of 11,218 news web pages from 15 different domains.