TitleFinder: extracting the headline of news web pages based on cosine similarity and overlap scoring similarity

Authors:
Hadi Mohammadzadeh;Thomas Gottron;Franz Schweiggert;Gerhard Heyer
Affiliations:
University of Ulm, Ulm, Germany;Universität Koblenz-Landau, Koblenz, Germany;University of Ulm, Ulm, Germany;Universität Leipzig, Leipzig, Germany
Venue:
Proceedings of the twelfth international workshop on Web information and data management
Year:
2012

Citing 12
Cited 0

A vector space model for automatic indexing

Communications of the ACM
Automatic extraction of titles from general documents using machine learning

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Title extraction from bodies of HTML documents and its application to web page retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Web page title extraction and its application

Information Processing and Management: an International Journal
Introduction to Information Retrieval

Introduction to Information Retrieval
Content Code Blurring: A New Approach to Content Extraction

DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
A General Learning Method for Automatic Title Extraction from HTML Pages

MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
Bridging the gap: from multi document Template Detection to single document Content Extraction

EuroIMSA '08 Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications
CETR: content extraction via tag ratios

Proceedings of the 19th international conference on World wide web
DOM based content extraction via text density

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Article clipper: a system for web article extraction

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
A Fast and Accurate Approach for Main Content Extraction Based on Character Encoding

DEXA '11 Proceedings of the 2011 22nd International Workshop on Database and Expert Systems Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatically extracting the headline of online web articles has many applications in web mining and information retrieval. In this paper, we developed a content-based and domain-and language-independent approach, TitleFinder, for unsupervised extraction of the headline of web articles. TitleFinder starts by using a heuristic to select a candidate headline. In a second step the contents of each text fragment in the HTML file are compared to the candidate headline. We implemented four types of similarity for this comparison: two variations of the cosine similarity based on tf and tf-idf weighting schemata, an overlap scoring similarity and an aggregated metric combining the scores of the previous three similarities. Our method achieves high performance in terms of effectiveness and efficiency and outperforms approaches operating on structural and visual features on a test set consisting of 11,218 news web pages from 15 different domains.