A vector space model for automatic indexing
Communications of the ACM
Automatic extraction of titles from general documents using machine learning
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Title extraction from bodies of HTML documents and its application to web page retrieval
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Web page title extraction and its application
Information Processing and Management: an International Journal
Introduction to Information Retrieval
Introduction to Information Retrieval
Content Code Blurring: A New Approach to Content Extraction
DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
A General Learning Method for Automatic Title Extraction from HTML Pages
MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
Bridging the gap: from multi document Template Detection to single document Content Extraction
EuroIMSA '08 Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications
CETR: content extraction via tag ratios
Proceedings of the 19th international conference on World wide web
DOM based content extraction via text density
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Article clipper: a system for web article extraction
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
A Fast and Accurate Approach for Main Content Extraction Based on Character Encoding
DEXA '11 Proceedings of the 2011 22nd International Workshop on Database and Expert Systems Applications
Hi-index | 0.00 |
Automatically extracting the headline of online web articles has many applications in web mining and information retrieval. In this paper, we developed a content-based and domain-and language-independent approach, TitleFinder, for unsupervised extraction of the headline of web articles. TitleFinder starts by using a heuristic to select a candidate headline. In a second step the contents of each text fragment in the HTML file are compared to the candidate headline. We implemented four types of similarity for this comparison: two variations of the cosine similarity based on tf and tf-idf weighting schemata, an overlap scoring similarity and an aggregated metric combining the scores of the previous three similarities. Our method achieves high performance in terms of effectiveness and efficiency and outperforms approaches operating on structural and visual features on a test set consisting of 11,218 news web pages from 15 different domains.