Web page title extraction and its application

Authors:
Yewei Xue;Yunhua Hu;Guomao Xin;Ruihua Song;Shuming Shi;Yunbo Cao;Chin-Yew Lin;Hang Li
Affiliations:
Department of Computer Science, Xi'an Jiaotong University, No. 28, Xianning West Road, Xi'an, Shanxi 710049, China;Department of Computer Science, Xi'an Jiaotong University, No. 28, Xianning West Road, Xi'an, Shanxi 710049, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China
Venue:
Information Processing and Management: an International Journal
Year:
2007

Citing 23
Cited 11

A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Knowledge-based metadata extraction from PostScript files

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Machine Learning for Information Extraction in Informal Domains

Machine Learning - Special issue on information retrieval
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
A statistical learning learning model of text classification for support vector machines

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Wrapping-oriented classification of web pages

Proceedings of the 2002 ACM symposium on Applied computing
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
The Perceptron Algorithm with Uneven Margins

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Extraction of Type Style Based Meta-Information from Imaged Documents

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Combining document representations for known-item search

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Table extraction using conditional random fields

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Tabular abstraction, editing, and formatting

Tabular abstraction, editing, and formatting
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
SA_MetaMatch: relevant document discovery through document metadata and indexing

ACM-SE 42 Proceedings of the 42nd annual Southeast regional conference
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Simple BM25 extension to multiple weighted fields

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Title extraction from bodies of HTML documents and its application to web page retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Using the structure of HTML documents to improve retrieval

USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Columbia Newsblaster: multilingual news summarization on the web

HLT-NAACL--Demonstrations '04 Demonstration Papers at HLT-NAACL 2004
Information extraction from web documents based on local unranked tree automaton inference

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

The effect of title term suggestion on e-commerce sites

Proceedings of the 10th ACM workshop on Web information and data management
Automatically generating high quality metadata by analyzing the document code of common file types

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Towards combining web classification and web information extraction: a case study

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Document structure meets page layout: loopy random fields for web news content extraction

Proceedings of the 10th ACM symposium on Document engineering
A unified approach for extracting multiple news attributes from news pages

PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
A comparison of discriminative classifiers for web news content extraction

RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
An efficient language-independent method to extract content from news webpages

Proceedings of the 11th ACM symposium on Document engineering
TitleFinder: extracting the headline of news web pages based on cosine similarity and overlap scoring similarity

Proceedings of the twelfth international workshop on Web information and data management
A hybrid approach for extracting informative content from web pages

Information Processing and Management: an International Journal
Automatic generation of limited-depth hyper-documents from clinical guidelines

Proceedings of the 2013 ACM symposium on Document engineering
How can catchy titles be generated without loss of informativeness?

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper is concerned with automatic extraction of titles from the bodies of HTML documents (web pages). Titles of HTML documents should be correctly defined in the title fields by the authors; however, in reality they are often bogus. It is advantageous if we can automatically extract titles from HTML documents. In this paper, we take a supervised machine learning approach to address the problem. We first propose a specification on HTML titles, that is, a 'definition' on HTML titles. Next, we employ two learning methods to perform the task. In one method, we utilize features extracted from the DOM (direct object model) Tree; in the other method, we utilize features based on vision. We also combine the two methods to further enhance the extraction accuracy. Our title extraction methods significantly outperform the baseline method of using the lines in largest font size as title (22.6-37.4% improvements in terms of F1 score). As application, we consider web page retrieval. We use the TREC Web Track data for evaluation. We propose a new method for HTML documents retrieval using extracted titles. Experimental results indicate that the use of both extracted titles and title fields is almost always better than the use of title fields alone; the use of extracted titles is particularly helpful in the task of named page finding (25.1-30.3% improvements).