A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Knowledge-based metadata extraction from PostScript files
DL '00 Proceedings of the fifth ACM conference on Digital libraries
Machine Learning for Information Extraction in Informal Domains
Machine Learning - Special issue on information retrieval
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
A statistical learning learning model of text classification for support vector machines
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Wrapping-oriented classification of web pages
Proceedings of the 2002 ACM symposium on Applied computing
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
The Perceptron Algorithm with Uneven Margins
ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Automatic document metadata extraction using support vector machines
Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Extraction of Type Style Based Meta-Information from Imaged Documents
ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Combining document representations for known-item search
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Table extraction using conditional random fields
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Tabular abstraction, editing, and formatting
Tabular abstraction, editing, and formatting
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
SA_MetaMatch: relevant document discovery through document metadata and indexing
ACM-SE 42 Proceedings of the 42nd annual Southeast regional conference
Learning block importance models for web pages
Proceedings of the 13th international conference on World Wide Web
Automatic web news extraction using tree edit distance
Proceedings of the 13th international conference on World Wide Web
Simple BM25 extension to multiple weighted fields
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Title extraction from bodies of HTML documents and its application to web page retrieval
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Using the structure of HTML documents to improve retrieval
USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Columbia Newsblaster: multilingual news summarization on the web
HLT-NAACL--Demonstrations '04 Demonstration Papers at HLT-NAACL 2004
Information extraction from web documents based on local unranked tree automaton inference
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
The effect of title term suggestion on e-commerce sites
Proceedings of the 10th ACM workshop on Web information and data management
Automatically generating high quality metadata by analyzing the document code of common file types
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Towards combining web classification and web information extraction: a case study
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Document structure meets page layout: loopy random fields for web news content extraction
Proceedings of the 10th ACM symposium on Document engineering
A unified approach for extracting multiple news attributes from news pages
PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
A comparison of discriminative classifiers for web news content extraction
RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
An efficient language-independent method to extract content from news webpages
Proceedings of the 11th ACM symposium on Document engineering
Proceedings of the twelfth international workshop on Web information and data management
A hybrid approach for extracting informative content from web pages
Information Processing and Management: an International Journal
Automatic generation of limited-depth hyper-documents from clinical guidelines
Proceedings of the 2013 ACM symposium on Document engineering
How can catchy titles be generated without loss of informativeness?
Expert Systems with Applications: An International Journal
Hi-index | 0.00 |
This paper is concerned with automatic extraction of titles from the bodies of HTML documents (web pages). Titles of HTML documents should be correctly defined in the title fields by the authors; however, in reality they are often bogus. It is advantageous if we can automatically extract titles from HTML documents. In this paper, we take a supervised machine learning approach to address the problem. We first propose a specification on HTML titles, that is, a 'definition' on HTML titles. Next, we employ two learning methods to perform the task. In one method, we utilize features extracted from the DOM (direct object model) Tree; in the other method, we utilize features based on vision. We also combine the two methods to further enhance the extraction accuracy. Our title extraction methods significantly outperform the baseline method of using the lines in largest font size as title (22.6-37.4% improvements in terms of F1 score). As application, we consider web page retrieval. We use the TREC Web Track data for evaluation. We propose a new method for HTML documents retrieval using extracted titles. Experimental results indicate that the use of both extracted titles and title fields is almost always better than the use of title fields alone; the use of extracted titles is particularly helpful in the task of named page finding (25.1-30.3% improvements).