Title extraction from bodies of HTML documents and its application to web page retrieval

Authors:
Yunhua Hu;Guomao Xin;Ruihua Song;Guoping Hu;Shuming Shi;Yunbo Cao;Hang Li
Affiliations:
Xi'an Jiaotong University, Xi'an, China;Peking University, Beijing, China;Microsoft Research Asia, Beijing, China;University of Science and Technology of China, Hefei, China;University of Science and Technology of China, Hefei, China;University of Science and Technology of China, Hefei, China;University of Science and Technology of China, Hefei, China
Venue:
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2005

Citing 15
Cited 24

A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Machine Learning for Information Extraction in Informal Domains

Machine Learning - Special issue on information retrieval
Wrapping-oriented classification of web pages

Proceedings of the 2002 ACM symposium on Applied computing
The Perceptron Algorithm with Uneven Margins

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Information Extraction with HMM Structures Learned by Stochastic Optimization

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Combining document representations for known-item search

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
SA_MetaMatch: relevant document discovery through document metadata and indexing

ACM-SE 42 Proceedings of the 42nd annual Southeast regional conference
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Simple BM25 extension to multiple weighted fields

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Using the structure of HTML documents to improve retrieval

USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Columbia Newsblaster: multilingual news summarization on the web

HLT-NAACL--Demonstrations '04 Demonstration Papers at HLT-NAACL 2004
Information extraction from web documents based on local unranked tree automaton inference

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

Web page title extraction and its application

Information Processing and Management: an International Journal
Optimizing web search using social annotations

Proceedings of the 16th international conference on World Wide Web
The influence of caption features on clickthrough patterns in web search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Learning query-biased web page summarization

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
PathRank: Web Page Retrieval with Navigation Path

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Can we learn a template-independent wrapper for news article extraction from a single training site?

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to Extract Web News Title in Template Independent Way

RSKT '09 Proceedings of the 4th International Conference on Rough Sets and Knowledge Technology
A General Learning Method for Automatic Title Extraction from HTML Pages

MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
Algorithm for Extracting Loosely Structured Data Records Through Digging Strict Patterns

World Wide Web
Web news extraction based on path pattern mining

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
Social network document ranking

Proceedings of the 10th annual joint conference on Digital libraries
LETOR: A benchmark collection for research on learning to rank for information retrieval

Information Retrieval
Article clipper: a system for web article extraction

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
An efficient language-independent method to extract content from news webpages

Proceedings of the 11th ACM symposium on Document engineering
Hybrid method for automated news content extraction from the web

WISE'06 Proceedings of the 7th international conference on Web Information Systems
Semantic scoring based on small-world phenomenon for feature selection in text mining

ADMA'06 Proceedings of the Second international conference on Advanced Data Mining and Applications
Extracting search-focused key n-grams for relevance ranking in web search

Proceedings of the fifth ACM international conference on Web search and data mining
A path-based approach for web page retrieval

World Wide Web
Exploring URL hit priors for web search

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Advanced information retrieval from web pages

FDIA'07 Proceedings of the 1st BCS IRSG conference on Future Directions in Information Access
Harnessing the wisdom of the crowds for accurate web page clipping

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
TitleFinder: extracting the headline of news web pages based on cosine similarity and overlap scoring similarity

Proceedings of the twelfth international workshop on Web information and data management
How can catchy titles be generated without loss of informativeness?

Expert Systems with Applications: An International Journal
Determining the titles of Web pages using anchor text and link analysis

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper is concerned with automatic extraction of titles from the bodies of HTML documents. Titles of HTML documents should be correctly defined in the title fields; however, in reality HTML titles are often bogus. It is desirable to conduct automatic extraction of titles from the bodies of HTML documents. This is an issue which does not seem to have been investigated previously. In this paper, we take a supervised machine learning approach to address the problem. We propose a specification on HTML titles. We utilize format information such as font size, position, and font weight as features in title extraction. Our method significantly outperforms the baseline method of using the lines in largest font size as title (20.9%-32.6% improvement in F1 score). As application, we consider web page retrieval. We use the TREC Web Track data for evaluation. We propose a new method for HTML documents retrieval using extracted titles. Experimental results indicate that the use of both extracted titles and title fields is almost always better than the use of title fields alone; the use of extracted titles is particularly helpful in the task of named page finding (23.1% -29.0% improvements).