A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Machine Learning for Information Extraction in Informal Domains
Machine Learning - Special issue on information retrieval
Wrapping-oriented classification of web pages
Proceedings of the 2002 ACM symposium on Applied computing
The Perceptron Algorithm with Uneven Margins
ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Information Extraction with HMM Structures Learned by Stochastic Optimization
Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Combining document representations for known-item search
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
SA_MetaMatch: relevant document discovery through document metadata and indexing
ACM-SE 42 Proceedings of the 42nd annual Southeast regional conference
Learning block importance models for web pages
Proceedings of the 13th international conference on World Wide Web
Automatic web news extraction using tree edit distance
Proceedings of the 13th international conference on World Wide Web
Simple BM25 extension to multiple weighted fields
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Using the structure of HTML documents to improve retrieval
USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Columbia Newsblaster: multilingual news summarization on the web
HLT-NAACL--Demonstrations '04 Demonstration Papers at HLT-NAACL 2004
Information extraction from web documents based on local unranked tree automaton inference
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Web page title extraction and its application
Information Processing and Management: an International Journal
Optimizing web search using social annotations
Proceedings of the 16th international conference on World Wide Web
The influence of caption features on clickthrough patterns in web search
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Learning query-biased web page summarization
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
PathRank: Web Page Retrieval with Navigation Path
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Can we learn a template-independent wrapper for news article extraction from a single training site?
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to Extract Web News Title in Template Independent Way
RSKT '09 Proceedings of the 4th International Conference on Rough Sets and Knowledge Technology
A General Learning Method for Automatic Title Extraction from HTML Pages
MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
Web news extraction based on path pattern mining
FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
Social network document ranking
Proceedings of the 10th annual joint conference on Digital libraries
LETOR: A benchmark collection for research on learning to rank for information retrieval
Information Retrieval
Article clipper: a system for web article extraction
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
An efficient language-independent method to extract content from news webpages
Proceedings of the 11th ACM symposium on Document engineering
Hybrid method for automated news content extraction from the web
WISE'06 Proceedings of the 7th international conference on Web Information Systems
Semantic scoring based on small-world phenomenon for feature selection in text mining
ADMA'06 Proceedings of the Second international conference on Advanced Data Mining and Applications
Extracting search-focused key n-grams for relevance ranking in web search
Proceedings of the fifth ACM international conference on Web search and data mining
A path-based approach for web page retrieval
World Wide Web
Exploring URL hit priors for web search
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Advanced information retrieval from web pages
FDIA'07 Proceedings of the 1st BCS IRSG conference on Future Directions in Information Access
Harnessing the wisdom of the crowds for accurate web page clipping
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Proceedings of the twelfth international workshop on Web information and data management
How can catchy titles be generated without loss of informativeness?
Expert Systems with Applications: An International Journal
Determining the titles of Web pages using anchor text and link analysis
Expert Systems with Applications: An International Journal
Hi-index | 0.00 |
This paper is concerned with automatic extraction of titles from the bodies of HTML documents. Titles of HTML documents should be correctly defined in the title fields; however, in reality HTML titles are often bogus. It is desirable to conduct automatic extraction of titles from the bodies of HTML documents. This is an issue which does not seem to have been investigated previously. In this paper, we take a supervised machine learning approach to address the problem. We propose a specification on HTML titles. We utilize format information such as font size, position, and font weight as features in title extraction. Our method significantly outperforms the baseline method of using the lines in largest font size as title (20.9%-32.6% improvement in F1 score). As application, we consider web page retrieval. We use the TREC Web Track data for evaluation. We propose a new method for HTML documents retrieval using extracted titles. Experimental results indicate that the use of both extracted titles and title fields is almost always better than the use of title fields alone; the use of extracted titles is particularly helpful in the task of named page finding (23.1% -29.0% improvements).