Pattern-based extraction of addresses from web page content

Authors:
Saeid Asadi;Guowei Yang;Xiaofang Zhou;Yuan Shi;Boxuan Zhai;Wendy Wen-Rong Jiang
Affiliations:
School of Information Technology & Electrical Engineering, The University of Queensland, St. Lucia, Brisbane, Australia;School of Information Technology & Electrical Engineering, The University of Queensland, St. Lucia, Brisbane, Australia;School of Information Technology & Electrical Engineering, The University of Queensland, St. Lucia, Brisbane, Australia;School of Information Technology & Electrical Engineering, The University of Queensland, St. Lucia, Brisbane, Australia;School of Information Technology & Electrical Engineering, The University of Queensland, St. Lucia, Brisbane, Australia;School of Information Technology & Electrical Engineering, The University of Queensland, St. Lucia, Brisbane, Australia
Venue:
APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
Year:
2008

Citing 17
Cited 2

Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Computing Geographical Scopes of Web Resources

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Information Extraction with HMM Structures Learned by Stochastic Optimization

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Improving pseudo-relevance feedback in web information retrieval using web page segmentation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Geographical information recognition and visualization in texts written in various languages

Proceedings of the 2004 ACM symposium on Applied computing
Named Entity recognition without gazetteers

EACL '99 Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics
Web-scale information extraction in knowitall: (preliminary results)

Proceedings of the 13th international conference on World Wide Web
Web-a-where: geotagging web content

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Block-based web search

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Geographic information retrieval (GIR): searching where and what

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Postal Address Detection fromWeb Documents

WIRI '05 Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration
Semi-supervised learning of geographical gazetteers from the internet

HLT-NAACL-GEOREF '03 Proceedings of the HLT-NAACL 2003 workshop on Analysis of geographic references - Volume 1
Efficient query processing in geographic web search engines

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Hierarchical hidden Markov models for information extraction

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Calculation of target locations for web resources

WISE'06 Proceedings of the 7th international conference on Web Information Systems
Searching the world wide web for local services and facilities: a review on the patterns of location-based queries

WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management

Using Local Popularity of Web Resources for Geo-Ranking of Search Engine Results

World Wide Web
Extraction of Address Data from Unstructured Text using Free Knowledge Resources

Proceedings of the 13th International Conference on Knowledge Management and Knowledge Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Extraction of addresses and location names from Web pages is a challenging task for search engines. Traditional information extraction and natural processing models remain unsuccessful in the context of the Web because of the uncontrolled heterogenous nature of the Web resources as well as the effects of HTML and other markup tags. We describe a new pattern-based approach for extraction of addresses from Web pages. Both HTML and vision-based segmentations are used to increase the quality of address extraction. The proposed system uses several address patterns and a small table of geographic knowledge to hit addresses and then itemize them into smaller components. The experiments show that this model can extract and itemize different addresses effectively without large gazetteers or human supervision.