Extracting unstructured data from template generated web documents

Authors:
Ling Ma;Nazli Goharian;Abdur Chowdhury;Misun Chung
Affiliations:
Illinois Institute of Technology;Illinois Institute of Technology;America Online Inc.;Illinois Institute of Technology
Venue:
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Year:
2003

Citing 10
Cited 9

Introduction to statistical pattern recognition (2nd ed.)

Introduction to statistical pattern recognition (2nd ed.)
Relevance reconsidered—towards an agenda for the 21st century: introduction to special topic issue on relevance research

Journal of the American Society for Information Science - Special issue: relevance research
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
Information Retrieval: Algorithms and Heuristics

Information Retrieval: Algorithms and Heuristics
RoadRunner: automatic data extraction from data-intensive web sites

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Using micro information units for internet search

Proceedings of the eleventh international conference on Information and knowledge management
Discovery of Frequent Word Sequences in Text

Proceedings of the ESF Exploratory Workshop on Pattern Detection and Discovery
Frequent term-based text clustering

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data

Evaluation of filtering current news search results

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Template detection for large scale search engines

Proceedings of the 2006 ACM symposium on Applied computing
Incremental web page template detection

Proceedings of the 17th international conference on World Wide Web
Tuning up FOIL for extracting information from the web

International Journal of Computer Applications in Technology
Extracting article text from the web with maximum subsequence segmentation

Proceedings of the 18th international conference on World wide web
Web document text and images extraction using DOM analysis and natural language processing

Proceedings of the 9th ACM symposium on Document engineering
Health: related information structuring for the semantic web

Proceedings of the 2011 International Conference on Intelligent Semantic Web-Services and Applications
News information extraction based on adaptive weighting using unsupervised Bayesian algorithm

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
A hybrid approach for extracting informative content from web pages

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a novel approach that identifies web page templates and extracts the unstructured data. Extracting only the body of the page and eliminating the template increases the retrieval precision for the queries that generate irrelevant results. We believe that by reducing the number of irrelevant results; the users are encouraged to go back to a given site to search. Our experimental results on several different web sites and on the whole cnnfn collection demonstrate the feasibility of our approach.