Automatic information extraction from semi-structured Web pages by pattern discovery

Authors:
Chia-Hui Chang;Chun-Nan Hsu;Shao-Cheng Lui
Affiliations:
Department of Computer Science and Information Engineering, National Central University, Chungli, Tauyuan 320, Taiwan;Institute of Information Science, Academia Sinica, Nankang, Taipei 115, Taiwan;ChungHwa Telecommunication Laboratories, Yangmei, Tauyuan 326, Taiwan
Venue:
Decision Support Systems - Web retrieval and mining
Year:
2003

Citing 13
Cited 28

New indices for text: PAT Trees and PAT arrays

Information retrieval
PAT-tree-based keyword extraction for Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
A scalable comparison-shopping agent for the World-Wide Web

AGENTS '97 Proceedings of the first international conference on Autonomous agents
Modeling Web sources for information integration

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric

Journal of the ACM (JACM)
A fast string searching algorithm

Communications of the ACM
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Gleaning the Web

IEEE Intelligent Systems

Building Web Information Extraction Tasks

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Context Generalization for Information Extraction from the Web

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
OLERA: Semisupervised Web-Data Extraction with Visual Support

IEEE Intelligent Systems
Using web structure and summarisation techniques for web content mining

Information Processing and Management: an International Journal
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Semi automated metadata extraction for preprints archives

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Cross-lingual audio-to-text alignment for multimedia content management

Decision Support Systems
Site-Wide Wrapper Induction for Life Science Deep Web Databases

DILS '09 Proceedings of the 6th International Workshop on Data Integration in the Life Sciences
FireCite: lightweight real-time reference string extraction from webpages

NLPIR4DL '09 Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries
Using Web structure and summarisation techniques for Web content mining

Information Processing and Management: an International Journal
Visual extraction of information from web pages

Journal of Visual Languages and Computing
Information extraction and classification from free text using a neural approach

CIARP'07 Proceedings of the Congress on pattern recognition 12th Iberoamerican conference on Progress in pattern recognition, image analysis and applications
Web data extraction system based on label library

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
Tag tree template for Web information and schema extraction

Expert Systems with Applications: An International Journal
MashUp web data sources and services based on semantic queries

Information Systems
On-line web database integration

Proceedings of the International Conference on Management of Emergent Digital EcoSystems
An intelligent, mobile information system to aid in-store purchase decision making

CIMMACS'05 Proceedings of the 4th WSEAS international conference on Computational intelligence, man-machine systems and cybernetics
A generic approach for on-the-fly adding of context-aware features to existing websites

Proceedings of the 22nd ACM conference on Hypertext and hypermedia
Concluding pattern of web page based on string pattern matching

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
WetDL: a web information extraction language

ADVIS'04 Proceedings of the Third international conference on Advances in Information Systems
Semi-automatic information extraction from discussion boards with applications for anti-spam technology

ICCSA'10 Proceedings of the 2010 international conference on Computational Science and Its Applications - Volume Part II
The HiLeX system for semantic information extraction

Transactions on Large-Scale Data- and Knowledge-Centered Systems V
Automatically extracting user reviews from forum sites

Computers & Mathematics with Applications
Concept comparison engines: A new frontier of search

Decision Support Systems
Complex Terminology Extraction Model from Unstructured Web Text Based Linguistic and Statistical Knowledge

International Journal of Information Retrieval Research
Effects of Terms Recognition Mistakes on Requests Processing for Interactive Information Retrieval

International Journal of Information Retrieval Research
Scalable and noise tolerant web knowledge extraction for search task simplification

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The World Wide Web is now undeniably the richest and most dense source of information; yet, its structure makes it difficult to make use of that information in a systematic way. This paper proposes a pattern discovery approach to the rapid generation of information extractors that can extract structured data from semi-structured Web documents. Previous work in wrapper induction aims at learning extraction rules from user-labeled training examples, which, however, can be expensive in some practical applications. In this paper, we introduce IEPAD (an acronym for Information Extraction based on PAttern Discovery), a system that discovers extraction patterns from Web pages without user-labeled examples. IEPAD applies several pattern discovery techniques, including PAT-trees, multiple string alignments and pattern matching algorithms. Extractors generated by IEPAD can be generalized over unseen pages from the same Web data source. We empirically evaluate the performance of IEPAD on an information extraction task from 14 real Web data sources. Experimental results show that with the extraction rules discovered from a single page, IEPAD achieves 96% average retrieval rate, and with less than five example pages, IEPAD achieves 100% retrieval rate for 10 of the sample Web data sources.