A methodical approach to extracting interesting objects from dynamic web pages

Authors:
Ling Liu;David Buttler;James Caverlee;Calton Pu;Jianjun Zhang
Affiliations:
College of Computing, George Institute of Technology, USA.;College of Computing, George Institute of Technology, USA.;College of Computing, George Institute of Technology, USA.;College of Computing, George Institute of Technology, USA.;College of Computing, George Institute of Technology, USA
Venue:
International Journal of Web and Grid Services
Year:
2005

Citing 10
Cited 2

A softbot-based interface to the Internet

Communications of the ACM
A scalable comparison-shopping agent for the World-Wide Web

AGENTS '97 Proceedings of the first international conference on Autonomous agents
NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Recognizing structure in Web pages using similarity queries

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Semi-Automatic Wrapper Generation for Internet Information Sources

COOPIS '97 Proceedings of the Second IFCIS International Conference on Cooperative Information Systems
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Wrapper application generation for semantic web: an xwrap approach

Wrapper application generation for semantic web: an xwrap approach
Exploiting the deep web with DynaBot: matching, probing, and ranking

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web

Information extraction in a set of knowledge using a fuzzy logic based intelligent agent

ICCSA'07 Proceedings of the 2007 international conference on Computational science and its applications - Volume Part III
A Fuzzy Logic intelligent agent for Information Extraction: Introducing a new Fuzzy Logic-based term weighting scheme

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a fully automated object extraction system for web documents. Our methodology consists of a layered framework and a set of algorithms. A distinct feature of our approach is the full automation of both the extraction of data object regions from dynamic web pages and the identification of the correct object-boundary separators. We implemented the methodology in the XWRAPElite object extraction system and evaluated the system using more than 3200 pages over 75 diverse websites. Our experiments show three important and interesting results: First, our algorithms for identifying the minimal object-rich subtree achieves a 96% success rate over all the web pages we have tested. Second, our algorithms for discovering and extracting object separator tags reach the success rate of 95%. Most significantly, the overall system achieves a precision between 96% and 100% (it returns only correct objects) and excellent recall (between 95% and 96%, with very few significant objects left out). The minimal subtree extraction algorithms and the object-boundary identification algorithms are fast, about 87 milliseconds per page with an average page size of 30KB.