NET – a system for extracting web data from flat and nested data records

Authors:
Bing Liu;Yanhong Zhai
Affiliations:
Department of Computer Science, University of Illinois at Chicago, Chicago, IL;Department of Computer Science, University of Illinois at Chicago, Chicago, IL
Venue:
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Year:
2005

Citing 13
Cited 23

Identifying syntactic differences between two programs

Software—Practice & Experience
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Table extraction using conditional random fields

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web

A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Extraction of flat and nested data records from web pages

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Mining templates from search result records of search engines

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Recognition of Data Records in Semi-structured Web-Pages Using Ontology and Χ2 Statistical Distribution

ADMA '08 Proceedings of the 4th international conference on Advanced Data Mining and Applications
WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES

Applied Artificial Intelligence
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
Juicer: Scalable Extraction for Thread Meta-information of Web Forum

PAISI '09 Proceedings of the Pacific Asia Workshop on Intelligence and Security Informatics
Information extraction for search engines using fast heuristic techniques

Data & Knowledge Engineering
Extracting Structured Data from Web Pages with Maximum Entropy Segmental Markov Model

WISE '09 Proceedings of the 10th International Conference on Web Information Systems Engineering
Web news extraction based on path pattern mining

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
Automatic extraction of web data records containing user-generated content

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A novel method for bilingual web page acquisition from search engine web records

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Online social network profile data extraction for vulnerability analysis

International Journal of Internet Technology and Secured Transactions
An indent shape based approach for web lists mining

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Extracting data records from query result pages based on visual features

BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Hybrid method for automated news content extraction from the web

WISE'06 Proceedings of the 7th international conference on Web Information Systems
Answering math queries with search engines

Proceedings of the 21st international conference companion on World Wide Web
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems
Towards Comparative Mining of Web Document Objects with NFA: WebOMiner System

International Journal of Data Warehousing and Mining
Web news extraction via path ratios

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

ACM Transactions on the Web (TWEB)
Semantic extraction of geographic data from web tables for big data integration

Proceedings of the 7th Workshop on Geographic Information Retrieval
Scalable and noise tolerant web knowledge extraction for search task simplification

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper studies automatic extraction of structured data from Web pages. Each of such pages may contain several groups of structured data records. Existing automatic methods still have several limitations. In this paper, we propose a more effective method for the task. Given a page, our method first builds a tag tree based on visual information. It then performs a post-order traversal of the tree and matches subtrees in the process using a tree edit distance method and visual cues. After the process ends, data records are found and data items in them are aligned and extracted. The method can extract data from both flat and nested data records. Experimental evaluation shows that the method performs the extraction task accurately.