Web data extraction based on partial tree alignment

Authors:
Yanhong Zhai;Bing Liu
Affiliations:
University of Illinois at Chicago, Chicago, IL;University of Illinois at Chicago, Chicago, IL
Venue:
WWW '05 Proceedings of the 14th international conference on World Wide Web
Year:
2005

Citing 29
Cited 127

The multiple sequence alignment problem in biology

SIAM Journal on Applied Mathematics
Algorithms for string searching

ACM SIGIR Forum
Identifying syntactic differences between two programs

Software—Practice & Experience
On the editing distance between unordered labeled trees

Information Processing Letters
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
A scalable comparison-shopping agent for the World-Wide Web

AGENTS '97 Proceedings of the first international conference on Autonomous agents
An Algorithm for Finding the Largest Approximately Common Substructures of Two Trees

IEEE Transactions on Pattern Analysis and Machine Intelligence
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
A machine learning based approach for table detection on the web

Proceedings of the 11th international conference on World Wide Web
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
New algorithm for ordered tree-to-tree correction problem

Journal of Algorithms
Structural extraction from visual layout of documents

Proceedings of the eleventh international conference on Information and knowledge management
Mining the Web: Discovering Knowledge from HyperText Data

Mining the Web: Discovering Knowledge from HyperText Data
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Table extraction using conditional random fields

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Automatic detection of fragments in dynamically generated web pages

Proceedings of the 13th international conference on World Wide Web
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Mining tables from large scale HTML texts

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data

Opinion observer: analyzing and comparing opinions on the Web

WWW '05 Proceedings of the 14th international conference on World Wide Web
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
2D Conditional Random Fields for Web information extraction

ICML '05 Proceedings of the 22nd international conference on Machine learning
Recovering semantic relations from web pages based on visual cues

Proceedings of the 11th international conference on Intelligent user interfaces
Interactive wrapper generation with minimal user effort

Proceedings of the 15th international conference on World Wide Web
Structure-driven crawler generation by example

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Automatic extraction of dynamic record sections from search engine result pages

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Summarizing personal web browsing sessions

UIST '06 Proceedings of the 19th annual ACM symposium on User interface software and technology
Enabling web browsers to augment web sites' filtering and sorting functionalities

UIST '06 Proceedings of the 19th annual ACM symposium on User interface software and technology
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
Automatically maintaining wrappers for semi-structured web sources

Data & Knowledge Engineering
Homepage live: automatic block tracing for web personalization

Proceedings of the 16th international conference on World Wide Web
Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web
Adaptive record extraction from web pages

Proceedings of the 16th international conference on World Wide Web
U-REST: an unsupervised record extraction system

Proceedings of the 16th international conference on World Wide Web
Interactive Tuples Extraction from Semi-Structured Data

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Extracting Web Data Using Instance-Based Learning

World Wide Web
Dynamic hierarchical Markov random fields and their application to web data extraction

Proceedings of the 24th international conference on Machine learning
Extraction of flat and nested data records from web pages

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Mining templates from search result records of search engines

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Webpage understanding: an integrated approach

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Context-aware wrapping: synchronized data extraction

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Towards a global schema for web entities

Proceedings of the 17th international conference on World Wide Web
Understanding web documents: finding pagelets for transformation using structural patterns

International Journal of Web Engineering and Technology
Detecting data records in semi-structured web sites based on text token clustering

Integrated Computer-Aided Engineering
Bootstrapping Information Extraction from Semi-structured Web Pages

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

The Journal of Machine Learning Research
WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES

Applied Artificial Intelligence
Automated Semantic Analysis of Schematic Data

World Wide Web
Automatic wrapper induction from hidden-web sources with domain knowledge

Proceedings of the 10th ACM workshop on Web information and data management
Tuning up FOIL for extracting information from the web

International Journal of Computer Applications in Technology
Discriminating Meaningful Web Tables from Decorative Tables Using a Composite Kernel

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Exploring websites through contextual facets

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Incorporating site-level knowledge to extract structured data from web forums

Proceedings of the 18th international conference on World wide web
Characterizing insecure javascript practices on the web

Proceedings of the 18th international conference on World wide web
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
Juicer: Scalable Extraction for Thread Meta-information of Web Forum

PAISI '09 Proceedings of the Pacific Asia Workshop on Intelligence and Security Informatics
Automatic hidden-web table interpretation, conceptualization, and semantic annotation

Data & Knowledge Engineering
A survey on sentiment detection of reviews

Expert Systems with Applications: An International Journal
Can we learn a template-independent wrapper for news article extraction from a single training site?

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
A Personal Web Information/Knowledge Retrieval System

Proceedings of the 2008 conference on Information Modelling and Knowledge Bases XIX
Profile-based focused crawling for social media-sharing websites

Journal on Image and Video Processing
Table extraction using spatial reasoning on the CSS2 visual box model

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Automatic wrapper generation using tree matching and partial tree alignment

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Algorithm for Extracting Loosely Structured Data Records Through Digging Strict Patterns

World Wide Web
Enhanced Gestalt Theory Guided Web Page Segmentation for Mobile Browsing

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Entropy-Based Visual Tree Evaluation on Block Extraction

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Automatic web data extraction using tree alignment

Proceedings of the 18th ACM conference on Information and knowledge management
Information extraction for search engines using fast heuristic techniques

Data & Knowledge Engineering
Answering table augmentation queries from unstructured lists on the web

Proceedings of the VLDB Endowment
Harvesting relational tables from lists on the web

Proceedings of the VLDB Endowment
Extracting Structured Data from Web Pages with Maximum Entropy Segmental Markov Model

WISE '09 Proceedings of the 10th International Conference on Web Information Systems Engineering
FireCite: lightweight real-time reference string extraction from webpages

NLPIR4DL '09 Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries
Visual extraction of information from web pages

Journal of Visual Languages and Computing
Web data extracion using visual features

Proceedings of the International Conference and Workshop on Emerging Trends in Technology
Bottom-up discovery of clusters of maximal ranges in HTML trees for search engines results extraction

BIS'07 Proceedings of the 10th international conference on Business information systems
Using structured tokens to identify webpages for data extraction

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Automatic extraction of clickable structured web contents for name entity queries

Proceedings of the 19th international conference on World wide web
WMS-extracting multiple sections data records from search engine results pages

Proceedings of the 2010 ACM Symposium on Applied Computing
Towards a wrapper-driven ontology-based framework for knowledge extraction

KSEM'07 Proceedings of the 2nd international conference on Knowledge science, engineering and management
Mining subtrees with frequent occurrence of similar subtrees

DS'07 Proceedings of the 10th international conference on Discovery science
Automatic hidden-web table interpretation by sibling page comparison

ER'07 Proceedings of the 26th international conference on Conceptual modeling
Labeling data extracted from the web

OTM'07 Proceedings of the 2007 OTM Confederated international conference on On the move to meaningful internet systems: CoopIS, DOA, ODBASE, GADA, and IS - Volume Part I
An effective method supporting data extraction and schema recognition on deep web

APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
An Intelligent information segmentation approach to extract financial data for business valuation

Expert Systems with Applications: An International Journal
Web news extraction based on path pattern mining

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
Cost-effective web search in bootstrapping for named entity recognition

DASFAA'08 Proceedings of the 13th international conference on Database systems for advanced applications
An automatic HTTP cookie management system

Computer Networks: The International Journal of Computer and Telecommunications Networking
Style and branding elements extraction from businessweb sites

Proceedings of the 10th ACM symposium on Document engineering
Automatic extraction of web data records containing user-generated content

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A unified approach for extracting multiple news attributes from news pages

PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
Exploiting content redundancy for web information extraction

Proceedings of the VLDB Endowment
ObjectRunner: lightweight, targeted extraction and querying of structured web data

Proceedings of the VLDB Endowment
Collective extraction from heterogeneous web lists

Proceedings of the fourth ACM international conference on Web search and data mining
A novel method for bilingual web page acquisition from search engine web records

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
2D correlative-chain conditional random fields for semantic annotation of web objects

Journal of Computer Science and Technology
Link-based hidden attribute discovery for objects on Web

Proceedings of the 14th International Conference on Extending Database Technology
Harvesting relational tables from lists on the web

The VLDB Journal — The International Journal on Very Large Data Bases
Incremental structured web database crawling via history versions

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
An approach to assess the quality of web pages in the deep web

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
A site oriented method for segmenting web pages

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
From one tree to a forest: a unified solution for structured web data extraction

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Web information extraction using markov logic networks

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting general lists from web documents: a hybrid approach

IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
Towards a spatial instance learning method for deep web pages

ICDM'11 Proceedings of the 11th international conference on Advances in data mining: applications and theoretical aspects
An indent shape based approach for web lists mining

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Concluding pattern of web page based on string pattern matching

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
News information extraction based on adaptive weighting using unsupervised Bayesian algorithm

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Semi-supervised multi-task learning of structured prediction models for web information extraction

Proceedings of the 20th ACM international conference on Information and knowledge management
Towards a unified solution: data record region detection and segmentation

Proceedings of the 20th ACM international conference on Information and knowledge management
Extracting data records from query result pages based on visual features

BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Hybrid method for automated news content extraction from the web

WISE'06 Proceedings of the 7th international conference on Web Information Systems
Extracting web data using instance-based learning

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
NET – a system for extracting web data from flat and nested data records

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Image description mining and hierarchical clustering on data records using HR-Tree

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
RecipeCrawler: collecting recipe data from WWW incrementally

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Maintaining web navigation flows for wrappers

DEECS'06 Proceedings of the Second international conference on Data Engineering Issues in E-Commerce and Services
TreeWrapper: automatic data extraction based on tree representation

AI'06 Proceedings of the 19th Australian joint conference on Artificial Intelligence: advances in Artificial Intelligence
Generating syntactic tree templates for feature-based opinion mining

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part II
Data extraction for search engine using safe matching

AI'11 Proceedings of the 24th international conference on Advances in Artificial Intelligence
Extracting multiple news attributes based on visual features

Journal of Intelligent Information Systems
Computationally effective algorithm for information extraction and online review mining

Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Automatically extracting user reviews from forum sites

Computers & Mathematics with Applications
Peer matrix alignment: a new algorithm

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
Sift: an end-user tool for gathering web content on the go

Proceedings of the 2012 ACM symposium on Document engineering
Turn the page: automated traversal of paginated websites

ICWE'12 Proceedings of the 12th international conference on Web Engineering
Clustering visually similar web page elements for structured web data extraction

ICWE'12 Proceedings of the 12th international conference on Web Engineering
An unsupervised method for author extraction from web pages containing user-generated content

Proceedings of the 21st ACM international conference on Information and knowledge management
Web table discrimination with composition of rich structural and content information

Applied Soft Computing
WPPS: a framework for web page processing

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Multiple sections extraction using visual cue

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part V
Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning

Proceedings of the sixth ACM international conference on Web search and data mining
Towards web-scale structured web data extraction

Proceedings of the sixth ACM international conference on Web search and data mining
A general theory of spatial relations to support a graphical tool for visual information extraction

Journal of Visual Languages and Computing
A measurement study of insecure javascript practices on the web

ACM Transactions on the Web (TWEB)
Assessing relevance and trust of the deep web sources and results based on inter-source agreement

ACM Transactions on the Web (TWEB)
Feature-based object identification for web automation

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Visually extracting data records from the deep web

Proceedings of the 22nd international conference on World Wide Web companion
Web news extraction via path ratios

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
A learning classifier-based approach to aligning data items and labels

BNCOD'13 Proceedings of the 29th British National conference on Big Data
Self-supervised automated wrapper generation for weblog data extraction

BNCOD'13 Proceedings of the 29th British National conference on Big Data
Linkage of compound objects for supporting maintenance of large-scale web sites

Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
Automated cookie collection testing

ACM Transactions on Software Engineering and Methodology (TOSEM)
Scalable and noise tolerant web knowledge extraction for search task simplification

Decision Support Systems
Leveraging spatial join for robust tuple extraction from web pages

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper studies the problem of extracting data from a Web page that contains several structured data records. The objective is to segment these data records, extract data items/fields from them and put the data in a database table. This problem has been studied by several researchers. However, existing methods still have some serious limitations. The first class of methods is based on machine learning, which requires human labeling of many examples from each Web site that one is interested in extracting data from. The process is time consuming due to the large number of sites and pages on the Web. The second class of algorithms is based on automatic pattern discovery. These methods are either inaccurate or make many assumptions. This paper proposes a new method to perform the task automatically. It consists of two steps, (1) identifying individual data records in a page, and (2) aligning and extracting data items from the identified data records. For step 1, we propose a method based on visual information to segment data records, which is more accurate than existing methods. For step 2, we propose a novel partial alignment technique based on tree matching. Partial alignment means that we align only those data fields in a pair of data records that can be aligned (or matched) with certainty, and make no commitment on the rest of the data fields. This approach enables very accurate alignment of multiple data records. Experimental results using a large number of Web pages from diverse domains show that the proposed two-step technique is able to segment data records, align and extract data from them very accurately.