Simultaneous record detection and attribute labeling in web data extraction

Authors:
Jun Zhu;Zaiqing Nie;Ji-Rong Wen;Bo Zhang;Wei-Ying Ma
Affiliations:
Tsinghua University, Beijing, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;Tsinghua University, Beijing, China;Microsoft Research Asia, Beijing, China
Venue:
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2006

Citing 27
Cited 46

On the limited memory BFGS method for large scale optimization

Mathematical Programming: Series A and B
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
The Hierarchical Hidden Markov Model: Analysis and Applications

Machine Learning
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Probabilistic Networks and Expert Systems

Probabilistic Networks and Expert Systems
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Fully Automated Object Extraction System for the World Wide Web

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Block-based web search

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data

ICML '04 Proceedings of the twenty-first international conference on Machine learning
An integrated, conditional model of information extraction and coreference with application to citation matching

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
A Hierarchical Field Framework for Unified Context-Based Classification

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
2D Conditional Random Fields for Web information extraction

ICML '05 Proceedings of the 22nd international conference on Machine learning
A comparison of algorithms for maximum entropy parameter estimation

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Collective information extraction with relational Markov networks

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Wrapper maintenance: a machine learning approach

Journal of Artificial Intelligence Research
Hierarchical hidden Markov models for information extraction

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Multiscale conditional random fields for image labeling

CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition

Web object retrieval

Proceedings of the 16th international conference on World Wide Web
Dynamic hierarchical Markov random fields and their application to web data extraction

Proceedings of the 24th international conference on Machine learning
Mining templates from search result records of search engines

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Webpage understanding: an integrated approach

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Towards a global schema for web entities

Proceedings of the 17th international conference on World Wide Web
Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

The Journal of Machine Learning Research
Academic conference homepage understanding using constrained hierarchical conditional random fields

Proceedings of the 17th ACM conference on Information and knowledge management
Closing the loop in webpage understanding

Proceedings of the 17th ACM conference on Information and knowledge management
SESQ: A Model-Driven Method for Building Object Level Vertical Search Engines

ER '08 Proceedings of the 27th International Conference on Conceptual Modeling
Harvesting, searching, and ranking knowledge on the web: invited talk

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Database and information-retrieval methods for knowledge discovery

Communications of the ACM - A Direct Path to Dependable Software
The YAGO-NAGA approach to knowledge discovery

ACM SIGMOD Record
Webpage understanding: beyond page-level search

ACM SIGMOD Record
StatSnowball: a statistical approach to extracting entity relationships

Proceedings of the 18th international conference on World wide web
Incorporating site-level knowledge to extract structured data from web forums

Proceedings of the 18th international conference on World wide web
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
Business Specific Online Information Extraction from German Websites

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Towards combining web classification and web information extraction: a case study

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Simultaneous Product Attribute Name and Value Extraction from Web Pages

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
FireCite: lightweight real-time reference string extraction from webpages

NLPIR4DL '09 Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries
Language models for web object retrieval

WiCOM'09 Proceedings of the 5th International Conference on Wireless communications, networking and mobile computing
Using structured tokens to identify webpages for data extraction

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
BioSnowball: automated population of Wikis

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
A Combination Approach to Web User Profiling

ACM Transactions on Knowledge Discovery from Data (TKDD)
Automatic extraction of web data records containing user-generated content

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A unified approach for extracting multiple news attributes from news pages

PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
Exploiting content redundancy for web information extraction

Proceedings of the VLDB Endowment
ObjectRunner: lightweight, targeted extraction and querying of structured web data

Proceedings of the VLDB Endowment
Collective extraction from heterogeneous web lists

Proceedings of the fourth ACM international conference on Web search and data mining
2D correlative-chain conditional random fields for semantic annotation of web objects

Journal of Computer Science and Technology
Link-based hidden attribute discovery for objects on Web

Proceedings of the 14th International Conference on Extending Database Technology
Web information extraction using Markov logic networks

Proceedings of the 20th international conference companion on World wide web
From one tree to a forest: a unified solution for structured web data extraction

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Web information extraction using markov logic networks

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Semi-supervised multi-task learning of structured prediction models for web information extraction

Proceedings of the 20th ACM international conference on Information and knowledge management
Towards a unified solution: data record region detection and segmentation

Proceedings of the 20th ACM international conference on Information and knowledge management
Extracting data records from query result pages based on visual features

BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Extracting multiple news attributes based on visual features

Journal of Intelligent Information Systems
Automatically extracting user reviews from forum sites

Computers & Mathematics with Applications
A dynamic learning framework to thoroughly extract structured data from web pages without human efforts

Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics
Discriminative learning for joint template filling

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
A simple approach to the design of site-level extractors using domain-centric principles

Proceedings of the 21st ACM international conference on Information and knowledge management
An unsupervised method for author extraction from web pages containing user-generated content

Proceedings of the 21st ACM international conference on Information and knowledge management
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems
Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning

Proceedings of the sixth ACM international conference on Web search and data mining
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent work has shown the feasibility and promise of template-independent Web data extraction. However, existing approaches use decoupled strategies - attempting to do data record detection and attribute labeling in two separate phases. In this paper, we show that separately extracting data records and attributes is highly ineffective and propose a probabilistic model to perform these two tasks simultaneously. In our approach, record detection can benefit from the availability of semantics required in attribute labeling and, at the same time, the accuracy of attribute labeling can be improved when data records are labeled in a collective manner. The proposed model is called Hierarchical Conditional Random Fields. It can efficiently integrate all useful features by learning their importance, and it can also incorporate hierarchical interactions which are very important for Web data extraction. We empirically compare the proposed model with existing decoupled approaches for product information extraction, and the results show significant improvements in both record detection and attribute labeling.