Automatic extraction of web data records containing user-generated content

Authors:
Xinying Song;Jing Liu;Yunbo Cao;Chin-Yew Lin;Hsiao-Wuen Hon
Affiliations:
Harbin Institute of Technology, Harbin, China;Harbin Institute of Technology, Harbin, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China
Venue:
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Year:
2010

Citing 25
Cited 4

Identifying syntactic differences between two programs

Software—Practice & Experience
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
A brief survey of web data extraction tools

ACM SIGMOD Record
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Fully Automated Object Extraction System for the World Wide Web

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Mining and summarizing customer reviews

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Tree-Structured Template Generation for Web Pages

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications)

Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications)
Joint optimization of wrapper generation and template detection

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Incorporating site-level knowledge to extract structured data from web forums

Proceedings of the 18th international conference on World wide web
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
Efficient record-level wrapper induction

Proceedings of the 18th ACM conference on Information and knowledge management
A structural support vector method for extracting contexts and answers of questions from online forums

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
NET – a system for extracting web data from flat and nested data records

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering

FoCUS: learning to crawl web forums

Proceedings of the 21st international conference companion on World Wide Web
Web-based closed-domain data extraction on online advertisements

Information Systems
An unsupervised method for author extraction from web pages containing user-generated content

Proceedings of the 21st ACM international conference on Information and knowledge management
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we are concerned with the problem of automatically extracting web data records that contain user-generated content (UGC). In previous work, web data records are usually assumed to be well-formed with a limited amount of UGC, and thus can be extracted by testing repetitive structure similarity. However, when a web data record includes a large portion of free-format UGC, the similarity test between records may fail, which in turn results in lower performance. In our work, we find that certain domain constraints (e.g., post-date) can be used to design better similarity measures capable of circumventing the influence of UGC. In addition, we also use anchor points provided by the domain constraints to improve the extraction process, which ends in an algorithm called MiBAT (Mining data records Based on Anchor Trees). We conduct extensive experiments on a dataset consisting of forum thread pages which are collected from 307 sites that cover 219 different forum software packages. Our approach achieves a precision of 98.9% and a recall of 97.3% with respect to post record extraction. On page level, it perfectly handles 91.7% of pages without extracting any wrong posts or missing any golden posts. We also apply our approach to comment extraction and achieve good results as well.