Web-scale information extraction with vertex

Authors:
Pankaj Gulhane;Amit Madaan;Rupesh Mehta;Jeyashankher Ramamirtham;Rajeev Rastogi;Sandeep Satpal;Srinivasan H. Sengamedu;Ashwin Tengli;Charu Tiwari
Affiliations:
Yahoo! Labs, Bangalore, India;Yahoo! Labs, Bangalore, India;Microsoft IDC, Hyderabad, India;Yahoo! Labs, Bangalore, India;Yahoo! Labs, Bangalore, India;Microsoft IDC, Hyderabad, India;Yahoo! Labs, Bangalore, India;Microsoft IDC, Bangalore, India;Yahoo! Labs, Bangalore, India
Venue:
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Year:
2011

Citing 0
Cited 10

Unsupervised user-generated content extraction by dependency relationships

WISE'11 Proceedings of the 12th international conference on Web information system engineering
Extracting data records from web using suffix tree

Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems
DEQA: deep web extraction for question answering

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part II
Towards web-scale structured web data extraction

Proceedings of the sixth ACM international conference on Web search and data mining
Unsupervised wrapper induction using linked data

Proceedings of the seventh international conference on Knowledge capture
Web news extraction via path ratios

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Intelligent and adaptive crawling of web applications for web archiving

ICWE'13 Proceedings of the 13th international conference on Web Engineering
Extraction and integration of partially overlapping web sources

Proceedings of the VLDB Endowment
Scalable and noise tolerant web knowledge extraction for search task simplification

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Vertex is a Wrapper Induction system developed at Yahoo! for extracting structured records from template-based Web pages. To operate at Web scale, Vertex employs a host of novel algorithms for (1) Grouping similar structured pages in a Web site, (2) Picking the appropriate sample pages for wrapper inference, (3) Learning XPath-based extraction rules that are robust to variations in site structure, (4) Detecting site changes by monitoring sample pages, and (5) Optimizing editorial costs by reusing rules, etc. The system is deployed in production and currently extracts more than 250 million records from more than 200 Web sites. To the best of our knowledge, Vertex is the first system to do high-precision information extraction at Web scale.