Automatically learning gazetteers from the deep web

Authors:
Tim Furche;Giovanni Grasso;Giorgio Orsi;Christian Schallhart;Cheng Wang
Affiliations:
University of Oxford, Oxford, United Kingdom;University of Oxford, Oxford, United Kingdom;University of Oxford, Oxford, United Kingdom;University of Oxford, Oxford, United Kingdom;University of Oxford, Oxford, United Kingdom
Venue:
Proceedings of the 21st international conference companion on World Wide Web
Year:
2012

Citing 9
Cited 1

Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Automatic information extraction from large websites

Journal of the ACM (JACM)
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
Automatic wrapper induction from hidden-web sources with domain knowledge

Proceedings of the 10th ACM workshop on Web information and data management
ODE: Ontology-assisted data extraction

ACM Transactions on Database Systems (TODS)
Robust web extraction: an approach based on a probabilistic tree-edit model

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Automatic wrappers for large scale web extraction

Proceedings of the VLDB Endowment
Text Processing with GATE

Text Processing with GATE

Automatic gazetteer enrichment with user-geocoded data

Proceedings of the Second ACM SIGSPATIAL International Workshop on Crowdsourced and Volunteered Geographic Information

Quantified Score

Hi-index	0.00

Visualization

Abstract

Wrapper induction faces a dilemma: To reach web scale, it requires automatically generated examples, but to produce accurate results, these examples must have the quality of human annotations. We resolve this conflict with AMBER, a system for fully automated data extraction from result pages. In contrast to previous approaches, AMBER employs domain specific gazetteers to discern basic domain attributes on a page, and leverages repeated occurrences of similar attributes to group related attributes into records rather than relying on the noisy structure of the DOM. With this approach AMBER is able to identify records and their attributes with almost perfect accuracy (98%) on a large sample of websites. To make such an approach feasible at scale, AMBER automatically learns domain gazetteers from a small seed set. In this demonstration, we show how AMBER uses the repeated structure of records on deep web result pages to learn such gazetteers. This is only possible with a highly accurate extraction system. Depending on its parametrization, this learning process runs either fully automatically or with human interaction. We show how AMBER bootstraps a gazetteer for UK locations in 4 iterations: From a small seed sample we achieve 94.4% accuracy in recognizing UK locations in the $4th$ iteration.