Answering table augmentation queries from unstructured lists on the web

Authors:
Rahul Gupta;Sunita Sarawagi
Affiliations:
IIT Bombay, India;IIT Bombay, India
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 19
Cited 29

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Learning dictionaries for information extraction by multi-level bootstrapping

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining reference tables for automatic text segmentation

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
A hierarchical graphical model for record linkage

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Efficient Batch Top-k Search for Dictionary-based Entity Recognition

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Adaptive Name Matching in Information Integration

IEEE Intelligent Systems
Creating probabilistic databases from information extraction models

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Entity Resolution with Markov Logic

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Extracting lists of data records from semi-structured web pages

Data & Knowledge Engineering
Language-Independent Set Expansion of Named Entities Using the Web

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Information Extraction

Foundations and Trends in Databases
Weakly-supervised acquisition of labeled class instances using graph random walks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Creating relational data from unstructured and ungrammatical data sources

Journal of Artificial Intelligence Research
Harvesting relational tables from lists on the web

Proceedings of the VLDB Endowment

Answering web questions using structured data: dream or reality?

Proceedings of the VLDB Endowment
Annotating and searching web tables using entities, types and relationships

Proceedings of the VLDB Endowment
Joint training for open-domain extraction on the web: exploiting overlap when supervision is limited

Proceedings of the fourth ACM international conference on Web search and data mining
Materializing multi-relational databases from the web using taxonomic queries

Proceedings of the fourth ACM international conference on Web search and data mining
Collective extraction from heterogeneous web lists

Proceedings of the fourth ACM international conference on Web search and data mining
Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials

The Journal of Machine Learning Research
Highly efficient algorithms for structural clustering of large websites

Proceedings of the 20th international conference on World wide web
Unexpected results in automatic list extraction on the web

ACM SIGKDD Explorations Newsletter
Harvesting relational tables from lists on the web

The VLDB Journal — The International Journal on Very Large Data Bases
Recovering semantics of tables on the web

Proceedings of the VLDB Endowment
Web information extraction using markov logic networks

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting general lists from web documents: a hybrid approach

IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
Semi-supervised multi-task learning of structured prediction models for web information extraction

Proceedings of the 20th ACM international conference on Information and knowledge management
WebSets: extracting sets of entities from the web using unsupervised information extraction

Proceedings of the fifth ACM international conference on Web search and data mining
An analysis of structured data on the web

Proceedings of the VLDB Endowment
InfoGather: entity augmentation and attribute discovery by holistic matching with web tables

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Automatic web-scale information extraction

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Finding related tables

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Answering table queries on the web using column keywords

Proceedings of the VLDB Endowment
LIEGE:: link entities in web lists with knowledge base

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems
WebPut: efficient web-based data imputation

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning

Proceedings of the sixth ACM international conference on Web search and data mining
Exploring structure and content on the web: extraction and integration of the semi-structured web

Proceedings of the sixth ACM international conference on Web search and data mining
Data-based research at IIT Bombay

ACM SIGMOD Record
Cost effective ontology population with data from lists in OCRed historical documents

Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing
Methods for exploring and mining tables on Wikipedia

Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics
The parallel path framework for entity discovery on the web

ACM Transactions on the Web (TWEB)
Scalable column concept determination for web tables using large knowledge bases

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present the design of a system for assembling a table from a few example rows by harnessing the huge corpus of information-rich but unstructured lists on the web. We developed a totally unsupervised end to end approach which given the sample query rows --- (a) retrieves HTML lists relevant to the query from a pre-indexed crawl of web lists, (b) segments the list records and maps the segments to the query schema using a statistical model, (c) consolidates the results from multiple lists into a unified merged table, (d) and presents to the user the consolidated records ranked by their estimated membership in the target relation. The key challenges in this task include construction of new rows from very few examples, and an abundance of noisy and irrelevant lists that swamp the consolidation and ranking of rows. We propose modifications to statistical record segmentation models, and present novel consolidation and ranking techniques that can process input tables of arbitrary schema without requiring any human supervision. Experiments with Wikipedia target tables and 16 million unstructured lists show that even with just three sample rows, our system is very effective at recreating Wikipedia tables, with a mean runtime of around 20s.