Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
Learning dictionaries for information extraction by multi-level bootstrapping
AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
A flexible learning system for wrapping tables and lists in HTML documents
Proceedings of the 11th international conference on World Wide Web
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining reference tables for automatic text segmentation
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
A hierarchical graphical model for record linkage
UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
Efficient Batch Top-k Search for Dictionary-based Entity Recognition
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Adaptive Name Matching in Information Integration
IEEE Intelligent Systems
Creating probabilistic databases from information extraction models
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Entity Resolution with Markov Logic
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Extracting lists of data records from semi-structured web pages
Data & Knowledge Engineering
Language-Independent Set Expansion of Named Entities Using the Web
ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Foundations and Trends in Databases
Weakly-supervised acquisition of labeled class instances using graph random walks
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Creating relational data from unstructured and ungrammatical data sources
Journal of Artificial Intelligence Research
Harvesting relational tables from lists on the web
Proceedings of the VLDB Endowment
Answering web questions using structured data: dream or reality?
Proceedings of the VLDB Endowment
Annotating and searching web tables using entities, types and relationships
Proceedings of the VLDB Endowment
Joint training for open-domain extraction on the web: exploiting overlap when supervision is limited
Proceedings of the fourth ACM international conference on Web search and data mining
Materializing multi-relational databases from the web using taxonomic queries
Proceedings of the fourth ACM international conference on Web search and data mining
Collective extraction from heterogeneous web lists
Proceedings of the fourth ACM international conference on Web search and data mining
Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials
The Journal of Machine Learning Research
Highly efficient algorithms for structural clustering of large websites
Proceedings of the 20th international conference on World wide web
Unexpected results in automatic list extraction on the web
ACM SIGKDD Explorations Newsletter
Harvesting relational tables from lists on the web
The VLDB Journal — The International Journal on Very Large Data Bases
Recovering semantics of tables on the web
Proceedings of the VLDB Endowment
Web information extraction using markov logic networks
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting general lists from web documents: a hybrid approach
IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
Semi-supervised multi-task learning of structured prediction models for web information extraction
Proceedings of the 20th ACM international conference on Information and knowledge management
WebSets: extracting sets of entities from the web using unsupervised information extraction
Proceedings of the fifth ACM international conference on Web search and data mining
An analysis of structured data on the web
Proceedings of the VLDB Endowment
InfoGather: entity augmentation and attribute discovery by holistic matching with web tables
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Automatic web-scale information extraction
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Answering table queries on the web using column keywords
Proceedings of the VLDB Endowment
LIEGE:: link entities in web lists with knowledge base
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
TEX: An efficient and effective unsupervised Web information extractor
Knowledge-Based Systems
WebPut: efficient web-based data imputation
WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning
Proceedings of the sixth ACM international conference on Web search and data mining
Exploring structure and content on the web: extraction and integration of the semi-structured web
Proceedings of the sixth ACM international conference on Web search and data mining
Data-based research at IIT Bombay
ACM SIGMOD Record
Cost effective ontology population with data from lists in OCRed historical documents
Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing
Methods for exploring and mining tables on Wikipedia
Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics
The parallel path framework for entity discovery on the web
ACM Transactions on the Web (TWEB)
Scalable column concept determination for web tables using large knowledge bases
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
We present the design of a system for assembling a table from a few example rows by harnessing the huge corpus of information-rich but unstructured lists on the web. We developed a totally unsupervised end to end approach which given the sample query rows --- (a) retrieves HTML lists relevant to the query from a pre-indexed crawl of web lists, (b) segments the list records and maps the segments to the query schema using a statistical model, (c) consolidates the results from multiple lists into a unified merged table, (d) and presents to the user the consolidated records ranked by their estimated membership in the target relation. The key challenges in this task include construction of new rows from very few examples, and an abundance of noisy and irrelevant lists that swamp the consolidation and ranking of rows. We propose modifications to statistical record segmentation models, and present novel consolidation and ranking techniques that can process input tables of arbitrary schema without requiring any human supervision. Experiments with Wikipedia target tables and 16 million unstructured lists show that even with just three sample rows, our system is very effective at recreating Wikipedia tables, with a mean runtime of around 20s.