Automated database schema design using mined data dependencies
Journal of the American Society for Information Science - Special issue: knowledge discovery and data mining
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Reconciling schemas of disparate data sources: a machine-learning approach
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Snowball: a prototype system for extracting relations from large text collections
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
DIRT @SBT@discovery of inference rules from text
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
A machine learning based approach for table detection on the web
Proceedings of the 11th international conference on World Wide Web
DBXplorer: enabling keyword search over relational databases
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Generic Schema Matching with Cupid
Proceedings of the 27th International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching
The VLDB Journal — The International Journal on Very Large Data Bases
Flexible Web Document Analysis for Delivery to Narrow-Bandwidth Devices
ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Word association norms, mutual information, and lexicography
ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics
Web-scale information extraction in knowitall: (preliminary results)
Proceedings of the 13th international conference on World Wide Web
Mining tables from large scale HTML texts
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
iMAP: discovering complex semantic matches between database schemas
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Knocking the door to the deep Web: integrating Web query interfaces
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Integrating Unstructured Data into Relational Databases
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Towards domain-independent information extraction from web tables
Proceedings of the 16th international conference on World Wide Web
Assisted querying using instant-response interfaces
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Discover: keyword search in relational databases
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A first tutorial on dataspaces
Proceedings of the VLDB Endowment
Web-scale extraction of structured data
ACM SIGMOD Record
Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Privacy preservation of aggregates in hidden databases: why and how?
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Vispedia: on-demand data integration for interactive visualization and exploration
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
SCOVO: Using Statistics on the Web of Data
ESWC 2009 Heraklion Proceedings of the 6th European Semantic Web Conference on The Semantic Web: Research and Applications
Weakly-supervised acquisition of labeled class instances using graph random walks
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Semi-supervised learning of semantic classes for query understanding: from the web and for the web
Proceedings of the 18th ACM conference on Information and knowledge management
ExSearch: a novel vertical search engine for online barter business
Proceedings of the 18th ACM conference on Information and knowledge management
Spatio-textual spreadsheets: geotagging via spatial coherence
Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Answering web questions using structured data: dream or reality?
Proceedings of the VLDB Endowment
Harvesting relational tables from lists on the web
Proceedings of the VLDB Endowment
Data integration for the relational web
Proceedings of the VLDB Endowment
Character-level analysis of semi-structured documents for set expansion
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Proceedings of the 13th International Conference on Extending Database Technology
Liquid query: multi-domain exploratory search on the web
Proceedings of the 19th international conference on World wide web
Exploiting information redundancy to wring out structured data from the web
Proceedings of the 19th international conference on World wide web
Entity relation discovery from web tables and links
Proceedings of the 19th international conference on World wide web
From information to knowledge: harvesting entities and relationships from web sources
Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Automatically incorporating new sources in keyword search-based data integration
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Expressive and flexible access to web-extracted data: a keyword-based structured query language
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Structured annotations of web queries
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Google fusion tables: web-centered data management and collaboration
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
NGITS'09 Proceedings of the 7th international conference on Next generation information technologies and systems
Querying structured information sources on the Web
International Journal of Metadata, Semantics and Ontologies
Acquisition of instance attributes via labeled and related instances
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Extraction and approximation of numerical attributes from the Web
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
WikiAnalytics: disambiguation of keyword search results on highly heterogeneous structured data
Procceedings of the 13th International Workshop on the Web and Databases
Redundancy-driven web data extraction and integration
Procceedings of the 13th International Workshop on the Web and Databases
Entity-relationship queries over wikipedia
SMUC '10 Proceedings of the 2nd international workshop on Search and mining user-generated contents
Communications of the ACM
Annotating and searching web tables using entities, types and relationships
Proceedings of the VLDB Endowment
QUICK: expressive and flexible search over knowledge bases and text collections
Proceedings of the VLDB Endowment
Automatic wrappers for large scale web extraction
Proceedings of the VLDB Endowment
Instance sense induction from attribute sets
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
HyLiEn: a hybrid approach to general list extraction on the web
Proceedings of the 20th international conference companion on World wide web
Automatically building probabilistic databases from the web
Proceedings of the 20th international conference companion on World wide web
Semi-supervised truth discovery
Proceedings of the 20th international conference on World wide web
FACTO: a fact lookup engine based on web tables
Proceedings of the 20th international conference on World wide web
Unexpected results in automatic list extraction on the web
ACM SIGKDD Explorations Newsletter
Harvesting relational tables from lists on the web
The VLDB Journal — The International Journal on Very Large Data Bases
Schema-as-you-go: on probabilistic tagging and querying of wide tables
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Retrieving attributes using web tables
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Recovering semantics of tables on the web
Proceedings of the VLDB Endowment
Exploring schema repositories with schemr
ACM SIGMOD Record
Extracting general lists from web documents: a hybrid approach
IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
Knowledge and reasoning for question answering: Research perspectives
Information Processing and Management: an International Journal
Attribute retrieval from relational web tables
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
DC proposal: graphical models and probabilistic reasoning for generating linked data from tables
ISWC'11 Proceedings of the 10th international conference on The semantic web - Volume Part II
Towards a framework for attribute retrieval
Proceedings of the 20th ACM international conference on Information and knowledge management
Finding dimensions for queries
Proceedings of the 20th ACM international conference on Information and knowledge management
Efficient query rewrite for structured web queries
Proceedings of the 20th ACM international conference on Information and knowledge management
Best-effort modeling of structured data on the web
ER'11 Proceedings of the 30th international conference on Conceptual modeling
Proceedings of the 5th International Workshop on Web APIs and Service Mashups
WebSets: extracting sets of entities from the web using unsupervised information extraction
Proceedings of the fifth ACM international conference on Web search and data mining
The role of query sessions in extracting instance attributes from web search queries
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Chapter 6: web data extraction for service creation
Search Computing
Chapter 13: liquid queries and liquid results in search computing
Search Computing
An analysis of structured data on the web
Proceedings of the VLDB Endowment
InfoGather: entity augmentation and attribute discovery by holistic matching with web tables
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Automatic web-scale information extraction
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Towards an ecosystem of structured data on the web
Proceedings of the 15th International Conference on Extending Database Technology
Building enriched web page representations using link paths
Proceedings of the 23rd ACM conference on Hypertext and social media
Information Visualization - Special issue on State of the Field and New Research Directions
Answering table queries on the web using column keywords
Proceedings of the VLDB Endowment
Entity-Relationship Queries over Wikipedia
ACM Transactions on Intelligent Systems and Technology (TIST)
LIEGE:: link entities in web lists with knowledge base
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
A system for extracting top-K lists from the web
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to perceive two-dimensional displays using probabilistic grammars
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II
YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia
Artificial Intelligence
Schema decryption for large extract-transform-load systems
ER'12 Proceedings of the 31st international conference on Conceptual Modeling
Understanding tables on the web
ER'12 Proceedings of the 31st international conference on Conceptual Modeling
OXPath: A language for scalable data extraction, automation, and crawling on the deep web
The VLDB Journal — The International Journal on Very Large Data Bases
Towards web-scale structured web data extraction
Proceedings of the sixth ACM international conference on Web search and data mining
Exploring structure and content on the web: extraction and integration of the semi-structured web
Proceedings of the sixth ACM international conference on Web search and data mining
Transforming graph data for statistical relational learning
Journal of Artificial Intelligence Research
Proactive natural language search engine: tapping into structured data on the web
Proceedings of the 16th International Conference on Extending Database Technology
Entity discovery and annotation in tables
Proceedings of the 16th International Conference on Extending Database Technology
InfoGather+: semantic matching and annotation of numeric and time-varying attributes in web tables
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Methods for exploring and mining tables on Wikipedia
Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics
User-driven quality evaluation of DBpedia
Proceedings of the 9th International Conference on Semantic Systems
MetKB: enriching RDF knowledge bases with web entity-attribute tables
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model
ACM Transactions on the Web (TWEB)
Using natural language to integrate, evaluate, and optimize extracted knowledge bases
Proceedings of the 2013 workshop on Automated knowledge base construction
The parallel path framework for entity discovery on the web
ACM Transactions on the Web (TWEB)
Aggregated search: A new information retrieval paradigm
ACM Computing Surveys (CSUR)
A human-machine method for web table understanding
WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Semantic extraction of geographic data from web tables for big data integration
Proceedings of the 7th Workshop on Geographic Information Retrieval
Extraction and integration of partially overlapping web sources
Proceedings of the VLDB Endowment
Scalable column concept determination for web tables using large knowledge bases
Proceedings of the VLDB Endowment
Schema extraction for tabular data on the web
Proceedings of the VLDB Endowment
Web table taxonomy and formalization
ACM SIGMOD Record
Strigil: A Framework for Data Extraction in Semi-Structured Web Documents
Proceedings of International Conference on Information Integration and Web-based Applications & Services
Synthesizing union tables from the web
IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Using linked data to mine RDF from wikipedia's tables
Proceedings of the 7th ACM international conference on Web search and data mining
A temporal-probabilistic database model for information extraction
Proceedings of the VLDB Endowment
Test-driven evaluation of linked data quality
Proceedings of the 23rd international conference on World wide web
Hi-index | 0.02 |
The World-Wide Web consists of a huge number of unstructured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML tables from Google's general-purpose web crawl, and used statistical classification techniques to find the estimated 154M that contain high-quality relational data. Because each relational table has its own "schema" of labeled and typed columns, each such table can be considered a small structured database. The resulting corpus of databases is larger than any other corpus we are aware of, by at least five orders of magnitude. We describe the WEBTABLES system to explore two fundamental questions about this collection of databases. First, what are effective techniques for searching for structured data at search-engine scales? Second, what additional power can be derived by analyzing such a huge corpus? First, we develop new techniques for keyword search over a corpus of tables, and show that they can achieve substantially higher relevance than solutions based on a traditional search engine. Second, we introduce a new object derived from the database corpus: the attribute correlation statistics database (AcsDB) that records corpus-wide statistics on co-occurrences of schema elements. In addition to improving search relevance, the AcsDB makes possible several novel applications: schema auto-complete, which helps a database designer to choose schema elements; attribute synonym finding, which automatically computes attribute synonym pairs for schema matching; and join-graph traversal, which allows a user to navigate between extracted schemas using automatically-generated join links.