A relational model of data for large shared data banks
Communications of the ACM
A machine learning based approach for table detection on the web
Proceedings of the 11th international conference on World Wide Web
Use of the SAND spatial browser for digital government applications
Communications of the ACM
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
VISUAL '99 Proceedings of the Third International Conference on Visual Information and Information Systems
Table extraction using conditional random fields
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Mining tables from large scale HTML texts
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
A survey of table recognition: Models, observations, transformations, and inferences
International Journal on Document Analysis and Recognition
Shallow parsing with conditional random fields
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Towards domain-independent information extraction from web tables
Proceedings of the 16th international conference on World Wide Web
TableSeer: automatic table metadata extraction and searching in digital libraries
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
WebTables: exploring the power of tables on the web
Proceedings of the VLDB Endowment
Automated ontology instantiation from tabular web sources-The AllRight system
Web Semantics: Science, Services and Agents on the World Wide Web
Spatio-textual spreadsheets: geotagging via spatial coherence
Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Data integration for the relational web
Proceedings of the VLDB Endowment
Annotating and searching web tables using entities, types and relationships
Proceedings of the VLDB Endowment
Recovering semantics of tables on the web
Proceedings of the VLDB Endowment
InfoGather: entity augmentation and attribute discovery by holistic matching with web tables
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Answering table queries on the web using column keywords
Proceedings of the VLDB Endowment
Structured toponym resolution using combined hierarchical place categories
Proceedings of the 7th Workshop on Geographic Information Retrieval
Hi-index | 0.00 |
Tabular data is an abundant source of information on the Web, but remains mostly isolated from the latter's interconnections since tables lack links and computer-accessible descriptions of their structure. In other words, the schemas of these tables -- attribute names, values, data types, etc. -- are not explicitly stored as table metadata. Consequently, the structure that these tables contain is not accessible to the crawlers that power search engines and thus not accessible to user search queries. We address this lack of structure with a new method for leveraging the principles of table construction in order to extract table schemas. Discovering the schema by which a table is constructed is achieved by harnessing the similarities and differences of nearby table rows through the use of a novel set of features and a feature processing scheme. The schemas of these data tables are determined using a classification technique based on conditional random fields in combination with a novel feature encoding method called logarithmic binning, which is specifically designed for the data table extraction task. Our method provides considerable improvement over the well-known WebTables schema extraction method. In contrast with previous work that focuses on extracting individual relations, our method excels at correctly interpreting full tables, thereby being capable of handling general tables such as those found in spreadsheets, instead of being restricted to HTML tables as is the case with the WebTables method. We also extract additional schema characteristics, such as row groupings, which are important for supporting information retrieval tasks on tabular data.