RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Integrating Unstructured Data into Relational Databases
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Structured Data Extraction from the Web Based on Partial Tree Alignment
IEEE Transactions on Knowledge and Data Engineering
EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
WebTables: exploring the power of tables on the web
Proceedings of the VLDB Endowment
Answering table augmentation queries from unstructured lists on the web
Proceedings of the VLDB Endowment
CETR: content extraction via tag ratios
Proceedings of the 19th international conference on World wide web
HyLiEn: a hybrid approach to general list extraction on the web
Proceedings of the 20th international conference companion on World wide web
Harvesting relational tables from lists on the web
The VLDB Journal — The International Journal on Very Large Data Bases
Building enriched web page representations using link paths
Proceedings of the 23rd ACM conference on Hypertext and social media
Document-topic hierarchies from document graphs
Proceedings of the 21st ACM international conference on Information and knowledge management
Hi-index | 0.00 |
In this tutorial we view the World Wide Web as a type of massive, decentralized database. At present, this "Web database" is presented in a manner largely devoid of any consistent meaning or schema. That is not to say that Web-data lacks an underlying organization; in fact, most Web content is generated from an underlying schema-bound, or otherwise structured database. Information extraction is generally concerned with the reconciliation of unstructured or semi-structured Web content with the neatly structured database paradigm. With this Web-database in hand, researchers and practitioners have recently begun developing mechanisms which return structured results in response to an unstructured query. These new developments are a product of (1) record, list and table extraction from large numbers of semi-structured Web pages, (2) integration of these disparate extraction results into a consistent form, and (3) analysis of the newly extracted and integrated Web data. Among the many fruits of this line of work is the ability for semi-structured Web data to enhance the search capabilities of a schema-bound database. Alternatively, structured database records have also been used to augment Web page collections typically used by Web search engines. We will cover several key technologies, and principles explored so far in the area of Web information extraction, search and exploration.