Flint: Google-basing the Web

Authors:
Lorenzo Blanco;Valter Crescenzi;Paolo Merialdo;Paolo Papotti
Affiliations:
Roma Tre University - Italy;Roma Tre University - Italy;Roma Tre University - Italy;Roma Tre University - Italy
Venue:
EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Year:
2008

Citing 15
Cited 7

Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Modern Information Retrieval

Modern Information Retrieval
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Data-rich Section Extraction from HTML pages

WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Automatic information extraction from large websites

Journal of the ACM (JACM)
Unsupervised named-entity extraction from the web: an experimental study

Artificial Intelligence
Record linkage: similarity measures and algorithms

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Introduction to Information Retrieval

Introduction to Information Retrieval
Wrapper maintenance: a machine learning approach

Journal of Artificial Intelligence Research
Open information extraction from the web

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence

Supporting the automatic construction of entity aware search engines

Proceedings of the 10th ACM workshop on Web information and data management
Probabilistic models to reconcile complex data from inaccurate data sources

CAiSE'10 Proceedings of the 22nd international conference on Advanced information systems engineering
Unexpected results in automatic list extraction on the web

ACM SIGKDD Explorations Newsletter
Chapter 6: web data extraction for service creation

Search Computing
An analysis of structured data on the web

Proceedings of the VLDB Endowment
Exploring structure and content on the web: extraction and integration of the semi-structured web

Proceedings of the sixth ACM international conference on Web search and data mining
The parallel path framework for entity discovery on the web

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Several Web sites deliver a large number of pages, each publishing data about one instance of some real world entity, such as an athlete, a stock quote, a book. Even though it is easy for a human reader to recognize these instances, current search engines are unaware of them. Technologies for the Semantic Web aim at achieving this goal; however, so far they have been of little help in this respect, as semantic publishing is very limited. We have developed a system, called Flint, for automatically searching, collecting and indexing Web pages that publish data representing an instance of a certain conceptual entity. Flint takes as input a small set of labeled sample pages: it automatically infers a description of the underlying conceptual entity and then searches the Web for other pages containing data representing the same entity. Flint automatically extracts data from the collected pages and stores them into a semi-structured self-describing database, such as Google Base. Also, the collected pages can be used to populate a custom search engine; to this end we rely on the facilities provided by Google Co-op.