Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Snowball: extracting relations from large plain-text collections
DL '00 Proceedings of the fifth ACM conference on Digital libraries
Modern Information Retrieval
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Data-rich Section Extraction from HTML pages
WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Extracting Patterns and Relations from the World Wide Web
WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Using the structure of Web sites for automatic segmentation of tables
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Automatic information extraction from large websites
Journal of the ACM (JACM)
Unsupervised named-entity extraction from the web: an experimental study
Artificial Intelligence
Record linkage: similarity measures and algorithms
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
A Survey of Web Information Extraction Systems
IEEE Transactions on Knowledge and Data Engineering
Introduction to Information Retrieval
Introduction to Information Retrieval
Wrapper maintenance: a machine learning approach
Journal of Artificial Intelligence Research
Open information extraction from the web
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Supporting the automatic construction of entity aware search engines
Proceedings of the 10th ACM workshop on Web information and data management
Probabilistic models to reconcile complex data from inaccurate data sources
CAiSE'10 Proceedings of the 22nd international conference on Advanced information systems engineering
Unexpected results in automatic list extraction on the web
ACM SIGKDD Explorations Newsletter
Chapter 6: web data extraction for service creation
Search Computing
An analysis of structured data on the web
Proceedings of the VLDB Endowment
Exploring structure and content on the web: extraction and integration of the semi-structured web
Proceedings of the sixth ACM international conference on Web search and data mining
The parallel path framework for entity discovery on the web
ACM Transactions on the Web (TWEB)
Hi-index | 0.00 |
Several Web sites deliver a large number of pages, each publishing data about one instance of some real world entity, such as an athlete, a stock quote, a book. Even though it is easy for a human reader to recognize these instances, current search engines are unaware of them. Technologies for the Semantic Web aim at achieving this goal; however, so far they have been of little help in this respect, as semantic publishing is very limited. We have developed a system, called Flint, for automatically searching, collecting and indexing Web pages that publish data representing an instance of a certain conceptual entity. Flint takes as input a small set of labeled sample pages: it automatically infers a description of the underlying conceptual entity and then searches the Web for other pages containing data representing the same entity. Flint automatically extracts data from the collected pages and stores them into a semi-structured self-describing database, such as Google Base. Also, the collected pages can be used to populate a custom search engine; to this end we rely on the facilities provided by Google Co-op.