A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
A brief survey of web data extraction tools
ACM SIGMOD Record
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
SemTag and seeker: bootstrapping the semantic web via automated semantic annotation
WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Wrapper induction for information extraction
Wrapper induction for information extraction
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
Learning from labeled and unlabeled data on a directed graph
ICML '05 Proceedings of the 22nd international conference on Machine learning
Proceedings of the 16th international conference on World Wide Web
Truth discovery with multiple conflicting information providers on the web
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Mining the search trails of surfing crowds: identifying relevant websites from user activity
Proceedings of the 17th international conference on World Wide Web
Learning query intent from regularized click graphs
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Automated Semantic Analysis of Schematic Data
World Wide Web
Extracting data records from the web using tag path clustering
Proceedings of the 18th international conference on World wide web
Named entity recognition in query
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Open information extraction from the web
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Truth discovery and copying detection in a dynamic world
Proceedings of the VLDB Endowment
Data integration for the relational web
Proceedings of the VLDB Endowment
Semi-supervised truth discovery
Proceedings of the 20th international conference on World wide web
FACTO: a fact lookup engine based on web tables
Proceedings of the 20th international conference on World wide web
Heterogeneous network-based trust analysis: a survey
ACM SIGKDD Explorations Newsletter
Place value: word position shifts vital to search dynamics
Proceedings of the 22nd international conference on World Wide Web companion
Hi-index | 0.00 |
Today the major web search engines answer queries by showing ten result snippets, which need to be inspected by users for identifying relevant results. In this paper we investigate how to extract structured information from the web, in order to directly answer queries by showing the contents being searched for. We treat users' search trails (i.e., post-search browsing behaviors) as implicit labels on the relevance between web contents and user queries. Based on such labels we use information extraction approach to build wrappers and extract structured information. An important observation is that many web sites contain pages for name entities of certain categories (e.g., AOL Music contains a page for each musician), and these pages have the same format. This makes it possible to build wrappers from a small amount of implicit labels, and use them to extract structured information from many web pages for different name entities. We propose STRUCLICK, a fully automated system for extracting structured information for queries containing name entities of certain categories. It can identify important web sites from web search logs, build wrappers from users' search trails, filter out bad wrappers built from random user clicks, and combine structured information from different web sites for each query. Comparing with existing approaches on information extraction, STRUCLICK can assign semantics to extracted data without any human labeling or supervision. We perform comprehensive experiments, which show STRUCLICK achieves high accuracy and good scalability.