Term-weighting approaches in automatic text retrieval
Readings in information retrieval
Inferring Web communities from link topology
Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Web document clustering: a feasibility demonstration
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Improved algorithms for topic distillation in a hyperlinked environment
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Methods for information server selection
ACM Transactions on Information Systems (TOIS)
Automatic discovery of language models for text databases
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Finding related pages in the World Wide Web
WWW '99 Proceedings of the eighth international conference on World Wide Web
Trawling the Web for emerging cyber-communities
WWW '99 Proceedings of the eighth international conference on World Wide Web
Recognizing structure in Web pages using similarity queries
AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Authoritative sources in a hyperlinked environment
Journal of the ACM (JACM)
ACM Computing Surveys (CSUR)
Agglomerative clustering of a search engine query log
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
A vector space model for automatic indexing
Communications of the ACM
Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
QProber: A system for automatic classification of hidden-Web databases
ACM Transactions on Information Systems (TOIS)
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Detection of Heterogeneities in a Multiple Text Database Environment
COOPIS '99 Proceedings of the Fourth IECIS International Conference on Cooperative Information Systems
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Pattern Classification (2nd Edition)
Pattern Classification (2nd Edition)
Criterion functions for document clustering
Criterion functions for document clustering
Distributed search over the hidden web: hierarchical database sampling and selection
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Structured databases on the web: observations and implications
ACM SIGMOD Record
QA-Pagelet: Data Preparation Techniques for Large-Scale Data Analysis of the Deep Web
IEEE Transactions on Knowledge and Data Engineering
HW-STALKER: a machine learning-based system for transforming QURE-Pagelets to XML
Data & Knowledge Engineering
AutoFeed: an unsupervised learning system for generating webfeeds
Proceedings of the 3rd international conference on Knowledge capture
Efficient, automatic web resource harvesting
WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
A web content manipulation technique based on page Fragmentation
Journal of Network and Computer Applications
Routing Queries through a Peer-to-Peer InfoBeacons Network Using Information Retrieval Techniques
IEEE Transactions on Parallel and Distributed Systems
Extracting lists of data records from semi-structured web pages
Data & Knowledge Engineering
CCReSD: concept-based categorisation of Hidden Web databases
International Journal of High Performance Computing and Networking
A methodical approach to extracting interesting objects from dynamic web pages
International Journal of Web and Grid Services
Detecting data records in semi-structured web sites based on text token clustering
Integrated Computer-Aided Engineering
Automatic wrapper induction from hidden-web sources with domain knowledge
Proceedings of the 10th ACM workshop on Web information and data management
Overview of autofeed: an unsupervised learning system for generating webfeeds
AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Crawling Deep Web Using a New Set Covering Algorithm
ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Estimating deep web data source size by capture---recapture method
Information Retrieval
Ranking bias in deep web size estimation using capture recapture method
Data & Knowledge Engineering
Automatic generation of data types for classification of deep web sources
DILS'05 Proceedings of the Second international conference on Data Integration in the Life Sciences
Discovering interesting information with advances in web technology
ACM SIGKDD Explorations Newsletter
Selecting queries from sample to crawl deep web data sources
Web Intelligence and Agent Systems
Hi-index | 0.00 |
In this paper, we introduce the concept of a QA-Pageletto refer to the content region in a dynamic page that containsquery matches. We present THOR, a scalable andefficient mining system for discovering and extracting QA-Pageletsfrom the Deep Web. A unique feature of THOR isits two-phase extraction framework. In the first phase, pagesfrom a deep web site are grouped into distinct clusters ofstructurally-similar pages. In the second phase, pages fromeach page cluster are examined through a subtree filteringalgorithm that exploits the structural and content similarityat subtree level to identify the QA-Pagelets.