Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
C4.5: programs for machine learning
C4.5: programs for machine learning
Searching distributed collections with inference networks
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Content routing: a scalable architecture for network-based information discovery
Content routing: a scalable architecture for network-based information discovery
STARTS: Stanford proposal for Internet meta-searching
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The SMART and SIRE experimental retrieval systems
Readings in information retrieval
Statistical methods for speech recognition
Statistical methods for speech recognition
Effective retrieval with distributed collections
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Methods for information server selection
ACM Transactions on Information Systems (TOIS)
Automatic discovery of language models for text databases
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Comparing the performance of database selection algorithms
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Cluster-based language models for distributed retrieval
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Scalable collection summarization and selection
Proceedings of the fourth ACM conference on Digital libraries
A decision-theoretic approach to database selection in networked IR
ACM Transactions on Information Systems (TOIS)
GlOSS: text-source discovery over the Internet
ACM Transactions on Database Systems (TODS)
Server selection on the World Wide Web
DL '00 Proceedings of the fifth ACM conference on Digital libraries
The impact of database selection on distributed searching
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Query routing for Web search engines: architectures and experiments
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Collection selection and results merging with topically organized U.S. patents and TREC data
Proceedings of the ninth international conference on Information and knowledge management
Probe, count, and classify: categorizing hidden web databases
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
SDLIP + STARTS = SDARTS a protocol and toolkit for metasearching
Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Query-based sampling of text databases
ACM Transactions on Information Systems (TOIS)
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Determining Text Databases to Search in the Internet
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Server Ranking for Distributed Text Retrieval Systems on the Internet
Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA)
Learning trees and rules with set-valued features
AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1
QProber: A system for automatic classification of hidden-Web databases
ACM Transactions on Information Systems (TOIS)
Web application security assessment by fault injection and behavior monitoring
WWW '03 Proceedings of the 12th international conference on World Wide Web
Relevant document distribution estimation method for resource selection
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Information sharing across private databases
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Comparing the performance of collection selection algorithms
ACM Transactions on Information Systems (TOIS)
A semisupervised learning method to merge search engine results
ACM Transactions on Information Systems (TOIS)
Learning query languages of Web interfaces
Proceedings of the 2004 ACM symposium on Applied computing
A Probabilistic Approach to Metasearching with Adaptive Probing
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
A Frequency-based Approach for Mining Coverage Statistics in Data Integration
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
When one sample is not enough: improving text database selection using shrinkage
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Web Searching and Information Retrieval
Computing in Science and Engineering
Organizing structured web sources by query schemas: a clustering approach
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Unified utility maximization framework for resource selection
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Guiding queries to information sources with InfoBeacons
Proceedings of the 5th ACM/IFIP/USENIX international conference on Middleware
Modeling and Managing Content Changes in Text Databases
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Downloading textual hidden web content through keyword queries
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
On search in peer-to-peer file sharing systems
Proceedings of the 2005 ACM symposium on Applied computing
A testing framework for Web application security assessment
Computer Networks: The International Journal of Computer and Telecommunications Networking - Web security
Information source selection for resource constrained environments
ACM SIGMOD Record
Two-stage statistical language models for text database selection
Information Retrieval
Estimating required recall for successful knowledge acquisition from the web
Proceedings of the 15th international conference on World Wide Web
Automatic structured query transformation over distributed digital libraries
Proceedings of the 2006 ACM symposium on Applied computing
Capturing collection size for distributed non-cooperative retrieval
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Identifying redundant search engines in a very large scale metasearch engine context
WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
"What is a good digital library?" - A quality model for digital libraries
Information Processing and Management: an International Journal
Effective keyword-based selection of relational databases
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Modeling and managing changes in text databases
ACM Transactions on Database Systems (TODS)
Distributed text retrieval from overlapping collections
ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Using query logs to establish vocabularies in distributed information retrieval
Information Processing and Management: an International Journal
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
DeepBot: a focused crawler for accessing hidden web content
Proceedings of the 3rd international workshop on Data enginering issues in E-commerce and services: In conjunction with ACM Conference on Electronic Commerce (EC '07)
Towards a query optimizer for text-centric tasks
ACM Transactions on Database Systems (TODS)
Routing Queries through a Peer-to-Peer InfoBeacons Network Using Information Retrieval Techniques
IEEE Transactions on Parallel and Distributed Systems
Adaptive-sampling algorithms for answering aggregation queries on Web sites
Data & Knowledge Engineering
Classification-aware hidden-web text database selection
ACM Transactions on Information Systems (TOIS)
Discovering gis sources on the web using summaries
Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Efficient Top-k Data Sources Ranking for Query on Deep Web
WISE '08 Proceedings of the 9th international conference on Web Information Systems Engineering
Proceedings of the VLDB Endowment
Automatic wrapper induction from hidden-web sources with domain knowledge
Proceedings of the 10th ACM workshop on Web information and data management
A comparison of techniques for estimating IDF values to generate lexical signatures for the web
Proceedings of the 10th ACM workshop on Web information and data management
Facilitating discovery on the private web using dataset digests
Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Web-scale extraction of structured data
ACM SIGMOD Record
Privacy preservation of aggregates in hidden databases: why and how?
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Classification-based resource selection
Proceedings of the 18th ACM conference on Information and knowledge management
A testing framework for Web application security assessment
Computer Networks: The International Journal of Computer and Telecommunications Networking - Web security
Kosmix: high-performance topic exploration using the deep web
Proceedings of the VLDB Endowment
Turbo-charging hidden database samplers with overflowing queries and skew reduction
Proceedings of the 13th International Conference on Extending Database Technology
Supporting keyword queries on structured databases with limited search interfaces
DASFAA'08 Proceedings of the 13th international conference on Database systems for advanced applications
Crawling the content hidden behind web forms
ICCSA'07 Proceedings of the 2007 international conference on Computational science and Its applications - Volume Part II
Unbiased estimation of size and other aggregates over hidden web databases
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Facilitating discovery on the private web using dataset digests
International Journal of Metadata, Semantics and Ontologies
Communications of the ACM
Just-in-time analytics on large file systems
FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Deep Web adaptive crawling based on minimum executable pattern
Journal of Intelligent Information Systems
Foundations and Trends in Information Retrieval
Facet discovery for structured web search: a query-log mining approach
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Attribute domain discovery for hidden web databases
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Crawling web pages with support for client-side dynamism
WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Using information retrieval techniques to route queries in an infobeacons network
DBISP2P'04 Proceedings of the Second international conference on Databases, Information Systems, and Peer-to-Peer Computing
Index-Based keyword search in mediator systems
EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology
Efficient deep web crawling using reinforcement learning
PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Optimal algorithms for crawling a hidden database in the web
Proceedings of the VLDB Endowment
Shard ranking and cutoff estimation for topically partitioned collections
Proceedings of the 21st ACM international conference on Information and knowledge management
Federated search in the wild: the combined power of over a hundred search engines
Proceedings of the 21st ACM international conference on Information and knowledge management
Topic-Sensitive hidden-web crawling
WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Information Systems
Discovering interesting information with advances in web technology
ACM SIGKDD Explorations Newsletter
Rank discovery from web databases
Proceedings of the VLDB Endowment
Hi-index | 0.02 |
Many valuable text databases on the web have non-crawlable contents that are "hidden" behind search interfaces. Metasearchers are helpful tools for searching over many such databases at once through a unified query interface. A critical task for a metasearcher to process a query efficiently and effectively is the selection of the most promising databases for the query, a task that typically relies on statistical summaries of the database contents. Unfortunately, web-accessible text databases do not generally export content summaries. In this paper, we present an algorithm to derive content summaries from "uncooperative" databases by using "focused query probes," which adaptively zoom in on and extract documents that are representative of the topic coverage of the databases. Our content summaries are the first to include absolute document frequency estimates for the database words. We also present a novel database selection algorithm that exploits both the extracted content summaries and a hierarchical classification of the databases, automatically derived during probing, to compensate for potentially incomplete content summaries. Finally, we evaluate our techniques thoroughly using a variety of databases, including 50 real web-accessible text databases. Our experiments indicate that our new content-summary construction technique is efficient and produces more accurate summaries than those from previously proposed strategies. Also, our hierarchical database selection algorithm exhibits significantly higher precision than its flat counterparts.