Focusing on novelty: a crawling strategy to build diverse language models

Authors:
Luciano Barbosa;Srinivas Bangalore
Affiliations:
AT&T Labs -- Research, Florham Park, NJ, USA;AT&T Labs -- Research, Florham Park, NJ, USA
Venue:
Proceedings of the 20th ACM international conference on Information and knowledge management
Year:
2011

Citing 18
Cited 0

Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Intelligent crawling on the World Wide Web with arbitrary predicates

Proceedings of the 10th international conference on World Wide Web
Scaling question answering to the web

ACM Transactions on Information Systems (TOIS)
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Language Model Adaptation Using Mixtures and an Exponentially Decaying Cache

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Word sense disambiguation in information retrieval revisited

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
An adaptive crawler for locating hidden-Web entry points

Proceedings of the 16th international conference on World Wide Web
Web resources for language modeling in conversational speech recognition

ACM Transactions on Speech and Language Processing (TSLP)
Web-scale named entity recognition

Proceedings of the 17th ACM conference on Information and knowledge management
Diversifying search results

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Data discretization unification

Knowledge and Information Systems
Efficacy of a constantly adaptive language modeling technique for web-scale applications

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Time-Sensitive Language Modelling for Online Term Recurrence Prediction

ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Adaptive geospatially focused crawling

Proceedings of the 18th ACM conference on Information and knowledge management
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Exploring web scale language models for search query processing

Proceedings of the 19th international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Word prediction performed by language models has an important role in many tasks as e.g. word sense disambiguation, speech recognition, hand-writing recognition, query spelling and query segmentation. Recent research has exploited the textual content of the Web to create language models. In this paper, we propose a new focused crawling strategy to collect Web pages that focuses on novelty in order to create diverse language models. In each crawling cycle, the crawler tries to ll the gaps present in the current language model built from previous cycles, by avoiding visiting pages whose vocabulary is already well represented in the model. It relies on an information theoretic measure to identify these gaps and then learns link patterns to pages in these regions in order to guide its visitation policy. To handle constantly evolving domains, a key feature of our crawler approach is its ability to adjust its focus as the crawl progresses. We evaluate our approach in two different scenarios in which our solution can be useful. First, we demonstrate that our approach produces more effective language models than the ones created by a baseline crawler in the context of a speech recognition task of broadcast news. In fact, in some cases, our crawler was able to obtain similar results to the baseline by crawling only 12.5% of the pages collected by the latter. Secondly, since in the news domain avoiding well-represented content might lead to novelty, i.e. up-to-date pages, we show that our diversity-based crawler can also be helpful to guide the crawler for the most recent content in the news. The results show that our approach was able to obtain on average 50% more up-to-date pages than the baseline crawler.