Combining text and link analysis for focused crawling-An application for vertical search engines

Authors:
G. Almpanidis;C. Kotropoulos;I. Pitas
Affiliations:
Department of Informatics, Aristotle University of Thessaloniki, Box 451, Thessaloniki GR-54124, Greece;Department of Informatics, Aristotle University of Thessaloniki, Box 451, Thessaloniki GR-54124, Greece;Department of Informatics, Aristotle University of Thessaloniki, Box 451, Thessaloniki GR-54124, Greece
Venue:
Information Systems
Year:
2007

Citing 39
Cited 14

Information retrieval in the World-Wide Web: making client-based searching feasible

Selected papers of the first conference on World-Wide Web
Using linear algebra for intelligent information retrieval

SIAM Review
Improving human-proceedings interaction: indexing the CHI index

CHI '95 Conference Companion on Human Factors in Computing Systems
Page and link classifications: connecting diverse resources

Proceedings of the third ACM conference on Digital libraries
Improved algorithms for topic distillation in a hyperlinked environment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The shark-search algorithm. An application: tailored Web site mapping

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Understanding search engines: mathematical modeling and text retrieval

Understanding search engines: mathematical modeling and text retrieval
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Topical locality in the Web

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing

Communications of the ACM
Intelligent crawling on the World Wide Web with arbitrary predicates

Proceedings of the 10th international conference on World Wide Web
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction

Proceedings of the 10th international conference on World Wide Web
Searching the Web

ACM Transactions on Internet Technology (TOIT)
Evaluating topic-driven web crawlers

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Stable algorithms for link analysis

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Topic-sensitive PageRank

Proceedings of the 11th international conference on World Wide Web
Information Retrieval

Information Retrieval
Modern Information Retrieval

Modern Information Retrieval
Automating the Construction of Internet Portals with Machine Learning

Information Retrieval
Mining the Web: Discovering Knowledge from HyperText Data

Mining the Web: Discovering Knowledge from HyperText Data
ARCCHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Learning to Probabilistically Identify Authoritative Documents

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Focused Crawls, Tunneling, and Digital Libraries

ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
Comparison of Three Vertical Search Spiders

Computer
On scaling latent semantic indexing for large peer-to-peer systems

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
UbiCrawler: a scalable fully distributed web crawler

Software—Practice & Experience
Crawling a country: better strategies than breadth-first for web page ordering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
The indexable web is more than 11.5 billion pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
SimFusion: measuring similarity using unified relationship matrix

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Graphs over time: densification laws, shrinking diameters and possible explanations

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
The freshness of web search engine databases

Journal of Information Science
A framework for understanding latent semantic indexing (LSI) performance

Information Processing and Management: an International Journal - Special issue: Formal methods for information retrieval
When are links useful? experiments in text classification

ECIR'03 Proceedings of the 25th European conference on IR research

A Topic-Specific Web Crawler with Concept Similarity Context Graph Based on FCA

ICIC '08 Proceedings of the 4th international conference on Intelligent Computing: Advanced Intelligent Computing Theories and Applications - with Aspects of Artificial Intelligence
Identification of factors predicting clickthrough in Web searching using neural network analysis

Journal of the American Society for Information Science and Technology
Finding topic trends in digital libraries

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
A Genre-Aware Approach to Focused Crawling

World Wide Web
Design and implementation of contextual information portals

Proceedings of the 20th international conference companion on World wide web
Using internal link and social network analysis to support searches in Wikipedia: A model and its evaluation

Journal of Information Science
An exploratory study of navigating wikipedia semantically: model and application

OCSC'11 Proceedings of the 4th international conference on Online communities and social computing
An evolutionary factor analysis computation for mining website structures

Expert Systems with Applications: An International Journal
PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING

Computational Intelligence
Turn the page: automated traversal of paginated websites

ICWE'12 Proceedings of the 12th international conference on Web Engineering
WNavis: Navigating Wikipedia semantically with an SNA-based summarization technique

Decision Support Systems
Semantic ranking of web pages based on formal concept analysis

Journal of Systems and Software
Fast dimension reduction for document classification based on imprecise spectrum analysis

Information Sciences: an International Journal
A novel shark-search algorithm for theme crawler

WISM'12 Proceedings of the 2012 international conference on Web Information Systems and Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

The number of vertical search engines and portals has rapidly increased over the last years, making the importance of a topic-driven (focused) crawler self-evident. In this paper, we develop a latent semantic indexing classifier that combines link analysis with text content in order to retrieve and index domain-specific web documents. Our implementation presents a different approach to focused crawling and aims to overcome the limitations imposed by the need to provide initial data for training, while maintaining a high recall/precision ratio. We compare its efficiency with other well-known web information retrieval techniques.