Clustering web pages about persons and organizations

Authors:
Shiren Ye;Tat-Seng Chua;Jeremy R. Kei
Affiliations:
School of Computing, National University of Singapore, Singapore;School of Computing, National University of Singapore, Singapore;School of Computing, National University of Singapore, Singapore
Venue:
Web Intelligence and Agent Systems
Year:
2005

Citing 16
Cited 2

Automatic text processing

Automatic text processing
Information foraging in information access environments

CHI '95 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Learning decision tree classifiers

ACM Computing Surveys (CSUR)
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Learning to extract symbolic knowledge from the World Wide Web

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
An Algorithm that Learns What‘s in a Name

Machine Learning - Special issue on natural language learning
Grouper: a dynamic clustering interface to Web search results

WWW '99 Proceedings of the eighth international conference on World Wide Web
Data clustering: a review

ACM Computing Surveys (CSUR)
Partitioning-based clustering for Web document categorization

Decision Support Systems - Special issue on WITS '97
Real life, real users, and real needs: a study and analysis of user queries on the web

Information Processing and Management: an International Journal
Web mining research: a survey

ACM SIGKDD Explorations Newsletter
Effective site finding using link anchor information

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic question answering on the web

Proceedings of the 11th international conference on World Wide Web
The Importance of Prior Probabilities for Entry Page Search

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
A maximum entropy approach to named entity recognition

A maximum entropy approach to named entity recognition

Web search engine working as a bee hive

Web Intelligence and Agent Systems
Estimating the size and evolution of categorised topics in web directories

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the most frequent Web surfing tasks is to search for persons and organizations by their names. Such names are often not distinctive, commonly occurring, and non-unique. Thus, a single name may be mapped to several named target entities. This paper describes a new methodology to cluster web pages returned by a search engine so that pages belonging to different entities are clustered into different groups. The algorithm uses a combination of named entities, and link-based and structure-based information as features to partition the document set into direct and indirect pages by means of a decision-tree model. It then chooses the appropriate distinctive direct pages as seeds to cluster the document set into different clusters. The algorithm has been found to be effective for web-based information retrieval applications.