Using Web structure and summarisation techniques for Web content mining

Authors:
Chen Lihui;Chue Wai Lian
Affiliations:
School of Electrical and Electronic Engineering, Division of Information Engineering, Nanyang Technological University, South Spine, Block S1, Nanyang Avenue, 639798 Republic of Singapore;School of Electrical and Electronic Engineering, Division of Information Engineering, Nanyang Technological University, South Spine, Block S1, Nanyang Avenue, 639798 Republic of Singapore
Venue:
Information Processing and Management: an International Journal
Year:
2005

Citing 33
Cited 0

Implementing agglomerative hierarchic clustering algorithms for use in document retrieval

Information Processing and Management: an International Journal
The vocabulary problem in human-system communication

Communications of the ACM
Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Clustering algorithms

Information retrieval
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Constant interaction-time scatter/gather browsing of very large document collections

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Information seeking in electronic environments

Information seeking in electronic environments
The World-Wide Web: quagmire or gold mine?

Communications of the ACM
Query expansion using local and global document analysis

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Reexamining the cluster hypothesis: scatter/gather on retrieval results

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Projections for efficient document clustering

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
SONIA: a service for organizing networked information autonomously

Proceedings of the third ACM conference on Digital libraries
Advantages of query biased summaries in information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Grouper: a dynamic clustering interface to Web search results

WWW '99 Proceedings of the eighth international conference on World Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Comparing noun phrasing techniques for use with medical digital library tools

Journal of the American Society for Information Science - Special topic issue on digital libraries: part 2
Efficient identification of Web communities

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Web mining research: a survey

ACM SIGKDD Explorations Newsletter
Applying summarization techniques for term selection in relevance feedback

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
MetaSpider: meta-searching and categorization on the Web

Journal of the American Society for Information Science and Technology
Using web structure for classifying and describing web pages

Proceedings of the 11th international conference on World Wide Web
Improving retrieval feedback with multiple term-ranking function combination

ACM Transactions on Information Systems (TOIS)
A Study of Approaches to Hypertext Categorization

Journal of Intelligent Information Systems
Digital Libraries and Autonomous Citation Indexing

Computer
Enabling Concept-Based Relevance Feedback for Information Retrieval on the WWW

IEEE Transactions on Knowledge and Data Engineering
TétraFusion: Information Discovery on the Internet

IEEE Intelligent Systems
Automatic information extraction from semi-structured Web pages by pattern discovery

Decision Support Systems - Web retrieval and mining
A task-oriented study on the influencing effects of query-biased summarisation in web searching

Information Processing and Management: an International Journal
A Nonlinear Mapping for Data Structure Analysis

IEEE Transactions on Computers
Self organization of a massive document collection

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

The dynamic nature and size of the Internet can result in difficulty finding relevant information. Most users typically express their information need via short queries to search engines and they often have to physically sift through the search results based on relevance ranking set by the search engines, making the process of relevance judgement time-consuming. In this paper, we describe a novel representation technique which makes use of the Web structure together with summarisation techniques to better represent knowledge in actual Web Documents. We named the proposed technique as Semantic Virtual Document (SVD). We will discuss how the proposed SVD can be used together with a suitable clustering algorithm to achieve an automatic content-based categorization of similar Web Documents. The auto-categorization facility as well as a ''Tree-like'' Graphical User Interface (GUI) for post-retrieval document browsing enhances the relevance judgement process for Internet users. Furthermore, we will introduce how our cluster-biased automatic query expansion technique can be used to overcome the ambiguity of short queries typically given by users. We will outline our experimental design to evaluate the effectiveness of the proposed SVD for representation and present a prototype called iSEARCH (Intelligent SEarch And Review of Cluster Hierarchy) for Web content mining. Our results confirm, quantify and extend previous research using Web structure and summarisation techniques, introducing novel techniques for knowledge representation to enhance Web content mining.