Document mining based on semantic understanding of text

Authors:
Khaled Shaban;Otman Basir;Mohamed Kamel
Affiliations:
Electrical and Computer Engineering, University of Waterloo, Waterloo, Canada;Electrical and Computer Engineering, University of Waterloo, Waterloo, Canada;Electrical and Computer Engineering, University of Waterloo, Waterloo, Canada
Venue:
CIARP'06 Proceedings of the 11th Iberoamerican conference on Progress in Pattern Recognition, Image Analysis and Applications
Year:
2006

Citing 16
Cited 1

Algorithms for clustering data

Algorithms for clustering data
Information filtering and information retrieval: two sides of the same coin?

Communications of the ACM - Special issue on information filtering
Using linear algebra for intelligent information retrieval

SIAM Review
A multilevel approach to intelligent information filtering: model, system, and evaluation

ACM Transactions on Information Systems (TOIS)
Data mining methods for knowledge discovery

Data mining methods for knowledge discovery
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Partitioning-based clustering for Web document categorization

Decision Support Systems - Special issue on WITS '97
Document Categorization and Query Generation on the World Wide WebUsing WebACE

Artificial Intelligence Review - Special issue on data mining on the Internet
A vector space model for automatic indexing

Communications of the ACM
Information Retrieval

Information Retrieval
Modern Information Retrieval

Modern Information Retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Principal Direction Divisive Partitioning

Data Mining and Knowledge Discovery
Document Ranking and the Vector-Space Model

IEEE Software
Mining the Web: Discovering Knowledge from HyperText Data

Mining the Web: Discovering Knowledge from HyperText Data
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning

iReMedI - Intelligent Retrieval from Medical Information

ECCBR '08 Proceedings of the 9th European conference on Advances in Case-Based Reasoning

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a new paradigm for mining documents by exploiting the semantic information of their texts. A formal semantic representation of linguistic inputs is introduced and utilized to build a semantic representation for documents. The representation is constructed through accumulation of syntactic and semantic analysis outputs. A new distance measure is developed to determine the similarities between contents of documents. The measure is based on inexact matching of attributed trees. It involves the computation of all distinct similarity common sub-trees, and can be computed efficiently. It is believed that the proposed representation along with the proposed similarity measure will enable more effective document mining processes. The proposed techniques to mine documents were implemented as components in a mining system. A case study of semantic document clustering is presented to demonstrate the working and the efficacy of the framework. Experimental work is reported, and its results are presented and analyzed.