Lexicon-based Document Representation

Authors:
Gloria Virginia;Hung Son Nguyen
Affiliations:
Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Poland. virginia@icm.edu.pl;Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Poland. son@mimuw.edu.pl
Venue:
Fundamenta Informaticae - Cognitive Informatics and Computational Intelligence: Theory and Applications
Year:
2013

Citing 10
Cited 0

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Hierarchical Document Clustering Based on Tolerance Rough Set Model

PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
Measuring semantic similarity between words using web search engines

Proceedings of the 16th international conference on World Wide Web
Stemming Indonesian: A confix-stripping approach

ACM Transactions on Asian Language Information Processing (TALIP)
Introduction to Information Retrieval

Introduction to Information Retrieval
Adaptive relevance feedback in information retrieval

Proceedings of the 18th ACM conference on Information and knowledge management
Positional relevance model for pseudo-relevance feedback

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Automatic Ontology Constructor for Indonesian Language

WI-IAT '10 Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Investigating the effectiveness of thesaurus generated using tolerance rough set model

ISMIS'11 Proceedings of the 19th international conference on Foundations of intelligent systems
Tolerance Approximation Spaces

Fundamenta Informaticae

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is a big challenge for an information retrieval system IRS to interpret the queries made by users, particularly because the common form of query consists of very few terms. Tolerance rough sets models TRSM, as an extension of rough sets theory, have demonstrated their ability to enrich document representation in terms of semantic relatedness. However, system efficiency is at stake because the weight vector created by TRSM TRSM-representation is much less sparse. We mapped the terms occurring in TRSM-representation to terms in the lexicon, hence the final representation of a document was a weight vector consisting only of terms that occurred in the lexicon LEX-representation. The LEX-representation can be viewed as a compact form of TRSM-representation in a lower dimensional space and eliminates all informal terms previously occurring in TRSM-vector. With these facts, we may expect a more efficient system. We employed recall and precision commonly used in information retrieval to evaluate the effectiveness of LEX-representation. Based on our examination, we found that the effectiveness of LEX-representation is comparable with TRSM-representation while the efficiency of LEX-representation should be better than the existing TRSM-representation. We concluded that lexicon-based document representation was another alternative potentially used to represent a document while considering semantics. We are tempted to implement the LEX-representation together with linguistic computation, such as tagging and feature selection, in order to retrieve more relevant terms with high weight. With regard to the TRSM method, enhancing the quality of tolerance class is crucial based on the fact that the TRSM method is fully reliant on the tolerance classes. We plan to combine other resources such as Wikipedia Indonesia to generate a better tolerance class.