Cross-lingual document representation and semantic similarity measure: a fuzzy set and rough set based approach

Authors:
Hsun-Hui Huang;Yau-Hwang Kuo
Affiliations:
Intelligent System, Media Processing Laboratory, Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan City, Taiwan and Department of Management Inform ...;Intelligent System, Media Processing Laboratory, Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan City, Taiwan
Venue:
IEEE Transactions on Fuzzy Systems
Year:
2010

Citing 29
Cited 1

A fuzzy document retrieval system using the keyword connection matrix and a learning method

Fuzzy Sets and Systems - Special issue on applications of fuzzy systems theory, Iizuka '88
Relevance weighting of search terms

Document retrieval systems
Theory of topological molecular lattices

Fuzzy Sets and Systems
Similarity measure between fuzzy sets and between elements

Fuzzy Sets and Systems
A comparison of similarity measures of fuzzy values

Fuzzy Sets and Systems
A vector space model for automatic indexing

Communications of the ACM
Vocabulary mining for information retrieval: rough sets and fuzzy sets

Information Processing and Management: an International Journal
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Pattern Recognition with Fuzzy Objective Function Algorithms

Pattern Recognition with Fuzzy Objective Function Algorithms
Building a Chinese-English wordnet for translingual applications

ACM Transactions on Asian Language Information Processing (TALIP)
A comparative study of fuzzy rough sets

Fuzzy Sets and Systems
Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC

CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
Information Retrieval with Conceptual Graph Matching

DEXA '00 Proceedings of the 11th International Conference on Database and Expert Systems Applications
Classification of Web Documents Using a Graph Model

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Efficient Phrase-Based Document Indexing for Web Document Clustering

IEEE Transactions on Knowledge and Data Engineering
Semantics-Preserving Dimensionality Reduction: Rough and Fuzzy-Rough-Based Approaches

IEEE Transactions on Knowledge and Data Engineering
The SMART information retrieval project

HLT '93 Proceedings of the workshop on Human Language Technology
Multilingual document clustering: an heuristic approach based on cognate named entities

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Automated ontology construction for unstructured text documents

Data & Knowledge Engineering
Graph-based text representation and knowledge discovery

Proceedings of the 2007 ACM symposium on Applied computing
A novel document similarity measure based on earth mover's distance

Information Sciences: an International Journal
A new approach on search for similar documents with multiple categories using fuzzy clustering

Expert Systems with Applications: An International Journal
Towards a unified approach to document similarity search using manifold-ranking of blocks

Information Processing and Management: an International Journal
Efficient Phrase-Based Document Similarity for Clustering

IEEE Transactions on Knowledge and Data Engineering
Enhancing multilingual latent semantic analysis with term alignment information

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Bilingual news clustering using named entities and fuzzy similarity

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
A fuzzy ontology and its application to news summarization

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Robust fuzzy clustering of relational data

IEEE Transactions on Fuzzy Systems
A Novel Similarity-Based Fuzzy Clustering Algorithm by Integrating PCM and Mountain Method

IEEE Transactions on Fuzzy Systems

Extracting news blog hot topics based on the W2T Methodology

World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

As cross-lingual information retrieval is attracting increasing attention, tools that measure cross-lingual semantic similarity between documents are becoming desirable. In this paper, two aspects of cross-lingual semantic document similarity measures are investigated: One is document representation, and the other is the formulation of similarity measures. Fuzzy set and rough set theories are applied to capture the inherently fuzzy relationships among concepts expressed by natural languages. Our approach first develops a language-independent sense-level document representation based on the fuzzy set model to reduce the barrier between different languages and further explores the fuzzy-rough hybrid approach to obtain a more robust macrosense-level document representation through the partitioning of the integrated sense association network of the document collection into macrosenses. Then, Tversky's notion of similarity and the F1 measure on information retrieval are adopted to formulate, respectively, two document similarity measures with fuzzy set operations on the two proposed document representations. The effectiveness of our approach is demonstrated by its success rate in identifying the English translations to their corresponding Chinese documents in a collection of Chinese-English parallel documents. Moreover, the proposed approach can be easily extended to process documents in other languages. It is believed that the proposed representations, along with the similarity measures, will enable more effective text mining processes.