Editorial: Managing and mining multilingual documents: Introduction to the special topic issue of information processing management

  • Authors:
  • Christopher C. Yang;Chih-Ping Wei;Lee-Feng Chien

  • Affiliations:
  • -;-;-

  • Venue:
  • Information Processing and Management: an International Journal
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Due to the popularity of the World Wide Web and the advance of Internet search engines, information in many different languages is accessible online nowadays. For example, one can easily access news stories in real time in over 30 languages on the Web. The number of non-English documents on the Web is growing faster than it was ten years ago due to the significant growth of Internet user population in developing countries. Although it is convenient to obtain multilingual information, these online documents are usually organized or managed in each language separately. Internet search engines provide hierarchical directories of documents for each language at independent portals even if these portals are provided by the same organization. We seldom find any hierarchical directory that provides multilingual classification. The lack of coordination among documents in different languages makes it inefficient for multilingual users to identify useful resources in multiple languages. Users are required to search by each language and concatenate the results. Such searching process is redundant and time consuming. Furthermore, in global business environments, mining knowledge from text in a single language may not provide sufficient support to knowledge workers. We often need to integrate multilingual text before applying text mining techniques or integrate the knowledge discovered from text in different languages for obtaining global knowledge. For example, opinion mining from the Web requires multilingual text mining because user opinions are available in different languages from all over the world. Hence, there is an urge need of advanced techniques in managing and mining multilingual documents.Substantial research efforts have been made toward facilitating cross-lingual information retrieval in the last decade (Yang & Wei, 2009). Such research endeavors mainly focus on how to cross the language boundary (Lam et al., 2005; Li and Yang, 2005; Yang and Li, 2004; Olsson, Oard, and Hajic, 2005; Yang, Wei, and Li, 2008). However, relatively less effort has been made on coordinating multilingual resources in a unified manner (Wei, Yang, and Lin, 2008). It is an important research area to be explored such that we can fully utilize the multilingual resources for better knowledge management. In this special issue, we have selected three papers covering the related topics: multilingual Web directory generation, multilingual novelty mining, and multilingual information retrieval. Yang et al. developed an approach to generate multilingual Web directory. Self-organizing map was first constructed on multiple sets of Web pages, one for each language, independently. Monolingual hierarchies were then generated. A hierarchy alignment method was applied to discover the associations between nodes from different hierarchies and a multilingual Web directory was constructed. A promising result was shown by experiments. Zhang et al. developed sentence categorization and novelty mining in multiple languages including Malay, Chinese, and English. The Rocchio algorithm was adopted for sentence categorization. The novelty of each sentence was computed by measuring the cosine similarities with the sentences extracted in sentence categorization. The experimental results showed that sentence-level novelty mining had similar performance in Malay, English and Chinese. It also showed that categorization improved multilingual novelty mining significantly. Tsai et al. developed a learning-based ranking algorithm, FRank, to construct a merging model for multilingual information retrieval. Cross-lingual information retrieval was first processed on separate collections, one for each language. The lists of monolingual results were then merged to produce a multilingual result list. Sixty two features were extracted from query, document, and translation levels. The FRank ranking algorithm used these features to construct a merging model. The experimental results showed significant improvement on the merging quality.