Integrating Cross-Language Hierarchies and Its Application to Retrieving Relevant Documents

  • Authors:
  • Fumiyo Fukumoto;Yoshimi Suzuki

  • Affiliations:
  • University of Yamanashi;University of Yamanashi

  • Venue:
  • ACM Transactions on Asian Language Information Processing (TALIP)
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Internet directories such as Yahoo! are an approach to improvethe efficacy and efficiency of Information Retrieval (IR) on theWeb, as pages (documents) are organized into hierarchicalcategories, and similar pages are grouped together. Most of thesearch engines on the Web service find documents that are assignedto a single classification hierarchy. Categories in the hierarchyare carefully defined by human experts and documents are wellorganized. However, a single hierarchy in one language is ofteninsufficient to find all relevant material, as each hierarchy tendsto have some bias in both defining hierarchical structure andclassifying documents. Moreover, documents written in a languageother than the users native language often include large amounts ofinformation related to the users request. In this article, wepropose a method of integrating cross-language (CL) categoryhierarchies, that is, Reuters 96 hierarchy and UDC code hierarchyof Japanese by estimating category similarities. The method doesnot simply merge two different hierarchies into one large hierarchybut instead extracts sets of similar categories, where each elementof the sets is relevant with each other. It consists of threesteps. First, we classify documents from one hierarchy intocategories with another hierarchy using a cross-language textclassification (CLTC) technique, and extract category pairs of twohierarchies. Next, we apply Ç2 statisticsto these pairs to obtain similar category pairs, and finally weapply the generating function of the Apriori algorithm(Apriori-Gen) to the category pairs, and find sets of similarcategories. Moreover, we examined whether integrating hierarchieshelps to support retrieval of documents with similar contents. Theretrieval results showed a 42.7% improvement over the baselinenonhierarchy model, and a 21.6% improvement over a singlehierarchy.