Integrating Cross-Language Hierarchies and Its Application to Retrieving Relevant Documents

Authors:
Fumiyo Fukumoto;Yoshimi Suzuki
Affiliations:
University of Yamanashi;University of Yamanashi
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2008

Citing 24
Cited 0

The nature of statistical learning theory

The nature of statistical learning theory
Elements of machine learning

Elements of machine learning
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Hierarchical classification of Web content

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
On integrating catalogs

Proceedings of the 10th international conference on World Wide Web
Learning to map between ontologies on the semantic web

Proceedings of the 11th international conference on World Wide Web
Topic Detection and Tracking: Event-Based Information Organization

Topic Detection and Tracking: Event-Based Information Organization
Information Retrieval

Information Retrieval
Machine Learning

Machine Learning
Improving Text Classification by Shrinkage in a Hierarchy of Classes

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
A scalability analysis of classifiers in text categorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
Embedding web-based statistical translation models in cross-language information retrieval

Computational Linguistics - Special issue on web as corpus
Machine translation vs. dictionary term translation: a comparison for English-Japanese news article alignment

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
A bootstrapping method for extracting bilingual text pairs

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Resource selection for domain-specific cross-lingual IR

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Effect of cross-language IR in bilingual lexicon acquisition from comparable corpora

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Reliable measures for aligning Japanese-English news articles and sentences

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
A survey on ontology mapping

ACM SIGMOD Record
The effect of translation quality in MT-based cross-language information retrieval

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Topic segmentation with shared topic detection and alignment of multiple documents

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Integrating multiple internet directories by instance-based learning

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Internet directories such as Yahoo! are an approach to improvethe efficacy and efficiency of Information Retrieval (IR) on theWeb, as pages (documents) are organized into hierarchicalcategories, and similar pages are grouped together. Most of thesearch engines on the Web service find documents that are assignedto a single classification hierarchy. Categories in the hierarchyare carefully defined by human experts and documents are wellorganized. However, a single hierarchy in one language is ofteninsufficient to find all relevant material, as each hierarchy tendsto have some bias in both defining hierarchical structure andclassifying documents. Moreover, documents written in a languageother than the users native language often include large amounts ofinformation related to the users request. In this article, wepropose a method of integrating cross-language (CL) categoryhierarchies, that is, Reuters 96 hierarchy and UDC code hierarchyof Japanese by estimating category similarities. The method doesnot simply merge two different hierarchies into one large hierarchybut instead extracts sets of similar categories, where each elementof the sets is relevant with each other. It consists of threesteps. First, we classify documents from one hierarchy intocategories with another hierarchy using a cross-language textclassification (CLTC) technique, and extract category pairs of twohierarchies. Next, we apply Ç2 statisticsto these pairs to obtain similar category pairs, and finally weapply the generating function of the Apriori algorithm(Apriori-Gen) to the category pairs, and find sets of similarcategories. Moreover, we examined whether integrating hierarchieshelps to support retrieval of documents with similar contents. Theretrieval results showed a 42.7% improvement over the baselinenonhierarchy model, and a 21.6% improvement over a singlehierarchy.