Text categorization of multilingual web pages in specific domain

Authors:
Jicheng Liu;Chunyan Liang
Affiliations:
North China Electric Power University, Beijing, China;North China Electric Power University, Beijing, China
Venue:
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Year:
2008

Citing 5
Cited 0

Using linear algebra for intelligent information retrieval

SIAM Review
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Optimizing search by showing results in context

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
A Study of Approaches to Hypertext Categorization

Journal of Intelligent Information Systems
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Compared to the traditional text categorization, automated categorization for domain- specific web pages poses new research challenges because of the noisy and diverse nature of the pages and the fine and complex category structure. For multilingual web pages, it also needs to be considered that how to extract the terms of different languages exactly. Using a dataset of hybrid Chinese-English chemical web pages, a new dictionary-based multilingual text categorization approach is proposed in this paper to try to classify the pages into a hierarchical topic structure more accurately. By using an automatic encoding detection and integration method, the approach can properly recognize and integrate the web page encodings. This makes the feature extraction more precise for the multilingual pages. The approach can also intensify the domain concepts in the web pages based on a chemistry dictionary. The experimental results show that the proposed approach has the better performance than the traditional categorization method when classifying the multilingual web pages in specific domain.