A Hierarchical n-Grams Extraction Approach for Classification Problem

Authors:
Faouzi Mhamdi;Ricco Rakotomalala;Mourad Elloumi
Affiliations:
UTIC, Unité de recherche en Technologies de l'Information et de la Communication, École Supérieure des Sciences et Techniques de Tunis, Tunisie;Laboratoire ERIC, Université Lyon 2, France;UTIC, Unité de recherche en Technologies de l'Information et de la Communication, École Supérieure des Sciences et Techniques de Tunis, Tunisie
Venue:
Advanced Internet Based Systems and Applications
Year:
2009

Citing 7
Cited 1

The nature of statistical learning theory

The nature of statistical learning theory
From data mining to knowledge discovery: an overview

Advances in knowledge discovery and data mining
Approximate statistical tests for comparing supervised classification learning algorithms

Neural Computation
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
An introduction to variable and feature selection

The Journal of Machine Learning Research

Free-gram phrase identification for modeling Chinese text

Information Processing Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

We are interested in protein classification based on their primary structures. The goal is to automatically classify proteins sequences according to their families. This task goes through the extraction of a set of descriptors that we present to the supervised learning algorithms. There are many types of descriptors used in the literature. The most popular one is the n-gram. It corresponds to a series of characters of n-length. The standard approach of the n-grams consists in setting first the parameter n, extracting the corresponding ngrams descriptors, and in working with this value during the whole data mining process. In this paper, we propose an hierarchical approach to the n-grams construction. The goal is to obtain descriptors of varying length for a better characterization of the protein families. This approach tries to answer to the domain knowledge of the biologists. The patterns, which characterize the proteins' family, have most of the time a various length. Our idea is to transpose the frequent itemsets extraction principle, mainly used for the association rule mining, in the n-grams extraction for protein classification context. The experimentation shows that the new approach is consistent with the biological reality and has the same accuracy of the standard approach.