Data Mining on DNA Sequences of Hepatitis B Virus

Authors:
KwongSak Leung;KinHong Lee;JinFeng Wang;Eddie YT Ng;Henry LY Chan;Stephen KW Tsui;Tony SK Mok;Pete Chi-Hang Tse;Joseph JY Sung
Affiliations:
The Chinese University of Hong Kong, Hong Kong;The Chinese University of Hong Kong, Hong Kong;The Chinese University of Hong Kong, Hong Kong;The Chinese University of Hong Kong, Hong Kong;The Chinese University of Hong Kong, Hong Kong;The Chinese University of Hong Kong, Hong Kong;The Chinese University of Hong Kong, Hong Kong;The Chinese University of Hong Kong, Hong Kong;The Chinese University of Hong Kong, Hong Kong
Venue:
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Year:
2011

Citing 14
Cited 2

Bayesian networks without tears: making Bayesian networks more accessible to the probabilistically unsophisticated

AI Magazine
Learning Boolean concepts in the presence of many irrelevant features

Artificial Intelligence
Floating search methods in feature selection

Pattern Recognition Letters
Learning Bayesian Networks: The Combination of Knowledge and Statistical Data

Machine Learning
Feature Selection: Evaluation, Application, and Small Sample Performance

IEEE Transactions on Pattern Analysis and Machine Intelligence
Genetic programming: an introduction: on the automatic evolution of computer programs and its applications

Genetic programming: an introduction: on the automatic evolution of computer programs and its applications
A genetic algorithm for determining nonadditive set functions in information fusion

Fuzzy Sets and Systems - Special issue on fuzzy measures and integrals
Machine Learning

Machine Learning
Data Mining Using Grammar-Based Genetic Programming and Applications

Data Mining Using Grammar-Based Genetic Programming and Applications
Inducing Logic Programs With Genetic Algorithms: The Genetic Logic Programming System

IEEE Expert: Intelligent Systems and Their Applications
A survey of evolutionary algorithms for data mining and knowledge discovery

Advances in evolutionary computing
Learning recursive functions from noisy examples using generic genetic programming

GECCO '96 Proceedings of the 1st annual conference on Genetic and evolutionary computation
Dimensionality reduction using genetic algorithms

IEEE Transactions on Evolutionary Computation
Classification by nonlinear integral projections

IEEE Transactions on Fuzzy Systems

SC³: Triple Spectral Clustering-Based Consensus Clustering Framework for Class Discovery from Cancer Gene Expression Profiles

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Review: Knowledge discovery in medicine: Current issue and future trend

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Extraction of meaningful information from large experimental data sets is a key element in bioinformatics research. One of the challenges is to identify genomic markers in Hepatitis B Virus (HBV) that are associated with HCC (liver cancer) development by comparing the complete genomic sequences of HBV among patients with HCC and those without HCC. In this study, a data mining framework, which includes molecular evolution analysis, clustering, feature selection, classifier learning, and classification, is introduced. Our research group has collected HBV DNA sequences, either genotype B or C, from over 200 patients specifically for this project. In the molecular evolution analysis and clustering, three subgroups have been identified in genotype C and a clustering method has been developed to separate the subgroups. In the feature selection process, potential markers are selected based on Information Gain for further classifier learning. Then, meaningful rules are learned by our algorithm called the Rule Learning, which is based on Evolutionary Algorithm. Also, a new classification method by Nonlinear Integral has been developed. Good performance of this method comes from the use of the fuzzy measure and the relevant nonlinear integral. The nonadditivity of the fuzzy measure reflects the importance of the feature attributes as well as their interactions. These two classifiers give explicit information on the importance of the individual mutated sites and their interactions toward the classification (potential causes of liver cancer in our case). A thorough comparison study of these two methods with existing methods is detailed. For genotype B, genotype C subgroups C1, C2, and C3, important mutation markers (sites) have been found, respectively. These two classification methods have been applied to classify never-seen-before examples for validation. The results show that the classification methods have more than 70 percent accuracy and 80 percent sensitivity for most data sets, which are considered high as an initial scanning method for liver cancer diagnosis.