Robust Feature Selection Approach for Patient Classification using Gene Expression Data



Md. Shahjaman1, 2*, Nishith Kumar1, 3, Md. Shakil Ahmed1, AnjumanAra Begum1, S. M. Shahinul
Islam4, Md. Nurul Haque Mollah1



1Bioinformatics Lab, Department of Statistics, University of Rajshahi-6205, Bangladesh;

2Department of Statistics, Begum Rokeya University, Rangpur-5400, Bangladesh;

3Department of Statistics, Bangabandhu Sheikh Mujibur Rahman Science and Technology University, Gopalganj, Bangladesh;

4Institutitute of Biological Science (IBSc), University of Rajshahi, Rajshahi-6205, Bangladesh;




Article Type




Received August 26, 2017; Revised September 11, 2017; Accepted September 12, 2017; Published October 31, 2017



Patient classification through feature selection (FS) based on gene expression data (GED) has already become popular to the research communities. T-test is the well-known statistical FS method in GED analysis. However, it produces higher false positives and lower accuracies for small sample sizes or in presence of outliers. To get rid from the shortcomings of t-test with small sample sizes, SAM has been applied in GED. But, it is highly sensitive to outliers. Recently, robust SAM using the minimum β-divergence estimators has overcome all the problems of classical t-test & SAM and it has been successfully applied for identification of differentially expressed (DE) genes. But, it was not applied in classification. Therefore, in this paper, we employ robust SAM as a feature selection approach along with classifiers for patient classification. We demonstrate the performance of the robust SAM in a comparison of classical t-test and SAM along with four popular classifiers (LDA, KNN, SVM and naive Bayes) using both simulated and real gene expression datasets. The results obtained from simulation and real data analysis confirm that the performance of the four classifiers improve with robust SAM than the classical t-test and SAM. From a real Colon cancer dataset we identified 21 additional DE genes using robust SAM that were not identified by the classical t-test or SAM. To reveal the biological functions and pathways of these 21 genes, we perform KEGG pathway enrichment analysis and found that these genes are involved in some important pathways related to cancer disease.



Feature selection, classification, robust SAM, β-divergence estimators.



Shahjaman et al. Bioinformation 13(10): 327-332 (2017)


Edited by

P Kangueane






Biomedical Informatics



This is an Open Access article which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. This is distributed under the terms of the Creative Commons Attribution License.