BACK TO CONTENTS   |    PDF   |    PREVIOUS   |    NEXT


Title

Biomarker Identification from RNA-Seq Data using a Robust Statistical Approach

 

Authors

Zobaer Akond1, 2, 4,*, Munirul Alam2, Md. Nurul Haque Mollah3

 

Affiliation

1Agricultural Statistics and Information & Communication Technology (ASICT) Division, Bangladesh Agricultural Research Institute (BARI), Joydebpur, Gazipur-1701, Bangladesh;

2Institute of Environmental Science, University of Rajshahi-6205, Bangladesh;

3Emerging Infections, Infectious Diseases Division, International Centre for Diarrheal Disease Research, Bangladesh (icddr,b);

4Bioinformatics Lab, Department of Statistics, University of Rajshahi, Rajshahi-6205, Bangladesh;

 

Email

akond25@yahoo.com;

 

Article Type

Hypothesis

 

Date

Received March 5, 2018; Revised April 2, 2018; Accepted April 5, 2018; Published April 30, 2018

 

Abstract

Biomarker identification by differentially expressed genes (DEGs) using RNA-sequencing technology is an important task to characterize the transcriptomics data. This is possible with the advancement of next-generation sequencing technology (NGS). There are a number of statistical techniques to identify DEGs from high-dimensional RNA-seq count data with different groups or conditions such as edgeR, SAMSeq, voom-limma, etc. However, these methods produce high false positives and low accuracy in presence of outliers. We describe a robust t-statistic method to overcome these drawbacks using both simulated and real RNA-seq datasets. The model performance with 61.2%, 35.2%, 21.6%, 6.9%, 74.5%, 78.4%, 93.1%, 35.2% sensitivity, specificity, MER, FDR, AUC, ACC, PPV, and NPV, respectively ‍at 20% outliers is reported. We identified 409 DE genes with p-values<0.05 using robust t-test in HIV viremic vs avirmeic state real dataset. There are 28 up-regulated genes and 381 down-regulated genes estimated by log2 fold change (FC) approach at threshold value 1.5. The up-regulated genes form three clusters and it is found that 11 genes are highly associated in HIV1/AIDS. Protein-protein interaction (PPI) of up-regulated genes using STRING database found 21 genes with strong association among themselves. Thus, the identification of potential biomarkers from RNA-seq dataset using a robust t-statistical model is demonstrated.

 

Keywords

RNA-seq data, differentially expressed genes, robust t-statistic, gene-disease network, protein-protein interaction.

 

Citation

Akond et al. Bioinformation 14(4): 153-163 (2018)

 

Edited by

P Kangueane

 

ISSN

0973-2063

 

Publisher

Biomedical Informatics

 

License

This is an Open Access article which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. This is distributed under the terms of the Creative Commons Attribution License.