Prediction of normalized signal strength on DNA sequencing micro arrays by n-grams within a neural network model

HOME | PDF |

Title	Prediction of normalized signal strength on DNA sequencing micro arrays by n-grams within a neural network model
Authors	Charles Chilaka^1,5, Steven Carr^2,3,*, Nabil Shalaby^3,4, Wolfgang Banzhaf^3,6
Affiliation	¹Program in Scientific Computing; ²Department of Biology; ³Department of Computer Science; ⁴Department of Mathematics and Statistics, Memorial University of Newfoundland and St. John’s, Newfoundland, Canada A1C 5S7; ⁵Department of Mathematics, FUT, Owerri, Nigeria; ⁶Present address: Department of Computer Science and Engineering, Michigan State University, East Lansing MI 48824
Email	Steve Carr - Phone: +1 (709) 764 4776 office; E-mail: scarr@mun.ca; *Corresponding author
Article Type	Research Article
Date	Received November 25, 2018; Accepted December 1, 2018; Published May 30, 2019
Abstract	We have shown previously that a feed-forward, back propagation neural network model based on composite n-grams can predict normalized signal strengths of a microarray based DNA sequencing experiment. The microarray comprises a 4xN set of 25-base single-stranded DNA molecule (”oligos”), specific for each of the four possible bases (A, C, G, or T) for Ade- nine, Cytosine, Guanine and Thymine respectively at each of N positions in the experimental DNA. Strength of binding between reference oligos and experimental DNA varies according to base complementarity and the strongest signal in any quartet should `call the base` at that position. Variation in base composition of and (or) order within oligos can affect accuracy and (or) confidence of base calls. To evaluate the effect of order, we present oligos as n-gram neural input vectors of degree 3 and measure their performance. Microarray signal intensity data were divided into training, validation and testing sets. Regression values obtained were >99.80% overall with very low mean square errors that transform to high best validation performance values. Pattern recognition results showed high percentage confusion matrix values along the diagonal and receiver operating characteristic curves were clustered in the upper left corner, both indices of good predictive performance. Higher order n-grams are expected to produce even better predictions.
Keywords	Neural networks, n-grams, Performance, Regression values, Confusion matrix, Receiver Operating Characteristic curves.
Citation	Chilaka et al. Bioinformation 15(6): 388-393 (2019)
Edited by	P Kangueane
ISSN	0973-2063
Publisher	Biomedical Informatics
License	This is an Open Access article which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. This is distributed under the terms of the Creative Commons Attribution License.