BACK TO CONTENTS   |    PDF   |    PREVIOUS   |    NEXT

Title

 

 

 

 

A cascaded approach to normalising gene mentions in biomedical literature

 

Authors

Hui Yang1, Goran Nenadic1, *, John A. Keane1

 

Affiliation

1School of Computer Science, University of Manchester, Manchester, UK

 

Email

G.Nenadic@manchester.ac.uk;  * Corresponding author

 

Article Type

Hypothesis

 

Date

revised September 30, 2007; accepted October 21, 2007; published online December 30, 2007

 

Abstract

Linking gene and protein names mentioned in the literature to unique identifiers in referent genomic databases is an essential step in accessing and integrating knowledge in the biomedical domain. However, it remains a challenging task due to lexical and terminological variation, and ambiguity of gene name mentions in documents. We present a generic and effective rule-based approach to link gene mentions in the literature to referent genomic databases, where pre-processing of both gene synonyms in the databases and gene mentions in text are first applied. The mapping method employs a cascaded approach, which combines exact, exact-like and token-based approximate matching by using flexible representations of a gene synonym dictionary and gene mentions generated during the pre-processing phase. We also consider multi-gene name mentions and permutation of components in gene names. A systematic evaluation of the suggested methods has identified steps that are beneficial for improving either precision or recall in gene name identification. The results of the experiments on the BioCreAtIvE2 data sets (identification of human gene names) demonstrated that our methods achieved highly encouraging results with F-measure of up to 81.20%.

 

Keywords

gene name normalisation; gene name mapping; lexical variability; text mining

Citation

Yang et al., Bioinformation 2(5): 197-206 (2007)

 

Edited by

A.T. Heiny, T. W. Tan & S. Ranganathan

 

ISSN

0973-2063

 

Publisher

Biomedical Informatics

License

This is an Open Access article which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. This is distributed under the terms of the Creative Commons Attribution License.