Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature

Soumya Raychaudhuri, Jeffrey T. Chang, Patrick D. Sutphin, Russ B. Altman

Research output: Contribution to journalArticle

136 Citations (Scopus)

Abstract

Functional characterizations of thousands of gene products from many species are described in the published literature. These discussions are extremely valuable for characterizing the functions not only of these gene products, but also of their homologs in other organisms. The Gene Ontology (GO) is an effort to create a controlled terminology for labeling gene functions in a more precise, reliable, computer-readable manner. Currently, the best annotations of gene function with the GO are performed by highly trained biologists who read the literature and select appropriate codes. In this study, we explored the possibility that statistical natural language processing techniques can be used to assign GO codes. We compared three document classification methods (maximum entropy modeling, naïve Bayes classification, and nearest-neighbor classification) to the problem of associating a set of GO codes (for biological process) to literature abstracts and thus to the genes associated with the abstracts. We showed that maximum entropy modeling outperforms the other methods and achieves an accuracy of 72% when ascertaining the function discussed within an abstract. The maximum entropy method provides confidence measures that correlate well with performance. We conclude that statistical methods may be used to assign GO codes and may be useful for the difficult task of reassignment as terminology standards evolve over time.

Original languageEnglish (US)
Pages (from-to)203-214
Number of pages12
JournalGenome Research
Volume12
Issue number1
DOIs
StatePublished - 2002

Fingerprint

Gene Ontology
Entropy
Genes
Terminology
Natural Language Processing
Molecular Sequence Annotation
Biological Phenomena

ASJC Scopus subject areas

  • Genetics

Cite this

Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. / Raychaudhuri, Soumya; Chang, Jeffrey T.; Sutphin, Patrick D.; Altman, Russ B.

In: Genome Research, Vol. 12, No. 1, 2002, p. 203-214.

Research output: Contribution to journalArticle

Raychaudhuri, Soumya ; Chang, Jeffrey T. ; Sutphin, Patrick D. ; Altman, Russ B. / Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. In: Genome Research. 2002 ; Vol. 12, No. 1. pp. 203-214.
@article{5b153c22b8e74c0a998ed88d623db667,
title = "Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature",
abstract = "Functional characterizations of thousands of gene products from many species are described in the published literature. These discussions are extremely valuable for characterizing the functions not only of these gene products, but also of their homologs in other organisms. The Gene Ontology (GO) is an effort to create a controlled terminology for labeling gene functions in a more precise, reliable, computer-readable manner. Currently, the best annotations of gene function with the GO are performed by highly trained biologists who read the literature and select appropriate codes. In this study, we explored the possibility that statistical natural language processing techniques can be used to assign GO codes. We compared three document classification methods (maximum entropy modeling, na{\"i}ve Bayes classification, and nearest-neighbor classification) to the problem of associating a set of GO codes (for biological process) to literature abstracts and thus to the genes associated with the abstracts. We showed that maximum entropy modeling outperforms the other methods and achieves an accuracy of 72{\%} when ascertaining the function discussed within an abstract. The maximum entropy method provides confidence measures that correlate well with performance. We conclude that statistical methods may be used to assign GO codes and may be useful for the difficult task of reassignment as terminology standards evolve over time.",
author = "Soumya Raychaudhuri and Chang, {Jeffrey T.} and Sutphin, {Patrick D.} and Altman, {Russ B.}",
year = "2002",
doi = "10.1101/gr.199701",
language = "English (US)",
volume = "12",
pages = "203--214",
journal = "Genome Research",
issn = "1088-9051",
publisher = "Cold Spring Harbor Laboratory Press",
number = "1",

}

TY - JOUR

T1 - Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature

AU - Raychaudhuri, Soumya

AU - Chang, Jeffrey T.

AU - Sutphin, Patrick D.

AU - Altman, Russ B.

PY - 2002

Y1 - 2002

N2 - Functional characterizations of thousands of gene products from many species are described in the published literature. These discussions are extremely valuable for characterizing the functions not only of these gene products, but also of their homologs in other organisms. The Gene Ontology (GO) is an effort to create a controlled terminology for labeling gene functions in a more precise, reliable, computer-readable manner. Currently, the best annotations of gene function with the GO are performed by highly trained biologists who read the literature and select appropriate codes. In this study, we explored the possibility that statistical natural language processing techniques can be used to assign GO codes. We compared three document classification methods (maximum entropy modeling, naïve Bayes classification, and nearest-neighbor classification) to the problem of associating a set of GO codes (for biological process) to literature abstracts and thus to the genes associated with the abstracts. We showed that maximum entropy modeling outperforms the other methods and achieves an accuracy of 72% when ascertaining the function discussed within an abstract. The maximum entropy method provides confidence measures that correlate well with performance. We conclude that statistical methods may be used to assign GO codes and may be useful for the difficult task of reassignment as terminology standards evolve over time.

AB - Functional characterizations of thousands of gene products from many species are described in the published literature. These discussions are extremely valuable for characterizing the functions not only of these gene products, but also of their homologs in other organisms. The Gene Ontology (GO) is an effort to create a controlled terminology for labeling gene functions in a more precise, reliable, computer-readable manner. Currently, the best annotations of gene function with the GO are performed by highly trained biologists who read the literature and select appropriate codes. In this study, we explored the possibility that statistical natural language processing techniques can be used to assign GO codes. We compared three document classification methods (maximum entropy modeling, naïve Bayes classification, and nearest-neighbor classification) to the problem of associating a set of GO codes (for biological process) to literature abstracts and thus to the genes associated with the abstracts. We showed that maximum entropy modeling outperforms the other methods and achieves an accuracy of 72% when ascertaining the function discussed within an abstract. The maximum entropy method provides confidence measures that correlate well with performance. We conclude that statistical methods may be used to assign GO codes and may be useful for the difficult task of reassignment as terminology standards evolve over time.

UR - http://www.scopus.com/inward/record.url?scp=0036144742&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0036144742&partnerID=8YFLogxK

U2 - 10.1101/gr.199701

DO - 10.1101/gr.199701

M3 - Article

C2 - 11779846

AN - SCOPUS:0036144742

VL - 12

SP - 203

EP - 214

JO - Genome Research

JF - Genome Research

SN - 1088-9051

IS - 1

ER -