Consensus clustering of gene expression data and its application to gene function prediction

Guanghua Xiao, Wei Pan

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

Predicting functions of genes is an important issue in biology. Clustering gene expression profiles has been widely used for gene function prediction, but most clustering methods are unstable and sensitive to input parameters such as starting values and number of clusters. In this article, we develop a novel consensus clustering method to address the instability issue and thus improve the performance of clustering methods. The biological function of an unannotated gene is predicted based on the most enriched functional category in its consensus cluster. The MIPS gene annotations are used to evaluate the predictive performance. It is shown that the consensus clustering-based classification method has a significantly better predictive performance than a previously used clustering-based classification method while performing as well as support vector machines (SVMs). In addition to the obvious applicability of consensus clustering to unsupervised learning, the method's advantages in supervised learning include its being a multiclass classifier that can be trained much faster than SVMs, its generality to include any of the many existing clustering algorithms, and its flexibility to be integrated with other predictive models built with other types of data, suggesting its potential for further improved performance. As a concrete example, we consider its combined use with protein-protein interaction data for gene function prediction. It is shown that the combined analysis has a significantly higher predictive accuracy and a much broader functional coverage than using either data source alone.

Original languageEnglish (US)
Pages (from-to)733-751
Number of pages19
JournalJournal of Computational and Graphical Statistics
Volume16
Issue number3
DOIs
StatePublished - Sep 2007

Fingerprint

Gene Expression Data
Gene expression
Genes
Clustering
Gene
Clustering Methods
Prediction
Support vector machines
Support Vector Machine
Proteins
Gene Expression Profile
Unsupervised learning
Unsupervised Learning
Supervised learning
Predictive Model
Protein-protein Interaction
Supervised Learning
Multi-class
Number of Clusters
Clustering algorithms

Keywords

  • Classification
  • Cross-validation
  • Gene annotation
  • Integrative analysis
  • Microarray
  • Protein-protein interaction

ASJC Scopus subject areas

  • Mathematics(all)
  • Statistics and Probability
  • Computational Mathematics

Cite this

Consensus clustering of gene expression data and its application to gene function prediction. / Xiao, Guanghua; Pan, Wei.

In: Journal of Computational and Graphical Statistics, Vol. 16, No. 3, 09.2007, p. 733-751.

Research output: Contribution to journalArticle

@article{61c3e691f7b148168c9ceb65f38bef99,
title = "Consensus clustering of gene expression data and its application to gene function prediction",
abstract = "Predicting functions of genes is an important issue in biology. Clustering gene expression profiles has been widely used for gene function prediction, but most clustering methods are unstable and sensitive to input parameters such as starting values and number of clusters. In this article, we develop a novel consensus clustering method to address the instability issue and thus improve the performance of clustering methods. The biological function of an unannotated gene is predicted based on the most enriched functional category in its consensus cluster. The MIPS gene annotations are used to evaluate the predictive performance. It is shown that the consensus clustering-based classification method has a significantly better predictive performance than a previously used clustering-based classification method while performing as well as support vector machines (SVMs). In addition to the obvious applicability of consensus clustering to unsupervised learning, the method's advantages in supervised learning include its being a multiclass classifier that can be trained much faster than SVMs, its generality to include any of the many existing clustering algorithms, and its flexibility to be integrated with other predictive models built with other types of data, suggesting its potential for further improved performance. As a concrete example, we consider its combined use with protein-protein interaction data for gene function prediction. It is shown that the combined analysis has a significantly higher predictive accuracy and a much broader functional coverage than using either data source alone.",
keywords = "Classification, Cross-validation, Gene annotation, Integrative analysis, Microarray, Protein-protein interaction",
author = "Guanghua Xiao and Wei Pan",
year = "2007",
month = "9",
doi = "10.1198/106186007X237838",
language = "English (US)",
volume = "16",
pages = "733--751",
journal = "Journal of Computational and Graphical Statistics",
issn = "1061-8600",
publisher = "American Statistical Association",
number = "3",

}

TY - JOUR

T1 - Consensus clustering of gene expression data and its application to gene function prediction

AU - Xiao, Guanghua

AU - Pan, Wei

PY - 2007/9

Y1 - 2007/9

N2 - Predicting functions of genes is an important issue in biology. Clustering gene expression profiles has been widely used for gene function prediction, but most clustering methods are unstable and sensitive to input parameters such as starting values and number of clusters. In this article, we develop a novel consensus clustering method to address the instability issue and thus improve the performance of clustering methods. The biological function of an unannotated gene is predicted based on the most enriched functional category in its consensus cluster. The MIPS gene annotations are used to evaluate the predictive performance. It is shown that the consensus clustering-based classification method has a significantly better predictive performance than a previously used clustering-based classification method while performing as well as support vector machines (SVMs). In addition to the obvious applicability of consensus clustering to unsupervised learning, the method's advantages in supervised learning include its being a multiclass classifier that can be trained much faster than SVMs, its generality to include any of the many existing clustering algorithms, and its flexibility to be integrated with other predictive models built with other types of data, suggesting its potential for further improved performance. As a concrete example, we consider its combined use with protein-protein interaction data for gene function prediction. It is shown that the combined analysis has a significantly higher predictive accuracy and a much broader functional coverage than using either data source alone.

AB - Predicting functions of genes is an important issue in biology. Clustering gene expression profiles has been widely used for gene function prediction, but most clustering methods are unstable and sensitive to input parameters such as starting values and number of clusters. In this article, we develop a novel consensus clustering method to address the instability issue and thus improve the performance of clustering methods. The biological function of an unannotated gene is predicted based on the most enriched functional category in its consensus cluster. The MIPS gene annotations are used to evaluate the predictive performance. It is shown that the consensus clustering-based classification method has a significantly better predictive performance than a previously used clustering-based classification method while performing as well as support vector machines (SVMs). In addition to the obvious applicability of consensus clustering to unsupervised learning, the method's advantages in supervised learning include its being a multiclass classifier that can be trained much faster than SVMs, its generality to include any of the many existing clustering algorithms, and its flexibility to be integrated with other predictive models built with other types of data, suggesting its potential for further improved performance. As a concrete example, we consider its combined use with protein-protein interaction data for gene function prediction. It is shown that the combined analysis has a significantly higher predictive accuracy and a much broader functional coverage than using either data source alone.

KW - Classification

KW - Cross-validation

KW - Gene annotation

KW - Integrative analysis

KW - Microarray

KW - Protein-protein interaction

UR - http://www.scopus.com/inward/record.url?scp=35348969985&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=35348969985&partnerID=8YFLogxK

U2 - 10.1198/106186007X237838

DO - 10.1198/106186007X237838

M3 - Article

AN - SCOPUS:35348969985

VL - 16

SP - 733

EP - 751

JO - Journal of Computational and Graphical Statistics

JF - Journal of Computational and Graphical Statistics

SN - 1061-8600

IS - 3

ER -