Sparse distance-based learning for simultaneous multiclass classification and feature selection of metagenomic data

Zhenqiu Liu, William Hsiao, Brandi L. Cantarel, Elliott Franco Drábek, Claire Fraser-Liggett

Research output: Contribution to journalArticle

30 Citations (Scopus)

Abstract

Motivation: Direct sequencing of microbes in human ecosystems (the human microbiome) has complemented single genome cultivation and sequencing to understand and explore the impact of commensal microbes on human health. As sequencing technologies improve and costs decline, the sophistication of data has outgrown available computational methods. While several existing machine learning methods have been adapted for analyzing microbiome data recently, there is not yet an efficient and dedicated algorithm available for multiclass classification of human microbiota. Results: By combining instance-based and model-based learning, we propose a novel sparse distance-based learning method for simultaneous class prediction and feature (variable or taxa, which is used interchangeably) selection from multiple treatment populations on the basis of 16S rRNA sequence count data. Our proposed method simultaneously minimizes the intraclass distance and maximizes the interclass distance with many fewer estimated parameters than other methods. It is very efficient for problems with small sample sizes and unbalanced classes, which are common in metagenomic studies. We implemented this method in a MATLAB toolbox called MetaDistance. We also propose several approaches for data normalization and variance stabilization transformation in MetaDistance. We validate this method on several real and simulated 16S rRNA datasets to show that it outperforms existing methods for classifying metagenomic data. This article is the first to address simultaneous multifeature selection and class prediction with metagenomic count data.

Original languageEnglish (US)
Article numberbtr547
Pages (from-to)3242-3249
Number of pages8
JournalBioinformatics
Volume27
Issue number23
DOIs
StatePublished - Dec 1 2011

Fingerprint

Distance Education
Metagenomics
Multi-class Classification
Feature Selection
Feature extraction
Computational methods
Ecosystems
MATLAB
Learning systems
Microbiota
Sequencing
Stabilization
Genes
Health
Count Data
Costs
Prediction
Small Sample Size
Learning
Ecosystem

ASJC Scopus subject areas

  • Statistics and Probability
  • Medicine(all)
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Cite this

Sparse distance-based learning for simultaneous multiclass classification and feature selection of metagenomic data. / Liu, Zhenqiu; Hsiao, William; Cantarel, Brandi L.; Drábek, Elliott Franco; Fraser-Liggett, Claire.

In: Bioinformatics, Vol. 27, No. 23, btr547, 01.12.2011, p. 3242-3249.

Research output: Contribution to journalArticle

Liu, Zhenqiu ; Hsiao, William ; Cantarel, Brandi L. ; Drábek, Elliott Franco ; Fraser-Liggett, Claire. / Sparse distance-based learning for simultaneous multiclass classification and feature selection of metagenomic data. In: Bioinformatics. 2011 ; Vol. 27, No. 23. pp. 3242-3249.
@article{1fb23cb61f6649068497418f78889250,
title = "Sparse distance-based learning for simultaneous multiclass classification and feature selection of metagenomic data",
abstract = "Motivation: Direct sequencing of microbes in human ecosystems (the human microbiome) has complemented single genome cultivation and sequencing to understand and explore the impact of commensal microbes on human health. As sequencing technologies improve and costs decline, the sophistication of data has outgrown available computational methods. While several existing machine learning methods have been adapted for analyzing microbiome data recently, there is not yet an efficient and dedicated algorithm available for multiclass classification of human microbiota. Results: By combining instance-based and model-based learning, we propose a novel sparse distance-based learning method for simultaneous class prediction and feature (variable or taxa, which is used interchangeably) selection from multiple treatment populations on the basis of 16S rRNA sequence count data. Our proposed method simultaneously minimizes the intraclass distance and maximizes the interclass distance with many fewer estimated parameters than other methods. It is very efficient for problems with small sample sizes and unbalanced classes, which are common in metagenomic studies. We implemented this method in a MATLAB toolbox called MetaDistance. We also propose several approaches for data normalization and variance stabilization transformation in MetaDistance. We validate this method on several real and simulated 16S rRNA datasets to show that it outperforms existing methods for classifying metagenomic data. This article is the first to address simultaneous multifeature selection and class prediction with metagenomic count data.",
author = "Zhenqiu Liu and William Hsiao and Cantarel, {Brandi L.} and Dr{\'a}bek, {Elliott Franco} and Claire Fraser-Liggett",
year = "2011",
month = "12",
day = "1",
doi = "10.1093/bioinformatics/btr547",
language = "English (US)",
volume = "27",
pages = "3242--3249",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "23",

}

TY - JOUR

T1 - Sparse distance-based learning for simultaneous multiclass classification and feature selection of metagenomic data

AU - Liu, Zhenqiu

AU - Hsiao, William

AU - Cantarel, Brandi L.

AU - Drábek, Elliott Franco

AU - Fraser-Liggett, Claire

PY - 2011/12/1

Y1 - 2011/12/1

N2 - Motivation: Direct sequencing of microbes in human ecosystems (the human microbiome) has complemented single genome cultivation and sequencing to understand and explore the impact of commensal microbes on human health. As sequencing technologies improve and costs decline, the sophistication of data has outgrown available computational methods. While several existing machine learning methods have been adapted for analyzing microbiome data recently, there is not yet an efficient and dedicated algorithm available for multiclass classification of human microbiota. Results: By combining instance-based and model-based learning, we propose a novel sparse distance-based learning method for simultaneous class prediction and feature (variable or taxa, which is used interchangeably) selection from multiple treatment populations on the basis of 16S rRNA sequence count data. Our proposed method simultaneously minimizes the intraclass distance and maximizes the interclass distance with many fewer estimated parameters than other methods. It is very efficient for problems with small sample sizes and unbalanced classes, which are common in metagenomic studies. We implemented this method in a MATLAB toolbox called MetaDistance. We also propose several approaches for data normalization and variance stabilization transformation in MetaDistance. We validate this method on several real and simulated 16S rRNA datasets to show that it outperforms existing methods for classifying metagenomic data. This article is the first to address simultaneous multifeature selection and class prediction with metagenomic count data.

AB - Motivation: Direct sequencing of microbes in human ecosystems (the human microbiome) has complemented single genome cultivation and sequencing to understand and explore the impact of commensal microbes on human health. As sequencing technologies improve and costs decline, the sophistication of data has outgrown available computational methods. While several existing machine learning methods have been adapted for analyzing microbiome data recently, there is not yet an efficient and dedicated algorithm available for multiclass classification of human microbiota. Results: By combining instance-based and model-based learning, we propose a novel sparse distance-based learning method for simultaneous class prediction and feature (variable or taxa, which is used interchangeably) selection from multiple treatment populations on the basis of 16S rRNA sequence count data. Our proposed method simultaneously minimizes the intraclass distance and maximizes the interclass distance with many fewer estimated parameters than other methods. It is very efficient for problems with small sample sizes and unbalanced classes, which are common in metagenomic studies. We implemented this method in a MATLAB toolbox called MetaDistance. We also propose several approaches for data normalization and variance stabilization transformation in MetaDistance. We validate this method on several real and simulated 16S rRNA datasets to show that it outperforms existing methods for classifying metagenomic data. This article is the first to address simultaneous multifeature selection and class prediction with metagenomic count data.

UR - http://www.scopus.com/inward/record.url?scp=82255194270&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=82255194270&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btr547

DO - 10.1093/bioinformatics/btr547

M3 - Article

C2 - 21984758

AN - SCOPUS:82255194270

VL - 27

SP - 3242

EP - 3249

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 23

M1 - btr547

ER -