A comparative study of multiple instance learning methods for cancer detection using T-cell receptor sequences

Danyi Xiong; Ze Zhang; Tao Wang; Xinlei Wang

doi:10.1016/j.csbj.2021.05.038

A comparative study of multiple instance learning methods for cancer detection using T-cell receptor sequences

Danyi Xiong, Ze Zhang, Tao Wang, Xinlei Wang

Research output: Contribution to journal › Review article › peer-review

15 Scopus citations

Abstract

As a branch of machine learning, multiple instance learning (MIL) learns from a collection of labeled bags, each containing a set of instances. The learning process is weakly supervised due to ambiguous instance labels. Since its emergence, MIL has been applied to solve various problems including content-based image retrieval, object tracking/detection, and computer-aided diagnosis. In biomedical research, the use of MIL has been focused on medical image analysis and molecule activity prediction. We review and apply 16 methods to investigate the applicability of MIL to a novel biomedical application, cancer detection using T-cell receptor (TCR) sequences. This important application can be a viable approach for large-scale cancer screening, as TCRs can be easily profiled from a subject's peripheral blood. We consider two feasible data-generating mechanisms, and for the purpose of performance evaluation, we simulate data under each mechanism, where we vary potentially important factors to mimic realistic situations. We also apply the methods to sequencing data of ten cancer types from The Cancer Genome Atlas, as an early proof of concept for distinguishing tumor patients from healthy individuals via TCR sequencing of peripheral blood. We find that given an appropriate MIL method is used, satisfactory performance with Area Under the Receiver Operating Characteristic Curve above 80% can be achieved for five in the ten cancers. Based on our numerical results, we make suggestions about selection of a proper method and avoidance of any method with poor performance. We further point out directions of future research as well as identify a pressing need of new MIL methodologies for improved performance (for some cancer types) and more explainable outcomes.

Original language	English (US)
Pages (from-to)	3255-3268
Number of pages	14
Journal	Computational and Structural Biotechnology Journal
Volume	19
DOIs	https://doi.org/10.1016/j.csbj.2021.05.038
State	Published - Jan 2021

Keywords

Binary classification
Primary instance
T-cell receptor
Weakly supervised learning
Witness rate

ASJC Scopus subject areas

Biotechnology
Biophysics
Structural Biology
Biochemistry
Genetics
Computer Science Applications

Access to Document

10.1016/j.csbj.2021.05.038

Cite this

@article{778fef113c3c46fcbe79395acf0a979d,

title = "A comparative study of multiple instance learning methods for cancer detection using T-cell receptor sequences",

abstract = "As a branch of machine learning, multiple instance learning (MIL) learns from a collection of labeled bags, each containing a set of instances. The learning process is weakly supervised due to ambiguous instance labels. Since its emergence, MIL has been applied to solve various problems including content-based image retrieval, object tracking/detection, and computer-aided diagnosis. In biomedical research, the use of MIL has been focused on medical image analysis and molecule activity prediction. We review and apply 16 methods to investigate the applicability of MIL to a novel biomedical application, cancer detection using T-cell receptor (TCR) sequences. This important application can be a viable approach for large-scale cancer screening, as TCRs can be easily profiled from a subject's peripheral blood. We consider two feasible data-generating mechanisms, and for the purpose of performance evaluation, we simulate data under each mechanism, where we vary potentially important factors to mimic realistic situations. We also apply the methods to sequencing data of ten cancer types from The Cancer Genome Atlas, as an early proof of concept for distinguishing tumor patients from healthy individuals via TCR sequencing of peripheral blood. We find that given an appropriate MIL method is used, satisfactory performance with Area Under the Receiver Operating Characteristic Curve above 80% can be achieved for five in the ten cancers. Based on our numerical results, we make suggestions about selection of a proper method and avoidance of any method with poor performance. We further point out directions of future research as well as identify a pressing need of new MIL methodologies for improved performance (for some cancer types) and more explainable outcomes.",

keywords = "Binary classification, Primary instance, T-cell receptor, Weakly supervised learning, Witness rate",

author = "Danyi Xiong and Ze Zhang and Tao Wang and Xinlei Wang",

note = "Funding Information: This work was supported by NIH grants R01CA258584 (PIs: T. Wang and X. Wang), R15GM131390 (PI: X. Wang), and P30CA142543 (PI: T. Wang), and Cancer Prevention and Research Institute of Texas (CPRIT) grant RP190208 (PI: T. Wang). Publisher Copyright: {\textcopyright} 2021 The Author(s)",

year = "2021",

month = jan,

doi = "10.1016/j.csbj.2021.05.038",

language = "English (US)",

volume = "19",

pages = "3255--3268",

journal = "Computational and Structural Biotechnology Journal",

issn = "2001-0370",

publisher = "Research Network of Computational and Structural Biotechnology",

}

TY - JOUR

T1 - A comparative study of multiple instance learning methods for cancer detection using T-cell receptor sequences

AU - Xiong, Danyi

AU - Zhang, Ze

AU - Wang, Tao

AU - Wang, Xinlei

N1 - Funding Information: This work was supported by NIH grants R01CA258584 (PIs: T. Wang and X. Wang), R15GM131390 (PI: X. Wang), and P30CA142543 (PI: T. Wang), and Cancer Prevention and Research Institute of Texas (CPRIT) grant RP190208 (PI: T. Wang). Publisher Copyright: © 2021 The Author(s)

PY - 2021/1

Y1 - 2021/1

N2 - As a branch of machine learning, multiple instance learning (MIL) learns from a collection of labeled bags, each containing a set of instances. The learning process is weakly supervised due to ambiguous instance labels. Since its emergence, MIL has been applied to solve various problems including content-based image retrieval, object tracking/detection, and computer-aided diagnosis. In biomedical research, the use of MIL has been focused on medical image analysis and molecule activity prediction. We review and apply 16 methods to investigate the applicability of MIL to a novel biomedical application, cancer detection using T-cell receptor (TCR) sequences. This important application can be a viable approach for large-scale cancer screening, as TCRs can be easily profiled from a subject's peripheral blood. We consider two feasible data-generating mechanisms, and for the purpose of performance evaluation, we simulate data under each mechanism, where we vary potentially important factors to mimic realistic situations. We also apply the methods to sequencing data of ten cancer types from The Cancer Genome Atlas, as an early proof of concept for distinguishing tumor patients from healthy individuals via TCR sequencing of peripheral blood. We find that given an appropriate MIL method is used, satisfactory performance with Area Under the Receiver Operating Characteristic Curve above 80% can be achieved for five in the ten cancers. Based on our numerical results, we make suggestions about selection of a proper method and avoidance of any method with poor performance. We further point out directions of future research as well as identify a pressing need of new MIL methodologies for improved performance (for some cancer types) and more explainable outcomes.

AB - As a branch of machine learning, multiple instance learning (MIL) learns from a collection of labeled bags, each containing a set of instances. The learning process is weakly supervised due to ambiguous instance labels. Since its emergence, MIL has been applied to solve various problems including content-based image retrieval, object tracking/detection, and computer-aided diagnosis. In biomedical research, the use of MIL has been focused on medical image analysis and molecule activity prediction. We review and apply 16 methods to investigate the applicability of MIL to a novel biomedical application, cancer detection using T-cell receptor (TCR) sequences. This important application can be a viable approach for large-scale cancer screening, as TCRs can be easily profiled from a subject's peripheral blood. We consider two feasible data-generating mechanisms, and for the purpose of performance evaluation, we simulate data under each mechanism, where we vary potentially important factors to mimic realistic situations. We also apply the methods to sequencing data of ten cancer types from The Cancer Genome Atlas, as an early proof of concept for distinguishing tumor patients from healthy individuals via TCR sequencing of peripheral blood. We find that given an appropriate MIL method is used, satisfactory performance with Area Under the Receiver Operating Characteristic Curve above 80% can be achieved for five in the ten cancers. Based on our numerical results, we make suggestions about selection of a proper method and avoidance of any method with poor performance. We further point out directions of future research as well as identify a pressing need of new MIL methodologies for improved performance (for some cancer types) and more explainable outcomes.

KW - Binary classification

KW - Primary instance

KW - T-cell receptor

KW - Weakly supervised learning

KW - Witness rate

UR - http://www.scopus.com/inward/record.url?scp=85108822382&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85108822382&partnerID=8YFLogxK

U2 - 10.1016/j.csbj.2021.05.038

DO - 10.1016/j.csbj.2021.05.038

M3 - Review article

C2 - 34141144

AN - SCOPUS:85108822382

SN - 2001-0370

VL - 19

SP - 3255

EP - 3268

JO - Computational and Structural Biotechnology Journal

JF - Computational and Structural Biotechnology Journal

ER -

A comparative study of multiple instance learning methods for cancer detection using T-cell receptor sequences

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this