Discrimination between Distant Homologs and Structural Analogs: Lessons from Manually Constructed, Reliable Data Sets

Hua Cheng, Bong Hyun Kim, Nick V. Grishin

Research output: Contribution to journalArticle

17 Citations (Scopus)

Abstract

A natural way to study protein sequence, structure, and function is to put them in the context of evolution. Homologs inherit similarities from their common ancestor, while analogs converge to similar structures due to a limited number of energetically favorable ways to pack secondary structural elements. Using novel strategies, we previously assembled two reliable databases of homologs and analogs. In this study, we compare these two data sets and develop a support vector machine (SVM)-based classifier to discriminate between homologs and analogs. The classifier uses a number of well-known similarity scores. We observe that although both structure scores and sequence scores contribute to SVM performance, profile sequence scores computed based on structural alignments are the best discriminators between remote homologs and structural analogs. We apply our classifier to a representative set from the expert-constructed database, Structural Classification of Proteins (SCOP). The SVM classifier recovers 76% of the remote homologs defined as domains in the same SCOP superfamily but from different families. More importantly, we also detect and discuss interesting homologous relationships between SCOP domains from different superfamilies, folds, and even classes.

Original languageEnglish (US)
Pages (from-to)1265-1278
Number of pages14
JournalJournal of Molecular Biology
Volume377
Issue number4
DOIs
StatePublished - Apr 4 2008

Fingerprint

Databases
Proteins
Datasets
Support Vector Machine
Protein Domains

Keywords

  • analogy
  • discrimination
  • homology
  • protein structures
  • support vector machines

ASJC Scopus subject areas

  • Virology

Cite this

Discrimination between Distant Homologs and Structural Analogs : Lessons from Manually Constructed, Reliable Data Sets. / Cheng, Hua; Kim, Bong Hyun; Grishin, Nick V.

In: Journal of Molecular Biology, Vol. 377, No. 4, 04.04.2008, p. 1265-1278.

Research output: Contribution to journalArticle

@article{f991b6c5259c4d3e8d9c6e0a997aab17,
title = "Discrimination between Distant Homologs and Structural Analogs: Lessons from Manually Constructed, Reliable Data Sets",
abstract = "A natural way to study protein sequence, structure, and function is to put them in the context of evolution. Homologs inherit similarities from their common ancestor, while analogs converge to similar structures due to a limited number of energetically favorable ways to pack secondary structural elements. Using novel strategies, we previously assembled two reliable databases of homologs and analogs. In this study, we compare these two data sets and develop a support vector machine (SVM)-based classifier to discriminate between homologs and analogs. The classifier uses a number of well-known similarity scores. We observe that although both structure scores and sequence scores contribute to SVM performance, profile sequence scores computed based on structural alignments are the best discriminators between remote homologs and structural analogs. We apply our classifier to a representative set from the expert-constructed database, Structural Classification of Proteins (SCOP). The SVM classifier recovers 76{\%} of the remote homologs defined as domains in the same SCOP superfamily but from different families. More importantly, we also detect and discuss interesting homologous relationships between SCOP domains from different superfamilies, folds, and even classes.",
keywords = "analogy, discrimination, homology, protein structures, support vector machines",
author = "Hua Cheng and Kim, {Bong Hyun} and Grishin, {Nick V.}",
year = "2008",
month = "4",
day = "4",
doi = "10.1016/j.jmb.2007.12.076",
language = "English (US)",
volume = "377",
pages = "1265--1278",
journal = "Journal of Molecular Biology",
issn = "0022-2836",
publisher = "Academic Press Inc.",
number = "4",

}

TY - JOUR

T1 - Discrimination between Distant Homologs and Structural Analogs

T2 - Lessons from Manually Constructed, Reliable Data Sets

AU - Cheng, Hua

AU - Kim, Bong Hyun

AU - Grishin, Nick V.

PY - 2008/4/4

Y1 - 2008/4/4

N2 - A natural way to study protein sequence, structure, and function is to put them in the context of evolution. Homologs inherit similarities from their common ancestor, while analogs converge to similar structures due to a limited number of energetically favorable ways to pack secondary structural elements. Using novel strategies, we previously assembled two reliable databases of homologs and analogs. In this study, we compare these two data sets and develop a support vector machine (SVM)-based classifier to discriminate between homologs and analogs. The classifier uses a number of well-known similarity scores. We observe that although both structure scores and sequence scores contribute to SVM performance, profile sequence scores computed based on structural alignments are the best discriminators between remote homologs and structural analogs. We apply our classifier to a representative set from the expert-constructed database, Structural Classification of Proteins (SCOP). The SVM classifier recovers 76% of the remote homologs defined as domains in the same SCOP superfamily but from different families. More importantly, we also detect and discuss interesting homologous relationships between SCOP domains from different superfamilies, folds, and even classes.

AB - A natural way to study protein sequence, structure, and function is to put them in the context of evolution. Homologs inherit similarities from their common ancestor, while analogs converge to similar structures due to a limited number of energetically favorable ways to pack secondary structural elements. Using novel strategies, we previously assembled two reliable databases of homologs and analogs. In this study, we compare these two data sets and develop a support vector machine (SVM)-based classifier to discriminate between homologs and analogs. The classifier uses a number of well-known similarity scores. We observe that although both structure scores and sequence scores contribute to SVM performance, profile sequence scores computed based on structural alignments are the best discriminators between remote homologs and structural analogs. We apply our classifier to a representative set from the expert-constructed database, Structural Classification of Proteins (SCOP). The SVM classifier recovers 76% of the remote homologs defined as domains in the same SCOP superfamily but from different families. More importantly, we also detect and discuss interesting homologous relationships between SCOP domains from different superfamilies, folds, and even classes.

KW - analogy

KW - discrimination

KW - homology

KW - protein structures

KW - support vector machines

UR - http://www.scopus.com/inward/record.url?scp=40849133659&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=40849133659&partnerID=8YFLogxK

U2 - 10.1016/j.jmb.2007.12.076

DO - 10.1016/j.jmb.2007.12.076

M3 - Article

C2 - 18313074

AN - SCOPUS:40849133659

VL - 377

SP - 1265

EP - 1278

JO - Journal of Molecular Biology

JF - Journal of Molecular Biology

SN - 0022-2836

IS - 4

ER -