Abstract
A natural way to study protein sequence, structure, and function is to put them in the context of evolution. Homologs inherit similarities from their common ancestor, while analogs converge to similar structures due to a limited number of energetically favorable ways to pack secondary structural elements. Using novel strategies, we previously assembled two reliable databases of homologs and analogs. In this study, we compare these two data sets and develop a support vector machine (SVM)-based classifier to discriminate between homologs and analogs. The classifier uses a number of well-known similarity scores. We observe that although both structure scores and sequence scores contribute to SVM performance, profile sequence scores computed based on structural alignments are the best discriminators between remote homologs and structural analogs. We apply our classifier to a representative set from the expert-constructed database, Structural Classification of Proteins (SCOP). The SVM classifier recovers 76% of the remote homologs defined as domains in the same SCOP superfamily but from different families. More importantly, we also detect and discuss interesting homologous relationships between SCOP domains from different superfamilies, folds, and even classes.
Original language | English (US) |
---|---|
Pages (from-to) | 1265-1278 |
Number of pages | 14 |
Journal | Journal of Molecular Biology |
Volume | 377 |
Issue number | 4 |
DOIs | |
State | Published - Apr 4 2008 |
Fingerprint
Keywords
- analogy
- discrimination
- homology
- protein structures
- support vector machines
ASJC Scopus subject areas
- Virology
Cite this
Discrimination between Distant Homologs and Structural Analogs : Lessons from Manually Constructed, Reliable Data Sets. / Cheng, Hua; Kim, Bong Hyun; Grishin, Nick V.
In: Journal of Molecular Biology, Vol. 377, No. 4, 04.04.2008, p. 1265-1278.Research output: Contribution to journal › Article
}
TY - JOUR
T1 - Discrimination between Distant Homologs and Structural Analogs
T2 - Lessons from Manually Constructed, Reliable Data Sets
AU - Cheng, Hua
AU - Kim, Bong Hyun
AU - Grishin, Nick V.
PY - 2008/4/4
Y1 - 2008/4/4
N2 - A natural way to study protein sequence, structure, and function is to put them in the context of evolution. Homologs inherit similarities from their common ancestor, while analogs converge to similar structures due to a limited number of energetically favorable ways to pack secondary structural elements. Using novel strategies, we previously assembled two reliable databases of homologs and analogs. In this study, we compare these two data sets and develop a support vector machine (SVM)-based classifier to discriminate between homologs and analogs. The classifier uses a number of well-known similarity scores. We observe that although both structure scores and sequence scores contribute to SVM performance, profile sequence scores computed based on structural alignments are the best discriminators between remote homologs and structural analogs. We apply our classifier to a representative set from the expert-constructed database, Structural Classification of Proteins (SCOP). The SVM classifier recovers 76% of the remote homologs defined as domains in the same SCOP superfamily but from different families. More importantly, we also detect and discuss interesting homologous relationships between SCOP domains from different superfamilies, folds, and even classes.
AB - A natural way to study protein sequence, structure, and function is to put them in the context of evolution. Homologs inherit similarities from their common ancestor, while analogs converge to similar structures due to a limited number of energetically favorable ways to pack secondary structural elements. Using novel strategies, we previously assembled two reliable databases of homologs and analogs. In this study, we compare these two data sets and develop a support vector machine (SVM)-based classifier to discriminate between homologs and analogs. The classifier uses a number of well-known similarity scores. We observe that although both structure scores and sequence scores contribute to SVM performance, profile sequence scores computed based on structural alignments are the best discriminators between remote homologs and structural analogs. We apply our classifier to a representative set from the expert-constructed database, Structural Classification of Proteins (SCOP). The SVM classifier recovers 76% of the remote homologs defined as domains in the same SCOP superfamily but from different families. More importantly, we also detect and discuss interesting homologous relationships between SCOP domains from different superfamilies, folds, and even classes.
KW - analogy
KW - discrimination
KW - homology
KW - protein structures
KW - support vector machines
UR - http://www.scopus.com/inward/record.url?scp=40849133659&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=40849133659&partnerID=8YFLogxK
U2 - 10.1016/j.jmb.2007.12.076
DO - 10.1016/j.jmb.2007.12.076
M3 - Article
C2 - 18313074
AN - SCOPUS:40849133659
VL - 377
SP - 1265
EP - 1278
JO - Journal of Molecular Biology
JF - Journal of Molecular Biology
SN - 0022-2836
IS - 4
ER -