Considering scores between unrelated proteins in the search database improves profile comparison

Ruslan I. Sadreyev; Yong Wang; Nick V. Grishin

doi:10.1186/1471-2105-10-399

Considering scores between unrelated proteins in the search database improves profile comparison

Ruslan I. Sadreyev, Yong Wang, Nick V. Grishin

Research output: Contribution to journal › Article › peer-review

1 Scopus citations

Abstract

Background: Profile-based comparison of multiple sequence alignments is a powerful methodology for the detection remote protein sequence similarity, which is essential for the inference and analysis of protein structure, function, and evolution. Accurate estimation of statistical significance of detected profile similarities is essential for further development of this methodology. Here we analyze a novel approach to estimate the statistical significance of profile similarity: the explicit consideration of background score distributions for each database template (subject). Results: Using a simple scheme to combine and analytically approximate query- and subject-based distributions, we show that (i) inclusion of background distributions for the subjects increases the quality of homology detection; (ii) this increase is higher when the distributions are based on the scores to all known non-homologs of the subject rather than a small calibration subset of the database representatives; and (iii) these all known non-homolog distributions of scores for the subject make the dominant contribution to the improved performance: adding the calibration distribution of the query has a negligible additional effect. Conclusion: The construction of distributions based on the complete sets of non-homologs for each subject is particularly relevant in the setting of structure prediction where the database consists of proteins with solved 3D structure (PDB, SCOP, CATH, etc.) and therefore structural relationships between proteins are known. These results point to a potential new direction in the development of more powerful methods for remote homology detection.

Original language	English (US)
Article number	399
Journal	BMC Bioinformatics
Volume	10
DOIs	https://doi.org/10.1186/1471-2105-10-399
State	Published - Dec 4 2009

ASJC Scopus subject areas

Structural Biology
Biochemistry
Molecular Biology
Computer Science Applications
Applied Mathematics

Access to Document

10.1186/1471-2105-10-399

Cite this

@article{e8db530298994e929b5caa19b2824466,

title = "Considering scores between unrelated proteins in the search database improves profile comparison",

abstract = "Background: Profile-based comparison of multiple sequence alignments is a powerful methodology for the detection remote protein sequence similarity, which is essential for the inference and analysis of protein structure, function, and evolution. Accurate estimation of statistical significance of detected profile similarities is essential for further development of this methodology. Here we analyze a novel approach to estimate the statistical significance of profile similarity: the explicit consideration of background score distributions for each database template (subject). Results: Using a simple scheme to combine and analytically approximate query- and subject-based distributions, we show that (i) inclusion of background distributions for the subjects increases the quality of homology detection; (ii) this increase is higher when the distributions are based on the scores to all known non-homologs of the subject rather than a small calibration subset of the database representatives; and (iii) these all known non-homolog distributions of scores for the subject make the dominant contribution to the improved performance: adding the calibration distribution of the query has a negligible additional effect. Conclusion: The construction of distributions based on the complete sets of non-homologs for each subject is particularly relevant in the setting of structure prediction where the database consists of proteins with solved 3D structure (PDB, SCOP, CATH, etc.) and therefore structural relationships between proteins are known. These results point to a potential new direction in the development of more powerful methods for remote homology detection.",

author = "Sadreyev, {Ruslan I.} and Yong Wang and Grishin, {Nick V.}",

note = "Funding Information: This study was supported by NIH grant GM67165 to NVG. The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing high-performance computing resources.",

year = "2009",

month = dec,

day = "4",

doi = "10.1186/1471-2105-10-399",

language = "English (US)",

volume = "10",

journal = "BMC Bioinformatics",

issn = "1471-2105",

publisher = "BioMed Central",

}

TY - JOUR

T1 - Considering scores between unrelated proteins in the search database improves profile comparison

AU - Sadreyev, Ruslan I.

AU - Wang, Yong

AU - Grishin, Nick V.

N1 - Funding Information: This study was supported by NIH grant GM67165 to NVG. The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing high-performance computing resources.

PY - 2009/12/4

Y1 - 2009/12/4

N2 - Background: Profile-based comparison of multiple sequence alignments is a powerful methodology for the detection remote protein sequence similarity, which is essential for the inference and analysis of protein structure, function, and evolution. Accurate estimation of statistical significance of detected profile similarities is essential for further development of this methodology. Here we analyze a novel approach to estimate the statistical significance of profile similarity: the explicit consideration of background score distributions for each database template (subject). Results: Using a simple scheme to combine and analytically approximate query- and subject-based distributions, we show that (i) inclusion of background distributions for the subjects increases the quality of homology detection; (ii) this increase is higher when the distributions are based on the scores to all known non-homologs of the subject rather than a small calibration subset of the database representatives; and (iii) these all known non-homolog distributions of scores for the subject make the dominant contribution to the improved performance: adding the calibration distribution of the query has a negligible additional effect. Conclusion: The construction of distributions based on the complete sets of non-homologs for each subject is particularly relevant in the setting of structure prediction where the database consists of proteins with solved 3D structure (PDB, SCOP, CATH, etc.) and therefore structural relationships between proteins are known. These results point to a potential new direction in the development of more powerful methods for remote homology detection.

AB - Background: Profile-based comparison of multiple sequence alignments is a powerful methodology for the detection remote protein sequence similarity, which is essential for the inference and analysis of protein structure, function, and evolution. Accurate estimation of statistical significance of detected profile similarities is essential for further development of this methodology. Here we analyze a novel approach to estimate the statistical significance of profile similarity: the explicit consideration of background score distributions for each database template (subject). Results: Using a simple scheme to combine and analytically approximate query- and subject-based distributions, we show that (i) inclusion of background distributions for the subjects increases the quality of homology detection; (ii) this increase is higher when the distributions are based on the scores to all known non-homologs of the subject rather than a small calibration subset of the database representatives; and (iii) these all known non-homolog distributions of scores for the subject make the dominant contribution to the improved performance: adding the calibration distribution of the query has a negligible additional effect. Conclusion: The construction of distributions based on the complete sets of non-homologs for each subject is particularly relevant in the setting of structure prediction where the database consists of proteins with solved 3D structure (PDB, SCOP, CATH, etc.) and therefore structural relationships between proteins are known. These results point to a potential new direction in the development of more powerful methods for remote homology detection.

UR - http://www.scopus.com/inward/record.url?scp=73149104157&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=73149104157&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-10-399

DO - 10.1186/1471-2105-10-399

M3 - Article

C2 - 19961610

AN - SCOPUS:73149104157

SN - 1471-2105

VL - 10

JO - BMC Bioinformatics

JF - BMC Bioinformatics

M1 - 399

ER -

Considering scores between unrelated proteins in the search database improves profile comparison

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this