A sequence family database built on ECOD structural domains

Yuxing Liao, R. Dustin Schaeffer, Jimin Pei, Nick V. Grishin

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Motivation: The ECOD database classifies protein domains based on their evolutionary relationships, considering both remote and close homology. The family group in ECOD provides classification of domains that are closely related to each other based on sequence similarity. Due to different perspectives on domain definition, direct application of existing sequence domain databases, such as Pfam, to ECOD struggles with several shortcomings. Results: We created multiple sequence alignments and profiles from ECOD domains with the help of structural information in alignment building and boundary delineation. We validated the alignment quality by scoring structure superposition to demonstrate that they are comparable to curated seed alignments in Pfam. Comparison to Pfam and CDD reveals that 27 and 16% of ECOD families are new, but they are also dominated by small families, likely because of the sampling bias from the PDB database. There are 35 and 48% of families whose boundaries are modified comparing to counterparts in Pfam and CDD, respectively.

Original languageEnglish (US)
Pages (from-to)2997-3003
Number of pages7
JournalBioinformatics
Volume34
Issue number17
DOIs
StatePublished - Sep 1 2018

Fingerprint

Databases
Alignment
Selection Bias
Sequence Alignment
Seeds
Multiple Sequence Alignment
Seed
Scoring
Superposition
Homology
Sampling
Proteins
Likely
Classify
Family
Protein
Demonstrate
Protein Domains

ASJC Scopus subject areas

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Cite this

A sequence family database built on ECOD structural domains. / Liao, Yuxing; Schaeffer, R. Dustin; Pei, Jimin; Grishin, Nick V.

In: Bioinformatics, Vol. 34, No. 17, 01.09.2018, p. 2997-3003.

Research output: Contribution to journalArticle

@article{337cf127d59b4215b9f7ff70a873118a,
title = "A sequence family database built on ECOD structural domains",
abstract = "Motivation: The ECOD database classifies protein domains based on their evolutionary relationships, considering both remote and close homology. The family group in ECOD provides classification of domains that are closely related to each other based on sequence similarity. Due to different perspectives on domain definition, direct application of existing sequence domain databases, such as Pfam, to ECOD struggles with several shortcomings. Results: We created multiple sequence alignments and profiles from ECOD domains with the help of structural information in alignment building and boundary delineation. We validated the alignment quality by scoring structure superposition to demonstrate that they are comparable to curated seed alignments in Pfam. Comparison to Pfam and CDD reveals that 27 and 16{\%} of ECOD families are new, but they are also dominated by small families, likely because of the sampling bias from the PDB database. There are 35 and 48{\%} of families whose boundaries are modified comparing to counterparts in Pfam and CDD, respectively.",
author = "Yuxing Liao and Schaeffer, {R. Dustin} and Jimin Pei and Grishin, {Nick V.}",
year = "2018",
month = "9",
day = "1",
doi = "10.1093/bioinformatics/bty214",
language = "English (US)",
volume = "34",
pages = "2997--3003",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "17",

}

TY - JOUR

T1 - A sequence family database built on ECOD structural domains

AU - Liao, Yuxing

AU - Schaeffer, R. Dustin

AU - Pei, Jimin

AU - Grishin, Nick V.

PY - 2018/9/1

Y1 - 2018/9/1

N2 - Motivation: The ECOD database classifies protein domains based on their evolutionary relationships, considering both remote and close homology. The family group in ECOD provides classification of domains that are closely related to each other based on sequence similarity. Due to different perspectives on domain definition, direct application of existing sequence domain databases, such as Pfam, to ECOD struggles with several shortcomings. Results: We created multiple sequence alignments and profiles from ECOD domains with the help of structural information in alignment building and boundary delineation. We validated the alignment quality by scoring structure superposition to demonstrate that they are comparable to curated seed alignments in Pfam. Comparison to Pfam and CDD reveals that 27 and 16% of ECOD families are new, but they are also dominated by small families, likely because of the sampling bias from the PDB database. There are 35 and 48% of families whose boundaries are modified comparing to counterparts in Pfam and CDD, respectively.

AB - Motivation: The ECOD database classifies protein domains based on their evolutionary relationships, considering both remote and close homology. The family group in ECOD provides classification of domains that are closely related to each other based on sequence similarity. Due to different perspectives on domain definition, direct application of existing sequence domain databases, such as Pfam, to ECOD struggles with several shortcomings. Results: We created multiple sequence alignments and profiles from ECOD domains with the help of structural information in alignment building and boundary delineation. We validated the alignment quality by scoring structure superposition to demonstrate that they are comparable to curated seed alignments in Pfam. Comparison to Pfam and CDD reveals that 27 and 16% of ECOD families are new, but they are also dominated by small families, likely because of the sampling bias from the PDB database. There are 35 and 48% of families whose boundaries are modified comparing to counterparts in Pfam and CDD, respectively.

UR - http://www.scopus.com/inward/record.url?scp=85055039299&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85055039299&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/bty214

DO - 10.1093/bioinformatics/bty214

M3 - Article

C2 - 29659718

AN - SCOPUS:85055039299

VL - 34

SP - 2997

EP - 3003

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 17

ER -