Identification and utilization of arbitrary correlations in models of recombination signal sequences

Lindsay G. Cowell; Marco Davila; Thomas B. Kepler; Garnett Kelsoe

doi:10.1186/gb-2002-3-12-research0072

Identification and utilization of arbitrary correlations in models of recombination signal sequences

Lindsay G. Cowell, Marco Davila, Thomas B. Kepler, Garnett Kelsoe

Research output: Contribution to journal › Article › peer-review

52 Scopus citations

Abstract

Background: A significant challenge in bioinformatics is to develop methods for detecting and modeling patterns in variable DNA sequence sites, such as protein-binding sites in regulatory DNA. Current approaches sometimes perform poorly when positions in the site do not independently affect protein binding. We developed a statistical technique for modeling the correlation structure in variable DNA sequence sites. The method places no restrictions on the number of correlated positions or on their spatial relationship within the site. No prior empirical evidence for the correlation structure is necessary. Results: We applied our method to the recombination signal sequences (RSS) that direct assembly of B-cell and T-cell antigen-receptor genes via V(D)J recombination. The technique is based on model selection by cross-validation and produces models that allow computation of an information score for any signal-length sequence. We also modeled RSS using order zero and order one Markov chains. The scores from all models are highly correlated with measured recombination efficiencies, but the models arising from our technique are better than the Markov models at discriminating RSS from non-RSS. Conclusions: Our model-development procedure produces models that estimate well the recombinogenic potential of RSS and are better at RSS recognition than the order zero and order one Markov models. Our models are, therefore, valuable for studying the regulation of both physiologic and aberrant V(D)J recombination. The approach could be equally powerful for the study of promoter and enhancer elements, splice sites, and other DNA regulatory sites that are highly variable at the level of individual nucleotide positions.

Original language	English (US)
Article number	research0072.1
Journal	Genome biology
Volume	3
Issue number	12
DOIs	https://doi.org/10.1186/gb-2002-3-12-research0072
State	Published - Dec 2002

Keywords

Gene Segment
Marginal Probability Distribution
Model Selection Procedure
Recombination Efficiency
Recombination Signal Sequence

ASJC Scopus subject areas

Genetics
Ecology, Evolution, Behavior and Systematics
Cell Biology

Access to Document

10.1186/gb-2002-3-12-research0072

Cite this

@article{903a48b47a33471497119b51ac65be91,

title = "Identification and utilization of arbitrary correlations in models of recombination signal sequences",

abstract = "Background: A significant challenge in bioinformatics is to develop methods for detecting and modeling patterns in variable DNA sequence sites, such as protein-binding sites in regulatory DNA. Current approaches sometimes perform poorly when positions in the site do not independently affect protein binding. We developed a statistical technique for modeling the correlation structure in variable DNA sequence sites. The method places no restrictions on the number of correlated positions or on their spatial relationship within the site. No prior empirical evidence for the correlation structure is necessary. Results: We applied our method to the recombination signal sequences (RSS) that direct assembly of B-cell and T-cell antigen-receptor genes via V(D)J recombination. The technique is based on model selection by cross-validation and produces models that allow computation of an information score for any signal-length sequence. We also modeled RSS using order zero and order one Markov chains. The scores from all models are highly correlated with measured recombination efficiencies, but the models arising from our technique are better than the Markov models at discriminating RSS from non-RSS. Conclusions: Our model-development procedure produces models that estimate well the recombinogenic potential of RSS and are better at RSS recognition than the order zero and order one Markov models. Our models are, therefore, valuable for studying the regulation of both physiologic and aberrant V(D)J recombination. The approach could be equally powerful for the study of promoter and enhancer elements, splice sites, and other DNA regulatory sites that are highly variable at the level of individual nucleotide positions.",

keywords = "Gene Segment, Marginal Probability Distribution, Model Selection Procedure, Recombination Efficiency, Recombination Signal Sequence",

author = "Cowell, {Lindsay G.} and Marco Davila and Kepler, {Thomas B.} and Garnett Kelsoe",

note = "Publisher Copyright: {\textcopyright} 2002, Cowell et al., licensee BioMed Central Ltd.",

year = "2002",

month = dec,

doi = "10.1186/gb-2002-3-12-research0072",

language = "English (US)",

volume = "3",

journal = "Genome biology",

issn = "1474-7596",

publisher = "BioMed Central",

number = "12",

}

TY - JOUR

T1 - Identification and utilization of arbitrary correlations in models of recombination signal sequences

AU - Cowell, Lindsay G.

AU - Davila, Marco

AU - Kepler, Thomas B.

AU - Kelsoe, Garnett

PY - 2002/12

Y1 - 2002/12

N2 - Background: A significant challenge in bioinformatics is to develop methods for detecting and modeling patterns in variable DNA sequence sites, such as protein-binding sites in regulatory DNA. Current approaches sometimes perform poorly when positions in the site do not independently affect protein binding. We developed a statistical technique for modeling the correlation structure in variable DNA sequence sites. The method places no restrictions on the number of correlated positions or on their spatial relationship within the site. No prior empirical evidence for the correlation structure is necessary. Results: We applied our method to the recombination signal sequences (RSS) that direct assembly of B-cell and T-cell antigen-receptor genes via V(D)J recombination. The technique is based on model selection by cross-validation and produces models that allow computation of an information score for any signal-length sequence. We also modeled RSS using order zero and order one Markov chains. The scores from all models are highly correlated with measured recombination efficiencies, but the models arising from our technique are better than the Markov models at discriminating RSS from non-RSS. Conclusions: Our model-development procedure produces models that estimate well the recombinogenic potential of RSS and are better at RSS recognition than the order zero and order one Markov models. Our models are, therefore, valuable for studying the regulation of both physiologic and aberrant V(D)J recombination. The approach could be equally powerful for the study of promoter and enhancer elements, splice sites, and other DNA regulatory sites that are highly variable at the level of individual nucleotide positions.

AB - Background: A significant challenge in bioinformatics is to develop methods for detecting and modeling patterns in variable DNA sequence sites, such as protein-binding sites in regulatory DNA. Current approaches sometimes perform poorly when positions in the site do not independently affect protein binding. We developed a statistical technique for modeling the correlation structure in variable DNA sequence sites. The method places no restrictions on the number of correlated positions or on their spatial relationship within the site. No prior empirical evidence for the correlation structure is necessary. Results: We applied our method to the recombination signal sequences (RSS) that direct assembly of B-cell and T-cell antigen-receptor genes via V(D)J recombination. The technique is based on model selection by cross-validation and produces models that allow computation of an information score for any signal-length sequence. We also modeled RSS using order zero and order one Markov chains. The scores from all models are highly correlated with measured recombination efficiencies, but the models arising from our technique are better than the Markov models at discriminating RSS from non-RSS. Conclusions: Our model-development procedure produces models that estimate well the recombinogenic potential of RSS and are better at RSS recognition than the order zero and order one Markov models. Our models are, therefore, valuable for studying the regulation of both physiologic and aberrant V(D)J recombination. The approach could be equally powerful for the study of promoter and enhancer elements, splice sites, and other DNA regulatory sites that are highly variable at the level of individual nucleotide positions.

KW - Gene Segment

KW - Marginal Probability Distribution

KW - Model Selection Procedure

KW - Recombination Efficiency

KW - Recombination Signal Sequence

UR - http://www.scopus.com/inward/record.url?scp=0003228073&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0003228073&partnerID=8YFLogxK

U2 - 10.1186/gb-2002-3-12-research0072

DO - 10.1186/gb-2002-3-12-research0072

M3 - Article

C2 - 12537561

AN - SCOPUS:0003228073

SN - 1474-7596

VL - 3

JO - Genome biology

JF - Genome biology

IS - 12

M1 - research0072.1

ER -

Identification and utilization of arbitrary correlations in models of recombination signal sequences

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this