Machine learning predicts translation initiation sites in neurologic diseases with nucleotide repeat expansions

Alec C. Gleason; Ghanashyam Ghadge; Jin Chen; Yoshifumi Sonobe; Raymond P. Roos

doi:10.1371/journal.pone.0256411

Machine learning predicts translation initiation sites in neurologic diseases with nucleotide repeat expansions

Alec C. Gleason, Ghanashyam Ghadge, Jin Chen, Yoshifumi Sonobe, Raymond P. Roos

Cecil H. and Ida Green Center For Reproductive Biology Sciences

Research output: Contribution to journal › Article › peer-review

9 Scopus citations

Abstract

A number of neurologic diseases associated with expanded nucleotide repeats, including an inherited form of amyotrophic lateral sclerosis, have an unconventional form of translation called repeat-associated non-AUG (RAN) translation. It has been speculated that the repeat regions in the RNA fold into secondary structures in a length-dependent manner, promoting RAN translation. Repeat protein products are translated, accumulate, and may contribute to disease pathogenesis. Nucleotides that flank the repeat region, especially ones closest to the initiation site, are believed to enhance translation initiation. A machine learning model has been published to help identify ATG and near-cognate translation initiation sites; however, this model has diminished predictive power due to its extensive feature selection and limited training data. Here, we overcome this limitation and increase prediction accuracy by the following: a) capture the effect of nucleotides most critical for translation initiation via feature reduction, b) implement an alternative machine learning algorithm better suited for limited data, c) build comprehensive and balanced training data (via sampling without replacement) that includes previously unavailable sequences, and d) split ATG and near-cognate translation initiation codon data to train two separate models. We also design a supplementary scoring system to provide an additional prognostic assessment of model predictions. The resultant models have high performance, with ~85–88% accuracy, exceeding that of the previously published model by >18%. The models presented here are used to identify translation initiation sites in genes associated with a number of neurologic repeat expansion disorders. The results confirm a number of sites of translation initiation upstream of the expanded repeats that have been found experimentally, and predict sites that are not yet established.

Original language	English (US)
Article number	e0256411
Journal	PloS one
Volume	17
Issue number	6 June
DOIs	https://doi.org/10.1371/journal.pone.0256411
State	Published - Jun 2022

ASJC Scopus subject areas

General

Access to Document

10.1371/journal.pone.0256411

Cite this

@article{29e1842f21d84e33a1d2eb76662ceeac,

title = "Machine learning predicts translation initiation sites in neurologic diseases with nucleotide repeat expansions",

abstract = "A number of neurologic diseases associated with expanded nucleotide repeats, including an inherited form of amyotrophic lateral sclerosis, have an unconventional form of translation called repeat-associated non-AUG (RAN) translation. It has been speculated that the repeat regions in the RNA fold into secondary structures in a length-dependent manner, promoting RAN translation. Repeat protein products are translated, accumulate, and may contribute to disease pathogenesis. Nucleotides that flank the repeat region, especially ones closest to the initiation site, are believed to enhance translation initiation. A machine learning model has been published to help identify ATG and near-cognate translation initiation sites; however, this model has diminished predictive power due to its extensive feature selection and limited training data. Here, we overcome this limitation and increase prediction accuracy by the following: a) capture the effect of nucleotides most critical for translation initiation via feature reduction, b) implement an alternative machine learning algorithm better suited for limited data, c) build comprehensive and balanced training data (via sampling without replacement) that includes previously unavailable sequences, and d) split ATG and near-cognate translation initiation codon data to train two separate models. We also design a supplementary scoring system to provide an additional prognostic assessment of model predictions. The resultant models have high performance, with ~85–88% accuracy, exceeding that of the previously published model by >18%. The models presented here are used to identify translation initiation sites in genes associated with a number of neurologic repeat expansion disorders. The results confirm a number of sites of translation initiation upstream of the expanded repeats that have been found experimentally, and predict sites that are not yet established.",

author = "Gleason, {Alec C.} and Ghanashyam Ghadge and Jin Chen and Yoshifumi Sonobe and Roos, {Raymond P.}",

note = "Publisher Copyright: {\textcopyright} 2022 Gleason et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.",

year = "2022",

month = jun,

doi = "10.1371/journal.pone.0256411",

language = "English (US)",

volume = "17",

journal = "PloS one",

issn = "1932-6203",

publisher = "Public Library of Science",

number = "6 June",

}

TY - JOUR

T1 - Machine learning predicts translation initiation sites in neurologic diseases with nucleotide repeat expansions

AU - Gleason, Alec C.

AU - Ghadge, Ghanashyam

AU - Chen, Jin

AU - Sonobe, Yoshifumi

AU - Roos, Raymond P.

N1 - Publisher Copyright: © 2022 Gleason et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PY - 2022/6

Y1 - 2022/6

N2 - A number of neurologic diseases associated with expanded nucleotide repeats, including an inherited form of amyotrophic lateral sclerosis, have an unconventional form of translation called repeat-associated non-AUG (RAN) translation. It has been speculated that the repeat regions in the RNA fold into secondary structures in a length-dependent manner, promoting RAN translation. Repeat protein products are translated, accumulate, and may contribute to disease pathogenesis. Nucleotides that flank the repeat region, especially ones closest to the initiation site, are believed to enhance translation initiation. A machine learning model has been published to help identify ATG and near-cognate translation initiation sites; however, this model has diminished predictive power due to its extensive feature selection and limited training data. Here, we overcome this limitation and increase prediction accuracy by the following: a) capture the effect of nucleotides most critical for translation initiation via feature reduction, b) implement an alternative machine learning algorithm better suited for limited data, c) build comprehensive and balanced training data (via sampling without replacement) that includes previously unavailable sequences, and d) split ATG and near-cognate translation initiation codon data to train two separate models. We also design a supplementary scoring system to provide an additional prognostic assessment of model predictions. The resultant models have high performance, with ~85–88% accuracy, exceeding that of the previously published model by >18%. The models presented here are used to identify translation initiation sites in genes associated with a number of neurologic repeat expansion disorders. The results confirm a number of sites of translation initiation upstream of the expanded repeats that have been found experimentally, and predict sites that are not yet established.

AB - A number of neurologic diseases associated with expanded nucleotide repeats, including an inherited form of amyotrophic lateral sclerosis, have an unconventional form of translation called repeat-associated non-AUG (RAN) translation. It has been speculated that the repeat regions in the RNA fold into secondary structures in a length-dependent manner, promoting RAN translation. Repeat protein products are translated, accumulate, and may contribute to disease pathogenesis. Nucleotides that flank the repeat region, especially ones closest to the initiation site, are believed to enhance translation initiation. A machine learning model has been published to help identify ATG and near-cognate translation initiation sites; however, this model has diminished predictive power due to its extensive feature selection and limited training data. Here, we overcome this limitation and increase prediction accuracy by the following: a) capture the effect of nucleotides most critical for translation initiation via feature reduction, b) implement an alternative machine learning algorithm better suited for limited data, c) build comprehensive and balanced training data (via sampling without replacement) that includes previously unavailable sequences, and d) split ATG and near-cognate translation initiation codon data to train two separate models. We also design a supplementary scoring system to provide an additional prognostic assessment of model predictions. The resultant models have high performance, with ~85–88% accuracy, exceeding that of the previously published model by >18%. The models presented here are used to identify translation initiation sites in genes associated with a number of neurologic repeat expansion disorders. The results confirm a number of sites of translation initiation upstream of the expanded repeats that have been found experimentally, and predict sites that are not yet established.

UR - http://www.scopus.com/inward/record.url?scp=85131224666&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85131224666&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0256411

DO - 10.1371/journal.pone.0256411

M3 - Article

C2 - 35648796

AN - SCOPUS:85131224666

SN - 1932-6203

VL - 17

JO - PloS one

JF - PloS one

IS - 6 June

M1 - e0256411

ER -

Machine learning predicts translation initiation sites in neurologic diseases with nucleotide repeat expansions

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this