Articulation-to-speech synthesis using articulatory flesh point sensors' orientation information

Beiming Cao, Myungjong Kim, Jun R. Wang, Jan Van Santen, Ted Mau, Jun Wang

Research output: Contribution to journalConference article

2 Citations (Scopus)

Abstract

Articulation-to-speech (ATS) synthesis generates audio waveform directly from articulatory information. Current works in ATS used articulatory movement information (spatial coordinates) only. The orientation information of articulatory flesh points has rarely been used, although some devices (e.g., electromagnetic articulography) provide that. Previous work indicated that orientation information contains significant information for speech production. In this paper, we explored the performance of applying orientation information of flesh points on articulators (i.e., tongue, lips and jaw) in ATS. Experiments using articulators' movement information with or without orientation information were conducted using standard deep neural networks (DNNs) and long-short term memory-recurrent neural networks (LSTM-RNNs). Both objective and subjective evaluations indicated that adding orientation information of flesh points on articulators in addition to movement information generated higher quality speech output than using movement information only.

Original languageEnglish (US)
Pages (from-to)3152-3156
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2018-September
DOIs
StatePublished - Jan 1 2018
Event19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, India
Duration: Sep 2 2018Sep 6 2018

Fingerprint

Speech Synthesis
Speech synthesis
Sensor
Sensors
Recurrent neural networks
Articulation
Flesh
Speech Production
Subjective Evaluation
Memory Term
Spatial Information
Recurrent Neural Networks
Waveform
Neural Networks

Keywords

  • Articulation-to-speech synthesis
  • Deep neural network
  • Orientation information

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Cite this

Articulation-to-speech synthesis using articulatory flesh point sensors' orientation information. / Cao, Beiming; Kim, Myungjong; Wang, Jun R.; Van Santen, Jan; Mau, Ted; Wang, Jun.

In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2018-September, 01.01.2018, p. 3152-3156.

Research output: Contribution to journalConference article

@article{99d03ad2f9104387965c3990c0259215,
title = "Articulation-to-speech synthesis using articulatory flesh point sensors' orientation information",
abstract = "Articulation-to-speech (ATS) synthesis generates audio waveform directly from articulatory information. Current works in ATS used articulatory movement information (spatial coordinates) only. The orientation information of articulatory flesh points has rarely been used, although some devices (e.g., electromagnetic articulography) provide that. Previous work indicated that orientation information contains significant information for speech production. In this paper, we explored the performance of applying orientation information of flesh points on articulators (i.e., tongue, lips and jaw) in ATS. Experiments using articulators' movement information with or without orientation information were conducted using standard deep neural networks (DNNs) and long-short term memory-recurrent neural networks (LSTM-RNNs). Both objective and subjective evaluations indicated that adding orientation information of flesh points on articulators in addition to movement information generated higher quality speech output than using movement information only.",
keywords = "Articulation-to-speech synthesis, Deep neural network, Orientation information",
author = "Beiming Cao and Myungjong Kim and Wang, {Jun R.} and {Van Santen}, Jan and Ted Mau and Jun Wang",
year = "2018",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2018-2484",
language = "English (US)",
volume = "2018-September",
pages = "3152--3156",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Articulation-to-speech synthesis using articulatory flesh point sensors' orientation information

AU - Cao, Beiming

AU - Kim, Myungjong

AU - Wang, Jun R.

AU - Van Santen, Jan

AU - Mau, Ted

AU - Wang, Jun

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Articulation-to-speech (ATS) synthesis generates audio waveform directly from articulatory information. Current works in ATS used articulatory movement information (spatial coordinates) only. The orientation information of articulatory flesh points has rarely been used, although some devices (e.g., electromagnetic articulography) provide that. Previous work indicated that orientation information contains significant information for speech production. In this paper, we explored the performance of applying orientation information of flesh points on articulators (i.e., tongue, lips and jaw) in ATS. Experiments using articulators' movement information with or without orientation information were conducted using standard deep neural networks (DNNs) and long-short term memory-recurrent neural networks (LSTM-RNNs). Both objective and subjective evaluations indicated that adding orientation information of flesh points on articulators in addition to movement information generated higher quality speech output than using movement information only.

AB - Articulation-to-speech (ATS) synthesis generates audio waveform directly from articulatory information. Current works in ATS used articulatory movement information (spatial coordinates) only. The orientation information of articulatory flesh points has rarely been used, although some devices (e.g., electromagnetic articulography) provide that. Previous work indicated that orientation information contains significant information for speech production. In this paper, we explored the performance of applying orientation information of flesh points on articulators (i.e., tongue, lips and jaw) in ATS. Experiments using articulators' movement information with or without orientation information were conducted using standard deep neural networks (DNNs) and long-short term memory-recurrent neural networks (LSTM-RNNs). Both objective and subjective evaluations indicated that adding orientation information of flesh points on articulators in addition to movement information generated higher quality speech output than using movement information only.

KW - Articulation-to-speech synthesis

KW - Deep neural network

KW - Orientation information

UR - http://www.scopus.com/inward/record.url?scp=85054962302&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85054962302&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2018-2484

DO - 10.21437/Interspeech.2018-2484

M3 - Conference article

AN - SCOPUS:85054962302

VL - 2018-September

SP - 3152

EP - 3156

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -