M. Sečujski, S. Ostrogonac, S. Suzić, D. Pekar, “Learning prosodic stress from data in neural network based text-to-speech synthesis”, Tr. SPIIRAN, 59 (2018), 192

Trudy SPIIRAN

RUS ENG

JOURNALS PEOPLE ORGANISATIONS CONFERENCES SEMINARS VIDEO LIBRARY PACKAGE AMSBIB

JavaScript is disabled in your browser. Please switch it on to enable full functionality of the website

	General information
	Latest issue
	Archive

	Search papers
	Search references

	RSS
	Latest issue
	Current issues
	Archive issues
	What is RSS

Informatics and Automation:
Year:
Volume:
Issue:
Page:
	Find

Personal entry:
Login:
Password:
	Save password
	Enter
	Forgotten password?
	Register

Trudy SPIIRAN, 2018, Issue 59, Pages 192–215
DOI: https://doi.org/10.15622/sp.59.8 (Mi trspy1019)

Artificial Intelligence, Knowledge and Data Engineering

Learning prosodic stress from data in neural network based text-to-speech synthesis

M. Sečujski^a, S. Ostrogonac^b, S. Suzić^a, D. Pekar^ab

^a University of Novi Sad
^b AlfaNum – Speech Technologies

Full-text PDF (1352 kB)

DOI: https://doi.org/10.15622/sp.59.8

Abstract: Naturalness is one of the most important aspects of synthesized speech, and state-of-the-art parametric speech synthesizers require training on large quantities of annotated speech data to be able to convey prosodic elements such as pitch accent and phrase boundary tone. The most frequently used framework for prosodic annotation of speech in American English is Tones and Break Indices – ToBI, which has also been adapted for use in a number of other languages. This paper presents certain deficiencies of ToBI when applied in synthesis of speech in American English, which are related to the absence of tags specifically intended to mark differences in the level of prosodic stress (emphasis) related to a particular sentence constituent. The research presented in the paper proposes the introduction of a set of tags intended for explicit modeling of the degree of prosodic stress. Namely, a certain sentence constituent can be particularly emphasized, when it is the intended focus of the utterance, or it can be de-emphasized, as is commonly the case with phrases reporting direct speech or with comment clauses. Through several listening tests it has been shown that learning such prosodic events from data has distinct advantages over approaches attempting to exploit the existing ToBI tags to convey the degree of emphasis in synthesized speech. Namely, speech synthesized by a neural network trained on data tagged for the level of prosodic stress appears more natural, and the listeners are more successful in locating the sentence constituent carrying prosodic stress.

Keywords: American English, prosodic stress, speech synthesis, ToBI.

Funding agency	Grant number
Ministry of Education, Science and Technical Development of Serbia	TR32035 OI178027
The research is supported by the Ministry of Education, Science and Technological Development of the Republic of Serbia (grants TR32035 and OI178027).

Received: 15.05.2018

Bibliographic databases:

Document Type: Article

UDC: 004.5

Language: English

Citation: M. Sečujski, S. Ostrogonac, S. Suzić, D. Pekar, “Learning prosodic stress from data in neural network based text-to-speech synthesis”, Tr. SPIIRAN, 59 (2018), 192–215

Citation in format AMSBIB

\Bibitem{SecOstSuz18}

\by M.~Se{\v{c}}ujski, S.~Ostrogonac, S.~Suzi{\'c}, D.~Pekar

\paper Learning prosodic stress from data in neural network based text-to-speech synthesis

\jour Tr. SPIIRAN

\yr 2018

\vol 59

\pages 192--215

\mathnet{http://mi.mathnet.ru/trspy1019}

\crossref{https://doi.org/10.15622/sp.59.8}

\elib{https://elibrary.ru/item.asp?id=35358996}

Linking options:

https://www.mathnet.ru/eng/trspy1019

https://www.mathnet.ru/eng/trspy/v59/p192

Citing articles in Google Scholar: Russian citations, English citations
Related articles in Google Scholar: Russian articles, English articles

Что такое QR-код?

Registration to the website

Logotypes