How Much Prosody Can You Learn from Twenty Utterances?

  • Eric Keller
  • Brigitte Zellner Keller


It was examined how much speech material is required to build a prosodic model for duration, fundamental frequency and intensity. For each of two speakers, fifty multiple linear regression models were built on the basis of seventy utterances per speaker (7'522 and 7'643 segments respectively). Models based on eight and twenty utterances showed good stability, satisfactory prediction for novel material, as well as closeness of fits comparable to those reported by other researchers for much larger corpora. Linear regressions were typically based on about ten independent predictors per prosodic parameter, which had previously been ranked according to their prediction of the dependent parameter. This ranking procedure advantageously replaced more commonly used regression trees. Variation in the closeness of fit of models based on sliding windows eight and twenty utterances long were traced to variations in bias, i.e., in the degree to which models systematically under- or overestimate target values. While the models in this study involved simple, non-optimized linear regressions without interactions, avenues are suggested for further improving the performance of this class of models. The results of this study suggest that a series of well-adapted small-footprint models provide more accurate information about the individual use of prosody in specific speech situations than a single model based on abundant data.
Keller, E., & Zellner Keller, B. (2013). How Much Prosody Can You Learn from Twenty Utterances?. Linguistik Online, 17(5).