FineTracker: Background

In everyday speech it is quite common for there to be no pauses between lexical items; words flow smoothly one into another with adjacent sounds coarticulated. This means that, if words are assumed to be constructed from a limited set of abstract phonemes, then virtually every contiguous phoneme string is compatible with many alternative word sequence interpretations. Human listeners, however, appear to be able to recognise intended word sequences without much difficulty. Even in the case of fully embedded words such as ham in hamster, listeners can make the distinction between the two interpretations even before the end of the first syllable “ham”.

There is now considerable evidence from psycholinguistic and phonetic research that sub-segmental (i.e. subtle, fine-grained, acoustic-phonetic) and supra-segmental (i.e. prosodic) detail in the speech signal modulates HSR, and helps the listener segment a speech signal into syllables and words (e.g. Davis et al., 2002; Kemps et al., 2005; Salverda et al., 2003). It is this kind of information that appears to help the human perceptual system distinguish short words (like ham) from the longer words in which they are embedded (like hamster). Salverda et al. (2003), for instance, showed that the lexical interpretation of an embedded sequence is related to its duration; a longer sequence tends to be interpreted as a monosyllabic word more often than a shorter one. Kemps et al. (2005) found that, in addition to duration, intonation seems to help the perceptual system in distinguishing singular forms from the stems of plural forms. These results question the validity of phone(me) as the unit of recognition in human listeners.

In the field of ASR, the validity of phone(me)s as the recognition unit is also debated. In the past few years alternative units of recognition have been investigated. An example are articulatory features (AFs). AFs describe properties of speech production and can be used to represent the acoustic signal in a compact manner. AFs are abstract classes which characterise the most essential aspects of articulatory properties of speech sounds (e.g. voice, nasality, roundedness, etc.) in a quantised form, leading to an intermediate representation between the signal and the lexical units (Kirchhoff, 1999).

AFs are often put forward as a more flexible and parsimonious alternative (Kirchhoff, 1999; Wester, 2003) to modelling the variation in speech using the standard ‘beads-on-a-string’ paradigm (Ostendorf, 1999), in which the acoustic signal is described in terms of phones and words as phone sequences. AFs offer the possibility of representing speech phenomena such as coarticulation and assimilation effects as simple feature spreading.

To date there is no computational model of HSR that is able to model the fine phonetic variation (Hawkins, 2003) that modulates human speech recognition. Fine-Tracker tries to fill this gap. The input representation of Fine-Tracker consists of tiers of these AFs. It can therefore also be used as a tool to investigate the usability of AFs as the unit of recognition in ASR systems. Fine-Tracker is based on the theory of human word recognition as explained in Norris (1994).

References

  • Davis, M.H., Marslen-Wilson, W.D., Gaskell, M.G., 2002. Leading up the lexical garden-path: Segmentation and ambiguity in spoken word recognition. Journal of Experimental Psychology: Human Perception and Performance, 28, 218-244.
  • Hawkins, S., 2003. Roles and representations of systematic fine phonetic detail in speech understanding. Journal of Phonetics, 31, 373-405.
  • Kemps, R.J.J.K., Ernestus, M., Schreuder, R., Baayen, R.H., 2005. Prosodic cues for morphological complexity: The case of Dutch plural nouns. Memory & Cognition, 33, 430-446.
  • Kirchhoff, K., 1999. Robust speech recognition using articulatory information, Ph.D. thesis, University of Bielefield.
  • Norris, D., 1994. Shortlist: A connectionist model of continuous speech recognition, Cognition, 52, 189-234.
  • Ostendorf, M., 1999. Moving beyond the ‘beads-on-a-string’ model of speech. Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, Keystone, CO, pp. 79-84.
  • Salverda, A.P., Dahan, D., McQueen, J.M., 2003. The role of prosodic boundaries in the resolution of lexical embedding in speech comprehension. Cognition, 90, 51-89.
  • Scharenborg, O., Norris, D., ten Bosch, L., McQueen, J.M., 2005. How should a speech recognizer work? Cognitive Science, 29 (6), 867-918.
  • Wester, M., 2003. Syllable classification using articulatory-acoustic features. Proceedings of Eurospeech, Geneva, Switzerland, pp. 233-236.

Back to Introduction