COMPUTATIONAL MODELS OF SPEECH PERCEPTION

By building computer models of the speech system and running simulations with them, we are able to better understand how phonetic categories are learned and how listeners process speech sounds during word recognition. We use several types of models to study speech perception, including neural network and statistical models. Two of the specific problems we are working on are unsupervised learning and cue weighting.


Unsupervised learning

Fig. 1. Learning speech sound categories over development.

The languages of the world use acoustic information in many different ways, and, as a result, different languages have different categories of speech sounds. For example, Japanese has a single category on the R/L dimension, while English has two. Similarly, English has only two categories of voicing (e.g. B and P), while Thai has four. In the process of learning language, one of the first tasks encountered by infants is to determine which sounds form different categories in their language and which do not. Careful measurement of the acoustic cues used in speech has revealed that there are statistical properties in the sound signal that reflect the categories of a given language. A number of researchers have suggested that infants may keep track of this information and use it to learn speech categories.

We have modeled this process using a type of computational model called a Gaussian mixture model (GMM). The movie below shows one of these models learning the voicing categories of English. Initially, the model starts out with a number of categories (since, like infants, it does not know how many categories its native language has along this dimension) in random places along the dimension. The model is presented with VOT values and adjusts the parameters of its categories using unsupervised learning and competition between the categories.

The video below shows a simulation of the mixture model acquiring English voicing categories based on VOT. Over time, the model eliminates unnecessary categories to arrive at the correct two-category solution, and it adjusts the parameters of the mixture components (black lines) to capture the distributional statistics of the input (red line).

With enough exposure, the model determines the correct number of categories in the language and their properties. This model provides us with a solution to the problem of acquiring categories though unsupervised learning. In addition, the model demonstrates that statistical learning and competition are sufficiently powerful mechanisms for acquiring speech categories, and it allows us to examine the process of speech development over time.


Cue weighting

Fig. 2. Schematic diagram of the cue weighting model.

In addition to learning individual acoustic cues, we would also like to know how children and adults learn to weight and combine multiple acoustic cues in order to perceive speech. One possibility is that the weights assigned to cues are determined based on the reliability of individual acoustic cues – more reliable cues would be weighted higher, and less reliable ones would be weighted lower. We have developed a GMM using this approach to determine cue weights. The model is able to learn the approximate weights for a variety of acoustic cues occurring in different contexts and different languages.

The model has also provided us with clues about why certain acoustic cues may be weighted differently that we would expect based on their reliability. For example, listeners use vowel length information for word-initial voicing judgments, but the statistical reliability of this cue is very low. The GMM weights this cue similarly to human listeners.

Since learning in the model is unsupervised, it does its best to approximate the distribution of the inputs. However, through learning, the model arrives at a set of categories along the vowel length dimension that is more robust than that categories in the input. Human listeners may overweight certain cues in speech for the same reason. This suggests that listeners may not be perfectly optimal in their use of perceptual cues and that learning may play a crucial role in determining how they weight cues.


More information: