Consonant Challenge: Baseline Recognition System

The baseline recognition system

The performance of various acoustic features (MFCC, FBANK, MELSPEC, Ratemaps) and recogniser architectures (monophone, triphone, gender-dependent/independent) was investigated. Two representative combinations were chosen as baselines for the Consonant Challenge: one system based on MFCCs, the other on Ratemaps.

For both systems, 30 models were trained, 24 consonants and two models for each of the three vowels – one to model the initial vowel context and one to model the final vowel context of the VCV. The maximum number of training items per consonant is 9 (vowel contexts) * 2 (stress conditions) * 16 (speakers) = 288 items.

MFCC-based recognition system

The speech is parameterised with 12 MFCC coefficients and log energy and augmented with first and second temporal derivatives resulting in a 39-dimensional feature vector. Each of the monophones consists of 3 emitting states with a 24-Gaussian mixture output distribution. No silence model and short pause model are employed in this distribution as features are end-pointed. The HMMs were trained from a flat start using HTK.

Ratemap-based recognition system

Ratemaps are a filterbank-based representation based on auditory excitation patterns. 64-dimensional feature vectors and the same model architecture as for the MFCC-based system were used.

A .zip file containing the MFCC-based baseline model, training and testing scripts, and the evaluation script can be downloaded here. The ratemap generation scripts are available upon request. A short explanation of the scripts can be downloaded here. For any remaining questions please contact Ning Ma (University of Sheffield, UK).

Results

The overall consonant recognition accuracy for the MFCC-based recognition system is 88.5% on Test set 1 (clean). The overall consonant recognition accuracy for the Ratemap-based system on Test set 1 is 84.4%.

The confusion matrices of the two baseline systems can be found here. The diagonal of the confusion matrices shows the number of correct responses (16 is the maximum). Vertically: the phoneme that was produced; horizontally: the phoneme that was recognised.

Back to Introduction