Bob recently bought an Amazon Echo Dot. He is very curious of the automatic speech recognition
technology used by Alexa. He wants to understand how it works. To start simple, he wants to know how
to recognize phonemes. He has a short speech recording. He needs your help to design an algorithm to
identify what phonemes were said in the recording. Before tackling this challenge, first some
background knowledge.
Speech Analysis
Let’s first briefly discuss some of the important speech properties. Firstly, speech signals are nonstationary,
i.e., they change over time. However, speech signals can typically be considered as quasistationary
over short segments, typically 5-20 ms. Thus, we often study the statistical and spectral
properties of speech defined over short segments such as 20 ms.
Speech can generally be classified as voiced (e.g., /a/, /i/, etc), unvoiced (e.g.,/sh/), or mixed. Time and
frequency domain plots for sample voiced and unvoiced segments are shown in Fig. 1. Voiced speech is
quasi-periodic in the time-domain and harmonically structured in the frequency-domain, while unvoiced
speech is random-like and broadband. In addition, the energy of voiced segments is generally higher
than the energy of unvoiced segments.