Monday, 3 December 2012

Lecture 4: Speech

Human Speech:



A learned communication system. Consists of three major components- voice, articulation and language skills.  Voice (also known as phonation) is the sound resulting from the movement of air through the vibration of the vocal cords of the voice box (larynx). The vocal chords consist of an elastic tissue, used for connections, which is covered by folds of mucous membrane. Vibration can occur by air passing in or out the lungs, it is possibly apt to liken it to the reed of a harmonica in this respect. The pitch that the vibrations take place add can be modified by muscles that see the vocal cords shortened and tightened for high pitched tones and loose and longer for low frequency tones. Approximately, the frequency range of speech is 80 Hz to 8 KHz, meaning the highest pitch of high quality speech is one hundred times higher than the lowest pitch.

There is a difference between speech and singing. A properly trained singer can produce a wider range of sounds than are produced in normal speech. They also aim to control their breath and have more regulated tension in their vocal chords. Yodelling for example is something that wouldn't be achievable by singing as it constantly alternates between high and low registers.

Articulation is when the larynx modifies air flow and the production of phonemes (basic speech sounds) is complete, the phonemes are then combined to form words of a language. The movement of tongue, lips, lower jaw and soft palate articule speech, interrupting and shaping the voiced and unvoiced airflow. There are approximately 40 phonemes for the English language which are classified in four categories - vowels, nasals, plosives and fricatives. This is specific to the English language, it may vary for others.

Phoneme Types:

Vowels: Generated from oscillatory excitation of the vocal track. In this process, the articulators remain static and the sound radiates from the mouth. Vowels are acoustically served by the first three of four vocal tract harmonic resonances, of which the correct terminology for is formants.

Nasals: When sound is radiated from the nasal cavity due to the raising of the tongue and the lowering of the velum (soft palate), nasals are generated:

Plosives: Produced by the sudden burst of pressure, resulting from the front of the vocal tract being shut by the tongue or the lips.

Frictatives: Similar to plosives, but these are formed by the partial restriction of the vocal tract rather than the full closure of it.

Time Domain Features:

Speech showcases bursts of localised and differentiated activity and by rule is never stationary.  Despite there being no gaps between words in speech, there can still be silence periods between them, the word speech itself containing two. This can sometimes aid the diagnosis of noise characteristics. Amplitude can very in speech, this can be referred to as a modulation process. Speech modulation tends to range between 0.25 Hz to 25 Hz, with the standard peak being between 3-6 Hz.

Voice Production Summary:

Lungs take the role of air reservoir and bellows, which act to force air between the vocal cords of the larynx. Unvoiced sounds form when the cords are relaxed and fail to vibrate, this can be simplified as a silence.

A Model of Voice Production:

This is often known as signal generation and can be used to create synthetic speech. Learned patterns in the brain interact with the nervous system, who in turn interact with local musculature,  which generates signals using the vocal chords, power supply using the lungs and resonators and articulators to process the signals.

Assumptions of Linear Predictive Coding

The vocal tract's characteristic mean that shape can not be changed at a particularly quick rate, current sound can often be found to be derived from the sound produced shortly before, approximately 25 ms. This means over this short period of time, new speech can be predicted to an extent, meaning the possibility of a recursive digital filter, measured in Hertz. 

Coding advantages are that as well as generating synthetic speech, the LPC approach has a coding advantage as the parameters to specify content in a speech frame is significantly less than the number of samples contained in that particular frame. This aids reduction of storaged requirement for digitised speech signals and reduces bandwidth for communication links in mediums such as mobile phone and internet systems.









No comments:

Post a Comment