Summary
This book focuses on the basics of natural language processing and machine learning required to make a standard speech- based gender identification system. In this book all the required signal processing techniques required for understanding the basics of natural language processing including all types of Fourier transform, basic speech enhancement techniques, voice activity detection and pitch estimation using sub harmonic-to-harmonic ratio are briefly explained as well. In the machine learning part, all the relevant machine learning models like Support Vector Machines, Gaussian Mixture Models and Adaptive boosting are explained. Lastly the results of different gender identification systems that were implemented using state of the art techniques are portrait and analysed.
Excerpt
Table Of Contents
Contents
Abstract
1
2
Acknowledgements
3
1
Introduction
10
2
Background
12
2.1
Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.1.1
Speech Signal
. . . . . . . . . . . . . . . . . . . . . . . .
12
2.2
Speech Signal Processing . . . . . . . . . . . . . . . . . . . . . . .
14
2.2.1
Fourier Transform . . . . . . . . . . . . . . . . . . . . . .
15
2.2.2
Discrete Cosine Transform . . . . . . . . . . . . . . . . . .
15
2.2.3
Digital Filters . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.2.4
Nyquist Shannon Sampling Theorem . . . . . . . . . . . .
18
2.2.5
Window Functions . . . . . . . . . . . . . . . . . . . . . .
19
3
Speech Enhancement
21
3.1
Signal to Noise Ratio . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.2
Spectral Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.3
Cepstral Mean Normalization . . . . . . . . . . . . . . . . . . . . .
23
3.4
RASTA Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.5
Voice Activity Detector . . . . . . . . . . . . . . . . . . . . . . . .
25
4
CONTENTS
5
3.5.1
The Empirical Mode Decomposition Method . . . . . . . .
26
3.5.2
The Hilbert Spectrum Analysis . . . . . . . . . . . . . . . .
26
3.5.3
Voice Activity Detection . . . . . . . . . . . . . . . . . . .
28
4
Gender Identification Systems
29
4.1
Acoustic Features . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
4.1.1
Mel Frequency Cepstral Coefficients (MFCC) . . . . . . . .
30
4.1.2
Shifted Delta Cepstral (SDC) . . . . . . . . . . . . . . . . .
31
4.1.3
Pitch Extraction Method . . . . . . . . . . . . . . . . . . .
33
4.2
Pitch Based Models . . . . . . . . . . . . . . . . . . . . . . . . . .
33
4.3
Models based on Acoustic Features
. . . . . . . . . . . . . . . . .
34
4.4
Fused Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
5
Learning Techniques for Gender Identification
36
5.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
5.2
Adaboost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
5.3
Gaussian Mixture Model (GMM) . . . . . . . . . . . . . . . . . . .
39
5.3.1
GMM Training . . . . . . . . . . . . . . . . . . . . . . . .
40
5.3.2
GMM Testing . . . . . . . . . . . . . . . . . . . . . . . . .
41
5.4
Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
5.5
Likelihood Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
5.6
Universal Background Model . . . . . . . . . . . . . . . . . . . . .
42
5.6.1
UBM Training . . . . . . . . . . . . . . . . . . . . . . . .
43
6
System Design and Implementation
44
6.1
Toolboxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
6.1.1
Signal Processing Toolbox . . . . . . . . . . . . . . . . . .
44
6.1.2
Machine Learning Toolbox . . . . . . . . . . . . . . . . . .
45
6.2
System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
6.2.1
Requirement . . . . . . . . . . . . . . . . . . . . . . . . .
45
6.2.2
Initial Approach . . . . . . . . . . . . . . . . . . . . . . .
45
6.2.3
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
6
CONTENTS
6.2.4
Feature Selection . . . . . . . . . . . . . . . . . . . . . . .
51
6.3
Experiments and Results . . . . . . . . . . . . . . . . . . . . . . .
53
6.3.1
Pitch Based Models
. . . . . . . . . . . . . . . . . . . . .
53
6.3.2
Models Based on Acoustic Features . . . . . . . . . . . . .
57
6.3.3
Fused Model . . . . . . . . . . . . . . . . . . . . . . . . .
60
6.3.4
YouTube Videos . . . . . . . . . . . . . . . . . . . . . . .
63
7
Conclusion
65
7.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
7.2
Future Recommendation . . . . . . . . . . . . . . . . . . . . . . .
66
Bibliography
67
A Appendix
73
List of Figures
2.1
Mechanism of the human speech system which is representing the
underlying phenomenon of speech generation and speech under-
standing. The grey boxes are representing computer systems for
natural language processing [HAH01] . . . . . . . . . . . . . . . .
13
2.2
Conversion of analogue signal to digital signal. Red lines show the
digital value of the analogue signal . . . . . . . . . . . . . . . . . .
14
2.3
DFT applied to a speech signal . . . . . . . . . . . . . . . . . . . .
16
2.4
DCT applied to a speech signal . . . . . . . . . . . . . . . . . . . .
17
2.5
Sampling of a Continuous Signal . . . . . . . . . . . . . . . . . . .
18
2.6
Hamming window effect . . . . . . . . . . . . . . . . . . . . . . .
19
3.1
A noisy speech signal [Vat12]
. . . . . . . . . . . . . . . . . . . .
23
3.2
A clean speech signal [Vat12]
. . . . . . . . . . . . . . . . . . . .
24
3.3
VAD applied to noisy speech [SZ12]
. . . . . . . . . . . . . . . .
27
4.1
A block diagram of gender identification model . . . . . . . . . . .
30
4.2
A block diagram of MFCC computation [Vat12] . . . . . . . . . . .
30
4.3
Mel frequency scale [Vat12] . . . . . . . . . . . . . . . . . . . . .
31
4.4
Graph of Mel filterbank of 24 filters [Vat12] . . . . . . . . . . . . .
31
4.5
Computational model of SDC [TcSKD02] . . . . . . . . . . . . . .
32
4.6
Plot of male and female pitch [Sun00] . . . . . . . . . . . . . . . .
32
4.7
Block diagram of gender identification model trained using MFCC
[Sun00] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
4.8
A fused gender identification model [Sun00] . . . . . . . . . . . . .
34
7
8
LIST OF FIGURES
4.9
A adaboost score fusion model [IKJGY10] . . . . . . . . . . . . . .
35
5.1
Optimal decision boundary between two classes
. . . . . . . . . .
37
5.2
One Dimensional Gaussian Mixture Model . . . . . . . . . . . . .
39
6.1
Block diagram of SDC feature extraction
. . . . . . . . . . . . . .
48
6.2
Block diagram of the final fused model . . . . . . . . . . . . . . . .
50
6.3
Snapshot of the Graphical User Interface of the system . . . . . . .
52
List of Tables
6.1
Results from pitch based model trained with 1 male and 1 female
speaker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
6.2
Results from pitch based model trained with 9 male and 1 female
speakers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
6.3
Results from pitch based model trained with 1 male and 9 female
speaker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
6.4
Results from pitch based model trained with 8 male and 8 female
speakers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
6.5
Results from MFCC model trained using 8 GMM components . . .
58
6.6
Results from MFCC model trained using 16 GMM components . . .
58
6.7
Results from MFCC model trained using 32 GMM components . . .
58
6.8
Results from SDC model trained using 8 GMM components . . . .
59
6.9
Results from SDC model trained using 16 GMM components . . . .
60
6.10 Results from SDC model trained using 32 GMM components . . . .
60
6.11 Results from fused model trained using 8 GMM components on
SDC features . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
6.12 Results from fused model trained using 16 GMM components on
SDC features . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
6.13 Results from fused model trained using 32 GMM components on
SDC features . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
6.14 Results from acoustic and fused models tested on large amount of
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
6.15 Accuracy of all the models that were tested on YouTube Videos . . .
63
9
Chapter 1
Introduction
As the significance of the computers in our daily life is getting popular, the inter-
action between human and machine is becoming more important day by day. The
desire of humans to communicate with machines in a natural way has led to the
evolution of natural language processing. As the advancements in this field are hap-
pening, it is likely that voice interaction systems will replace the standard keyboards
in near the future. Today if we look in the technology market around us we have
some really state of the art technologies like Microsoft
R
Kinect and Apple
R
SIRI
which performs really well. But every speech system that is available today has its
own drawbacks and continuous work is being done to increase the performance of
such systems. To increase the performance of speech systems pre-processing like
gender and language identifications are required.
This book focuses on automatic gender identification system using speech. Iden-
tification of gender using the speech of the speaker concerns in detecting that the
spoken speech is of male or female speaker. Automatic Gender Identification (AGI)
via speech has several applications in the field of natural language processing. In
[AH96] has shown that gender dependant speech recognition models are more ac-
curate than gender independent models. Google's latest speech recognition system
that can be seen in android devices and Google Glass initially finds the gender of
the speaker before performing the speech recognition for search. The result of its
speech recognition accuracy is exceptionally high as compared to their previous
10
11
speech recognition system which was an unisex model. Recently a company has
launched its "Kinect" based online fitting room that determines the gender of the
person using it speech to offer him clothes. In the context of multimedia indexing
gender recognition can considerably decrease the search space up to half [HC].
Automatic gender recognition itself is a complex task and it has its own prob-
lems and limitations, until now no gender recognition system exists which can work
on real time environment with 100% accuracy. As in a real world environment or in
the case of multimedia indexing many acoustic conditions exist like noisy speech,
compressed speech, silence, speech on the telephone, different languages and so on
which significantly reduces the performance of a general gender identification sys-
tem. So ideally a system is required which can give acceptable performance under
previously described acoustic conditions.
In general, there are three main approaches to building an automatic gender
identification system: The first approach uses pitch as a discriminating factor and
use labelled data to identify the gender of the speaker. The second approach deals
with acoustic features like MFCC and unlabelled data to identify the gender. In this
approach relevant features are extracted then the model is trained. In this case gen-
erally a GMM is trained for each gender and results from one model are subtracted
from the other model to find the gender. The third approach is quite commonly used
after year 2005, in which pitch models are combined with acoustic models to form
a fused model.
This book is organised as follows: The first chapter presents the challenge of
gender identification system. The second chapter presents necessary knowledge of
signal processing. Then each chapter describes a step of a gender identification sys-
tem: including speech enhancement techniques to reduce background noise, feature
extraction, gender modelling methods and different decision making techniques.
Finally, the last chapter presents the implementations done for this project and re-
sults obtained from testing different models on a large set of speakers and YouTube
videos
Chapter 2
Background
To understand the gender identification process using speech, we first need to un-
derstand the structure of speech. This chapter includes human speech and what is
the basic difference between female and male voice.
2.1
Speech
Spoken language or human speech is the natural form of human communication
which requires the use of voice. In terms of Linguistics human speech is a form of
sound wave which is produced by the lungs and it is given uniqueness by tongue,
lips and jaws [HAH01]
2.1.1
Speech Signal
Speech is produced when the air pressure generated by lungs reaches the vocal
cords. Then speech begins to resonate in the nasal cavities according to the position
of lips, tongue and other organs in the mouth. In terms of signal processing speech
signal is an analogue signal which is the convolution of the source e
[n] and a filter
h
[n] which can be seen in equation 2.1 where lungs can be modelled as the source
12
2.1. SPEECH
13
Figure 2.1: Mechanism of the human speech system which is representing the un-
derlying phenomenon of speech generation and speech understanding. The grey
boxes are representing computer systems for natural language processing [HAH01]
e
[n] and the resonance of speech in the mouth can be modelled as the filter h[n]
x
[n] = e[n] h[n] =
k
=-
e
[k]h[n - k]
(2.1)
where x
[n] is the speech signal.
Human Speech Frequency
The frequency range that is the part of the audio range is 300Hz to 3400Hz which
means that human speech lies in this range [Tit00a]. On the other hand the sound
ranges i.e. the frequency range between humans can hear any sound is between 20
Hz to 20,000Hz. Beyond the region of 20,000Hz the region of ultrasonic comes
which, humans are unable to hear.
14
CHAPTER 2. BACKGROUND
Figure 2.2: Conversion of analogue signal to digital signal. Red lines show the
digital value of the analogue signal
Fundamental Frequency (Pitch)
Generally fundamental frequency is defined as the minimum frequency of the peri-
odic waveform. The fundamental frequency or usually known as pitch in terms of
natural language processing is the biggest discriminating factor between a male and
a female speech. A typical male adult has a fundamental frequency between 85Hz
to 180Hz and an adult female has a fundamental frequency in the range of 165Hz
to 225Hz [BO00].
2.2
Speech Signal Processing
From [HAH01] we know that speech is an analogue signal but unfortunately today`s
computers work with digital signals so speech is saved in digital form in computers.
When speech is converted to digital form, it loses some of the data so accurate
representation of analogue signal into digital signal is required . A conversion of
analogue signal can be seen in figure 2.2
2.2. SPEECH SIGNAL PROCESSING
15
2.2.1
Fourier Transform
According to Joseph Fourier, any signal can be represented as a linear combination
of sinusoids which means that the Fourier transform can be described as transform-
ing a function of time f
(t) into a function of frequency F(). This can be shown
as
F
() =
-
f
(t)e
-2it
dt
(2.2)
There exist different types of Fourier transforms but most famous are
1. Continuous Time Fourier Transform
2. Continuous Fourier Transform
3. Discrete Fourier Transform
4. Discrete Time Fourier Transform
In an automatic gender recognition system only discrete Fourier transform is re-
quired so only that will be explained.
Discrete Fourier Transform
For any periodic signal x
[t] the discrete Fourier transform can be defined as
X
() =
t
-1
t
=0
x
[t]e
-2it
(2.3)
A discrete Fourier transform applied to a signal can be seen in figure 2.3
2.2.2
Discrete Cosine Transform
Discrete Cosine Transform commonly known as DCT is similar to DFT. DCT is
used to transform a finite sequence of data points into sum of different sinusoids vi-
brating at different frequencies. DCT is usually used for compression of images and
sound where the lower number of higher frequency components can be discarded
which means that the transformed signal is mostly comprised of lower frequencies
16
CHAPTER 2. BACKGROUND
Figure 2.3: DFT applied to a speech signal
thus majority of information can be found in first coefficient. More information
about the usage of the DCT in speech processing can be found in [MAMG11].
Mathematically DCT can be defined as
X
T
[k] =
T
-1
n
=0
x
T
[n]cos
k
T
n
+
1
2
,
k
[0,T - 1]
(2.4)
where X
T
[k] is the kth coefficient. DCT applied to a speech signal can be seen in
figure 2.4
2.2.3
Digital Filters
Digital filters are mathematical models that are applied to a signal to remove some
components of that signal or to enhance some aspects of the signal. In natural
language processing widely used filters are low pass filters, band pass filters and
high pass filters [APHA96].
Low Pass Filter
A low pass filter is used to discard the frequencies higher than the cut-off frequency
in a speech signal.
2.2. SPEECH SIGNAL PROCESSING
17
Figure 2.4: DCT applied to a speech signal
High Pass Filter
A high pass filter is used to discard the frequencies lower than the cut-off frequency
in a speech signal.
Band Pass Filter
The band pass filter allows a certain range of frequencies to pass and discard all the
frequencies that are higher or lower than the cut-off frequencies.
Sampling
Human speech signal is naturally a analogue signal but to perform any computa-
tional tasks on the speech signal, it should be converted to a digital form. In signal
processing, sampling means to convert a continuous time signal to discrete time
signal by looking at in regular intervals of time. The regular interval of time is
generally called the sampling interval which is the reciprocal of the sampling fre-
quency and is denoted by T
s
. Sampling frequency, generally known as the sampling
rate is defined as number discrete samples taken from a signal in one second and
is denoted by f
s
=
1
T
s
. The higher the sampling frequency is the better the digital
signal is as more information was captured and less information was lost. Usually
18
CHAPTER 2. BACKGROUND
in speech processing 44 KHz is considered to be a good sampling rate which means
that 44000 samples are taken from 1 second of speech.
Quantization
In digital signal processing quantization is the process of mapping a continuous
range of very large values to a smaller set of discrete or integer values. The error
that is induced because of the loss of the information during mapping is called
quantization error. Quantization is the used in analogue-to-digital converters for
converting discrete signals to digital signals using a quantization level specified in
bits. As loss of information during quantization is irreversible, it is a good practise
to set the quantization level to higher bits. A good quality compact disc is sampled
at 44.1 KHz with a quantization level of 16 bits which can give 65,536 possible
values per sample.
Figure 2.5: Sampling of a Continuous Signal
2.2.4
Nyquist Shannon Sampling Theorem
Nyquist Shannon sampling theorem more generally known as Nyquist sampling
theorem states that
2.2. SPEECH SIGNAL PROCESSING
19
"If a function x
(t) contains no frequencies higher than B hertz, it
is completely determined by giving its ordinates at a series of points
spaced
1
2B
seconds apart [Wik13b]."
Which actually means that to reconstruct a continuous signal from its digital form,
the sampling rate f
s
should be twice as greater than the bandwidth B of the signal.
f
s
> 2B
(2.5)
2.2.5
Window Functions
In signal processing, window function is defined as a mathematical function who
value is zero outside the given interval. As a person is talking, sound produced by it
changes very quickly so to study every change/segment of speech, it is divided into
many frames with help of a window function.
Figure 2.6: Hamming window effect
Most famous window functions are rectangular and Hamming windows. Rect-
angular function is always constant inside the given interval and always zero outside
the interval. Rectangular window is a constant inside the interval which means that
the changes around the edges are abrupt so to decrease this abrupt effect, Hamming
20
CHAPTER 2. BACKGROUND
window is used which is generalized by equation.
h
N
[n] =
- cos
2
n
N
- 1
0
n < N
0
otherwise
(2.6)
A hamming window effect can be seen in figure 2.6. In MATLAB's signal process-
ing toolbox hamming window function is available which can be used by giving
command hamming
(Signal)
Details
- Pages
- Type of Edition
- Erstausgabe
- Publication Year
- 2014
- ISBN (eBook)
- 9783954897285
- ISBN (Softcover)
- 9783954892280
- File size
- 3.2 MB
- Language
- English
- Publication date
- 2014 (March)
- Keywords
- Natural Languge Processing Gender Identification Machine Learning
- Product Safety
- Anchor Academic Publishing