Overview of Speech Based Gender Identification

Sheikh, Hassam

Overview of Speech Based Gender Identification

by Hassam Sheikh (Author)

Gender Studies

53115

Summary

This book focuses on the basics of natural language processing and machine learning required to make a standard speech- based gender identification system. In this book all the required signal processing techniques required for understanding the basics of natural language processing including all types of Fourier transform, basic speech enhancement techniques, voice activity detection and pitch estimation using sub harmonic-to-harmonic ratio are briefly explained as well. In the machine learning part, all the relevant machine learning models like Support Vector Machines, Gaussian Mixture Models and Adaptive boosting are explained. Lastly the results of different gender identification systems that were implemented using state of the art techniques are portrait and analysed.

Excerpt

Contents

Abstract

Acknowledgements

Introduction

Background

2.1

Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1.1

Speech Signal

. . . . . . . . . . . . . . . . . . . . . . . .

2.2

Speech Signal Processing . . . . . . . . . . . . . . . . . . . . . . .

2.2.1

Fourier Transform . . . . . . . . . . . . . . . . . . . . . .

2.2.2

Discrete Cosine Transform . . . . . . . . . . . . . . . . . .

2.2.3

Digital Filters . . . . . . . . . . . . . . . . . . . . . . . . .

2.2.4

Nyquist Shannon Sampling Theorem . . . . . . . . . . . .

2.2.5

Window Functions . . . . . . . . . . . . . . . . . . . . . .

Speech Enhancement

3.1

Signal to Noise Ratio . . . . . . . . . . . . . . . . . . . . . . . . .

3.2

Spectral Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . .

3.3

Cepstral Mean Normalization . . . . . . . . . . . . . . . . . . . . .

3.4

RASTA Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.5

Voice Activity Detector . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS

3.5.1

The Empirical Mode Decomposition Method . . . . . . . .

3.5.2

The Hilbert Spectrum Analysis . . . . . . . . . . . . . . . .

3.5.3

Voice Activity Detection . . . . . . . . . . . . . . . . . . .

Gender Identification Systems

4.1

Acoustic Features . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.1.1

Mel Frequency Cepstral Coefficients (MFCC) . . . . . . . .

4.1.2

Shifted Delta Cepstral (SDC) . . . . . . . . . . . . . . . . .

4.1.3

Pitch Extraction Method . . . . . . . . . . . . . . . . . . .

4.2

Pitch Based Models . . . . . . . . . . . . . . . . . . . . . . . . . .

4.3

Models based on Acoustic Features

. . . . . . . . . . . . . . . . .

4.4

Fused Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Learning Techniques for Gender Identification

5.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.2

Adaboost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.3

Gaussian Mixture Model (GMM) . . . . . . . . . . . . . . . . . . .

5.3.1

GMM Training . . . . . . . . . . . . . . . . . . . . . . . .

5.3.2

GMM Testing . . . . . . . . . . . . . . . . . . . . . . . . .

5.4

Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.5

Likelihood Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.6

Universal Background Model . . . . . . . . . . . . . . . . . . . . .

5.6.1

UBM Training . . . . . . . . . . . . . . . . . . . . . . . .

System Design and Implementation

6.1

Toolboxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.1.1

Signal Processing Toolbox . . . . . . . . . . . . . . . . . .

6.1.2

Machine Learning Toolbox . . . . . . . . . . . . . . . . . .

6.2

System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.2.1

Requirement . . . . . . . . . . . . . . . . . . . . . . . . .

6.2.2

Initial Approach . . . . . . . . . . . . . . . . . . . . . . .

6.2.3

Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS

6.2.4

Feature Selection . . . . . . . . . . . . . . . . . . . . . . .

6.3

Experiments and Results . . . . . . . . . . . . . . . . . . . . . . .

6.3.1

Pitch Based Models

. . . . . . . . . . . . . . . . . . . . .

6.3.2

Models Based on Acoustic Features . . . . . . . . . . . . .

6.3.3

Fused Model . . . . . . . . . . . . . . . . . . . . . . . . .

6.3.4

YouTube Videos . . . . . . . . . . . . . . . . . . . . . . .

Conclusion

7.1

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.2

Future Recommendation . . . . . . . . . . . . . . . . . . . . . . .

Bibliography

A Appendix

List of Figures

2.1

Mechanism of the human speech system which is representing the

underlying phenomenon of speech generation and speech under-

standing. The grey boxes are representing computer systems for

natural language processing [HAH01] . . . . . . . . . . . . . . . .

2.2

Conversion of analogue signal to digital signal. Red lines show the

digital value of the analogue signal . . . . . . . . . . . . . . . . . .

2.3

DFT applied to a speech signal . . . . . . . . . . . . . . . . . . . .

2.4

DCT applied to a speech signal . . . . . . . . . . . . . . . . . . . .

2.5

Sampling of a Continuous Signal . . . . . . . . . . . . . . . . . . .

2.6

Hamming window effect . . . . . . . . . . . . . . . . . . . . . . .

3.1

A noisy speech signal [Vat12]

. . . . . . . . . . . . . . . . . . . .

3.2

A clean speech signal [Vat12]

. . . . . . . . . . . . . . . . . . . .

3.3

VAD applied to noisy speech [SZ12]

. . . . . . . . . . . . . . . .

4.1

A block diagram of gender identification model . . . . . . . . . . .

4.2

A block diagram of MFCC computation [Vat12] . . . . . . . . . . .

4.3

Mel frequency scale [Vat12] . . . . . . . . . . . . . . . . . . . . .

4.4

Graph of Mel filterbank of 24 filters [Vat12] . . . . . . . . . . . . .

4.5

Computational model of SDC [TcSKD02] . . . . . . . . . . . . . .

4.6

Plot of male and female pitch [Sun00] . . . . . . . . . . . . . . . .

4.7

Block diagram of gender identification model trained using MFCC

[Sun00] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.8

A fused gender identification model [Sun00] . . . . . . . . . . . . .

LIST OF FIGURES

4.9

A adaboost score fusion model [IKJGY10] . . . . . . . . . . . . . .

5.1

Optimal decision boundary between two classes

. . . . . . . . . .

5.2

One Dimensional Gaussian Mixture Model . . . . . . . . . . . . .

6.1

Block diagram of SDC feature extraction

. . . . . . . . . . . . . .

6.2

Block diagram of the final fused model . . . . . . . . . . . . . . . .

6.3

Snapshot of the Graphical User Interface of the system . . . . . . .

List of Tables

6.1

Results from pitch based model trained with 1 male and 1 female

speaker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.2

Results from pitch based model trained with 9 male and 1 female

speakers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.3

Results from pitch based model trained with 1 male and 9 female

speaker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.4

Results from pitch based model trained with 8 male and 8 female

speakers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.5

Results from MFCC model trained using 8 GMM components . . .

6.6

Results from MFCC model trained using 16 GMM components . . .

6.7

Results from MFCC model trained using 32 GMM components . . .

6.8

Results from SDC model trained using 8 GMM components . . . .

6.9

Results from SDC model trained using 16 GMM components . . . .

6.10 Results from SDC model trained using 32 GMM components . . . .

6.11 Results from fused model trained using 8 GMM components on

SDC features . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.12 Results from fused model trained using 16 GMM components on

SDC features . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.13 Results from fused model trained using 32 GMM components on

SDC features . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.14 Results from acoustic and fused models tested on large amount of

data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.15 Accuracy of all the models that were tested on YouTube Videos . . .

Chapter 1

Introduction

As the significance of the computers in our daily life is getting popular, the inter-

action between human and machine is becoming more important day by day. The

desire of humans to communicate with machines in a natural way has led to the

evolution of natural language processing. As the advancements in this field are hap-

pening, it is likely that voice interaction systems will replace the standard keyboards

in near the future. Today if we look in the technology market around us we have

some really state of the art technologies like Microsoft

Kinect and Apple

SIRI

which performs really well. But every speech system that is available today has its

own drawbacks and continuous work is being done to increase the performance of

such systems. To increase the performance of speech systems pre-processing like

gender and language identifications are required.

This book focuses on automatic gender identification system using speech. Iden-

tification of gender using the speech of the speaker concerns in detecting that the

spoken speech is of male or female speaker. Automatic Gender Identification (AGI)

via speech has several applications in the field of natural language processing. In

[AH96] has shown that gender dependant speech recognition models are more ac-

curate than gender independent models. Google's latest speech recognition system

that can be seen in android devices and Google Glass initially finds the gender of

the speaker before performing the speech recognition for search. The result of its

speech recognition accuracy is exceptionally high as compared to their previous

speech recognition system which was an unisex model. Recently a company has

launched its "Kinect" based online fitting room that determines the gender of the

person using it speech to offer him clothes. In the context of multimedia indexing

gender recognition can considerably decrease the search space up to half [HC].

Automatic gender recognition itself is a complex task and it has its own prob-

lems and limitations, until now no gender recognition system exists which can work

on real time environment with 100% accuracy. As in a real world environment or in

the case of multimedia indexing many acoustic conditions exist like noisy speech,

compressed speech, silence, speech on the telephone, different languages and so on

which significantly reduces the performance of a general gender identification sys-

tem. So ideally a system is required which can give acceptable performance under

previously described acoustic conditions.

In general, there are three main approaches to building an automatic gender

identification system: The first approach uses pitch as a discriminating factor and

use labelled data to identify the gender of the speaker. The second approach deals

with acoustic features like MFCC and unlabelled data to identify the gender. In this

approach relevant features are extracted then the model is trained. In this case gen-

erally a GMM is trained for each gender and results from one model are subtracted

from the other model to find the gender. The third approach is quite commonly used

after year 2005, in which pitch models are combined with acoustic models to form

a fused model.

This book is organised as follows: The first chapter presents the challenge of

gender identification system. The second chapter presents necessary knowledge of

signal processing. Then each chapter describes a step of a gender identification sys-

tem: including speech enhancement techniques to reduce background noise, feature

extraction, gender modelling methods and different decision making techniques.

Finally, the last chapter presents the implementations done for this project and re-

sults obtained from testing different models on a large set of speakers and YouTube

videos

Chapter 2

Background

To understand the gender identification process using speech, we first need to un-

derstand the structure of speech. This chapter includes human speech and what is

the basic difference between female and male voice.

2.1

Speech

Spoken language or human speech is the natural form of human communication

which requires the use of voice. In terms of Linguistics human speech is a form of

sound wave which is produced by the lungs and it is given uniqueness by tongue,

lips and jaws [HAH01]

2.1.1

Speech Signal

Speech is produced when the air pressure generated by lungs reaches the vocal

cords. Then speech begins to resonate in the nasal cavities according to the position

of lips, tongue and other organs in the mouth. In terms of signal processing speech

signal is an analogue signal which is the convolution of the source e

[n] and a filter

[n] which can be seen in equation 2.1 where lungs can be modelled as the source

2.1. SPEECH

Figure 2.1: Mechanism of the human speech system which is representing the un-

derlying phenomenon of speech generation and speech understanding. The grey

boxes are representing computer systems for natural language processing [HAH01]

[n] and the resonance of speech in the mouth can be modelled as the filter h[n]

[n] = e[n] h[n] =

[k]h[n - k]

(2.1)

where x

[n] is the speech signal.

Human Speech Frequency

The frequency range that is the part of the audio range is 300Hz to 3400Hz which

means that human speech lies in this range [Tit00a]. On the other hand the sound

ranges i.e. the frequency range between humans can hear any sound is between 20

Hz to 20,000Hz. Beyond the region of 20,000Hz the region of ultrasonic comes

which, humans are unable to hear.

CHAPTER 2. BACKGROUND

Figure 2.2: Conversion of analogue signal to digital signal. Red lines show the

digital value of the analogue signal

Fundamental Frequency (Pitch)

Generally fundamental frequency is defined as the minimum frequency of the peri-

odic waveform. The fundamental frequency or usually known as pitch in terms of

natural language processing is the biggest discriminating factor between a male and

a female speech. A typical male adult has a fundamental frequency between 85Hz

to 180Hz and an adult female has a fundamental frequency in the range of 165Hz

to 225Hz [BO00].

2.2

Speech Signal Processing

From [HAH01] we know that speech is an analogue signal but unfortunately today`s

computers work with digital signals so speech is saved in digital form in computers.

When speech is converted to digital form, it loses some of the data so accurate

representation of analogue signal into digital signal is required . A conversion of

analogue signal can be seen in figure 2.2

2.2. SPEECH SIGNAL PROCESSING

2.2.1

Fourier Transform

According to Joseph Fourier, any signal can be represented as a linear combination

of sinusoids which means that the Fourier transform can be described as transform-

ing a function of time f

(t) into a function of frequency F(). This can be shown

() =

(t)e

-2it

(2.2)

There exist different types of Fourier transforms but most famous are

1. Continuous Time Fourier Transform

2. Continuous Fourier Transform

3. Discrete Fourier Transform

4. Discrete Time Fourier Transform

In an automatic gender recognition system only discrete Fourier transform is re-

quired so only that will be explained.

Discrete Fourier Transform

For any periodic signal x

[t] the discrete Fourier transform can be defined as

() =

-1

[t]e

-2it

(2.3)

A discrete Fourier transform applied to a signal can be seen in figure 2.3

2.2.2

Discrete Cosine Transform

Discrete Cosine Transform commonly known as DCT is similar to DFT. DCT is

used to transform a finite sequence of data points into sum of different sinusoids vi-

brating at different frequencies. DCT is usually used for compression of images and

sound where the lower number of higher frequency components can be discarded

which means that the transformed signal is mostly comprised of lower frequencies

CHAPTER 2. BACKGROUND

Figure 2.3: DFT applied to a speech signal

thus majority of information can be found in first coefficient. More information

about the usage of the DCT in speech processing can be found in [MAMG11].

Mathematically DCT can be defined as

[k] =

-1

[n]cos

[0,T - 1]

(2.4)

where X

[k] is the kth coefficient. DCT applied to a speech signal can be seen in

figure 2.4

2.2.3

Digital Filters

Digital filters are mathematical models that are applied to a signal to remove some

components of that signal or to enhance some aspects of the signal. In natural

language processing widely used filters are low pass filters, band pass filters and

high pass filters [APHA96].

Low Pass Filter

A low pass filter is used to discard the frequencies higher than the cut-off frequency

in a speech signal.

2.2. SPEECH SIGNAL PROCESSING

Figure 2.4: DCT applied to a speech signal

High Pass Filter

A high pass filter is used to discard the frequencies lower than the cut-off frequency

in a speech signal.

Band Pass Filter

The band pass filter allows a certain range of frequencies to pass and discard all the

frequencies that are higher or lower than the cut-off frequencies.

Sampling

Human speech signal is naturally a analogue signal but to perform any computa-

tional tasks on the speech signal, it should be converted to a digital form. In signal

processing, sampling means to convert a continuous time signal to discrete time

signal by looking at in regular intervals of time. The regular interval of time is

generally called the sampling interval which is the reciprocal of the sampling fre-

quency and is denoted by T

. Sampling frequency, generally known as the sampling

rate is defined as number discrete samples taken from a signal in one second and

is denoted by f

. The higher the sampling frequency is the better the digital

signal is as more information was captured and less information was lost. Usually

CHAPTER 2. BACKGROUND

in speech processing 44 KHz is considered to be a good sampling rate which means

that 44000 samples are taken from 1 second of speech.

Quantization

In digital signal processing quantization is the process of mapping a continuous

range of very large values to a smaller set of discrete or integer values. The error

that is induced because of the loss of the information during mapping is called

quantization error. Quantization is the used in analogue-to-digital converters for

converting discrete signals to digital signals using a quantization level specified in

bits. As loss of information during quantization is irreversible, it is a good practise

to set the quantization level to higher bits. A good quality compact disc is sampled

at 44.1 KHz with a quantization level of 16 bits which can give 65,536 possible

values per sample.

Figure 2.5: Sampling of a Continuous Signal

2.2.4

Nyquist Shannon Sampling Theorem

Nyquist Shannon sampling theorem more generally known as Nyquist sampling

theorem states that

2.2. SPEECH SIGNAL PROCESSING

"If a function x

(t) contains no frequencies higher than B hertz, it

is completely determined by giving its ordinates at a series of points

spaced

seconds apart [Wik13b]."

Which actually means that to reconstruct a continuous signal from its digital form,

the sampling rate f

should be twice as greater than the bandwidth B of the signal.

> 2B

(2.5)

2.2.5

Window Functions

In signal processing, window function is defined as a mathematical function who

value is zero outside the given interval. As a person is talking, sound produced by it

changes very quickly so to study every change/segment of speech, it is divided into

many frames with help of a window function.

Figure 2.6: Hamming window effect

Most famous window functions are rectangular and Hamming windows. Rect-

angular function is always constant inside the given interval and always zero outside

the interval. Rectangular window is a constant inside the interval which means that

the changes around the edges are abrupt so to decrease this abrupt effect, Hamming

CHAPTER 2. BACKGROUND

window is used which is generalized by equation.

[n] =

- cos

- 1

n < N

otherwise

(2.6)

A hamming window effect can be seen in figure 2.6. In MATLAB's signal process-

ing toolbox hamming window function is available which can be used by giving

command hamming

(Signal)

Details

Pages
Type of Edition: Erstausgabe
Publication Year: 2014
ISBN (Softcover): 9783954892280
ISBN (eBook): 9783954897285
File size: 3.2 MB
Language: English
Publication date: 2014 (March)
Keywords: Natural Languge Processing Gender Identification Machine Learning
Product Safety: Anchor Academic Publishing

Author

Hassam Sheikh (Author)

Overview of Speech Based Gender Identification

Summary

Excerpt

Table Of Contents

Details

Author

Hassam Sheikh (Author)