Eliza . u} If. 1., _ . v.
.., 7 ..
I... a... .. EEMWauhrsaﬁ

‘ xvnul,
K . . .3‘

 

 

 

 

 

 

wiéﬁﬁgﬁiéﬁg ‘

95,.

 

$me ,

In?”

I .

..
.5;
than;
.1.
z

.
I

5

whim:

:.
r...
2: 5. .3».

15.. . .
1 $.15
54 = 9..
55:2.
V45114

. $2 t.
53.3.3»!

.1
.ﬂﬁuuuﬁwmm

Lalx .-
1.! ..
! ks yo. i

llllllllllllll lillllll’llll

301771 5412

LIBRARY
Michigan State
University

 

 

 

 

This is to certify that the

thesis entitled

AUTOMATIC IDENTITY VERIFICATION USING
FACES, FINGERPRINTS, SPEECH

presented by
Yatin Subhashchandra Kulkarni

has been accepted towards fulﬁllment
of the requirements for

Master's degree in Compugez Science
& Engineering

 

“WV

Major professor

Date I)“ «571/ V178

 

0-7639 MS U is an Afﬁrmative Action/Equal Opportunity Institution

PLACE IN RETURN Box to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE

DATE DUE

DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1/98 mummy.“

 

AUTOMATIC IDENTITY VERIFICATION USING FACES,
FINGERPRINTS AND SPEECH

By

Yatz'n S'. Kulkarnz'

A THESIS

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

MASTER OF SCIENCE

Computer Science and Engineering

1998

Iran
aiiiiu
than
Ihe’: s
QUIT}
firm.
Of 9N
Tilly],-
(‘I‘mﬁf
[he \.
96er
huiila

be {liq

ABSTRACT
AUTOMATIC IDENTITY VERIFICATION USING FACES,
FINGERPRINTS AND SPEECH
By
Yatz'n S. Kulkarm'

In the age of Internet communication and commerce, ensuring the security of
transactions is a major concern. To ensure the total security of the billions of trans-
actions taking place everyday, it is necessary to prevent the information being ex-
changed from being intercepted during transmission and to establish the identity of
the sender and/or the receiver. While increasingly powerful and sophisticated data
encryption algorithms are being made available to avoid the interception of informa-
tion, age-old techniques of user-name and password are still the predominant means
of establishing the identity of the sender and/ or the receiver. Biometrics is a tech-
nique that is capable of establishing an individual’s identity with a high degree of
confidence on the basis of his/ her physiological and /or behavioral characteristic. Of
the various biometric techniques available [1], we are mainly interested in faces, ﬁn-
gerprints, and speech. Recognition by face is probably the most common method that
humans use to establish a person’s identity. Fingerprint veriﬁcation has proved to

be the most reliable way of verifying the identity of a person and is by far the most

iii

an

an:

Hie

idol:

deti'.

SYST

(w l.

50 ‘d.'

 

mature biometric technique. Speaker veriﬁcation, wherein a person’s voice patterns
are used to authenticate his / her identity claim, has been shown to be readily accepted
and widely deployed biometric technique. In this thesis we have designed and imple-
mented an automatic identity veriﬁcation system that authenticates an individual’s
identity claim by integrating the above three biometric techniques through a robust
decision fusion scheme. Our goal is to improve the performance of the integrated
system, in terms of false accept and false reject rates, beyond those possible by using
each of the techniques separately, while still meeting the response time requirements

so as to make a real time implementation possible.

To My Wife, Kaushika

iv

 

 

sampl
It in.

PIOH’I
(Emmi;
[Iii-5 E

.1 I
\‘\ 1§'

U

V

r1.
Misﬁt (-

ACKNOWLEDGMENTS

This thesis could not have been completed but for the time and patience of the nu-
merous people who generously allowed me to collect face, ﬁngerprint, and speech
samples for use in my research. I sincerely thank each one of them for their coopera-
tion. Special thanks are due to the members of the Pattern Recognition and Images
Processing laboratory of the Department of Computer Science and Engineering for
continually accommodating my requests for more and more data in their busy sched-
ules and for giving me a number of useful insights into my research. Above all, I
wish to express my gratitude to Professor Anil K. Jain, my thesis advisor, for his
generous support, guidance and patience throughout my research. His attention to
detail has inspired me time and again to critically analyze and improve my work. I
have beneﬁted greatly from his vast experience and valuable suggestions that he has
shared with me over the course of this work. I will be forever indebted to him for
providing me this wonderful Opportunity to work in this ﬁeld.

I would like to thank Profs. John Weng and Jack Deller for serving on my thesis
committee. I am very grateful to Dr. Weng for his helpful suggestions and construc-

tive criticism that contributed towards improving the quality of my research. Special

 

 

then..-
hldll‘ =
Iron-1a
Opt’ll
ram
and l
bilit)‘

degrel

‘ 1
gllidd

and s.

 

D‘di‘d.

q .

opi‘f‘],

as for

 

thanks are due to Dr. Deller for sharing his knowledge about speech recognition using
hidden Markov models that contributed signiﬁcantly to Chapter 4 of this thesis. I
would also like to express my thanks to Prof. George Stockman for always keeping an
open door for me to discuss research with him at any time throughout my work. I also
want to thank all the faculty members from the Departments of Computer Science
and Engineering, Electrical and Computer Engineering, and Statistics and Proba-
bility for the excellent instruction that I received during the course of my Masters
degree.

I also wish to express my Sincere gratitude to Dr. Lin Hong for his help and
guidance on every aspect of my research and also for sharing his ﬁngerprint data
and source code. I would also like to thank Aditya Vailaya, Scott Connell, Nicolae
Duta, Vera Bakic and my other PRIP colleagues for helping me throughout my work.
Special thanks are also due to Stefaan Delcroix for helping me academically as well
as for providing moral support and fun times outside school.

I am grateful to the Department of Computer Science and Engineering for its
generous ﬁnancial support without which my research would not have been possible.
I also wish to thank the system administrators in the Department for maintaining
excellent computing facilities and satisfying my ever increasing disk space require-

ment. I also appreciate the enthusiastic helpfulness of the Departmental support
Staff, particularly Linda Moore, Cathy Davison, and Mary Gebbia at all times.
Special thanks to my parents and my brother, Ketan, who stood by me in troubled
tiInes and made all this possible. My mother’s undying love and faith enabled me to
0Vercome a number of obstacles and achieve my goals. Special thanks are also due

vi

 

 

to m_\‘l
for ll
my \I’
PilfOll
III Illi"

gram

 

 

to my father-in-law Dr. S. V. Shanbhag and my mother-in-law Mrs. S. S. Shanbhag
for their moral support over the past six years. Finally, I am most thankful to
my wife, Kaushika, for her love and support over the last six years. Her constant
encouragement helped me tremendously during my studies. She also was a great help
in the data collection process. I dedicate this thesis to her with all my love and sincere

gratitude.

vii

 

H

CO '

 

L151
L151

1 11'

1.5.1

 

N
~r1
CL?

~~~~~

,0 (\D
'J-A CA:

CO
H

CA?
"'1
. . b-—« -

CA)
(0

CAD
to
5—..1

Ann—.4

Pt.)
“a

c: v

 

 

TABLE OF CONTENTS

LIST OF TABLES x
LIST OF FIGURES xi
1 Introduction 1
1.1 An Ideal Personal Identity Veriﬁcation System ............... 2
1.2 Biometrics ................................... 3
1.3 Requirements of a biometric characteristic ................. 4
1.4 Biometric Technologies Prevalent Today .................. 5
1.5 A Biometric based Personal Identity Authentication System ....... 8
1.5.1 Operational Mode ............................. 9
1.5.2 Accuracy .................................. 10
1.6 Combination of multiple biometrics ..................... 11
1.7 Problem Statement .............................. 14
1.8 Intended use and Constraints ........................ 14
1.9 Contributions ................................. 14
1.10 Overview of the thesis ............................ 15
1.11 Summary ................................... 15
2 Face Veriﬁcation 17
2.1 Face Detection ................................ 19
2.2 Face Recognition ............................... 21
2.2.1 Training Stage ............................... 24
2.2.2 Operational Stage ............................. 26
2.3 Experimental Results in Literature ..................... 27
2.4 Summary ................................... 28
3 Fingerprint Veriﬁcation 29
3.1 Fingerprint Image Acquisition ........................ 29
3.2 Feature Extraction .............................. 32
3.2.1 Minutiae Extraction ............................ 34
3.3 Fingerprint Matching ............................. 35
3.3.1 Minutiae Matching ............................. 36
3.4 Experimental Results in Literature ..................... 40
3.5 Summary ................................... 40

viii

 

 

4 S;
4.1

4.1.1
4.1.2
4.1.3

id

fl -1 fl fl fl ‘1 ‘1 ~1

Agococowww

(1‘ (I) m
(IQ H

4 Speaker Veriﬁcation 43

4.1 Feature Extraction .............................. 45
4.1.1 Speech Acquisition and Preprocessing .................. 51
4.1.2 Speech Segmentation ............................ 52
4.1.3 Cepstral Analysis .............................. 58
4.2 Speaker Modeling ............................... 62
4.3 Pattern Matching ............................... 73
4.4 Experimental Results in Literature ..................... 76
4.5 Summary ................................... 77
5 Decision Fusion 79
5.1 Formulation .................................. 81
5.2 The Neyman-Pearson Rule .......................... 84
5.3 Linear Discriminant Functions ........................ 86
5.4 Summary ................................... 87
6 Integrated Biometric System 88
6.1 Data Acquisition Module ........................... 90
6.2 Enrollment Module .............................. 90
6.3 Template Database .............................. 92
6.4 Veriﬁcation Module .............................. 93
6.5 Commercially Available Biometric Systems ................. 94
6.6 Summary ................................... 96
7 Experimental Results 97
7.1 Database ................................... 99
7.2 System Training ............................... 103
7.3 Performance Evaluation ........................... 107
7.3.1 Performance of the N eyman-Pearson decision fusion schema ...... 107
7.3.2 Performance of the linear discriminant function decision fusion schema 108
7.3.3 Veriﬁcation Speed ............................. 108
7.4 Summary ................................... 109
8 Summary and Future Research 111
8.1 Summary ................................... 111
8.2 Future Research ................................ 113

ix

"O >—-‘

LIST OF TABLES

1.1 Comparison of various Biometric technologies [2]. .............

7.1 Various factors affecting the performance of different biometric systems. .
7.2 Wall-clock times for the various veriﬁcation systems. ...........

1.1
1.2

2.1
2.2
2.3
2.4

3.1
3.2
3.3
3.4
3.5
3.6

3.7

4.1
4.2
4.3

4.4
4.5
4.6
4.7
4.8
4.9
4.10

5.1

6.1
6.2
6.3

6.4
6.5
6.6

LIST OF FIGURES

Examples of nine different biometric technologies. .............
A generic biometric identity veriﬁcation system. ..............

The architecture of a face-based veriﬁcation system .............
A typical facial image obtained using our imaging setup. .........
Our face image segmentation algorithm. ..................
The ﬁrst nine eigenfaces ............................

Architecture of the ﬁngerprint veriﬁcation system ..............
Different kinds of ﬁngerprints: (a) inked, (b) latent, and (c) live-scan. . .
The live-scan ﬁngerprint scanner from Digital Biometrics ..........
Examples of minutiae and their characterization. .............
Flowchart of the minutiae extraction algorithm after Jain et al. [2]. . . .
Fingerprint matching problem: (a) and (b) are two different impressions

from the same ﬁnger ............................
Flowchart of the minutiae matching algorithm ................

A block diagram of the HMM-based Speaker Veriﬁcation Module. . . . .
A schematic diagram of the human speech production mechanism [3]. . .
A discrete-time model representing the human speech production system

[3]. .....................................
Vocal-tract models for different sounds [3]. .................
Vocal-tract magnitude spectrum for letter ‘a’ for two different speakers. .
Plots of the log energy for speech acquired under different environments.
The ﬂowchart of the algorithm for word boundary detection. .......
Output of the speech segmentation algorithm. ...............
A fully connected three-state hidden Markov model .............
A left-to-right ﬁve state Hidden Markov Model. ..............

Integration of different biometric characteristics (after Jain et al. [4]). . .

The block diagram of the integrated system for personal veriﬁcation.
GUIs for data acquisition during enrollment .................
User Database: (a) The users enrolled in the system are shown in the list,
(b) For each user, three type of data are deﬁned .............
GUIs for the enrollment module. ......................
GUIs for the veriﬁcation module. ......................
The result of a successful veriﬁcation .....................

xi

6
9

18

30
31
32
33
35

37
41

45
46

48
49
50
53
55
58
63
64

8O

89
91

92

7.1
7.2
7.3

7.4
7.5
7.6
7.7
7.8
7.9

8.1

An ideal receiver operating characteristics curve ............... 99

A typical example of the nine face images acquired for each user. . . . . 100
A typical example of the ten ﬁngerprint images of good quality acquired

for each user. ............................... 101
The waveform of a typical speech sample in our database. ........ 102
The genuine and impostor distributions for face matching. ........ 104
The genuine and impostor distributions for ﬁngerprint matching ...... 105
The genuine and impostor distributions for the speaker veriﬁcation system. 106
The receiver operating characteristics for the Neyman-Pearson rule. . . . 108

The receiver operating characteristics for the Linear Discriminant function. 109

Example of the temporal variation present between training and test face
images .................................... 114

xii

Chapter 1

Introduction

Today, the Internet is changing the very nature of business. More and more routine
business transactions such as exchange of important documents, transfer of funds, etc.
are being conducted electronically. While the companies, institutions, and the people
involved in these transaction stand to beneﬁt from the speed and efﬁciency that the
Internet has to offer, it has also made it possible for hi-tech computer criminals to
gain access to sensitive information and monetary resources. Hence, with the growing
volume of Internet commerce, there is a growing need for ensuring the security of the
billions of electronic transactions taking place.

To ensure fool proof security, it is necessary to establish the identity of the receiver
and / or the sender and to ensure that the information is not intercepted during transit.
Currently, it is possible to ensure that the information is not intercepted during transit
by means of powerful and sophisticated data encryption programs. However, the age-
old method of requesting a user-name and a password is still the only means of

establishing the identity of the sender and / or the receiver.

2

Since the user-name and password are something that one knows, this method
of establishing a person’s identity is called a knowledge-based method. The other
traditional method, which is very popular in point-Of-access control applications, relies
on the user supplying a token such as a passport, a driver’s license, etc., as proof of
identity and is correspondingly called a token-based method.

Automatic personal identity authentication systems based on these two methods
have been deployed on a large scale during the past few decades and have gained
immense popularity. The most popular system being the combination of a magnetic
stripe card and a personal identiﬁcation number (PIN) that is still being used at
millions of ATM machines world wide [1]. The popularity of these methods can be
attributed to their simplicity, ease of use, the ability to give absolute yes/ no answers
and the ease of integrating them into the existing systems. However, their simplicity
also makes them easily susceptible to fraud. Passwords and PINS can be stolen,
misplaced or forgotten, and identity cards may be easily forged. Furthermore, since
the proof of identity is something that one knows or one has, it is not possible to
distinguish between an authorized user and an impostor who has fraudulently gained

access to the relevant knowledge and/or token.

1.1 An Ideal Personal Identity Veriﬁcation System

The pertinent question today then, is “Can we have an automatic personal identity
veriﬁcation system that is easy to use, 100% accurate and impossible to circumvent?”

The answer to this question lies in understanding how the three requirements can

3

be satisﬁed. By “easy to use” we mean that the user is not required to memorize a
cumbersome password or a personal identiﬁcation number and neither is he required
to carry around a magnetic stripe card or any other proof of identity. By “100% ac-
curate” we imply that the system never allows an impostor to gain access and neither
does it ever reject a genuine individual. And ﬁnally, by “impossible to circumvent”
we mean that an impostor can by no means fool the system into recognizing him/her

as a genuine individual.

Unfortunately such a system is still, at best, an elusive goal.

1.2 Biometrics

As humans, we identify a person by his/ her face and/or voice. When a more reliable
proof of identity is required, use of ﬁngerprints is preferred. Thus, we rely on the
variations in various physiological and behavioral characteristics amongst different
people for identiﬁcation. Biometrics is a collection of technologies that attempts
to measure these variations so as to make automatic personal identiﬁcation, using
physiological and behavioral characteristics, possible. It differs from knowledge-based
and token-based personal identiﬁcation systems, in that, it makes use of something
that one is or one does to identify that individual. Since it is impossible to steal a
physiological characteristic from an individual and difﬁcult to imitate an individual’s
behavioral characteristic, biometrics is inherently better suited for automatic personal

identiﬁcation and/ or identity authentication.

4
1.3 Requirements of a Biometric Characteristic

Based on the deﬁnition of biometrics given above, it is easy to see that not all human
physiological and behavioral characteristics can be used as biometrics. The follow-
ing requirements must be satisﬁed by a physiological and behavioral characteristic
to be useful as a biometric [2]: (i) universality, which means that every individual
must posses that characteristic, (ii) uniqueness, which implies that no two individuals
should be identical with respect to that characteristic, (iii) permanence, which states
that the characteristic must not change beyond reasonable limits over a period of time
and neither must it be easily changeable by artiﬁcial means, and (iv) collectability,
which requires that the characteristic must be suitable for being measured quantita-
tively. Additional constraints are imposed on the characteristics due to the need for
a biometric system to be of practical use. These constraints are: (i) performance,
which refers to the resources required for constructing the biometric system, the time
required for the system to measure the biometric, process it and make a decision, the
identiﬁcation accuracy that can be achieved using that biometric, and the robustness
of the system to variations in operating environment, (ii) acceptability, which indi-
cates the ease of use and the willingness of the intended users towards using it on a
regular basis, and (iii) circumvention, which indicates the possibility of the system

being fooled by an impostor attempting to gain access by fraudulent means.

5
1.4 Biometric Technologies Prevalent Today

To date, nine biometrics namely face, ﬁngerprints, speech, iris, retinal blood vessels
pattern, hand geometry, hand vein pattern, facial thermograms, and signature have
proved to be quite successful and a number of practical systems have also been built
using these biometrics. Face, ﬁngerprints, iris, retinal patterns, hand geometry, hand
vein patterns, and facial thermograms are physiological characteristics whereas sig-
nature is a behavioral characteristic. Speech is a physiological as well as behavioral
characteristic. Typically, biometrics that are of a physiological nature are preferred
over those of a behavioral nature, since they exhibit smaller intro-class variations.
Furthermore, systems based on physiological biometrics are more difficult to circum-
vent than those based on behavioral biometrics. Figure 1.1 depicts examples of each

of the above nine biometrics.

Each of the nine biometrics, described above, satisfy the seven requirements stated
in Section 1.3 to varying degrees. Table 1.1 shows a brief comparison of the nine bio-
metric techniques along the seven requirements [2]. Depending upon the intended
application, the usefulness of a speciﬁc biometric technique varies. NO single biomet-
ric technique can outperform all the other techniques in all application domains. For
example, although ﬁngerprint and iris-based biometric techniques perform much bet-
ter than speech-based systems, in terms of accuracy, in telephony based applications,

speaker veriﬁcation is the most economical choice.

Besides the techniques mentioned above, a number of other biometric techniques

have been investigated or are currently under study, which include ear shape, gait,

 

 

 

 

 

 

[)I'ﬂpr

 

 

Figure 1.1: Examples of nine different biometric technologies.

keystroke dynamics, body odor, acoustic emission of writing lip shape, DNA, etc. Al-
though each of these techniques has its own advantages, so far, none of them can
achieve an accuracy that is comparable to the nine different techniques mentioned
above or can be implemented fully automatically. In fact, they do not have a strong

potential to become a valid biometric technique to be used widely in the near future.

.E 2:228:25 286805 82:? 6 828:800 #4 28d.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

>5— :wE 32 8:6m8 32 32 8:6m8 £825
33 :wE 32 :wE 32 32 32 28:8me
62 as am: as 5388 am: am: new ass:
82 as am: 8285 82 52 Has a:
nwE 8:6o8 8:6o8 8:6:8 8:6o8 8:6u8 8:6o8 86> 68am
8:6o8 8:6m8 8:6m8 :wE 8:6m8 8:6m8 8:6m8 @8880 68mm
:wE 8:6o8 :wE 8:608 :wE :wE 8:6o8 EEEOwEm
6: is SEES am: as 6: LEE 58882:. .a
32 :wE 32 :wE 8:6m8 32 :wE womb
#:2838885 _ 5:33:33: 8888.29: _ 3:538:00 88288.8m $288683 52:8383 m25o82m

 

 

8
1.5 A Biometric based Personal Identity Authen-

tication System

Having studied the various biometric technologies, let us now address the issue of
designing an automatic personal identity authentication system. A generic automatic
personal identity authentication system based on biometrics is depicted in Figure
1.2. Clearly, before an individual can use the system, he/ she must be enrolled in the
system. This is similar, for example, to going to a bank and opening a bank account so
that you may receive an ATM card, which would then allow you to withdraw money
anytime at an ATM machine. Consequently, any biometric system must consist of
at-least two modules: (i) enrollment module and (ii) veriﬁcation module. From an
engineering perspective, the enrollment module must be able to (i) obtain a raw digital
representation of the desired biometric, (ii) process the raw digital representation by
means of a feature extractor so as to generate a compact but expressive representation,
referred to as a template, and (iii) store the template in a central database or a
magnetic stripe card or a smart card to be issued to the user. The veriﬁcation
module is to be used at the point-of-access and it should be able to capture the same
biometric characteristic as the enrollment module, convert it into the same format as
the template and ﬁnally compare the two so as to verify the identity claimed by the
individual.

Apart from these two essential modules, a practical automatic personal identity
veriﬁcation system based on biometrics may also incorporate auxiliary modules such

as a module for database management (in case a central database is being used),

[dainty Claim
(e4, User ID)

Flee
I III nllnh nl \IINIIII‘
Fad-I'l‘bermognms
Biometric
Read r

ﬁngerprints -—]

IIIndGeomeu-y —]

\ I I lIIkJIlIIIl \Ilulllli

  

Reﬁll-ISM—

Feature
Extractor

 

SIM" —

 

Figure 1.2: A generic biometric identity veriﬁcation system.

modules for doing off-line training, etc. Such modules may not be directly related
to the biometric technology being used, but are required nevertheless to make a

biometric system practical to use on a daily basis.

1.5.1 Operational Mode

There are two possible ways in which the identity of an individual can be established:
(2') veriﬁcation, wherein the individual submits a claim of identity and the system
veriﬁes the validity of the claim, and (ii) identiﬁcation, wherein the individual does
not make any prior claim of identity and the system assigns an identity to the individ-
ual. In the veriﬁcation mode, the system makes an one-to-one comparison between
the input representation and the template corresponding to the claimed identity and
makes a decision as to accept or reject the claim. In the identiﬁcation mode, the sys-

tem makes one—to-many comparisons between the input representation and the entire

10

template database. A measure of similarity or dissimilarity is computed for each com-
parison made and then the individual is assigned the identity of the template that

resulted in the highest (lowest) value of the similarity (dissimilarity) measure.

1.5.2 Accuracy

Given a biometric system that is designed to operate either in the veriﬁcation mode
or the identiﬁcation mode, we can divide the entire population of probable users
into two sets: (i) genuine individuals, users enrolled in the system and authorized
to access the resources being protected by the system, and (ii) impostors, users not
enrolled in the system and not authorized to access the resources being protected by
the system, but wanting to do so by fraudulent means. The biometric system either
accepts or identiﬁes a user as a genuine individual or rejects a user as an impostor.
Consequently, the following four situations may occur: (i) a genuine individual is
accepted, (ii) a genuine individual is rejected, (iii) an impostor is accepted, and (iv)
an impostor is rejected. Clearly, situations (i) and (iv) are correct system responses
whereas situations (ii) and (iii) are erroneous system responses. It is interesting to
note that situations (ii) and (iii) never occur in a knowledge-based or token-based
identiﬁcation system. A PIN or a password is either correct or incorrect. However, all
of the biometric characteristics studied above exhibit a signiﬁcant amount of intra-
class variations. Due to these variations, a biometric system cannot make an absolute
yes or no decision, but rather it associates a conﬁdence level with the decision, which

may be represented as a probability value. Therefore, the performance of a biometric

 

 

SYSICI
prohi-
m‘fr‘ II
(WI) (7‘
a get
mea:
“'8 V

gf‘llii

11

system is measured in terms of two error rates, (i) false acceptance rate (FAR), the
probability that an impostor is accepted as a genuine individual, and (ii) false reject
rate (FRR), the probability that a genuine individual is rejected as an impostor. These
two error rates can be evaluated empirically for a given biometric system by generating
a genuine distribution and an impostor distribution of the Similarity (dissimilarity)
measure used. It is easy to see that FAR and F RR are dual of each other. Ideally,
we would like both of them to be zero so that an impostor is never accepted and a
genuine individual is never rejected. However in practical systems, a low FAR results
in a high FRR and vice versa. Depending upon the intended domain of application,
it may be desirable to either have a low FAR and a tolerable FRR, or, have a low

FRR and an admissible FAR.

In some cases the performance of a biometric system is also measured in terms of
the authentic acceptance rate, which is given as (1 — FAR), and the equal error rate,

which is deﬁned as the error rate for which (FAR = F RR).

1.6 Combination of multiple biometrics

The individual use of different biometric characteristics result in systems that vary
widely in terms of FAR and FRR. At one extreme we have biometric characteristics
such as face and dynamic signature that result in a system with low FRR but a high
FAR, while at the other extreme, systems based on ﬁngerprints, retinal scans and iris
have a very low FAR but an unacceptably high FRR. The primary goal of combining

multiple biometrics is, then, to improve the FAR and the FRR of the integrated

 

term;

 

 

 

ions

Will].

2150‘

31111

Y
If}.

f] [I I

 

IOI

 

5pm
dlSi}

an

 

12

system while meeting the response time requirement of a real time system. A number
of researchers have proposed automatic personal identiﬁcation and/or veriﬁcation
systems that combine two or more different biometrics.

Brunelli and Falavigna [5] have proposed a system that integrates face recognition
and speaker recognition. They have decomposed the face recognition and speaker
recognition subsystems into three and two single feature classiﬁers, respectively. Var-
ious methods of combining the scores resulting from these ﬁve classiﬁers are then
studied. They report a authentic acceptance rate of 98% for the integrated system
as opposed to the 88% and 91% authentic acceptance rates provided by the speaker
and face recognition systems, respectively.

Dieckmann et al. [6] have prOposed the personal authentication system, SESAM,
that integrates three different biometric cues from two different data sources: one
static cue derived from an image of the face and two dynamic cues, the spectrum of
the sound and the lip motion of a person saying his/ her name in front of the system.
The results of the individual classiﬁers are combined using a 2-from-3 approach. They
report a FRR of 6.6% and FAR of 0.4% for the integrated identiﬁcation system as
compared to a FRR of 19.7%, 12.5% and 18.2% and a FAR of 2.1%, 1.3% and 7.2%
for speech, lip movement and face image systems, respectively.

Duc et al. [7] also propose a system that integrates speech-based authentication
and face-based authentication. However, they model the joint distributions of the
speech and face matching scores for genuine individuals and impostors as Gaussian
distributions and then use a maximum a posteriori (MAP) classiﬁer to make the ac-

cept / reject decision. They report a FAR of 0.54% and FRR of 0.0% for the integrated

 

we

lace

that
ﬁnger
fllh’
decis
scori
11119;

14.9

Willi

and

 

13
system as opposed to a FAR of 3.6% and 6.7% and a FRR of 7.4% and 0.0% for the

face and the speech systems, respectively.

Hong and Jain [8] have proposed an automatic personal identiﬁcation system
that integrates faces and ﬁngerprints. First, the impostor distributions for faces and
ﬁngerprints are estimated from empirical data. The system then attempts to minimize
FRR while holding FAR constant at a desired value. This is achieved by varying the
decision threshold for the ﬁngerprint system in accordance with the face matching
score and the desired FAR. For a FAR of 0.001%, they report a FRR of 9.8% for the
integrated system as opposed to a FRR of 64.1% for the face system and a FRR of

14.9% for the ﬁngerprint system.

The gain in performance reported in all the above cases is impressive enough to
warrant further research into this topic. We have chosen to integrate face, ﬁngerprint,

and speech for the following reasons:

1. Popularity: Face, ﬁngerprint, and speech are the most popular and widely ac-
cepted biometrics today. These biometrics are accepted in a court of law as

valid proofs of identity.

2. Cost: Speech is the most economical biometric and can be most easily integrated
into several existing systems. The cost of the additional resources required for
acquiring face and ﬁngerprint is rapidly decreasing. Thus, amongst the nine
biometrics discussed in Section 1.4, face, ﬁngerprint, and speech Offer the most

cost efﬁcient solution.

1.7 I

Our Ul.‘ j m
Iion system
a derision f

What is I.“ L“

1.8 In‘

The resultin;
establislmmn
:bmsands. \’

:mended fr r u

. L '. '
dillultﬁl lliiilﬁi‘

1-9 Cor

Th5 thesis; [1,.

 

Sinfirm that r»

(12'
(1112; the d.

 

14
1 .7 Problem Statement

Our objective in this thesis is to design and implement an automatic identity veriﬁca-
tion system that integrates face, ﬁngerprint, and speaker veriﬁcation modules through
a decision fusion scheme with an objective of achieving a lower FAR and FRR than

what is possible using these three individual biometrics.

1.8 Intended use and Constraints

The resulting automatic veriﬁcation system is to be used in a small to medium Sized
establishment wherein the number of enrolled individuals is of the order of a few
thousands. We further assume the users to be cooperative. Finally, the system is
intended for use in a controlled environment, wherein illumination, background, and

ambient noise are within speciﬁed limits.

1.9 Contributions

This thesis has resulted in the following contributions: (i) a robust decision fusion
scheme that enables the integration of face, ﬁngerprint, and Speech for veriﬁcation,
and (ii) the design and implementation of a fully automatic personal identity veriﬁ-
cation system based on the proposed decision fusion scheme.

The face recognition and ﬁngerprint veriﬁcation modules were obtained as stand-
alone applications for the Unix platform [8] and were ported to the Windows NT

platform and integrated with a speaker veriﬁcation module. The Speaker veriﬁcation

Laboratory

by the amt

1.10

We begin 5
of hztgterpt
Chapter 4
face. fittgt
of the int
Chapter I
Chapter

IC’PlC'S 111

1.11

KHURiet
cure in .
biometrj
hnlds Ill
bdﬁed (it

that of '

15

module was developed with the aid of the HTK toolkit [9] from EntrOpic Research
Laboratories [10]. The system itself has evolved from a previously developed system

by the author, called F2ID [11].

1.10 Overview of the thesis

We begin in Chapter 2 with a study of face veriﬁcation techniques, followed by a study
of ﬁngerprint matching techniques in Chapter 3 and Speaker veriﬁcation techniques in
Chapter 4. Then in Chapter 5 we address the decision fusion schemes that integrate
face, ﬁngerprint, and speech so as to improve the veriﬁcation accuracy. The design
of the integrated system and some implementation details are discussed at length in
Chapter 6, followed by the experimental results in Chapter 7. Finally, we conclude in
Chapter 8 by addressing the limitations of our system and outline the future research

topics in this ﬁeld.

1.11 Summary

Knowledge—based and token-based identity authentication systems are no longer se-
cure in our modern electronic society. An identity authentication system based on
biometrics overcomes the limitations of a knowledge-based or token-based system and
holds the key to the security of various systems in the future. Furthermore, systems
based on multiple biometrics are able to improve the accuracy of the system beyond

that of a system based on a single biometric. Such multimodal biometric systems

ate also limit:
fully aut Ullli-t
namely. f‘dl‘t"

fusititn Si‘llt‘ll

based on 11‘.

16

are also more robust to fraud. The goal of this thesis is to design and implement a
fully automatic personal identity authentication system that uses three biometrics,
namely, face, ﬁngerprint, and speech and demonstrate that our proposed decision
fusion scheme improves the performance of the system beyond that of the systems

based on the individual biometrics.

 

Chat

Face

face. as a it
allows for fa
system can i
identitmtiot
based on I'd"
diagram of a

The have
video images
database of f
The ﬁtllowirt;
the given lit.

and [2)] lift”

 

 

1'-
id{' '
If against,

Chapter 2

Face Veriﬁcation

Face, as a biometric, has been primarily used in identiﬁcation systems since its use
allows for fast indexing into a large database. Its use as a biometric in a veriﬁcation
system can be considered as a logical extension of the principles underlying its use in
identiﬁcation system. Towards this end, we shall ﬁrst discuss an identiﬁcation system
based on faces and then extend our discussion to face-based veriﬁcation. The block
diagram of a face-based veriﬁcation system is depicted in Figure 2.1.

The face recognition problem can be formally stated as follows: “Given still or
video images of a scene, identify one or more persons in the scene using a stored
database of faces” [12]. In order to solve the above problem, it is necessary to solve
the following two sub-problems: (i) detect the presence of one or more human faces in
the given image or sequence of images, and segment the faces from the background,
and (ii) identify the individuals, by matching a general unconstrained view of their
face against a database of faces. The lack of a face recognition system, that can
operate in an unconstrained environment, can be attributed to these two challenging

17

813‘

18

sub-problems. The problem of face detection and segmentation has received a consid-
erable amount of attention [13, 14, 15, 16, 17, 18, 19, 20]. Although the performance
of these systems in terms of detection rate and false alarm rate is acceptable, the de-
tection time is far too high to be used in a real-time biometric system. Furthermore,
face recognition from a general view remains, to this date, an open research prob-
lem. Hence, in the context of a real-time automatic personal identiﬁcation system,
face recognition is generally performed using static, controlled and well illuminated
frontal facial images captured against a plain background. Such constraints, although
unrealistic in a practical sense, allow for an efficient and fairly accurate solution to
the face recognition problem. Let us now discuss the details of a face recognition

system.

 

Iinrulllncnl \ lmlulc S) \lt'lli Dululmw

 
    
   
 
   
   

 
 

Eigen-
Space

wl‘ Inlcl‘l'nu‘ Projection

 

User Name

_ . Reject
Projection

\ crilimtinn \lutlulc

Figure 2.1: The architecture of a face—based veriﬁcation system.

2.1

The (“ht

[I'lf’ffii 1'.

t
lab (£211 i
3.3 Sln W
W? are i
I

. l',
illf.‘ 1m I

SiPpS:

(h

(0
(n

19
2.1 Face Detection

The choice of a suitable face detection algorithm is dependent upon the intended
operational environment. In our system, the facial images are obtained in a controlled
laboratory environment under sufﬁcient illumination and a plain background. Figure
2.2 shows a typical example of a facial image obtained using our imaging setup. Thus,
we are able to employ a very simple edge-detection based algorithm in order to locate
the face and segment it from the background. The algorithm consists of the following

steps:

 

Figure 2.2: A typical facial image obtained using our imaging setup.

1. lie-sampling: The original image is 640x480 pixels in size. The size of the face
portion in this image is approximately 320x360 pixels. The computational de-
mand on the face recognition system for images of this size would render it
incapable of operating in real time. Consequently, the original image is down-
sampled to a size of 160x120 pixels. This results in a face sub-image of approx-

imately 80x90 pixels.

10

. Smoothing: The reduced image is smoothed using a Gaussian ﬁlter so as to

retain the strong edges and eliminate the weak edges in the image.

3.

is t
in).
lit?

11' t1

Fa

per

Olflt
em;
aiml;
the 1
(Wm
the \

“JP t.

 

20

3. Edge Detection: Two edge operators, one to detect the vertical edges and one
to detect the horizontal edges are then applied to the smoothed image and the
resulting images are thresholded to obtain two binary edge images. The idea
is to detect the left and the right extents of the face by using the vertical edge
image and the top of the head using the horizontal edge image. The chin cannot
be detected in this manner since it does not result in a signiﬁcant edge in the

horizontal edge image.

4. Face detection and segmentation: Two edge distributions (number of edge points
per bin) are now computed: one along rows on the horizontal edge image and
other along columns on the vertical edge image. The bin size is determined
empirically for the given imaging setup (for our imaging setup bin size is 2
along rows as well as columns). Peaks in the two distributions are indicative of
the presence of strong edges in the region spanned by the bins in which the peaks
occur. To locate the left and the right bounds of the face, edge distribution for
the vertical edge image is examined sequentially from left to right. Similarly, the
top of the head is located by examining the edge distribution for the horizontal
edge image. The location of the center of the face is then hypothesized by
computing the centroid of all the edge pixels that lie within the left and the
right bounds; an 80x90 window placed at the estimated center is then used to
crOp the face image. Finally, the cropped image is normalized to have a certain

mean and standard deviation.

figure 2
Hlf’lllud l5.
the Clutlutt
cooperat in
htdhatk d

effertiwly ;

'LIif'V»"
‘t‘lidl
1mg: _

«HOI4SQ)

2.2 F

 

Tar ~
-9153 h
in - .
“‘i It Is-
C0 ,
usequ‘jn

St -

21

Figure 2.3 illustrates the various stages in our algorithm. Although this simple
method is highly susceptible to artifacts in the image (such as vertical stripes on
the clothing of the individual), it works rather well for our imaging setup and for
cooperative users. In addition, we shall see in Chapter 6, that in the presence of
feedback during the data acquisition process, the above method can be used very
effectively for accurately extracting the face.

Resamplin g
(160X120) Gaussian Smoothing Veritcal Edge Image

 

    

 

 

Figure 2.3: Our face image segmentation algorithm.

2.2 Face Recognition

Face is a natural choice of a physiological characteristic for use as a biometric and in
fact, it is the most routinely used characteristic by humans to identify individuals.
Consequently, a considerable amount of research has been devoted towards under-

standing the recognition process used by humans and machine vision systems that

DH If
due I
nique
space
will fr):
sional :
each or.
between
Space an
of the (in
such as p1
glilar vain

SO as [O I'p'

  
   
  
   
 

infer-Clam

A rem;
ill] elast;
”ISUlts. tl.

siﬁer. “Tit

22

require face image [21, 12, 22, 23, 24]. Numerous studies in psychophysics and neuro-
physiological literature have suggested that humans make use of facial features such
as hair, eyes, nose, mouth, etc. to identify faces [12]. Based on these studies, early
research in automatic face recognition systems focused on developing algorithms that
attempted to measure the various facial features and represent a facial image based
on these measurements [24]. However, these efforts met with limited success primarily
due to the enormous computational complexity of the tasks involved. Popular tech-
niques in face recognition now treat a facial image as a point in a high dimensional
space [21, 23]. It is argued that the facial images belonging to a speciﬁc individual
will form a set of points that are clustered in a compact region in this high dimen-
sional space and clusters belonging to different individuals are well separated from
each other. The recognition task involves: (i) computing a suitable distance metric
between a given test image and a set of reference images in the high dimensional
space and (ii) assigning the identity of the reference image with the smallest value
of the chosen distance metric to the test image. Dimensionality reduction techniques
such as principal component analysis (PCA), linear discriminant analysis (LDA), sin-
gular value decomposition (SVD), etc. reduce the dimensionality of the feature space
so as to reduce the computational complexity of the problem while maintaining the
inter-class separation [21, 23].

A recent study [22] of three well-known face recognition techniques, (i) eigenface,
(ii) elastic matching, and (iii) neural nets, has shown by analysis and experimental
results, that “the eigenface algorithm, which is essentially a minimum distance clas-

siﬁer, works well when lighting variation is small”. Our decision to use the eigenface

'dI—‘PI
p673!
grim:
ICII ("d
nique:
ai't'lh'v‘i'
Fat:
Steigtf‘S :2
eigenvet
IPSpUlltli
Pigenspat
the temp
onto the .

the test in

   
   

eigent-alu.
apllfoatrh, _.
are denim

elgenfafjp

23

approach for face recognition is based on the following reasons: (i) in the context of
personal identiﬁcation, control can be exerted over the lighting conditions, the back-
ground and the pose of the subject, (ii) a very compact template can be generated
for each user by using the eigenface approach, (iii) efﬁcient database retrieval tech-
niques can be employed to retrieve the templates quickly [23], and (iv) the eigenface

approach has been shown to be more accurate than the attribute-based approach [24].

Face recognition using the eigenface-based approach consists of the following two
stages [21]: (i) training stage: given a set of input images, compute the eigenvalues and
eigenvectors of the covariance matrix of these images; retain the M eigenvectors cor-
responding to the M highest eigenvalues; project each face onto this M -dimensional
eigenspace so as to obtain an M -element feature vector for each face which then forms
the template for that face, and (ii) operational stage: given a test image, project it
onto the M -dimensional eigenspace, so as to obtain an M -element feature vector for
the test image; compute the Euclidean distance, in the M -dimensional eigenspace, be-
tween this feature vector and each of the templates in the database; use the l-nearest

neighbor decision rule to establish the identity of the test image.

The M eigenvectors (represented as a 2D image) corresponding to the M highest
eigenvalues resemble face images and hence the approach is termed as the eigenface
approach. Examples of the ﬁrst nine eigenfaces for a particular set of input images
are depicted in Figure 2.4. We shall now discuss the mathematical details of the

eigenface approach.

 

‘1’ .)'[‘
(in ..

I "" 1-

24

 

Figure 2.4: The ﬁrst nine eigenfaces.

2 .2.1 ’ITaining Stage

Given a face image, I (at, y), as an W x H array of pixel intensities, it can be represented
as a W x H -dimensional feature vector, by concatenating the rows of I (2:, y). For our
image size of 80 x 90 pixels, each face image maps to a point in the 7200—dimensional
feature space. Due to the inherent similarity between different face images, the dis-
tribution of face images in this high dimensional space will not be random. Conse-

quently, principal component analysis can be used to project this high dimensional

feature span
Let us (ll

the number

iSdehned a,

Thf‘ ('0‘;

“here.
The di.
eigetttalm
furtherin-
than TQUL

 

mmlnittt

 

25

feature space to a lower dimensional subspace, also known as the eigenspace.
Let us denote the set of training face images as 1‘1, 1‘2, - -- ,1“ T, where T denotes
the number of images in the training set. The average face of the above set of images

is deﬁned as

I T

The covariance matrix is then calculated as

C—1T\P\IJ'
“'51"; Tina

where, I!” = I), — ‘II.

The dimension of this covariance matrix is 7200 by 7200 and calculating the
eigenvalues and eigenvectors of a matrix of this size is “an intractable task” [21].
Furthermore, if the number of training images, T < 7200, there are only T — 1 rather
than 7200 meaningful eigenvectors (“The remaining eigenvectors will have associated
eigenvalues of zero” [21]). The 7200 eigenvectors are, therefore, computed by ﬁrst

computing the eigenvectors of a T x T dimensional matrix, deﬁned as

1 ,
E = TZt/nxrn.

n=1

Let t
est eieet

85

The
t'etitots.

obtain I

XE‘XI.

26

Let v1, v2, - - - ,vM denote the M eigenvectors of 2, corresponding to the M high-

est eigenvalues. The M eigenvectors of the covariance matrix C are then obtained

as
T
u,=Zv,n\Iln, 1:1,... ,M.
1121
The projection matrix is then given as u = (u1,u2, - ~ ,uM), and the training
vectors, \Ill, \Ilg, - -- ,\IIT, are then projected onto the M-dimensional eigenspace to
obtain the M-dimensional templates, (1)1, (1)2, - ~ ,<I>T, as
<I>n—u'\Iln, 7221, ,T

2.2.2 Operational Stage

Given a set of T templates, (1)1, (1)2, - .. ,<I>T, the mean face, \11, and the projection

matrix, 11, the operational stage takes a test face image, I‘, and projects it onto the

M -dimensional eigenspace as

II = u'(F— \I').

Next, a l-nearest neighbor classiﬁer, operating in eigenspace, is used to establish

 

1111

Or] I
hes:
[Ital

hzu

27

the identity of the test face image. The distance between the test pattern, II, and a
template, <1)", deﬁned as ”(I)” — II”, where [I o I] denotes L2 norm, is called Distance

From Feature Space (DFFS) [21].

Identity veriﬁcation using face can be performed by simply computing the mini-
mum DFFS for the input image against the templates for the claimed identity, and
comparing this DFFS against a pre-determined threshold. The threshold can be
determined for a desired FAR or FRR by generating the impostor and genuine dis-

tributions for the DFFS for the given database of users.

2.3 Experimental Results in Literature

The eigenface method for identiﬁcation has been tested by a number of researchers
on several different databases [21, 12, 22]. It is well known that the method works
best when the lighting and scale variations are small [21]. Consequently, for databases
that satisfy the above requirement (e.g., the MIT database), the reported performance
ﬁgures are as high as 97% correct classiﬁcation, whereas for databases that do not
satisfy these constraints (e.g., the Weizmann and Olivetti databases), the reported
performance ﬁgures are around 80% correct classiﬁcation. Zhang et al. [22] report a
66% correct classiﬁcation rate for a database that was obtained by pooling together

the MIT, the Olivetti, the Weizmann and the Bern databases.

 

2.4 S

fate teo "fl
accepted l
the most
practice. z
work: in:
against a]
merited iii
if such Col
more. thet
and that s
airtime}:
1003113
ence of at
the fare []
have 50 f,
’fﬂOl llll]
C13] llllitg.
300er.}.

In 01;,

 

2’ [5 9!?

it: in the

 

28
2.4 Summary

Face recognition is a non-intrusive method for personal identiﬁcation. It is widely
accepted by people as a safe biometric technique and has the potential to become
the most popular biometric technique. However, for face recognition to be useful in
practice, a number of constraints need to be imposed. As of today, face recognition
works best with static, controlled and well illuminated frontal facial images taken
against a plain background. Under such conditions, the face images can be easily seg-
mented from the background and the resulting intra-class variations are small. But
if such constraints are imposed then the system looses its user-friendliness. Further-
more, there is no conclusive evidence that facial images are a reliable proof of identity
and that systems based on face recognition can achieve an acceptable identiﬁcation
accuracy.

Ideally, a face recognition system should be able to automatically detect the pres-
ence of a face in an image, segment the face, if one is present, and be able to identify
the face from a general view point. Research efforts to solve these difﬁcult problems
have so far met with very limited success. We thus conclude that it is very difﬁcult
if not impossible, to design an automatic personal identiﬁcation system based on fa-
cial images alone, that is capable of achieving an acceptable degree of recognition
accuracy in a general and unconstrained environment.

In our system the eigenface technique for face recognition by Turk and Pentland
[21] is employed. We have further extended the recognition system so as to operate

it in the veriﬁcation mode.

Chapter 3

Fingerprint Veriﬁcation

A ﬁngerprint is a pattern of ridges and furrows that are formed on the tips of the ﬁnger
due to the accumulation of dead and corniﬁed cells. The individuality of a ﬁngerprint
is completely deﬁned by the local characteristics of the ridge pattern (minute details)
and their relationships. The process of ﬁngerprint veriﬁcation involves (i) ﬁngerprint
image acquisition, (ii) the extraction and representation of the local ridge character-
istics and (iii) the matching of two ridge characteristic patterns so as to measure the
similarity/dissimilarity between them. A block diagram Of a ﬁngerprint veriﬁcation
system is shown in Figure 3.1. We shall now discuss the details of the various steps

involved.

3.1 Fingerprint Image Acquisition

Fingerprint images can be acquired either in an off-line fashion, in which case they
are referred to as inked ﬁngerprints or latent ﬁngerprints, or they can be acquired

29

 

in an
.Thl ht.)
Slow,

Unata
Will. It.

an all

”Undo
“”81 ir

illi illi-

30

innrnllnu-nl \lmlulc

   
 
 

Minutia

l wr l'm_l.[-m.,. Extracto System ”Multim-

 
    

User Name Minutia

Extracto

\ cl'il'it'uliml \Imllllc

 

 

Figure 3.1: Architecture of the ﬁngerprint veriﬁcation system.

in an online fashion, in which case they are referred to as live-scan ﬁngerprints.
Although the inked method of ﬁngerprint image acquisition is rather cumbersome and
slow, it is has been the standard technique for over a hundred years. Furthermore,
unavailability of direct feedback during the acquisition process makes it difﬁcult to
control the quality of the acquired images. Finally, this method is unacceptable in

an automated veriﬁcation system.

Live-scan ﬁngerprints eliminate the intermediate step of getting an impression
on paper by obtaining a digital image of the ﬁngerprint directly from the ﬁnger.
Thus, live-scan ﬁngerprints are ideally suited for an automated veriﬁcation system. A
number of sensing mechanisms can be used for the acquisition of live-scan ﬁngerprint
images including (i) optical frustrated total internal reﬂection (FTIR), (ii) ultrasonic
total internal reflection, (iii) optical total internal reflection of edge—lit holograms,

(iv) thermal sensing of the temperature differential (across the ridges and valleys),

(r) 59min
of lire-site
currently

Live-s
tally obra
surface of
sisrs only
the time r
valid ridge
(if’fulmal i.

and “VP-M

 

 

or a.

Cr .
JIUPpt'

0n the g.

31

(v) sensing of differential capacitance, and (vi) non-contact 3D scanning. A number
of live-scan ﬁngerprint image acquisition devices based on the above techniques are
currently available in the market.

Live-scan ﬁngerprints vary markedly from inked ﬁngerprints in that they are typi-
cally obtained using the dabs method, wherein a ﬁnger is impressed on the acquisition
surface of a scanning device without rolling. Thus, a dabs live-scan ﬁngerprint con-
sists only of the ridge structure that is in contact with the acquisition surface at
the time of acquisition. Although, such a ﬁngerprint tends to have a smaller area of
valid ridge structure as opposed to an inked ﬁngerprint, it has a smaller amount of
deformation than an inked ﬁngerprint. Figure 3.2 shows examples of inked, latent

and live-scan ﬁngerprints.

     

Figure 3.2: Different kinds of ﬁngerprints: (a) inked, (b) latent, and (c) live-scan.

Of the various methods of obtaining live-scan ﬁngerprints described above, the
most popular techniques is based on optical frustrated total internal reﬂection (FTIR)
concept. When a ﬁnger is placed on one side of a glass platen (prism), the ridges

on the ﬁngertip are in contact with the glass surface, whereas the furrows are not in

-' ‘ II“

 

32

contact with the glass surface. A laser light source illuminates the glass at a certain
angle. The points on the glass surface which are in contact with the ridges cause
the incident light to be scattered randomly, whereas at the other points on the glass
surface, the incident light suffers total internal reﬂection. The light reﬂected from the
glass is captured by a CCD camera, resulting in a corresponding ﬁngerprint image on
the imaging plane of the CCD. Figure 3.3 shows the FTIR ﬁngerprint scanner used

in our veriﬁcation system.

 

Figure 3.3: The live-scan ﬁngerprint scanner from Digital Biometrics.

3.2 Feature Extraction

The role of feature extraction is to derive a set of representative features from the input
image, that satisfy the following requirements: (i) retain the discriminating power of
the ﬁngerprint image, (ii) compactness, (iii) suitable for use by a matching algorithm,
(iv) robust to noise and distortions, and (2)) easy to compute. Since the individuality
of ﬁngerprints is completely determined by the local ridge characteristics, the ﬁrst

property postulates that a representation that captures these local characteristics is

33

best suited for use in automatic ﬁngerprint veriﬁcation systems. To date, a total
of one hundred and ﬁfty different local ridge characteristics, called minute details,
have been identiﬁed. However, they are not evenly distributed and their detection
depends on the image quality and impression conditions. The two most prominent
ridge characteristics, called minutiae, are (i) ridge ending and (ii) ridge bifurcation.
A ridge ending is deﬁned as a point on the ridge at which a ridge terminates abruptly
and a ridge bifurcation is deﬁned as a point on the ridge where a ridge forks or diverges
into branch ridges. Given a ﬁngerprint image of reasonable quality, minutiae can be
identiﬁed easily. Each minutia is characterized by its type, its as and y coordinates,
and its direction. Examples of minutiae along with the features that characterize

them are shown in Figure 3.4.

/ ‘ a ‘ 8
_,.>
(x. y (X. y)

 

 

Rid e Endin . . .
Ridge Bifurcation g g Ridge Bifurcation
Ridge Ending :X

 

Figure 3.4: Examples of minutiae and their characterization.

A representation based on minutiae is compact, suitable for use by a matching
algorithm, robust to noise and distortions, and easy to compute. Due to varying light-

ing conditions and due to variations in the impression pressure, a ridge ending may be

..r.

are

q“

n a

1;

flue

34

mistaken for a ridge bifurcation and vice versa, hence no distinction is made between
the two kinds of minutiae. Each detected minutiae is, thus, characterized by the

following three parameters: (i) x-coordinate, (ii) y-coordinate, and (iii) orientation.

3.2.1 Minutiae Extraction

For our ﬁngerprint veriﬁcation system, we make use of the minutiae extraction algo-
rithm proposed in [25]. The overall ﬂowchart of the algorithm is depicted in Figure
3.5. It mainly consists of three stages: (i) orientation ﬁeld estimation, (ii) ridge
extraction, and (iii) minutiae extraction and post-processing. First, local ridge ori-
entation is estimated and the region of interest is located. Ridges are extracted from
the input gray-level images, processed to get rid of the small speckles and holes,
and thinned to obtain 8-connected single-pixel wide ridges. Finally, in the minutiae
extraction stage, minutiae are extracted from the thinned ridges and reﬁned using

certain heuristics.

For each minutiae detected by the above algorithm, the following parameters
are recorded: (i) sic-coordinate, (ii) y-coordinate, (iii) orientation, which is deﬁned
as the local ridge orientation of the associated ridge, and (iv) the associated ridge
segment. The recorded ridges are represented as one-dimensional discrete signals,
normalized by a preset length parameter which is approximately equal to the average
inter-ridge distance in the ﬁngerprints. About 10 locations on the ridge associate
with each minutiae are sampled. A ﬁngerprint image when represented by the above

representation and stored in a compressed format takes, on an average, about 250

,.y, . ‘. ,1 I“ f \ "3, ‘..
. .1 (ﬁtting, ._
'l "~37? P! ‘V

{Quad

 

 

Figure 3.5: Flowchart of the minutiae extraction algorithm after Jain et al. [2].

bytes, a reduction by a factor of approximately 1,228 from the original size of 307,208

bytes for a 640 x 480 8-bit image.

3.3 Fingerprint Matching

The ﬁngerprint matching problem is deﬁned as follows: Given two (a test and a

template) ﬁngerprint representations, determine whether the two ﬁngerprints are im-

y

[if

36

pressions of the same ﬁnger [2].

3.3.1 Minutiae Matching

Minutiae matching is essentially a point pattern matching problem. In the ideal
case, if (i) the correspondence between the template minutiae pattern and the input
minutiae pattern is known, (ii) there are no deformations such as translation, rotation
and deformations between them, and (iii) each minutiae present in a ﬁngerprint image
is exactly localized, then minutiae matching is a relatively simple task of counting
the number of spatially matching pairs between the two ﬁngerprints and comparing
it against a pro-speciﬁed threshold value. These conditions are rarely satisﬁed in
reality and hence under real life conditions, minutiae matching is an extremely difﬁcult
problem (refer to Figure 3.6). The difﬁculty of the problem can be attributed, mainly,
to the following two reasons: (i) Given a template and a test minutiae pattern from
the same ﬁnger, it is still necessary to establish the correspondence between the two,
and (ii) the imaging system and the process of ﬁngerprint image acquisition introduce
a number of errors in the representation. In order for the minutiae matching algorithm
to operate reliably in practice, it is necessary to establish and characterize a realistic
model of the variations that occur among the representations of mated pairs. The

following properties are desirable to be included in the model [25]:

1. The ﬁnger may be placed at different locations on the sensor resulting in a
(global) translation between the minutiae from the test and the template rep-

resentations.

37

 

Figure 3.6: Fingerprint matching problem: (a) and (b) are two different impressions
from the same ﬁnger.

M

01

K,

. The ﬁnger may be placed in different orientations on the sensor resulting in a

(global) rotation between the minutiae from the test and the template repre-

sentations.

. The ﬁnger may exert a different (average) downward normal pressure on the

sensor resulting in a (global) spatial scaling between the minutiae from the test

and the template representations.

. The ﬁnger may exert a different (average) shear force on the sensor resulting in

a (global) shear transformation (characterized by a shear direction and magni-

tude) between the minutiae from the test and the template representations.

. Spurious minutiae may be present in both the template as well as the test

representations.

. Genuine minutiae may be absent in the template or test representations.

. Minutiae may be locally perturbed from their true location and the perturbation

at;

38

may be different for each individual minutiae.

8. The individual perturbations among the corresponding minutiae could be rela-
tively large (with respect to ridge spacing) but the perturbations among pairs

of minutiae are spatially linear.

9. The individual perturbations among the corresponding minutiae could be rela-
tively large (with respect to ridge spacing) but the perturbations among pairs

of minutiae are spatially non-linear.

10. Only a (ridge) connectivity preserving transformation could characterize the

relationship between the test and template representations.

A large number of minutiae matching algorithms which are essentially “Euclidean”
matchers have been proposed in the literature [26, 27, 28, 29, 30, 31]. These al-
gorithms satisfy the above assumptions to varying degrees. However, they are ei-
ther too slow for use in real-time systems or are not reliable enough in terms of
accuracy. In our system an alignment-based matching algorithm developed in [25]
is used. The algorithm is simple in theory, efﬁcient in discrimination, and fast

in speed. Given a template representation consisting of M minutiae denoted as

P = ((xf, yf’, Of)? - .. ,(xﬁp 3151,65,?) and an input representation consisting of N
minutiae denoted as Q = ((23% 31?, 6?)T, - ~ ,(x2, y3, 6%)?) the algorithm performs

minutiae matching by executing the following steps:

1. Estimate the translation and rotation parameters between the ridge associated

with each input minutiae and the ridge associated with each template minutiae

39

and align the two minutiae patterns according to the estimated parameters.

. Convert the template pattern and input pattern into the polar coordinate rep-
resentations with respect to the corresponding minutiae on which the alignment
is achieved and represent them as two symbolic strings by concatenating each

minutiae in an increasing order of radial angles:

PP = ((rf,ef,61P)T,---,(rﬂ,ef,,0£)T) (3'1)

Q11 : ((T1Qa 610361Q)T3 ' ' ' a(T1?’ae]QV361Cg/)T)v (3'2)

where r..,e,., and 0... represent the corresponding radius, radial angle, and nor-

malized minutiae orientation with respect to the reference minutiae, respectively.

. Match the resulting strings Pp and (2,, with a modiﬁed dynamic programming

algorithm to ﬁnd the ‘edit distance’ between Pp and Qp.

. Use the minimum edit distance between Pp and Qp to establish the correspon-
dence of the minutiae between Pp and Qp. The matching score, S, is then deﬁned

as .'

_ IOOMPQMPQ

3 MN ’

(3.3)

 

where M pQ is the number of minutiae which fall in the bounding boxes of tem-

Mir

40

plate minutiae. The bounding boa: of a minutiae speciﬁes the possible positions

of the corresponding input minutiae with respect to the template minutiae.

Figure 3.7 depicts the above algorithm.

3.4 Experimental Results in Literature

The performance of the algorithm discussed above has been evaluated using the MSU
ﬁngerprint database and a portion of the NIST 9 fingerprint database. The MSU
database consists of 150 individuals with 10 ﬁngerprints per individual. The ﬁnger-
prints have been obtained using a live-scan ﬁngerprint reader. No restrictions on
the position and orientation of the ﬁnger were imposed. Approximately 90% of the
ﬁngerprints are estimated to be of fairly good quality. Jain et al. [2] report an au-
thentic accept rate of 87.5% with a false accept rate of 0.01% for the MSU database.
The NIST database consists of 1,350 mated ﬁngerprint card pairs. Each mated pair
consists of a ﬁngerprint obtained using the rolled method and a ﬁngerprint obtained
using the live-scan method. The reported performance ﬁgures are 83.1% authentic

accept rate with a false accept rate of 0.012% [2].

3.5 Summary

Fingerprints have been used for centuries for establishing the identity of an individual.
Their biological properties are well understood and extensive research has been con-

ducted on ﬁngerprint matching. The uniqueness of a ﬁngerprint is completely deﬁned

41

 

Figure 3.7: Flowchart of the minutiae matching algorithm.

42

by the local ridge characteristics. Minutiae, which are deﬁned as ridge endings and
ridge bifurcations, are the most prominent amongst the local ridge characteristics and
a representation based on minutiae has been successful in designing automatic ﬁnger-
print veriﬁcation systems. We have decided to use the minutiae extraction algorithm
proposed in [25] because it is fast, efﬁcient and tolerant to noise. Minutiae matching
is essentially a point—pattern matching problem. The alignment-based elastic algo-
rithm proposed in [25] has been used in our system for its speed and robustness.
The algorithm is able to adaptively compensate for the nonlinear deformations and
inexact transformations between mated ﬁngerprints and hence is able to achieve a

good veriﬁcation accuracy at an acceptable speed.

Chapter 4

Speaker Veriﬁcation

The speaker veriﬁcation problem can be stated as follows: “Given an utterance,
verify the identity claimed by the speaker against a database of known speakers.”
Depending upon the application, the speaker may be prompted (visually or orally) to
speak a phrase known to the system. Alternatively, the system may attempt to verify
the identity claim without any knowledge of the actual words that were spoken. The
former mode of operation is termed as text-dependent speaker veriﬁcation, whereas
the latter is termed as text-independent Speaker veriﬁcation.

Text-dependent speaker veriﬁcation systems can either make use of a small vo-
cabulary and require the user to speak certain words selected at random from the
vocabulary, or allow the user to select his “password” phrase and require him to use
the same phrase each time. The use of a small vocabulary is generally preferred since
the random choice of words to be spoken at the time of veriﬁcation makes the system
less circumventable by fraudulent means. Furthermore, it is also possible to build a
speech recognition system for the chosen vocabulary which is then used to validate the

43

44

correctness of the acquired speech data at enrollment as well as during veriﬁcation.

We have designed and implemented a text-dependent speaker veriﬁcation system
that uses a vocabulary of four digits: one, two, seven, and nine. This vocabulary was
chosen for the following reasons: (i) Rabiner et al. [32] have demonstrated that hidden
Markov models (HMMS) can be used for generating robust spoken digit models for an
individual, and (ii) given four digits, 24 combinations are possible, thus, effectively
increasing the length of the vocabulary from 4 to 24 which in turn makes the system
less susceptible to tape recording fraud (the use of prerecorded speech). There are no
particular reasons for the choice of the particular four digits, or for electing to have
four digits instead of three or ﬁve. It can, however, be argued that the choice of the
number of digits represents a trade-off between user convenience and susceptibility
of the system to tape recording fraud: a large number of digits used in combinations
would make the system practically impossible to break by tape recording fraud but
would also increase the time required for enrollment as well as veriﬁcation and hence
could render the system unusable in a practical scenario.

Input to the system consists of a combination of the four digits, visually presented
to the speaker on a video monitor. During enrollment, twelve of the twenty-four
possible combinations of the four digits are recorded for each user. Each composite
utterance is then segmented to obtain four sets of training data, one for each digit.
Each set is then used to generate a spoken-digit model for that speaker. Thus, four
models are generated for each speaker. The training samples from all the users are
then pooled to train a speech recognition system. During veriﬁcation, the user is

prompted visually to speak a combination chosen at random by the system from the

45

24 possible combinations. The composite utterance is segmented into four utterances
and each utterance is fed to the speech recognition system. If the four utterances
are correctly recognized as the prompted combination, they are compared with the
corresponding spoken-digit models for the user and a matching score is computed.
The matching score is then fed to the decision fusion module which then makes the
ﬁnal accept/reject decision in conjunction with the results of the face veriﬁcation
system and the ﬁngerprint veriﬁcation system. Figure 4.1 depicts a block diagram of
our speaker veriﬁcation system. It consists of three modules: (i) speech acquisition

and feature extraction, (ii) speech model generation, and (iii) pattern matching.

 

lnmllnunl \Imlrilc

    
  

   

  
  
  
 

Pre-
Processing

 
    
  
 
 

Cepstral . HMM

 

 

  

  

l wl' lulu luu Analysis [Training N) \lrm Hulzilmw

an
Segmentation

  
 
  

 
  
  

Pre-
Processing

 
  

Cepstral

 

Analysis Accept /

Reject

 
 

\ k‘l'il'ix .Ilinn \lmllllr

Figure 4.1: A block diagram of the HMM-based Speaker Veriﬁcation Module.

4.1 Feature Extraction

The speaker-speciﬁc characteristics of speech are due to differences in physiologi-
cal and behavioral aspects of the speech-production system in humans [33]. The

main physiological aspect of the human speech production system is the vocal tract

46

shape. The vocal tract is generally considered as the speech production organs above

the vocal folds, which consists of the following: (i) laryngeal pharynx (beneath the

epiglottis), (ii) oral pharynx (behind the tongue, between the epiglottis and velum),

(iii) oral cavity (forward of the velum and bounded by the lips, tongue, and palate),

(iv) nasal pharynx (above the velum, rear end of nasal cavity), and (v) nasal cavity

(above the palate and extending from the pharynx to the nostrils). The shaded area

in Figure 4.2 depicts the vocal tract [3].

, y/{Z‘A
............. l/ /‘

Soft palate
(velum)

 

Nllll cavity

Noon-i1

Lip

Tonga

Teeth

Oral (or buccal) cavity
Jaw

Trachee

Lung

- --|— ——-J.u.‘:-- .0.“.‘

Figure 4.2: A schematic diagram of the human speech production mechanism [3].

The salient acoustic features of speech are contained in the spectral modiﬁcations

made to the source excitation by the vocal tract system. Hence, it is common in

speaker veriﬁcation systems to make use of features derived only from the vocal tract.

47

In order to characterize the features of the vocal tract, the human speech production
mechanism is represented as a discrete-time system of the form depicted in Figure
4.3 [3].

The acoustic wave is produced when the airﬂow from the lungs is carried by
the trachea through the vocal folds. This source of excitation can be character-
ized as phonation, whispering, frication, compression, vibration, or a combination of
these [3]. Phonated excitation occurs when the airﬂow is modulated by the vocal
folds. Whispered excitation is produced by airﬂow rushing through a small trian-
gular opening between the arytenoid cartilage at the rear of the nearly closed vocal
folds. Frication excitation is produced by constrictions in the vocal tract. Compres-
sion excitation results from releasing a completely closed and pressurized vocal tract.
Vibration excitation is caused by air being forced through a closure other than the vo-
cal folds, especially at the tongue. Speech produced by phonated excitation is called
voiced, that produced by phonated excitation plus frication is called mixed, and that

produced by other types of excitation is called unvoiced.

From Figure 4.3, the overall transfer function of the speech production system can

now be given as

8(3) H(Z)R(Z) unvoiced case

 

G(z)H(z)R(z) voiced case.

Thus, it is possible to represent the vocal-tract in a parametric form as the trans-

48

Gain for voice source

Pitch

Period Impulse Glottal

—>‘ train 7 Pulse
generator E(z) model (3(2)

 

 

 

 

 

 

 

 

 

 

 

 

Voiced/
unvoiced VocaHract _._ Radiation _, Speech
switch . WI ”(7‘) model R(z) 5(2)

 

 

 

 

 

 

Vocal-tract
parameters

 

Random
noise
generator E(z)

 

 

 

 

Gain for noise source

Figure 4.3: A discrete-time model representing the human speech production system

[3]-

fer function H (2) In order to estimate the parameters of H (z) from the observed
speech waveform, it is necessary to assume some form for H (z) Ideally, the transfer
function should contain poles as well as zeros. However, if only the voiced regions
of speech are used then an all-pole model for H (z) is sufﬁcient. Furthermore, linear
prediction analysis can be used to efﬁciently estimate the parameters of an all-pole
model. Finally, it can also be noted that the all-pole model is the minimum-phase
part of the true model and has a magnitude spectrum that is identical to the magni-
tude spectrum of the true model, which contains the bulk of the speaker dependent
information. Figure 4.4 illustrates the differences in the models for different sounds

spoken by the same individual [3]. Figure 4.5 illustrates the differences in the models

for two speakers saying the same vowel.

i (eve)

Hit)

e (hate)

E (met)

@ (at)

beird)

(a)

 

ll

 

 

///

 

 

 

 

 

 

 

 

 

sass"

 

 

49

0’)
ALL OCCURENCFS

 

-S(XX} r

 

 

_IM 1 l A I

15“”

 

-lm . l l__ 1 1
Time (msec)

 

 

 

 

40000 L 1 ‘
Time (msec)

2,]

(cl
ALL OCCURENCES

10

-l(l

-20, 4 l i

‘l

0
hi—

Frequency (kHz)

-l0

O

Frequency (kHz)

_m l g I

‘lF

Frequency (kHz)

-233 l l’l

ll

:5
5

Frequency (kHz)

oéosgg
E
V

 

4
Frequency (kHz)
40
30
' 20
0
3(0) ‘ 1 >
0 4
Frequency (kHz)

Figure 4.4: Vocal-tract models for different sounds [3].

50

.28?on 32¢th 93 no.“ .e. 832 no“ Eseuoeam mvBEwwE 8e5430> ”we mes—ma

A 3 $332. 5532. “.5252
2 9o 3 a... no to 2. No 3 o

 

 

 

. .. n .] _ _ a, . a. 8-

(9p) aeuodeau apnuuﬁuw

 

 

2 an ﬁagz. 3:039: bogus—.02
m6 «d No m6 m6 v6 06 ~.o _..o o

 

 

 

_ . . . . __ _ _ . 8-

(9p) aeuodsaa apnuuﬁew

 

 

51

4.1.1 Speech Acquisition and Preprocessing

The spoken speech is converted to an analog signal through the use of a noise can-
cellation headset micrOphone manufactured by Labtec. The analog speech signal is
digitized using a Creative Labs SoundBlaster audio card. The sampling rate used is
8KHz and the resolution is 16 bits/sample. The digitized speech signal, s(n), is then
processed by a ﬁrst-order FIR ﬁlter to spectrally ﬂatten the signal and to make it
less susceptible to ﬁnite precision effects during the later stages of processing. The

z-transform of the ﬁrst-order FIR ﬁlter used in our system is

H(z) : 1 — az‘l, (4.2)

where a = 0.95. Thus, the preemphasized signal s(n) is related to s(n) as,

The preemphasized signal, s(n), is then blocked into frames of N = 300 samples,
with a shift of M = 100 samples. This corresponds in time to 37.5 ms frames with
a 12.5 ms shift between frames. If the lth frame of speech is denoted by 331(n) and

there are L frames within the entire speech signal then

52

x,(n)=§(Ml+n), n=0,1,-~,N—1, l=0,1,~-,L—1.

Each frame is then weighted by a Hamming window,

w(n) = 0.54 — 0.46 cos (27732), 0 S n S N — 1.
The windowed signal is
i,(n) = m,(n)w(n), 0 S n S N — 1.

4.1.2 Speech Segmentation

speech artifacts, segmentation into digits is a relatively easy task [34].

(4.5)

Since the speech signal is a composite utterance consisting of four digits, it is seg-
mented so as to obtain the four sub-utterances corresponding to the four digits. In
order to segment the speech, we compute the log energy in dB (Equation 4.7) of each
frame. A typical plot of the log energy for an utterance consisting of four digits is
shown in Figure 4.6(a). We note that corresponding to each spoken digit, there is a

peak in the energy waveform. Thus, in the absence of background noise and other

However, in most practical scenarios, a fair amount of background noise and speech

 

 

 

 

 

 

 

 

 

 

35 l l I Y T I T
30 - 4
25 ~ J
20 r r
m
‘U
.E
5 15 r- '1
2
Lu
10 i- -1
5 ~ -
o -4
_5 u I l l r L i
0 0.5 1 1.5 2 2.5 3 3.5 4
Samples x 104

(a) Background noise minimal

 

 

 

 

 

 

 

 

 

30 T T l l T l I
25 r- -1
20 a
g 15 -*
.E
>,
9
0
C
LU 10 -*
5 -1
o -1
_5 l l l l l i 1
0 0.5 1 1.5 2 2.5 3 3.5 4
Samples x 10‘

(b) Background noise signiﬁcant

Figure 4.6: Plots of the log energy for speech acquired under different environments.

54

artifacts such as clicking while Opening the mouth to speak, breath noise, etc. are
present. These noise sources result in an energy waveform as shown in Figure 4.6(b).
Under such conditions, the segmentation problem is no longer trivial [35].

We employ a modiﬁed version of the tOp-down approach proposed by Wilpon et
al. [35] to detect the word boundaries. A brief description of the algorithm is given
below, and the ﬂowchart is shown in Figure 4.7.

The log energy of the speech signal is computed for each frame as

N—l
E(l)=1010g10 (2115101)?) , l: 0,1,- .. ,L — 1. (4.7)

In order to compensate for the background noise, adaptive-level equalization of the

energy contour is performed. First, Em," is computed as

Em," = min E(l)

0Sl<L

which is then subtracted from the energy contour to obtain E (Z)

Next, a histogram of the signal energies below 10 dB is computed. This histogram

is smoothed by applying a three—point median ﬁlter and the mode of the smoothed

55

 

‘ Raw Energy Wavefonn

Energy
7 ; Normalizationvr:

 

  

 

 

 

 

 

 

. $559.

0
7
was

 

 

 

 

 

 

 

2w
X
5
3
.n. a
mi
II
e
me 5. H
w, 2. w.
W. n em M.
e. I
Ivan: Eml'mmlva
.m e... m m...
m. 5. m
0H ...
N 1
t5
N] .0
it _
a m. a m a m 5 o J.
mu 5 65cm 8..
] s]1]« ﬁlm
I
u .
L3
5. ..
2 g
m .m n m
a Wm, m
2
WA] mm‘ls
s .dmH W.
m nmo
r c..
5.
0
_ . N 1.
"m u m n w m "w o

 

 

 

 

l Detected word Boundaries ‘1

Pulse Ordering
and
Recombination

 

 

' Samples

 

 

0.6 >

 

x10”

Figure 4.7: The ﬂowchart of the algorithm for word boundary detection.

56

histogram is computed. Finally, the modiﬁed energy contour is computed as

E(l) = E(l) — Mode.

Under ideal recording conditions, the resulting energy contour has the prOperty
that during periods of silence, the energy level oscillates around the 0 dB level (Figure
4.6(a)). However, in the presence of signiﬁcant background noise, this may not be

true (Figure 4.6(b)).

In the original algorithm [35] the equalized energy contour is scanned repeatedly
for the maximum energy pulse. Next, the algorithm examines the frames to the left
and the right of the frame with the peak energy until it ﬁnds frames in which the
energy falls below a predeﬁned threshold. Finally, the algorithm reﬁnes the pulse
boundaries by checking the ﬁrst few and the last few frames of the pulse for consis-
tently low energy content. The choice of the predeﬁned threshold is the key to the
success of the algorithm. For our particular setup, a threshold below 5 dB needs to
be chosen for the algorithm to work properly. However, we have observed that in
the presence of widely variable background noise, any threshold below the 5 dB level
gives inconsistent results. As a result, we have modiﬁed the algorithm as follows:

Once the peak energy frame is detected, we scan the energy frames to the left
and the right of the frame. However, we chose a high threshold of 10 dB to obtain
the initial estimates of the pulse boundaries. Next we examine a certain number of

frames beyond the left and the right boundaries and record the minimum energies

57

in those intervals. The ﬁnal pulse boundaries are then taken as the frames where
the minimum energies occur. Furthermore, the choice of the number of frames to be
examined is decided by the pulse width at the 10 dB level. Thus, for small steep
pulses we examine a fewer number of frames than for large and wide pulses. This
eliminates the need for selecting a threshold that may be sensitive to the background
noise. This modiﬁed algorithm has been observed to give better results than the

original algorithm.

The detected pulses are then checked for a minimum width and magnitude and
are eliminated from the signal. The algorithm stops scanning the energy contour if

the last peak detected was below a threshold (15 dB in our case).

The validated pulses are then arranged in a decreasing order of their peak am-
plitudes. Since each utterance is supposed to contain four digits, if the number of
detected pulses is greater than four then an attempt is made to combine some of the
pulses to form longer pulses. Starting with the highest peak amplitude pulse, the end
points of the adjacent pulses are examined. If an adjacent pulse is found within a
certain number of frames from the end point of the pulse under consideration, the two
pulses are combined. If the number of pulses beyond this point is still greater than
four, then the magnitudes of the pulses beyond the fourth largest pulse are checked

and a decision is made either to discard the pulses or to reject the recording.

The log energy waveform of a typical utterance of the four digits one, two, seven,

and nine along with the detected word boundaries is shown in Figure 4.8.

 

 

 

 

 

 

 

 

 

 

‘ ’ 6 9 6 9
35 r 1r 7 r ' * 9 1 ‘ r
I I I l
I I I I
I I I I
30 r I I l l r
I I I I
I I I I
I I I I
25 [— I I I I d
I I I I
I I I I
20 - I I I I q
I I I I
8 I I I I
s I I I I
515 " I I I I "
5 I I I I
I I I I
I I I I
10 F I I I I d
I I I I
I I I I
5 - I I I -
I I I
I I I I
I
0 I I I ‘
I I I I
_5 1 l I I l i I
0 0.5 l 1.5 2 2.5 3 3.5 4
Samples x 10‘

Figure 4.8: Output of the speech segmentation algorithm.

4.1.3 Cepstral Analysis

As discussed in Section 4.1, during a stationary frame of speech, the vocal-tract is

generally modeled as an all-pole ﬁlter whose transfer function, H (z), is given as

Ho

, 4.8
1— Eli. a(i)z“ ( )

 

H(z) =

where H0 represents an overall gain term and a(i) are the ﬁlter coefﬁcients. The ﬁl-
ter coefﬁcients can be estimated to the pth order by using linear prediction analysis[3].

The coefﬁcients thus estimated are called a linear predictive code(LPC).

The LPC features were very popular in the early speech-recognition and speaker-

59

veriﬁcation systems. However, comparison of two LPC feature vectors requires the use
of computationally expensive similarity measures such as the Itakura-Saito distance
[32] and hence LPC features are unsuitable for use in real-time systems. Furui [36]
suggested the use of the cepstrum, deﬁned as the inverse Fourier transform of the
logarithm of the magnitude spectrum, in speech-recognition applications. The use
of the cepstrum allows for the similarity between two cepstral feature vectors to be
computed by using the Euclidean distance. Furthermore, Atal [37] has demonstrated
that the cepstrum derived from the LPC features results in the best performance in
terms of FAR and FRR for speaker veriﬁcation. Consequently, we use the LPC derived
cepstrum for our speaker veriﬁcation system. The order, p, of the LPC analysis is a
parameter whose choice depends upon the application. For spoken digit recognition,
Rabiner et al. [38] have demonstrated that p = 10 results in the best performance.
Hence, for each frame a 10th order LPC analysis using Durbin’s recursive method is
performed to obtain 10 LPC coefﬁcients. Durbin’s recursive method consists of the

following steps:

1. The autocorrelation coeﬂicients up to the 10th order of each frame of the win-

dowed signal are obtained as,

N—l—m
r((m) = Z 5:,(n)i:,(n + m), m = 0,1,--- ,10. (4.9)

11:0

2. Each frame of 11 autocorrelations is converted into 10 LPC coefﬁcients by the

60

following recursive algorithm:

E”) = r(0), (4.10)
L—l . '

k,- = r(i)-— a;-1r(|r—j|)}/E('-1), i=1,---,10 (4.11)
j=1

a(i) : k2,, (4.12)

aii) = “ii-U—kiaiiIJ-l), (4.13)

E“) = (1—kf)E“‘1), (4.14)

where the summation in Equation (4.11) is omitted for i = 1. The above set
of equations is solved recursively for i = 1, 2, - -- ,10 and the resulting 10 LPC

coefﬁcients are given as

m =1,--- ,10. (4.15)

Next, Q cepstral coefﬁcients, C(m), m = 1, 2, - - - ,Q, are computed from the p = 10

LPC coefﬁcients, using the following recursion:

61

c0 2 [nor2 (4.16)
m—l k
cm 2 am + Z (a) ckam_k, m =1,--- ,p (4.17)
1:21
m-l k
Cm = (_) Ckam-ka m > pa (4'18)
k=1 m

where 02 is the gain term in the LPC model. Generally, Q is chosen to be greater
than p and is typically given as Q 2 (%)p [38]. We choose Q = 12. This vector is

then weighted by a window W(m) of the form

Finally, the time derivative of the sequence of weighted cepstral vectors is approxi-
mated by a ﬁrst-order orthogonal polynomial over a ﬁnite length window of (2K +1)
frames (K = 2 in our case) to obtain 12 delta cepstrum coefﬁcients. The 12 weighted
cepstral coefﬁcients are combined with the 12 delta cepstrum coefﬁcients to form the
feature vector for the given frame. The entire speech signal is thus represented by a

sequence of feature vectors.

62
4.2 Speaker Modeling

Speech produced by the same person at different times result in similar, yet different,
sequences of feature vectors. The purpose of speech modeling is to build models that
capture these variations in the extracted set of features. There are two types of models
that have been used extensively in speaker veriﬁcation and speech recognition sys-
tems: stochastic models and template models. The stochastic model treats the speech
production process as a parametric random process and assumes that the parame-
ters of the underlying stochastic process can be estimated in a precise, well-deﬁned
manner. The template model attempts to model the speech production process in a
non-parametric manner by retaining a number of sequences of feature vectors derived
from multiple utterances of the same word by the same person. Template models
dominated early work in speaker veriﬁcation and speech recognition because the tem-
plate model is intuitively more reasonable. However, recent work with stochastic
models has demonstrated that these models are more ﬂexible and result in better

models of the speech production process [33].

A very popular stochastic model for modeling the speech production process is
the hidden Markov model (HMM). HMMs are extensions to the conventional Markov
models, wherein the observations are a probabilistic function of the state, i.e., the
model is a doubly embedded stochastic process where the underlying stochastic pro-
cess is not directly observable (it is hidden). The HMM can only be viewed through
another set of stochastic processes that produce the sequence of observations. Thus,

HMM is a ﬁnite-state machine, where a probability density function p(:r:|s,-) is associ-

63

ated with each state 3,. The states are connected by a transition network, where the
state transition probabilities are a,,- = P(s,-|s,~). A fully connected three-state HMM

is depicted in Figure 4.9.

Figure 4.9: A fully connected three-state hidden Markov model.

For speech signals, another type of HMM, called a left-right model or a Bakis
model, is found to be more useful. A left-right model has the property that as time
increases, the state index increases (or stays the same)— that is the system states
proceed from left to right. Since the properties of a speech signal change over time in
a successive manner, this model is very well suited for modeling the speech production
process. The parameters required for a complete speciﬁcation of the HMM depicted
in Figure 4.10 are summarized below. Again, the choice of these parameters depends
upon the application and the particular choice of parameters, given below, has been
demonstrated by Rabiner et al. [38] to result in the best performance for isolated

spoken-digit recognition.

64

all

0:)

O (i .4.
o e 0 o

a13 a24 a35

Figure 4.10: A left-to-right ﬁve state Hidden Markov Model.

1. N, the number of states in the model. In our case N = 5.

2. A : [an], i, j = 1, - ~ - ,N, the state-transition matrix, where a,,- is the probabil-
ity of making a transition from state j to state i. For a left-to-right model, we
have the constraint a0 = 0, j < i, j > i + 2 and aNN = 1. The state-transition

matrix is thus of the form

all 012 013 0 0

A: 0 0 (133 0.34 035 (419)

0 0 0 044 0.45

 

 

3. it = {7r,~},i = 1, - -- ,N, the initial state distribution, in which 1r,- 2 P[q1 = i].
Again, for a left-to-right model the initial state must have unity index and hence

we have,

65

1 ifi = 1,
713 = (4.20)

0 ifi¢1.

4. B, the observation-probability function. For our case, the observation-

probability function for each state is a continuous mixture density of the form

M
bi(0) Z ZCikN(Mikank)a 2:1,' ' ' 7N, (4'21)
[:21

where o is the 24—dimensional observation vector of cepstral coefﬁcients, cu, is
the mixture coefﬁcient for the kth mixture in state i and N is a 24-dimensional
multivariate Gaussian density with mean vector p“, and covariance matrix U“,
for the kth mixture component in state i. The mixture gains cu, satisfy the

following constraints:

M
Zakzr i=1,-~,N (4.22)

k
c,,,>o, lSiSN, k=1,---,M (4.23)

so that the probability density function is properly normalized, i.e.,

66

/ b,(o)do——- 1, j=1,-~ ,N. (4.24)

00

We choose M = 5 mixtures per state.

The complete parameter set of the model, as described above, is denoted com-

pactly as

/\ = (A,B, 7r). (4.25)

Given an HMM with a parameter set A, we can now compute the probability that

a particular sequence of observation vectors, denoted as

O = (01,02, - -- ,oT) (4.26)

was generated by the model. This conditional probability is denoted as P(O[)\).
The training data for each digit is used to estimate the parameter set )1 for the
corresponding HMM, so as to maximize P(O|)\). Before we discuss the training
procedure, we will ﬁrst list some variables and describe procedures for evaluating

these variables [39]:

1. The forward variable, at(i), is deﬁned as

67

at(i) = P(01,02, ‘ ' ' IOtIQt = Z[/\), (4.27)

which is the probability of the partial observation sequence, 01, 02, . ~ . ,ot, (until

time t) and state i at time t, given the model A. at(i) is evaluated inductively

as follows:

(a) Initialization

01(2) : 7rtbi(ol)7 2:13. ' ° IN' (4’28)

(b) Induction

N t = 1, . .. ,T _1
at+lfjl = [Z aiding] bj(0t+1)a (429)

The above procedure is referred to as the forward procedure [39].

2. The backward variable, ﬂ,(i), is deﬁned as

261(1) = P(0t+1,0t+2a ' " .OTIQt = 3', A) (4-30)

68

which is the probability of the partial observation sequence from t + 1 to the
end, given state i at time t and the model A. ﬂt(i) is evaluated inductively as

follows:

(a) Initialization

BTU) = 1, i = 1, - -~ ,N. (4.31)
(b) Induction
‘ N . t=T—1,T—2,~-,1
3th) 2 Zaijbj(0t+l)ﬁt+l(])a (4-32)
j=1 2:1,.H,N

The above procedure is referred to as the backward procedure [39].

3. The posteriori probability variable, w(i, k), is deﬁned as

7t“, k) : P(qt : i[0, A)? (433)

which is the probability of being in state i at time t with the kth mixture
component accounting for 0,, given the observation sequence 0, and the model

/\. “Mi, k) is evaluated using a¢(i) and ﬂt(i) as follows [39]:

69

 

 

. (Ml-Wt”) CIkN(#ik, Uik)
1 1,16 = . 4.34
r ( ) ] [Z ( )

[2.1101(040) 2.1:. ctmNMIm. Um.)

Rabiner et al. [39] also deﬁne y, (j ) as the probability of being in state j at time

t given the observation sequence 0, and the model A. From the deﬁnition of

7t(jik)

7.0) = 2710'. k). (435)

Based on the above deﬁnitions, the training procedure for a HMM, also known as the
Baum- Welch [40] method is given as follows:
First deﬁne €t(i, j) as the probability of being in state i at time t, and state j at

time t + 1, given the model and the observation sequence, i.e.,

€t(2’j) : P(Qt : laqt+l : jloiA) (436)

{,(i, j) can be written in terms of the forward and backward variables as

at(i)aijbj (0t+1)5t+1(j) 4.37
23:123:1a‘(i)a‘jbj(°t+1)ﬁt+l(j)' ( )

 

€t(za.j) :

70

Since w(i) is the probability of being in state i at time t, given the entire observation

sequence and the model, we have

N
to) = Zeal). (4.38)

The summation over t (from t = 1 to t = T — 1) of w(i) is the expected number of
transitions from state i in the observation sequence 0. The summation over t (from
t = 1 to t = T — 1) of §t(i, j) is the expected number of transitions from state i to

state j.

Using the above formulas and an initial estimate of the parameters, re-estimation

of the parameters is performed by the following equations:

71

7r,- 2 number of times in state i at time (t = 1)

= 71(2) (4.39)

expected number of transitions from state i to state j

 

 

 

 

 

a” = expected number of transitions from state i
T- . .
_ Zt=11€t(z’]) (4 40)
- T—l - .
thl 71(2)
5. _ expected number of times the system is in state i using mixture k
'k _ expected number of times the system is in state i
T .
k
: TZt:1;t(z’ ) (4.41)
Zt=12k=17t(z’k)
T .
ﬂ“): : thl 7th: k) ‘ 0t (4 42)
i T ' .
thl 7th: k)
T .
U). = Em 7:0. 19) ' (0t — Mrklfot — ”belt (4 43)

 

Zf=17tuakl

Thus, given an initial model estimate A = (A,B,7r), Equations. (4.39)-(4.43)
result in a re-estimated model denoted as A = (A, B, 7’r). It has been proven by Baum
and his colleagues [40] that either (i) the initial model A deﬁnes a critical point of
the likelihood function, in which case A : A; or (ii) model A is more likely than
model A in the sense that P(OIA) > P(O|A). Thus the training procedure proceeds
by iteratively using A in place of A and repeating the re-estimation calculations. The

re—estimation procedure is terminated when (P(OIA) — P(O|A)) < e, where e is some

72
predeﬁned threshold.

As with any iterative technique, a good choice of initial parameters is imperative
in order to ensure the convergence of the re-estimation procedure. A number of
algorithms for obtaining good initial estimates of the parameters have been proposed.
For our system, we make use of the algorithm proposed in [32]. The main steps in

the algorithm are summarized below:

1. Choose the state transition matrix such that the sum of each row is 1.0.

2. For a given set of utterances, compute the global mean and variance of all the
feature vectors and use this global mean and variance as the mean and variance

of the mixture densities for each state.

3. Use the Viterbi algorithm [41] to segment the utterances into states.

4. For each state, apply a clustering algorithm on the feature vectors assigned to
that state so as to obtain the same number of clusters as the number of compo-

nents in the mixture.

5. Update the mean and variances of the mixture densities to be the mean and

variances of the corresponding clusters.

6. Repeat steps 3 through 5 till the change in the model parameters between two

successive updates is less than a predeﬁned value (c).

The training procedure described above is also used in our system to train four

speaker-independent models corresponding to the four digits. These speaker indepen-

73

dent models are used for performing speech recognition and also for normalizing the

resulting matching score as described in the next section.

4.3 Pattern Matching

The pattern matching process involves the comparison of a given set of input feature
vectors against the speaker model for the claimed identity and computing a matching
score. For the hidden Markov models discussed above, the matching score is the
probability that a given set of feature vectors was generated by the model. Let
the set of feature vectors corresponding to the input speech be denoted as O =
(01, 02, - .. ,oT) and let the HMM be denoted as A = (A, B, it). We wish to compute
the conditional probability P(O|A). From the deﬁnition of the forward variable aT(i),

this conditional probability is simply given as

N
P(OlA) = Zaﬁi). (4.44)

However, in most cases we are interested in determining the optimal sequence
of states for the given observation sequence and the model and then computing the
probability of that state sequence. The Viterbi algorithm [41] is used to ﬁnd the
single best state sequence, q = (Q1Q2 - - -qT), for the given observation sequence 0 =

(01,02, - -- ,oT) and the HMM A = (A, B,7r) as follows

First the quantity 6t(i) is deﬁned as

74

6t(i) = max P[q1q2 - - ~qt_1qt = i,0102 - - -ot]A], (4.45)

(71 Iq2a'" 9Qt—l

as the highest probability along a single path, at time t, which accounts for the
ﬁrst t observations and ends in state i. Based on the above deﬁnition, the complete

procedure for ﬁnding the best state sequence is given as:

1. Initialization

61(2) : 7[dbl-(01)) Z: 1a ' ° ' i N (4'46)
w1(i) = 0. (4.47)
2. Recursion
' ' t = 2, . .. ,T
5M) = lgg’ﬁllét-lhlaijlbjfotla (4-48)
— _ j = 1, - ,N
. . t = 2, . -- ,T
1(lth?) = arglrgiglfvlﬁnblasl. (4-49)
_ j : 1i ' IN

3. Termination

75

P“ = ,ggliﬂiﬂ (4.50)
q} = arglrggvléﬂill- (4.51)

4. Path (state sequence) backtracking

qZ‘ =¢l+1(q2'+.). t=T—1,T—2,--- ,1. (4.52)

The probability of this single best path is the matching score for the given input
against the HMM for the claimed identity and is given by Equation (4.50) above.
However, on most machines a direct computation of the above probability will result
in numerical underﬂow. Hence, it is customary to calculate the matching score as the
logarithm of the probability. Thus, for a given digit the matching score against the

speaker’s model (S) is given as

L,(S) =10g(P,.*(S)), r = 1, 2, 7, 9. (4.53)

In our system the input consists of one utterance containing each of the four dig-
its. Hence, the probability of the utterance being spoken by the user is the product

of the individual probabilities under the assumption that the digits were spoken in-

76

dependently. However, since we are dealing with the logarithm of the probability, the
combined matching score can be simply computed by adding together the matching

score for each utterance against its corresponding model:

L(S) = L1(S) + L2(S) + L7(S) + L9(S). (4.54)

Recent studies [42, 43, 44] have suggested that the log likelihood score deﬁned
above can be normalized by the use of the so called cohort models. The test utterance
is scored against the cohort models and the resulting score is subtracted from the log
likelihood score deﬁned above. The normalized matching score thus obtained has
been demonstrated to be more stable and less variable than the unnormalized score
and thus results in a better performance.

Several cohort models have been proposed [43]. However, in our system we make
use of the speaker-independent models that are used for speech recognition as the co-
hort models. Thus cohort normalization is done by computing the combined matching
score for the test utterance against the speaker independent models and subtracting

that score from the combined matching score given by Equation (4.54) above.

4.4 Experimental Results in Literature

The YOHO database [45] has been widely used for evaluating the performance of

speaker veriﬁcation systems. The database consists of combination lock sequences

77

(e.g., 26-47 -94) collected from 138 individuals over a period of three months in an office
environment using a STU-III electret-microphone telephone handset. Each individual
has four enrollment sessions with 24 phrases per session and 10 veriﬁcation sessions
with four phrases per session. For the YOHO database, Che and Lin [46] report an
equal error rate of 0.62% using cepstrum features and word-level HMM models. For
the same database, Colombi et al. [44] report an equal error rate of 0.28% using a
feature vector composed of the cepstrum, log frame energy, ﬁrst-order delta cepstrum
and second-order delta cepstrum. However, they use phoneme—level HMM models
rather than word-level HMM models and they also employ cohort normalization.

It should, however, be noted that in-spite of the impressive performance ﬁgures
stated above, speaker veriﬁcation systems have not been able to achieve comparable

accuracies in ﬁeld tests.

4.5 Summary

The production of human speech is a complex process that involves several transfor-
mations occurring at different levels: semantic, linguistic, articulatory, and acoustic
[33]. Anatomical variations that occur among different people and the differences
in their learned speaking habits manifest themselves as differences in the acoustic
properties of the speech signal. By analyzing and identifying these differences, it is
possible to discriminate among speakers [33]. The veriﬁcation process, using speech,
can be text-dependent or text-independent.

We have implemented a text—dependent speaker veriﬁcation system. Our vocabu-

78

lary consists of four digits: one, two, seven and nine. Hidden Markov models are used
to model the speech production process for each digit and the matching score is the
probability that a given speech sample was generated by the HMM for the claimed
identity. Speech recognition is also implemented to ensure the validity of the input
speech and for use in implementing cohort normalization.

Speech is probably the most readily accepted biometric and, in fact, the largest
scale deployment of any biometric to date is Sprint’s Voice FONCARD [33]. The
reason for this popularity is simple: acquisition of speech samples is a very simple
non-invasive procedure. Furthermore, speech acquisition requires very modest hard-
ware and, in most cases, it is possible to perform identity authentication over the
existing telephone networks. However, compared to ﬁngerprints or retinal scans, the
discriminative capability of a voice-print is signiﬁcantly lower. Consequently, the
performance of a speaker veriﬁcation system for a large population is limited. Fur-
thermore, environmental factors such as background noise, the time of the day, the
transmission channel, as well as the emotional state of the individual, cause a signiﬁ-
cant change in speech patterns. Also people skilled in mimicking other peoples voices
may be able to easily fool a speaker veriﬁcation system. Consequently, it is gener-
ally agreed that a speaker veriﬁcation system will not be able to achieve an accuracy

comparable to ﬁngerprint-based or retinal pattern-based biometric systems.

Chapter 5

Decision Fusion

Each of the three biometric systems considered so far makes use of only one biometric
characteristic to compute a conﬁdence about the identity claim made by an individ-
ual. An integration schema that combines the conﬁdence levels expressed by the
independent biometric systems is likely to improve the veriﬁcation accuracy. In this
chapter we propose two different decision fusion schemas that allow us to integrate
face, ﬁngerprint and speech to improve the accuracy of the system in terms of FAR
and FRR.

Identity authentication using multiple biometric indicators is essentially a deci-
sion fusion problem, which utilizes information from multiple systems to increase the
fault-tolerance, to reduce uncertainty, to reduce noise, and to overcome the limita-
tions of individual systems [47]. A multi-modal approach can increase the reliability
of the decisions made by a biometric system. Multiple biometrics enable a user’s
identity to be veriﬁed even if some of the biometric characteristics used by the sys-
tem are not available and / or not suitable for automatic processing. By using multiple

79

80

biometric characteristics, the system will be applicable to a larger target population.
In addition, a multi-modal biometric system is generally more robust to fraudulent
technologies, because it is more difﬁcult to forge multiple biometric characteristics
than to forge a single biometric characteristic. Figure 5.1 depicts the architecture of

a generic biometric-based identity veriﬁcation system employing multiple biometrics

[4]-

then} find;

 

Accept /
Reject

Figure 5.1: Integration of different biometric characteristics (after Jain et al. [4]).

Decision fusion can be carried out at different levels [5]: )Abstract level, the
output from each module is only a YES / NO label without any conﬁdence associated
with the labels; in this case, the simple majority rule may be employed to reach a

more reliable decision [48], (ii) Rank level; the output from each module is a YES / NO

81

label and the modules are ranked by decreasing conﬁdence values, but the conﬁdence
values themselves are not speciﬁed; (iii) Measurement level; the output from each
modules is a YES/ NO label with associated conﬁdence values; in this case, more
accurate decisions can be made by integrating different conﬁdence measures to a
more informative conﬁdence measure. Each of the biometric systems considered so
far has a very different characteristic and a different matching scheme. Therefore, it
is more reasonable to integrate the three biometrics at the measurement level instead
of at the other two levels. An identity claim along with the requisite biometrics are
inputted to the integrated system, the different modules within the system process
the respective biometric and produce an accept/reject decision along with a score
indicating the conﬁdence in their decision. The decision fusion module then makes a

ﬁnal accept / reject decision.

5. 1 Formulation

Let 8 denote a given biometric system, and let <I>1,<I>2,- - . ,<I>N denote the templates
of the N users enrolled in 3, who are labeled by numeric indicators, 1, 2, - u ,N [4].
Assume, for simplicity, that each enrolled user has only one template (for each type of
indicator) stored in the system. So the template for the ith user, <I>i = [(1)], g, g},
has three components, where (1)], g, (1)}, are the templates for ﬁngerprint, face, and
speech biometrics, respectively. Let (4”,!) denote the biometric indicator and the
identity claimed by a user. Again (PO has three components, (1)0 = {(D‘],<I>3,CI>3},

corresponding to the measurements of the three biometric indicators. The claimed

82

identity, I, belongs to either category wl or category 1112, where w indicates that the
user claims a true identity (a genuine user) and U12 indicates that the user claims
a false identity (an impostor). The biometric system 8 matches <I>° against (1)1 to

determine into which category, w or U12, the claimed identity I falls, i.e.

wl, lf .7:((I)O,q)l) > 6,
I e (5.1)

1112, otherwise,

where f(<I>°,<I>’) is a function which measures the similarity between (1)0 and (DI
and e is a threshold. Under the assumption that the three biometric indicators are

independent, we can rewrite the function f(<P°, (1)1) as follows:

f((1)0,(1)1) : .7:(f1((13(1’,T{),f2(¢g,¢g),f3(@g,(1%)), (5'2)

where 7161“], <I>[), .772((I>g, (1);), and 73(eg,<1>§) are functions that measure the similar-
ity between the corresponding biometric indicators. For the three biometrics consid-

ered so far, we have:

1. Fingerprints: Let (1)] denote the extracted minutiae for the I th identity and (1)?

denote the extracted minutiae for the input ﬁngerprint image. Then

83

100C2

PQ , (5.3)

 

f1((13’1), (bl) =

where P and Q are the total number of minutiae in (I)? and (1)], respectively,
and C is the total number of corresponding minutiae pairs between (D? and (1)]

established by the minutiae matching algorithm.

. Face: Let (1)5 denote the projection of the face image for the I th identity onto
the eigenspace and let (1)3 denote the projection of the input face image into the

same eigenspace. Then

$2614.44) = -||<1>§ - 93H. (54)

where [l - M denotes the L2 norm.

. Speech: Let (Pg denote the hidden Markov model for a given word for the
I th identity and <I>° = (01,02, - -- ,oT) denote the sequence of feature vectors

extracted from the input speech. Then

x3(<r>g,<1>g) = P((01,02, - -- ,oT)|<I>g). (5.5)

84

As explained in chapter 4, the probability computation step is performed using

the Viterbi algorithm (i.e., the maximum likelihood path is used).

When the input biometric indicator belongs to a genuine individual, the resulting
similarity measure is called a genuine score and when it belongs to an impostor it
is called a impostor score. If we denote the three similarity measures deﬁned above
by the random variables X1, X2, and X3 then the distributions of genuine score for
ﬁngerprint, face and speech veriﬁcation are given by the class-conditional probability
functions pJ-(lewl), where j = 1, 2, 3. Similarly, the distributions of impostor score
are given by the class-conditional probability functions 19,-(le2112), where j = 1, 2, 3.
Further, under the assumption that X1, X2, and X 3 are statistically independent, the
joint class-conditional probability density function of X 1, X 2, and X 3 has the following

form:

3
p(X1,X2,X3|wz-) = HPj(lewi), i: 1,2- (5-6)
i=1

5.2 The N eyman—Pearson Rule

Depending on the application requirement on veriﬁcation accuracy, any one of a
number of different statistical decision theory frameworks can be used. In biometrics,
the performance requirement is usually speciﬁed in terms of the PAR [1]. In this case,
the decision fusion scheme should establish a decision boundary which satisﬁes the

FAR speciﬁcation and minimizes the F RR. Let R3 denote the three-dimensional space

85

spanned by (X 1, X2, X 3); R? and R3 denote the wl-region and wg-region, respectively
(1213 + R3 = R3); 60 denote the speciﬁed FAR. According to the Neyman-Pearson

ru1e[49], a given observation, X 0 = (X 9, X3, X g), is classiﬁed as:

p()ﬁ’,XE.X§|w1) > A

 

 

”(1)1, . , . 1112
amuse ”WW ’ (5.7)
mg, otherwise
where
p(X1,X2,X3|w1)
/\ and 5.8
p(X1,X2,X3|w2) ( )
60 = p(X1,X2,X3|w2)dX1dX2dX3. (5-9)

31

For a given biometric system, the genuine and the impostor class-conditional prob-
ability density functions are usually unknown. A critical issue in this decision fusion
scheme is to estimate the genuine and impostor class-conditional probability den-
sity functions from a set of training samples. Ideally, it would be desirable to be
able to characterize these probability density functions by known statistical models.
However, in practice, these genuine and impostor probability density functions are

estimated using a non-parametric technique from empirical data.

86
5.3 Linear Discriminant Functions

In addition to the statistical decision theory approach, decision fusion can also be

performed using a discriminant function approach:

wl, if .7-‘(X°, X0, X0) > (10
(X?.X3,X§)e 1 2 3 (5.10)

Mg, otherwise,

where .7: is the discriminant function and a0 is a threshold, which can be “learned”
from a set of training samples using a number of techniques. While the discriminant

function .7: can be of any form, in practice, a linear function is used for its simplicity:

wl, if (a1X0+a2X°+a3X°) > a0
(X?,X§,X§) e 1 2 3 (5.11)

102, otherwise,

where a0,a1,a2, and a3 are unknown parameters which need to be estimated from
training data. An important problem in this approach is to ﬁnd an appropriate
discriminant function that satisﬁes the speciﬁed FAR. Generally, there exists no sys-
tematic method to ﬁnd such a discriminant function. It is, however, possible to use
a perceptron to obtain a decision surface that minimizes the equal error rate and
then the desired decision scheme can be established by moving the resulting decision

surface parallel to itself so that the FRR is minimized for the speciﬁed FAR.

87
5.4 Summary

Due to the inherent limitations of each of the biometric characteristic (refer to Table
1.1) it is difﬁcult for an automatic personal identity authentication system based on
only a single biometric characteristic to achieve the desired veriﬁcation accuracy. A
multimodal biometric system that integrates more than one biometric characteristic
through a decision fusion scheme can overcome the limitations of the individual bio-
metric characteristic. Thus, if the person is unable to give a good ﬁngerprint sample
due to various reasons (dryness of the skin, personal injury, etc.) then the system
may use only his/ her face and speech samples for identity authentication. Similarly,
in a noisy environment, a biometric system based on speech alone may not be able to
reach a reliable decision, while in a cluttered environment face veriﬁcation may not
work very well. We have prOposed two decision fusion schemes that integrate faces,
ﬁngerprints and speech so as to improve the veriﬁcation accuracy. The prOposed
schemes also make the integrated system applicable to a larger target population and
diverse operating environments.

Decision fusion can also be employed for automatic personal identiﬁcation systems
wherein the goal is to improve the accuracy as well as the speed of the integrated
system [8]. It can also be employed in systems wherein a single biometric characteristic

is processed by multiple matching algorithms.

Chapter 6

Integrated Biometric System

Our goal is to design and implement a fully automatic personal veriﬁcation system
that integrates three biometrics: faces, ﬁngerprint and speech. The system must be
able to operate in real time and must be user-friendly. By real time we imply that the
response time of the system must be of the order of a few seconds. “User-friendliness”
in this context implies that the system must be simple to use and easy to maintain.
Furthermore, the proposed system is intended for use in a medium sized environment
wherein the number of enrolled users is of the order of a few thousands. The block
diagram of such an integrated system is depicted in Figure 6.1. The system mainly
consists of four components: (2) data acquisition module, (z'z') enrollment module,
(iii) template database, and (iv) veriﬁcation module.

The data acquisition module is responsible for acquiring face and ﬁngerprint im-
ages and speech samples of a user who intends to use the system. The enrollment
module is then responsible for converting them into a format suitable for use by the
verification module and storing them in the template database. In addition, the en-

88

89

ICnrullnIr-m \lmlulc

Face
llnm'.’ \..j11|~|l|u|. Database
\l 'lul.’ Br0wser

    
  
 
 

   
  
 

     
 

 

Extractor
Minutiae
Extractor
Cepstral
Analysis

     
 

 

\Ix-v l1 \.\1Io|~1llwll
\lmlul.~

W
M

 

   
 

  
  

Ei ens ace
Cogmpapi'ison
Minutiae
Matching
HMM l
Scoring I

\ cril'iculiun \lmlulc

 

Decision Accept I
Fusion R939“

     
 
 

 

  

Figure 6.1: The block diagram of the integrated system for personal veriﬁcation.

rollment module also allows for system maintenance operations such as user deletion,
user update, system parameter speciﬁcation, etc. The use of a template database
allows for efficient retrieval of templates for a given identity claim during veriﬁcation
and also allows for any off-line operations that may be required. The veriﬁcation
module receives a set of input data from the data acquisition module and the iden-
tity claimed by the user. It then performs feature extraction on the input data and
retrieves the representative templates for the claimed identity. Finally, it evaluates

the input data against the templates and makes use of decision fusion to reach an

90

accept / rej ect decision.

6.1 Data Acquisition Module

The performance of a biometric-based personal veriﬁcation system depends critically
on the data acquisition process. If the user is assumed to be cooperative then the
quality of the acquired data can be controlled by providing the following mechanisms
to the user: (z) feedback and self-regulation guide, and (ii) quality control. For a
cooperative user, feedback provides an efficient mechanism to guarantee the quality
of input images; if the user does not provide the corresponding biometric charac-
teristic properly, the feedback mechanism will allow the user to make appropriate
adjustments. We have designed an intuitive graphical user interface to acquire face
and ﬁngerprint images and speech data (Figure 6.2). For face and ﬁngerprint images,
real-time feedback allows the user to rapidly adjust the location of the ﬁnger or the
position of the head so as to satisfy the placement requirements for face and ﬁnger-
print images. For speech data, the quality of the input speech in terms of signal to

noise ratio, etc. is monitored and feedback is provided to the user accordingly.

6.2 Enrollment Module

In our system, the data acquisition process is differentiated from the enrollment pro-
cess so as to allow for data collection to be done in a distributed fashion. The

enrollment module allows for centralized system administration. Once the data is

91

Data A( Ilumlmn Dmlul) - ‘

Please place the center of your ﬁnga—Meﬁlnter of the cross hair. Thank You.
2 was“... “an” ‘ ., . -.

.n’“

  

605 W Madison St

 
 

pt
Chicago. IL 60661

  
 

 

i 12-902-3751

  

 

 

Figure 6.2: GUIs for data acquisition during enrollment.

collected for a particular user, the following events must occur before the user is con-
sidered to be enrolled in the system: (2') the eigenspace and the projections onto the
eigenspace of the user must be computed, (ii) a representative image from the given
set of ﬁngerprint images must be selected and minutiae extraction must be applied,

and (iii) hidden Markov models must be trained for the user.

The graphical user interface for the enrollment module has been designed to fa-
cilitate the above three tasks for an administrator. The administrator can select a
user from the list of users in the database (Figure 6.3(a)). For the selected user, the

particular data to be browsed can be selected from the three possible choices (Figure

92
6.3(b)). The GUI changes to adapt itself to the data type selected (Figure 6.4). The

required tasks to complete the enrollment can now be performed from the appropriate

screens.

 

 

 

 

 

 

 

UserName:
‘ ad ak . . . . .. . ‘ p
7 Select Date Type - - -
, ashu §

bakiweI (0‘ Faces

bemard “ -

P
, chenshao ; Fingerprints
‘connell V F Speech
(a) (b)

Figure 6.3: User Database: (a) The users enrolled in the system are shown in the list,
(b) For each user, three type of data are deﬁned.

6.3 Template Database

Ideally, a relational database management system would be used for storing the tem-
plates generated for each user. However, since our system is intended for use in
a medium sized environment, it suffices to use the native ﬁle system for the given
hardware and software platform for storing the templates. The template database
is organized as a hierarchical directory structure which is indexed by the user name.
For each user, the face template consists of the projections of the face images onto
the eigenspace deﬁned by the set of enrolled users. The ﬁngerprint template consists
of the set of extracted minutiae from the selected ﬁngerprint image (the selection

is done manually in the enrollment module). The speech template consists of the

93

 

Figure 6.4: GUIs for the enrollment module.

four hidden Markov models for that user. The combined storage requirement for the
three templates for a single user is of the order of a few thousand kilobytes (typically

around 7 to 8 KB).

6.4 Veriﬁcation Module

During the veriﬁcation session, the user is required to give one rendition of each of the
three biometrics and submit an identity claim. Since the data acquisition method at

veriﬁcation stage is different from the enrollment session, we have designed a different

94

data acquisition GUI for the veriﬁcation module (Figure 6.5). The veriﬁcation module
is then responsible for retrieving the appropriate templates from the database, com-
pute the matching scores of the input renditions against the corresponding templates

and ﬁnally making an “Accept/ Reject” decision through decision fusion.

in: r [nun-mun! .HIII Slum

 

Figure 6.5: GUIs for the veriﬁcation module.

6.5 Commercially Available Biometric Systems

Today a large number of vendors are offering automatic personal identiﬁcation and
veriﬁcation systems in the market. Some of the well-known systems are TrueFace

from Miros [50] and FaceIt from Visionics [51] that are face—based systems, GateKey

95

 

 

 

 

 

 

 

 

 

 

 

Figure 6.6: The result of a successful veriﬁcation.

from Indenticator Technology [52] and BioMouse from American Biometric Company
Inc. [53] that are ﬁngerprint-based systems, and VoiceCrypt from Veritel Corporation
of America [54], SpeakeEz from T-Netix [55] and SpeakerKey from ITT [56] that are
speech-based systems. All the above systems are available for the Microsoft Windows
platform. Admittedly, our system does not have the same commercial appeal as
these systems but the functionality provided by our system is comparable to any of
the above systems. Also, all the above systems make use of a single biometric. We are
not aware of any commercially available system for the Microsoft Windows platform

that uses multiple biometrics.

 

 

96
6.6 Summary

In this chapter we have described a real-time, medium sized automatic personal ver-
iﬁcation system that uses three different biometrics: face, ﬁngerprint, and speech
to perform identity authentication. The system provides convenient graphical user
interfaces for data acquisition, user enrollment and database management and for
identity authentication. Real-time feedback is provided during the data acquisition
process to ensure that the data thus obtained is of reasonable quality. User enroll-
ment can be performed in an off-line fashion and at a centralized location so as to
ensure the validity of the users being added to the system. The veriﬁcation module
requires one instance each of face, ﬁngerprint and speech along with an identity claim
to perform authentication. The template database is maintained as a hierarchical
directory structure on the native ﬁle system of the host machine, in our case, an IBM
compatible personal computer running Windows NT 4.0. In the next chapter we
will evaluate the performance of our system and demonstrate through experimental
results that the system achieves the desired performance and also operates in real

time.

Chapter 7

Experimental Results

In this chapter we shall attempt to evaluate the performance of our integrated system
and assess the gain in performance that can be achieved by employing the two decision
fusion schemes discussed in Chapter 5. We will demonstrate that our proposed de-
cision fusion schemes signiﬁcantly improve the performance of the integrated system
over the individual biometric systems. It is, however, critical to understand that the
performance of a biometric system is dependent upon the Operating environment and
the amount of c00peration that can be expected from the user. Hence, we must ﬁrst
understand the various factors that affect the performance of the biometric systems
based on face, ﬁngerprint, and speech. These factors are summarized in Table 7.1.
The users are assumed to be cooperative. Controllabilz'ty refers to the ease with which
the various factors can be constrained to be within given limits for a cooperative user
without causing undue inconvenience to the user.

The performance evaluation of a biometric system is a difficult task [57, 58].
Given any biometric system, it is extremely time consuming and expensive for the

97

98

 

 

 

 

 

 

 

 

 

Biometric Factors Controllability
Illumination High
Distance High
Face Orientation Medium
Cosmetics Low
Facial Expressions Low
Illumination High
Fingerprint Impression pressure Medium
Orientation Medium
Ambient Noise Medium
Speech Spectral Variations Low
Device Variations Medium

 

 

 

Table 7.1: Various factors affecting the performance of different biometric systems.

manufacturer to test it under all possible conditions under which the system may have
to Operate. As of today, there are no standard benchmarks available for a biometric
system. The performance of a biometric system can be quantiﬁed by various measures
of performance such as FAR, F RR, receiver operating curve, etc. However, these
performance metrics are all dependent on the database available for training and
testing and the conditions under which the data was collected. Hence proper care
needs to be taken while generating the database. The number of individuals in the
database should be large enough so as to adequately represent the population and
enough samples must be available for each individual [57]. If these conditions are met
then the resulting performance metric can be a fair estimate of the true performance

of the system when used with the entire population.

We shall make use of the receiver operating characteristics (ROC) to assess the
performance of our system. The ROC curve is a plot of the authentic acceptance rate

(1 — F RR) against the FAR. The ROC curve is obtained by varying the operating

99

point of the system from a point where the FRR is 100% to a point where the FAR is
100%. Ideally, we would like to have a system with a 100% authentic acceptance rate
for all values of FAR. This would imply that a genuine individual is never rejected.
The ROC curve for such a system is depicted in Figure 7 .1. In practice, due to the
inherent variability in the data acquisition process, it is impossible for a system to
have such a ROC curve. The objective is then to design a system with a ROC curve

that is as close as possible to the ROC curve depicted in Figure 7.1.

 

 

100*

0
O

a)
O
l

N
O
I

03
O
I

40*-

Authentic Accept Rate (96)
8

30- 'l

20-

10-

 

 

 

o l l 1 l l 1 L 1 1
0 1 O 20 30 40 50 60 70 80 90 1 00
False Accept Rate (96)

 

Figure 7.1: An ideal receiver operating characteristics curve.

7 .1 Database

We have collected a database of faces, ﬁngerprints and speech samples from 50 indi-
viduals. For each of these 50 individuals, we have collected 9 samples each of their

frontal facial images, 10 samples each of their ﬁngerprints from a single ﬁnger and

100

24 samples each of their speech. The frontal facial images were obtained using a
Panasonic camera under adequate light and against a plain background. The images
were digitized using a frame-grabber from Data Translation and were re-sampled and
crOpped from their original size of 640 x 480 to the desired size of 80 x 90. Finally,
the images were normalized to have zero mean. The subjects were asked to vary the
orientation of the head and the direction of their gaze within small amounts across
each of the nine samples. A typical set of face images obtained for an individual are

depicted in Figure 7.2.

 

Figure 7.2: A typical example of the nine face images acquired for each user.

The ﬁngerprint images were obtained using a live—scan ﬁngerprint scanner manu-

factured by Digital Biometrics. The ﬁngerprint acquisition process was supervised in

101

terms of controlling the position, orientation and quality of the ﬁngerprints. Conse-
quently, all of the acquired ﬁngerprints are of reasonable quality. Figure 7.3 depicts

typical ﬁngerprints in the database.

’ ""_" _‘ "' ‘>" W’ " A_" ' war-Era“). W'_'WM‘I .‘~..;\ ' .

. ,2 ‘h- c H“
.‘r"“5“7‘- .1

(i) 0)

Figure 7.3: A typical example of the ten ﬁngerprint images of good quality acquired
for each user.

Our speech database consists of 24 samples per speaker, two sets of 12 samples
each, collected over two sessions held approximately a week apart. The recordings
were done in a laboratory environment using a Labtec micrOphone employing noise
reduction technology and a Sound Blaster audio card. The resulting speech samples
have a very high signal-to—noise ratio. Figure 7.4 depicts a typical speech sample
consisting of the four digits one, two, seven and nine.

The database described above was used for generating the face, ﬁngerprint and

102

 

08" .4

06- ..

0.4 - ‘

 

   

 

 

 

—0.4 r -<
-0.6 '- -
-08 >- d

_1 i 1 1 l l l 1
0 0.5 1 1.5 2 2.5 3 3.5 4
Samples x 10‘

Figure 7.4: The waveform of a typical speech sample in our database.

speech templates for the users and for training the decision fusion schemes. Next, for
a group of 25 people, additional sets of data were collected, over multiple sessions. In
each such session, 5 samples each of face, ﬁngerprint and speech were collected. On
an average, about 3 such sessions were conducted for each of the 25 users. Finally, a
database of impostors was generated by pooling together face, ﬁngerprint and speech
samples of individuals not enrolled in the system. The performance of the system
was also evaluated on this database of impostors. The genuine and the impostor

databases were pooled together to evaluate the performance of the individual and the

integrated systems.

 

 

103
7 .2 System Training

The ﬁrst step in the evaluation process is to generate the impostor and genuine
distributions for face, ﬁngerprint and speech matching scores. We shall brieﬂy discuss
the methodologies used for generating the impostor and genuine distributions for
the individual biometrics. We note that the matching scores for face, ﬁngerprint
and speech will have their respective ranges. Hence, we normalize the genuine and

impostor distributions for each biometric to the range [0, 100].

Face

In order to generate the genuine and impostor matching scores, the leave-one-out
method was employed. Thus a face template for a user is obtained by using eight
of the nine facial images. The image left out is used for computing the genuine
and impostor matching scores. The process is then repeated for each of the nine
images. Thus, for a database of 50 individuals with 9 face images per individual, the
comparison of each image with the 50 templates results in one genuine matching score
and 49 impostor matching scores. There are a total of 450 facial images, resulting in
a total of 450 genuine scores and 22,050 impostor scores. The genuine and impostor

distributions for face matching are depicted in Figure 7.5.

Fingerprint

Out of the ten ﬁngerprints available for each user, we have manually selected one

ﬁngerprint as the template. The manual selection of the template ﬁngerprint is a

104

Discrete Probability Distribution tor Face

9 I Y Y T T T T i 1

 

8 _ Genuine

lmposter

Percentage(%)

 

 

 

 

0 1 l 1 i 1\ .i k'

1
0 10 20 30 40 50 60 70 80 90 100
Discretized Face Scores

 

Figure 7.5: The genuine and impostor distributions for face matching.

reasonable choice so as to enable us to represent each individual using the most
expressive sample. The remaining nine ﬁngerprints are used to generate the impostor
and the genuine distributions for the matching scores in a manner similar to that
described for the face veriﬁcation system. The genuine and impostor distributions for

ﬁngerprint matching are depicted in Figure 7.6.

Speech

Amongst the 24 samples for each user, 12 samples from one session were used for
generating the hidden Markov models for that user and the 12 samples from the second
session were used for generating the genuine and impostor distributions in a manner

similar to the face veriﬁcation system. The genuine and impostor distributions for

speaker veriﬁcation are depicted in Figure 7.7.

 

V:

105

Discrete Probability Distributions tor Fingerprints
100 r I 1 r r r W

 

70 P Imposler

Percentage (‘16)
8

 

 

 

 

10’-
o 1 Al 4“.

1
O 1 0 20 30 40 50 60 7O 80 90 1 00
Discretized ﬁngerprint matching scores

Figure 7.6: The genuine and impostor distributions for ﬁngerprint matching.

The above distributions were then used to train the two decision fusion schemes

discussed in Chapter 5 in the following fashion.

Neyman-Pearson Rule

The training procedure for the Neyman-Pearson rule essentially involves an exhaustive
search in the 100 x 100x100 dimensional space deﬁned by the three discrete probability
distributions. The objective is to deﬁne a region wherein the false acceptance rate
is less than the desired value and then deﬁne a test on the likelihood ratio, A. The
Neyman-Pearson lemma[49] guarantees that the likelihood ratio test, as deﬁned in
Equation 5.7, with this A will be the most powerful test resulting in the smallest
possible false reject rate for the speciﬁed false accept rate. We simplify the exhaustive

search into a linear search by sorting the likelihood ratios for the training data in

106

Discrete Probability Distributions tor Speech
7 I I I I I l T

 

Genuine

 

       

 

 

l 1 l 1 A AL 1
o 10 20 30 40 so 60 7O 80 90 100
Discretizedspeech matchinoscores

 

Figure 7.7: The genuine and impostor distributions for the speaker veriﬁcation sys-
tem.

an ascending order and then sequentially adding the sorted data points into the

acceptance region without exceeding the predeﬁned false accept rate.

Linear Discriminant Functions

Since the three impostor and genuine distributions are assumed to be independent
of each other, the joint impostor and genuine distributions are simply obtained by
combining the three univariate distributions in a three-dimensional Euclidean space.
In this space the parameters of the linear discriminant function are learned using
a perceptron. Since a perceptron can only be trained to minimize the equal error
rate, a trial-and-error method is used next, wherein the decision surface obtained
through perceptron training is moved parallel to itself till the desired FAR is achieved.

Although, such an ad-hoc method is unlikely to be optimal, experimental results

107

demonstrate that the method works reasonably well.

7 .3 Performance Evaluation

The purpose of this thesis is to demonstrate that the FAR and F RR of an auto-
matic personal veriﬁcation system can be improved by integrating multiple biomet-
rics. Consequently, for each decision fusion schema, we ﬁrst evaluate the performance
of automatic personal veriﬁcation systems based on each of the three biometrics by
plotting the ROC curves for each system. Next we will evaluate the performance
of the integrated system by plotting the ROC curve of the integrated system and
demonstrate the gain in performance. The test data collected for the 25 individuals

along with the impostor database was used to evaluate the various systems.

7 .3.1 Performance of the Neyman—Pearson Decision Fusion

Schema

The training procedure described above results in one likelihood ratio test for each
desired value of the false accept rate. The test data is now used to evaluate the false
accept rate and the false reject rate for each test on previously unseen data. The
resulting receiver operating characteristics for the various systems are depicted in

Figure 7.8.

108

 

100 17 T v 11711!

 

 

fusion

 

80 - fingerprint

authenticate acceptance rate(%)
8
T
1

4o» -

 

 

face
A L AAAAAl A A A ALLAAI A 41 A A .1111 A A A ALLJL
-1 0
1O 10 10 10
false acceptance rate(%)

ALAJAI A

10'3 to"

 

Figure 7.8: The receiver operating characteristics for the Neyman-Pearson rule.

7.3.2 Performance of the Linear Discriminant Function De-

cision Fusion Schema

The various linear discriminant functions obtained during the training phase are used
along with the test data for performance evaluation. The resulting receiver operating

characteristics for the various systems are depicted in Figure 7.9.

7 .3.3 Veriﬁcation Speed

Since the integrated system is intended to operate in real time, it is necessary that
the response time of the system be of the order of a few seconds. The average wall-

clock time required for our system for one veriﬁcation session on a Pentium 200MHz

109

 

 

 

 

 

 

 

 

 

1 ,. "——-r' *T"
, I
J
I
I -l
I
J
I
I .1
I
/
l
A J ..
o\° /
V /
o ..
ii ..-.‘I ’
I 601"";_.J / “
_ f
t . x
b- . /
gm: /’ —l
f/
/
i: 40" /’ -‘
/
< _ / _‘
30 I
I,
[I
- I u!
20 ,I’ ........ Speech
,” --- Face
io~ ’,/” -—-—‘- Fingerprint ~
’ — Fusion
0 A A A A AAAAl A A LA AALJI A A A A AAAAl A A A ALAAA
10‘2 10“ 10° 10‘ 102
False Acceptance Rate (°/o)

Figure 7.9: The receiver operating characteristics for the Linear Discriminant func-
tions.

machine running Windows NT 4.0 is summarized in Table 7.2.

 

Face Fingerprint Speech Integrated System
Time (seconds) 0.5 1.5 0.75 3.0

 

 

 

 

 

 

 

 

 

 

 

Table 7.2: Wall-clock times for the various veriﬁcation systems.

7.4 Summary

Obtaining a reliable estimate of the performance of a biometric system is a rather
difﬁcult task. To ensure that the estimated performance ﬁgures are reliable, the

database used for evaluation must contain enough samples so as to adequately rep-

110

resent the intended population and the intended operating environment. Also, care
must be taken while collecting the data to ensure that there is no bias towards or
against any one of the three biometrics. The entire database is divided into three
components: (2') data to be used for generating the templates, (22) data to be used for
training the various decision fusion schemes and (222') data to be used for evaluating
the generalization ability of the trained systems.

In this chapter we have presented the experimental results for the two proposed
decision fusion schemes for a limited database collected in a laboratory environment
and from cooperative users. For this database, we have demonstrated that the pro-
posed decision fusion schemes are able to signiﬁcantly improve the performance of the
integrated system over the individual biometrics systems. The integrated system has
a response time of approximately three seconds on a Pentium 200MHz machine run-
ning Windows NT 4.0. With code optimization and a faster processor, the algorithm

is capable of operating in real-time.

 

Chapter 8

Summary and Future Research

In this chapter we will summarize the work we have done, discuss the limitations of

our prOposed approach and propose some directions for future research.

8. 1 Summary

Biometrics, which is deﬁned as the use of human physiological and behavioral char-
acteristics to establish and / or verify the identity of an individual is poised to become
the popular security standard in the near future. It is inherently more secure than
knowledge-based or token-based security systems since it relies upon something that
one 23 rather than something that one knows (e.g., passwords) or something that one
has (e.g., magnetic stripe cards) to make an identiﬁcation/veriﬁcation. A number of
different physiological and behavioral characteristics have been studied and identiﬁed
as potential candidates for a biometric system. Amongst them face, ﬁngerprint, and
speech are the most widely applied and accepted biometric techniques. However,

111

112

each of these three biometrics has its limitations: face is the least accurate biometric;
the performance of ﬁngerprints is acceptable only when the ﬁngerprint images ob-
tained are of good quality; and the performance of speech depends heavily upon the
operational environment and the data acquisition equipment. Thus, systems based
on a single biometric are not able to achieve the performance required of a practical
system. It is expected that a system that integrates these three biometrics can over-
come the limitations of the individual biometrics and be suitable for practical use.
Furthermore, such a system is also applicable to a larger target population.

In this thesis we have designed and implemented a real-time fully automatic per-
sonal identity veriﬁcation system that integrates the following three biometrics: face,
ﬁngerprint, and speech. The system uses the eigenface method by Turk and Pentland
[21] for face recognition/ veriﬁcation, the alignment-based minutiae matching algo-
rithm by Jain et al. [2] for ﬁngerprint veriﬁcation and continuous mixture density
hidden Markov models for speaker veriﬁcation. A video camera is used to obtain
frontal images of the individual’s face under suitable lighting conditions and against
a plain background. An optical ﬁngerprint scanner is used to obtain the ﬁngerprint
images. The face and the ﬁngerprint images are digitized using a frame grabber. The
resulting images are of 640x480 size. Speech is recorded using a noise cancellation
microphone and a standard sound card. The speech is sampled at a frequency of
8000 Hz and at a resolution of 16 bits per sample. The data thus obtained is then
processed by the respective sub-systems resulting in three matching scores. These
matching scores are then fed to a decision fusion module which uses one of the two

prOposed schemes to make the ﬁnal “Accept / Reject” decision. Suitable graphical user

113

interfaces have been provided to make the veriﬁcation system user-friendly and for
providing real-time feedback during the data acquisition process.

Experimental results have been obtained on a limited database collected under a
controlled laboratory environment and from cooperative users. These results demon-
strate that our system achieves desired improvement in veriﬁcation accuracy over the

individual systems while maintaining the response-time requirements.

8.2 Future Research

In our system face veriﬁcation can only be done under a controlled operating en-
vironment. This is mainly because of the lack of a reliable and fast face detection
algorithm. The practicality of the system can be increased by incorporating a more
sophisticated face detection algorithm.

The performance of the face veriﬁcation system is the worst amongst the three
biometrics. The primary reason for this is the elapsed duration between the training
data collection and the test data collection. Figure 8.1 illustrates one instance of
the difference between training and test data for the face system. Such changes are
undesirable but unavoidable for a practical system.

Intra—class variations introduced by time affect the performance of all the three
biometrics but to varying extent. Faces and speech are affected the most whereas
ﬁngerprints are affected only to a limited extent. It is, however, possible to overcome
these variations by employing some kind of template adaptation technique. After each

veriﬁcation session, the test data is used to update the stored template. If the system

114

 

Figure 8.1: Example of the temporal variation present between training and test face
images.

is used regularly then the temporal changes can be tracked and the performance can

be maintained at the desired level.

Decision fusion is currently being performed at the measurement level. It is pos-
sible to perform decision fusion at a feature level. It is expected that a classiﬁer
designed to operate in the combined feature space is likely to be able to achieve a
higher performance.

The integrated system is, at best, a useful tool for demonstration purposes. A
number of enhancements need to be made to make it a commercially viable system. It
would also be desirable to make a developer’s toolkit available that could be used by
application developers to easily integrate multimodal biometric authentication into
their applications.

Finally, we note that the performance of our system has been tested on a limited
database acquired in a. laboratory environment. The performance on a larger database

consisting of a few thousand users remains to be evaluated. We are currently in the

115

process of combining a number of different face, ﬁngerprint and speech databases so

as to generate a larger and a more realistic database for testing our system.

 

Bibliography

[1] E. Newham, “The biometric report,” tech. rep., SJB Services, New York, 1995.

[2] A. K. Jain, L. Hong, S. Pankanti, and R. Bolle, “An identity-authentication
system using ﬁngerprints,” Proceedings of the IEEE, vol. 85, pp. 1365 — 1388,

September 1997.

[3] J. R. Deller, J. G. Proakis, and J. H. Hansen, Discrete-time processing of speech

signals. New York: Macmillan Pub. Co., 1993.

[4] A. Jain, L. Hong, and Y. Kulkarni, “A multimodal veriﬁcation system using faces,
ﬁngerprints and speech,” in To be published in the proceedings of AVBPA ’99,

(Washington, DC), March 1999.

[5] R. Brunelli and D. Falavigna, “Person identiﬁcation using multiple cues,” IEEE
Trans. on Pattern Analysis and Machine Intelligence, vol. 17, pp. 955 — 966,

October 1995.

[6] U. Dieckmann, P. Plankensteiner, and T. Wagner, “Sesam: A biometric person
identiﬁcation system using sensor fusion,” Pattern Recognition Letters, vol. 18,
pp. 827—833, September 1997.

116

l1

117

[7] B. Duc, E. S. Bigun, J. Bigun, G. Maitre, and S. Fischer, “Fusion of audio and
video information for multi modal person authentication,” Pattern Recognition

Letters, vol. 18, pp. 835-843, Spetember 1997.

[8] L. Hong and A. Jain, “Integrating faces and ﬁngerprints for personal identiﬁca-

tion,” To appear in IEEE PAMI, December 1998.

[9] S. Young, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, The H TK Book.

Cambridge, England: Entropic Cambridge Research Laboratories, 1997.

[10] Entropic, Entropic Speech Technology. http://www.entropic.com, 1998.

[11] A. Jain, L. Hong, and Y. Kulkarni, “F2id: A personal identiﬁcation system using
faces and ﬁngerprints,” in Proceedings of the I CPR, vol. 2, (Brisbane, Australia),

pp. 1373—1375, August 1998.

[12] R. Chellappa, C. Wilson, and A. Sirohey, “Human and machine recognition of

faces: A survey,” Proceedings of the IEEE, vol. 83, pp. 705 — 740, May 1995.

[13] K. Sung and T. Poggio, “Example-based learning for view-based human face

detection,” A.I. Memo 1521, MIT A.I. Laboratory, 1994.

[14] T. K. Leung, M. C. Burl, and P. Perona, “F inding faces in cluttered scenes us-
ing random labelled graph matching,” in Proceedings of ICC V— ’95, (Cambridge,

MA), 1995.

 

118

[15] H. Rowley, S. Baluja, and T. Kanade, “Nerual network-based face detection,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 1,

pp.23—38,1998.

[16] M. Lew and N. Huijsmans, “Information theory and face detection,” in Proceed-

ings of [CPR ’96, (Vienna, Austria), pp. 601—610, 1996.

[17] A. Colmenarez and T. Huang, “Face detection with information-based maximum

discrimination,” in Proceedings of CVPR- ’97, (San Juan, Puerto Rico), 1997.

[18] K. C. Yow and R. Cipolla, “Feature-based human face detection,” Image and

Vision Computing, vol. 15, no. 9, pp. 713—735, 1997.

[19] S. H. Lin, S. Y. King, and L. J. Lin, “Face recognition/ detection by probabilistic
decision-based neural network,” IEEE Transactions on Neural Networks, vol. 8,

no. 1, pp. 114—131, 1997.

[20] N. Duta and A. K. Jain, “Learning the human face concept from black and
white pictures,” in Proceedings of the 14th ICPR, vol. 2, (Brisbane, Australia),

pp. 1365—1367, August 1998.

[21] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of Cognitive

Neuroscience, vol. 3, no. 1, pp. 71 - 86, 1991.

[22] J. Zhang, Y. Yan, and M. Lades, “Face recognition: Eigenface, elastic matching
and neural nets,” Proceedings of the IEEE, vol. 85, pp. 1423 — 1435, September

1997.

[23]

[24]

[25]

[26]

[27]

[28]

[29]

119

D. L. Swets and J. Weng, “Using discriminant eigenfeatures for image retrieval,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18,

pp. 831—836, August 1996.

R. Brunelli and T. Poggio, “Face recognition: Features versus templates,” IEEE

Trans. PAMI, vol. 15, no. 10, pp. 1042 — 1052, 1995.

L. Hong, Personal Identiﬁcation Using Fingerprints. PhD thesis, Department of

Computer Science and Engineering, Michigan State University, 1998.

N. Ansari, M. H. Chen, and E. S. H. Hou, Dynamic, Genetic, and Chaotic
Programming, ch. A genetic algorithm for point pattern matching. John Wiley

and Sons, 1992.

H. Baird, Model Based Image Matching Using Location. Cambridge, MA: MIT

Press, 1984.

S. H. Chang, F. H. Cheng, W. H. Hsu, and G. Z. Wu, “Fast algorithm for
point pattern matching: Invariant to translations, rotations, and scale changes,”

Pattern Recognition, vol. 30, no. 2, pp. 321—339, 1997.

M. Eleccion, “Automatic ﬁngerprint identiﬁcation,” IEEE Spectrum, vol. 10,

no. 9, pp. 36—45, 1973.

[30] A. Ranade and A. Rosenﬁeld, “Point pattern matching by relaxation,” Pattern

Recognition, vol. 12, no. 2, pp. 269—275, 1983.

120

[31] N. Ratha, K. Karu, S. Chen, and A. K. Jain, “A real-time matching system

[32]

[33]

[34]

[35]

[36]

[37]

for large ﬁngerprint database,” IEEE Transactions on PAMI, vol. 18, no. 8,

pp. 799-813, 1996.

L. Rabiner and B. Juang, Fundamentals of Speech Recognition. Prentice Hall,

1993.

J. P. Campbell, “Speaker recognition: A tutorial,” Proceedings of the IEEE,

vol. 85, pp. 1437 — 1462, September 1997.

L. F. Lamel, L. R. Rabiner, A. E. Rosenberg, and J. G. Wilpon, “An improved
endpoint detector for isolated word recognition,” IEEE Transactions on Accous-

tics, Speech, and Signal Processing, vol. ASSP-29, pp. 777 — 785, August 1981.

J. G. Wilpon, L. R. Rabiner, and T. Martin, “An improved word—detection
algorithm for telephone-quality speech incorporating both syntactic and semantic
constraints,” ATE? T Bell Labs Technical Journal, vol. 63, pp. 479 — 498, March

1984.

S. Furui, “Cepstral analysis technique for automatic speaker veriﬁcation,” IEEE
Transactions on Accoustics, Speech, and Signal Processing, vol. ASSP-29, pp. 254

- 272, April 1981.

B. Atal, “Effectiveness of linear prediction characteristics of the speech wave for
automatic speaker identiﬁcation and veriﬁcation,” Journal of Acoustic Society of

America, vol. 55, no. 6, pp. 1304—1312, 1974.

[38]

[39]

[40]

[41]

[42]

[43]

[44]

121

L. R. Rabiner, B. H. Juang, S. E. Levinson, and M. M. Sondhi, “Recognition of
isolated digits using hidden markov models with continuous mixture densities,”

AT 89' T Technical Journal, vol. 64, pp. 1211 — 1234, July-August 1985.

L. R. Rabiner, “A tutorial on hidden markov models and selected applications
in speech recognition,” Proceedings of the IEEE, vol. 77, pp. 257 — 286, February

1989.

A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from
incomplete data via the em algorithm,” J. Roy. Stat. Soc, vol. 39, no. 1, pp. 1—

38, 1977.

G. D. Forney, “The viterbi algorithm,” Proc. IEEE, vol. 61, pp. 268—278, March

1973.

A. E. Rosenberg, J. DeLong, C.-H. Lee, B.-H. Juang, and F. K. Soong, “The use

of cohort normalised scores for speaker veriﬁcation,” in Proc. I CSLP ’92, (Banff,

Canada), Oct. 1992.

A. E. Rosenberg and S. Parthasarathy, “Speaker background models for con-
nected digit password speaker veriﬁcation,” in Proc. I CASSP ’96, (Atlanta, GA),

pp. 81—84, May 1996.

J. Colombi, D. Ruck, S. Rogers, M. Oxley, and T. Anderson, “Cohort selec-
tion and word grammar effects for speaker recognition,” in Proc. ICASSP ’96,

(Atlanta, GA), pp. 85—88, May 1996.

122

[45] J. J. P. Campbell, “Testing with the yoho cd-rom voice veriﬁcation corpus,” in

Proc. ICASSP ’95, (Detroit, MI), pp. 341—344, 1995.

[46] C. Che and Q. Lin, “Speaker recognition using hmm with experiments on the

yoho database,” in Proc. EUROSPEECH, (Madrid, Italy), pp. 625—628, 1995.

[47] A. K. Jain, R. Bolle, and S. Pankanti, eds., Biometrics Personal Identiﬁcation

in Networked Society, ch. 16. Kluwer Academic Publishers, 1998.

[48] Y. A. Zuev and S. K. Ivanov, “The voting as a way to increase the decision reli-
ability,” in Proc. Foundations of Information/Decision Fusion with Applications

to Engineering Problems, (Washington, DC), pp. 206—210, August 1996.

[49] T. Y. Young and T. W. Calvert, Classiﬁcation, Estimation and Pattern Recog-

nition. New York: American Elsevier Publishing Company, Inc., 1974.

[50] Miros, TrueFace by Miros. http://www.miros.com, 1998.

[51] Visionics, Welcome to the Visionics Website. http://www.visionics.com, 1998.

[52] Identicator, Welcome to Identicator Technology! http://www.identicator.com,

1998.

[53] American Biometric Company, American Biometric Company Home Page.

http: //www.biomouse.com, 1998.

[54] Veritel Corporation, Veritel Corporation of America Homepage.

http://www.veritelcorp.com, 1998.

123

[55] T-Netix Inc., T-Netirc Home Page. http://www.t-netix.com, 1998.

[56] ITT Industries, SpeakerKey by ITT Industries. http://www.speakerkey.com,

1998.

[57] J. G. Daugman and G. O. Williams, “A proposed standard for biometric decid-
ability.,” in Proc. Card Tech/Secure Tech Conference, (Atlanta, GA), pp. 223—234,

1996.

[58] S. G. Davies, “Touching big brother: How biometric technology will fuse ﬂesh

and machine,” Information Technology and People, vol. 7, no. 4, pp. 60—69, 1994.