Heterogeneous Face Recognition
By
Brendan F. Klare

A Dissertation
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
Doctor of Philosophy
Computer Science and Engineering
2012

Abstract
Heterogeneous Face Recognition
By
Brendan F. Klare

One of the most diﬃcult challenges in automated face recognition is computing
facial similarities between face images acquired in alternate modalities. Called heterogeneous face recognition (HFR), successful solutions to this recognition paradigm
would allow the vast collection of face photographs (acquired from driver’s licenses,
passports, mug shots, and other sources of frontal face images) to be matched against
face images from alternate modalities (e.g. forensic sketches, infrared images, aged
face images). This dissertation oﬀers several contributions to heterogeneous face
recognition algorithms. The ﬁrst contribution is a framework for matching forensic
sketches to mug shot photographs. In developing a technique called Local Featurebased Discriminant Analysis (LFDA), we are able to signiﬁcantly improve sketch
recognition accuracies with respect to a state of the art commercial face recognition engine. The improved accuracy of LFDA allows for facial searches of criminal
oﬀenders using a hand drawn sketch based on a verbal description of the subject’s
appearance, called a forensic sketch. The second contribution of this dissertation is a
generic framework for heterogeneous face recognition. By representing images from
alternate modalities with their non-linear similarity to a set of prototype subjects
who provide images from each corresponding modalities, the need to directly compare face images from alternate modality is eliminated. This property generalizes the
algorithm, called Prototype Random Subspaces (P-RS), to any HFR scenario. The
viability of this algorithm is demonstrated on four separate HFR databases (near
infrared, thermal infrared, forensic sketch, and viewed sketch). The third contribu-

tion of this dissertation is a large scale examination of face recognition algorithms
in the presence of aging. We study whether or not aging-invariant face recognition
algorithms generalize to non-aging scenarios. By demonstrating that they do not
generalize, we conclude that the heterogeneous appearances between faces that have
aged casts aging-invariant face recognition problem in the same category as heterogeneous face recognition. That is, much like images acquired in alternate modalities,
aged face images should be matched using specially trained algorithms. The fourth
contribution of this dissertation is an examination of how heterogeneous demographics (i.e. gender, race, and age) aﬀect the recognition accuracy of face recognition
systems. Using six diﬀerent face recognition systems (including commercial systems,
non-trainable systems, and a trainable face recognition system), the experiments conclude that all systems have a consistently lower recognition accuracy on the following
demographic cohorts: (i) females, (ii) black subjects, and (iii) young subjects. This
study also examined whether or not recognition accuracy could be improved for a
speciﬁc demographic cohort by training a system exclusively on that cohort. The
ﬁfth contribution of this dissertation is an examination of the problem of identifying
a subject from a caricature. A caricature is a facial sketch of a subject’s face that
exaggerates identiﬁable facial features beyond realism, yet humans still have a profound ability to identify subjects from their caricature sketch. Automated caricature
recognition with the intent of discovering improved facial feature representations with
respect to face recognition as a whole. To enable this task, we propose a set of qualitative facial features that encodes the appearance of both caricatures and photographs.
We utilized crowdsourcing, to assist in the labeling of the qualitative features. Using
these features, we combine logistic regression, multiple kernel learning, and support
vector machines to generate a similarity score between a caricature and a facial photograph. Experiments are conducted on a dataset of 196 pairs of caricatures and
photographs, which we have made publicly available.

To my loving wife, Christina

iv

Acknowledgments
My eﬀorts towards this thesis would have been in vein had it not been for the contributions of a wide range of people.
Foremost on the list of those to whom I am indebted is my advisor, Anil Jain
(or as I refer to him, “Professor Jain”). I am not a superstitious person, but instead
I believe that randomness (and hence, luck) shapes our life experiences. With this
perhaps being the case, I consider myself inﬁnitely lucky to have crossed paths with
Professor Jain. While the more celebrated achievements in his career are clearly
evident when reading his CV or doing any Google Scholar search on the topic of
biometrics or pattern recognition, I believe that his passion and innate talent in
mentoring and developing his students is his greatest quality. The causality is not
dubious; the success of Professor Jain’s students is the result of his hard work and
passion towards developing them into greater researchers, engineers, and scientists.
Professor Jain has taught me the value of collaborating with fellow researchers
and those those who support our academic ﬁelds from an administrative capacity,
such as program managers and government employees. I have learned that great
research often involves looking ahead of your peers, and trying to ﬁnd solutions to
a problem before people realize the problem even exists. I have learned it does not
matter how great a particular algorithm or body of research is if its eﬃcacy has been
demonstrated with a ﬂawed experimental design. I have implicitly been taught by
Professor Jain to adhere to Occam’s razor when developing a solution to a given
problem. Finally, despite often being privy to catered lunches and Mrs. Jain’s sweets
in the lab, I have painfully learned the persistence of the “no free lunch” theorem as
the joys of research are constantly balanced by trying times.
The most important thing I have learned from Professor Jain is that nothing
v

trumps passion for your profession. Professor Jain is arguably the smartest person
I know, however his intelligence alone did not make him the person we all look up
to. It is his day in, day out passion for research. I hope to never forget from him to
approach my profession with great passion.
Success in one’s career is all for naught without someone to share it with. To this
end, my hard work is always with the knowledge that at the end of the day I get to
come home to my truly better half: my wife Christina. My having met Christina is
the only thing that prevents me from completely denouncing fate because I do not
feel anyone is this lucky. No matter what I achieve in ﬁeld of pattern recognition,
biometrics, and beyond, I will ﬁrst and foremost be deﬁned by Christina and our
amazing relationship. I would not trade anything in the world for Christina, including
this dissertation. The only joy greater joy than reﬂecting on our nearly the near six
years we have spent together is thinking of the of the rest of our lives we have ahead
of us.
While Professor Jain had a profound inﬂuence on the last four years of my life,
my father, Fred Klare (or “Dad”), has had the largest single inﬂuence in my life. In
life, one does well to follow the proverb “you should not count your eggs until they
are hatched”. However, if eggs were in fact my father, this sage advice would not be
necessary. You can always count on my Dad. Always. He is the most consistent and
integrable person I will ever know in my life.
My father is also endlessly selﬂess. He lives for his four children, and never showed
it more then with the sacriﬁces he made during the untimely passing of our mother,
Anna Klare, when I was around 13 years old. My mother was a truly amazing person
herself: no one who met her ever seemed to forget her. I owe my (arguable) talents
in math, and intense personality (for better or worse) to her. However, after her
passing, my father was left to raise four children while maintain his then career as a
Special Agent in the Secret Service. No matter how much I would veer from the path
vi

of being a hard working citizen, my father would consistently (and often painfully)
remind me of this path. I have no doubt that I would not be completing a PhD had it
not been for my father sacriﬁcing so much time and eﬀort in helping me understand
the diﬀerence between right and wrong.
I am fortunate to have an amazing family. My younger sisters Kelly and Kristy
are the best sisters anyone could ask for. My brother Kevin has always been there to
give me advise and opinions when I need them. My Aunt Gracie has meant a great
deal to our family, and I am very grateful for how much she has done to help get my
mind oﬀ of tough times over the years. My step-mother Connie and my step-brother
Erik make our family so much better. Christina’s mother Diane Pearson has been
so good to me over the years - I truly feel comfortable calling her “Mom”. It has
been wonderful calling the rest of Christina’s family my own as well, including Rick,
Andrew, Tony, Gloria, Nanny, and Poppop.
When I was 19 years old I dropped out of school and joined the Army Rangers,
mostly at the advice of my father. Four years later I left a changed man that was
ready for any challenge in life. My time in the 75th Ranger Regiment was profound.
The rangers I served with taught me how lead by example. They taught me self
sacriﬁce; many by paying the ultimate sacriﬁce to the airborne ranger in the sky.
They forged in me the mindset to always strive to be better than anyone else, yet
always be humble in any such endeavors. In 3rd Ranger Battalion I met some of my
best friends in life (e.g. Chris K., Tony V., Andrew S., Paul O.). Finally, I learned
that RLTW: Rangers Lead the Way. As I will always consider myself a ranger, my
subsequent academic studies have all been with the mind set of RLTW.
When I left the Army, I applied to eight diﬀerent Universities. While I was not a
serious student before I joined the Army, I was indeed quite focused and determined
afterwards. Such was the message of my application letter. Only one institution
believed me: the University of South Florida. I am fully indebted to USF for taking
vii

a chance on me. I tried to reward them by getting an A in ever course I toke, but I
feel short with a lone B+.
My original plan upon graduating with by B.S. in Computer Science from USF
was to get a position in industry. However, Christina still had a year remaining in her
studies so I decided to pursue an M.S. degree at USF. During this time I was able to
work with Sudeep Sarkar. Working with Professor Sarkar allowed me to discover my
love for research. I am very thankful to Professor Sarkar, and I learned a tremendous
amount from him. Had I not taking the undergraduate computation geometry course
from him my senior year (to which he did one of the ﬁnest teaching jobs I have ever
seen), then it is unlikely I would have ended up in the ﬁeld of pattern recognition.
Upon arrival at the Pattern Recognition and Image Processing (PRIP) Lab at
Michigan State, I quickly realized that my fellow lab mates were well trained in
Professor Jain’s ideals. There is a phrase for such scholars: pripies. I am privileged
to be able to call myself a pripie.
The two pripies that had the great inﬂuence on my studies are Dr. Unsang Park
and Serhat Bucak. Unsang help guide my through the vast ﬁeld of face recognition
when I arrived. I could always count on Unsang’s help with any problem I had during
my studies. When Unsang left for greener pastures, I felt like I lost my safety net in
the lab.
Serhat joined the lab as a PhD student when I arrived in Fall 2008. I will likely
look back at our time here and think of our common love of Spartan basketball (which
was kindly rewarded with two trips to the Final Four), his help guiding Christina and
I through the streets of Istanbul during our honeymoon, and the enjoyment of being
his friend. However, Serhat also played an invaluable role in my research. I consider
Serhat’s knowledge of pattern recognition to be second to none. Because of this, I
have consistently relied on Serhat’s feedback and advice in my studies.
I am very grateful the many other pripies in the lab, which includes (but is not
viii

limited to): Alessandra Paulino, Soweon Yoon, Abhishek Nagar, Pavan Mallapragada,
Radha Chitta, Kate Bonnen, Scott Klum, Shengcai Liao, Zhifeng Li, Hu Han, Qijun
Zhao, Tim Havens, Koichiro Niinuma, and Jianjiang Feng. I need give an extra
mention to Alessandra who (thankfully) always seemed to catch the smallest of my
mistakes. I am very thankful have been able to work with Kate as well. While Kate
is my peer, I learned a lot through my fortunate role of helping to advise her studies.
Kate also spent considerable time reviewing this thesis.
I am very grateful to have such an capable and cooperative thesis committee.
Professor Rong Jin, Professor Yiying Tong, and Professor Selin Aviyente have been
a great help during my studies. I am truly proud to have them serving as committee
members for this dissertation. Professor Jin and Professor Tong have each given me
speciﬁc advice on diﬀerent papers that helped comprise this thesis. Professor Emeritus
George Stockman was also quite important in the early stages of this thesis.
Many outside collaborators have made this thesis possible. No one has been a
bigger asset and colleague than Scott McCallum. His ﬁngerprints are on nearly every
project in this thesis. Unfortunately, all I have done in return was teach him the joy
of an Oberon. Others who have helped greatly along the way include: Scott Swann,
Kelly Faddis, Richard Vorder Bruegge, John Manzo, Greg Michaud, Lois Gibson,
Tayfun Akgul, Professor Arun Ross, Sheila Meese, Catyana Sawyer, Paul Moody,
Mark Burge, and Josh Klontz. Mark Burge in particular has been a tremendous
mentor and colleague during the past three years, and I am very thankful for the
time I have spent working with him.
I want to thank all my friends. Sometimes my better ideas would come after
enjoying a IPA from any of the ﬁne craft breweries in the great state of Michigan.
Christian Weeder, Adam Dorr, Jim Marr, Tom Robinson, Joe Sanchez, Jake Flynn,
Srijana Pradhan, Kate Flynn, and many others have given me the proper balance I
have needed between studies and a social life. Kent and Michaelle Rehmann are the
ix

greatest neighbors anyone could ask for. Michael Horton and his tribe at Spartan
Crossﬁt have also been one of the best things that have ever happened to Christina
and I, and I am not sure if I could have endured the stress from this last year without
blowing oﬀ steam at “the box”.
Thank you everyone!
RLTW.

x

Table of Contents

LIST OF TABLES

xiv

LIST OF FIGURES

xvi

1 Introduction
1.1 The Lineage of Face Recognition . . . . . .
1.2 Automated Face Recognition Algorithms . .
1.2.1 Detection, Alignment, and Normalization
1.2.2 Feature Representation . . . . . . . . . . .
1.2.3 Feature Extraction . . . . . . . . . . . . .
1.2.4 Matching . . . . . . . . . . . . . . . . . .
1.3 Heterogeneous Face Recognition . . . . . . .
1.4 Contributions . . . . . . . . . . . . . . . . .
1.5 Thesis Organization . . . . . . . . . . . . .
2 Forensic Sketch Recognition
2.1 Introduction . . . . . . . . . . . . . . . . . .
2.2 Related Work . . . . . . . . . . . . . . . . .
2.3 Feature-based Sketch Matching . . . . . . .
2.3.1 Feature-based Representation . . . . . . .
2.3.2 Local Feature-based Discriminant Analysis
2.4 Viewed Sketch Matching Results . . . . . .
2.5 Matching Forensic Sketches . . . . . . . . .
2.5.1 Forensic Sketch Database . . . . . . . . .
2.5.2 Human Memory and Forensic Sketches . .
2.5.3 Forensic Sketch Region Saliency . . . . . .
2.5.4 Large-Scale Forensic Sketch Matching . .
2.6 Forensic Sketch Matching Results . . . . . .
2.7 Summary . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

1
5
9
9
12
22
26
27
31
32

.
.
.
.
.
.
.
.
.
.
.
.
.

33
33
36
37
37
42
46
47
48
49
51
52
55
58

3 Heterogenous Face Recognition using Kernel Prototype Similarities
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Heterogeneous Face Recognition . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Kernel Prototype Representation . . . . . . . . . . . . . . . . . . . . .
3.2.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Preprocessing and Representation . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Geometric Normalization . . . . . . . . . . . . . . . . . . . . . . . . .
xi

62
62
64
64
65
66
68
68

3.3.2 Image Filtering . . . . . . . . . . .
3.3.3 Local Descriptor Representation . .
3.4 Heterogeneous Prototype Framework
3.4.1 Discriminant Analysis . . . . . . .
3.5 Random Subspaces . . . . . . . . . .
3.5.1 Motivation . . . . . . . . . . . . .
3.5.2 Prototype Random Subspaces . . .
3.5.3 Recognition . . . . . . . . . . . . .
3.5.4 Score Level Fusion . . . . . . . . .
3.6 Baselines . . . . . . . . . . . . . . . .
3.6.1 Commercial Matcher . . . . . . . .
3.6.2 Direct Random Subspaces . . . . .
3.7 Experiments . . . . . . . . . . . . . .
3.7.1 Databases . . . . . . . . . . . . . .
3.7.2 Results . . . . . . . . . . . . . . .
3.8 Summary . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

4 Face Recognition Across Time Lapse
4.1 Introduction . . . . . . . . . . . . . . . .
4.2 Dataset . . . . . . . . . . . . . . . . . .
4.3 Random Subspace Face Recognition . . .
4.3.1 Face Representation . . . . . . . . . .
4.3.2 Random Subspaces . . . . . . . . . . .
4.4 Experiments . . . . . . . . . . . . . . . .
4.4.1 Computational Demands . . . . . . . .
4.5 Conclusions . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

5 Face Recognition Performance: Role of Demographic
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Prior Studies and Related Work . . . . . . . . . . . . . . .
5.3 Face Database . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Face Recognition Algorithms . . . . . . . . . . . . . . . . .
5.4.1 Commercial Face Recognition Algorithms . . . . . . . .
5.4.2 Non-Trainable Face Recognition Algorithms . . . . . . .
5.4.3 Trainable Face Recognition Algorithm . . . . . . . . . .
5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . .
5.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6.1 Gender . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6.2 Race . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6.3 Age Demographic . . . . . . . . . . . . . . . . . . . . . .
5.6.4 Impact of Training . . . . . . . . . . . . . . . . . . . . .
5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . .
xii

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

. 69
. 70
. 72
. 75
. 77
. 77
. 78
. 82
. 82
. 83
. 83
. 83
. 85
. 85
. 87
. 100

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

102
102
105
108
109
110
111
116
117

Information
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .

119
119
123
127
129
129
129
132
147
161
161
164
165
165
171

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

6 Towards Automated Caricature Recognition
6.1 Introduction . . . . . . . . . . . . . . . . . . . . .
6.2 Related Work . . . . . . . . . . . . . . . . . . . .
6.3 Caricature Dataset . . . . . . . . . . . . . . . . .
6.4 Qualitative Feature Representation . . . . . . . .
6.4.1 Level 1 Qualitative Features . . . . . . . . . . .
6.4.2 Level 2 Features . . . . . . . . . . . . . . . . .
6.4.3 Feature Labeling . . . . . . . . . . . . . . . . .
6.5 Matching Qualitative Features . . . . . . . . . . .
6.5.1 Logistic Regression . . . . . . . . . . . . . . . .
6.5.2 Multiple Kernel Learning and SVM . . . . . . .
6.6 Image Descriptor-based Recognition . . . . . . . .
6.7 Experimental Results . . . . . . . . . . . . . . . .
6.8 Summary . . . . . . . . . . . . . . . . . . . . . .
7 Summary and
7.1 Contributions
7.2 Future Work
7.3 Conclusions .

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

173
173
175
177
178
181
181
182
183
184
187
188
189
191

Conclusions
193
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

APPENDICES

199

A “R” Transform is a Special Case of Eigen-transform

200

BIBLIOGRAPHY

203

xiii

List of Tables

Table 1.1: Example features from each of the three diﬀerent levels of facial
features that are used to represent face images by (a) humans, and (b)
machines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

Table 2.1: Rank-1 recognition rates for matching viewed sketches using the
CUHK public dataset. The standard deviation across the ﬁve random
splits for each method in the middle and right columns is less than 1%. 44
Table 2.2: Demographics of the 159 forensic sketch images and the 10,159
mugshot gallery images. . . . . . . . . . . . . . . . . . . . . . . . . .

48

Table 3.1: Rank-1 accuracies for the proposed Prototype Random Subspace
(P-RS) method across ﬁve recognition scenarios using an additional
10,000 subjects in the gallery. . . . . . . . . . . . . . . . . . . . . . .

84

Table 3.2: Rank-1 accuracies for the proposed Prototype Random Subspace
(P-RS) method on a standard photograph to photograph matching
scenario using an additional 10,000 subjects in the gallery. . . . . . .

85

Table 3.3: Eﬀect of each component in the P-RS framework on recognition
accuracy. Components tested are LDA, the transformation matrix R,
and random subspaces (RS). Listed are the average Rank-1 accuracies
for each scenario without the additional 10,000 gallery images. . . .

98

Table 5.1: Number of subjects used for training and testing for each demographic category. Two images per subject were used. Training and
test sets were disjoint. A total of 102,942 face images were used in this
study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Table 5.2: Listed are the true accept rates at a ﬁxed false accept rate of 0.1%
for each matcher on the gender demographic. . . . . . . . . . . . . . 162
Table 5.3: Listed are the true accept rates at a ﬁxed false accept rate of 0.1%
for each matcher on the race dataset. . . . . . . . . . . . . . . . . . 163
Table 5.4: Listed are the true accept rates at a ﬁxed false accept rate of 0.1%
for each matcher on the age dataset. . . . . . . . . . . . . . . . . . . 163
xiv

Table 6.1: Average veriﬁcation accuracies of the proposed qualitative, image
feature-based, and baseline methods. Shown are the true accept rates
(TAR) at ﬁxed false accept rates (FAR) of 1.0% and 10.0%. Average accuracies and standard deviations were measured over 10 random
splits of 134 training subjects and 62 testing subjects (subjects in training and test sets are diﬀerent). . . . . . . . . . . . . . . . . . . . . . . 185
Table 6.2: Average identiﬁcation accuracies of the proposed qualitative, image
feature-based, and baseline methods. Average accuracies and standard
deviations were measured over 10 random splits of 134 training subjects
and 62 testing subjects (subjects in training and test sets are diﬀerent). 186

xv

List of Figures

Figure 1.1: The reduction in error rates over the past 20 years for state of the art
face recognition systems, as benchmarked by the National Institute of
Standards and Technology [34]. For interpretation of the references to
color in this and all other ﬁgures, the reader is referred to the electronic
version of this dissertation. . . . . . . . . . . . . . . . . . . . . . . . .

2

Figure 1.2: Some of the major challenges in automated face recognition. These
challenges include (a) heterogeneous face recognition (top row shows
face images of subjects in non-visible modalities and the bottom row
shows corresponding faces in visible light), (b) unconstrained face
recognition (images from [39]), and (c) aging-invariant face recognition (same subject at four diﬀerent ages). . . . . . . . . . . . . . . . .

3

Figure 1.3: Pose, illumination and expression (PIE) challenges. (a) A face image with controlled pose, expression, and illumination. Face images
with variations in (b) facial pose, (c) illumination, and (d) facial expression. These factors are common sources of error in automated face
recognition systems. . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

Figure 1.4: The decrease in accuracy of a leading commercial face recognition
algorithm as a function of the time lapse between the probe and gallery
images. Measured on a mug shot database containing 94,631 face images of 28,031 subjects, this performance degradation illustrates one of
several challenges in face recognition research. . . . . . . . . . . . . .

4

Figure 1.5: Examples images from three diﬀerent heterogeneous face recognition
scenarios. The top row contains probe images from (a) near-infrared,
(b) thermal infrared, and (c) forensic sketch modalities. The bottom
row contains the corresponding gallery (visible band face image) photographs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

Figure 1.6: The human visual system starts with the eyes (top of the image
in red) which senses visible light waves. The path of information detected by the eyes ends with the visual cortex of the brain (bottom of
the image in red). The visual cortex is the largest component in the
brain, and is responsible for such intelligent tasks as object recognition,
motion detection, and face recognition. . . . . . . . . . . . . . . . . .

7

xvi

Figure 1.7: The Thatcher eﬀect. (a) Most people will notice only minor diﬀerences between the two inverted face images shown. (b) However, when
turned upright, the diﬀerences between the same two face images are
noticed to be far more severe. Our reduced ability to see the strong
diﬀerences between the images in (a) is believed to be because inverted
faces do not trigger the fusiform face area of the brain. . . . . . . . .

10

Figure 1.8: The common steps utilized by most face recognition algorithms. .

11

Figure 1.9: Diﬀerent methods for face alignment. (a) Face images before (left
column) and after (right column) alignment through planar rotation
and scaling. (b) Face images aligned using a morphable model (images from [13]), and (c) a video sequence aligned using a 3D model
whose parameters were solved from a structure from motion algorithm
(images from [102]) . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

Figure 1.10: Examples of the three levels of facial features [56]. (a) Level 1
features contain low dimensional appearance information that is useful
for determining highly level identifying information such as ethnicity,
gender, and the general shape of a face. (b) Level 2 features require
detailed processing for face recognition and captures information regarding the structure and speciﬁc shape and texture of the face. (c)
Level 3 features include marks, moles, scars, and other irregular micro
features of the face. . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

Figure 1.11: A ﬁngerprint image and its (a) Level 1, (b) Level 2, and (c) Level 3
features. The organization of facial feature is analogous to such feature
levels [56]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

Figure 1.12: An example of how Level 1 features can easily ﬁlter out faces
that exhibit large diﬀerences, but cannot distinguish faces that possess many similarities. The probe image in (a) was matched to each
gallery image using a Level 1 image pixel representation (diﬀerence in
PCA features using the euclidean distance). Note that a larger PCA
distance indicates that the faces are less similar. Using this Level 1
representation, the face in (a) matched well to an image of a similar
looking subject (c) than its true mate (b), but was easily diﬀerentiated
from other subjects that looked largely diﬀerent (d). The information
in Level 1 features is suﬃcient for quickly discarding some subjects
(d), but more detailed Level 2 features are needed to discriminate between similar looking subjects (c). These images are from the AR face
database [89]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

Figure 1.13: Face images of two identical twins. While their Level 1 and Level
2 features are the same, the facial mark information contained in the
Level 3 features (shown in red circles) oﬀers discriminating information
[60]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

xvii

Figure 1.14: Most heterogeneous face recognition scenarios leverage large visible light face image databases to determine a subject’s identity from
their face image acquired in some non-visible modality. (a) Between
driver’s licenses, passports, and mug shots, a visible light face image
exists for a majority of the population. (b) Many forensic and law
enforcement scenarios only have face images available from alternate
imaging sources such as infrared, LIDAR, or forensic sketches. . . . .

28

Figure 2.1: The diﬀerence between viewed sketches and forensic sketches. (a)
Viewed sketches and their corresponding photographs. (b) Two pairs
of good quality forensic sketches and the corresponding photographs.
(c) Two pairs of poor quality forensic sketches and the corresponding
photographs. Sketches were labeled as “good” if they (subjectively)
exhibited a mostly accurate portrayal of a subject. Otherwise, if a
sketch did not strongly resemble the subject, it was labeled as “poor”.

35

Figure 2.2: An overview of training using the LFDA framework. Each sketch
and photo is represented by SIFT and MLBP feature descriptors extracted from overlapping patches. After grouping “slices” of patches
together into feature vectors Φ(k) (k = 1 · · · N ), we learn a discriminant projection Ψk for each slice. . . . . . . . . . . . . . . . . . . . .

38

Figure 2.3: An overview of matching using the LFDA framework. Recognition
is performed after combining each projected vector slice into a single
vector ϕ and measuring the normed distance between a probe sketch
and gallery photo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

Figure 2.4: An example of the internal (b) and external (c) features of the
face image in (a). Humans tend to use the internal facial features for
recognizing faces they are familiar with, and the external features for
recognizing faces they are unfamiliar with [155]. Witnesses of a crime
are generally unfamiliar with the culprit, therefore the external facial
features should be more salient in matching forensic sketches. . . . . .

48

Figure 2.5: Masks used for region based forensic sketch matching. Shown above
are the mean photo patches of each patch used for a particular region.
The mosaic eﬀect is due to the fact that face patches are extracted in
an overlapping manner. . . . . . . . . . . . . . . . . . . . . . . . . . .

51

Figure 2.6: Performance of matching forensic sketches that were labeled as
good (49 sketches) and poor (110 sketches) against a gallery of 10,159
mugshot images without using race/gender ﬁltering. . . . . . . . . . .

52

Figure 2.7: Performance of matching good sketches with and without using
ancillary demographic information (race and gender) to ﬁlter the results. 54
xviii

Figure 2.8: Matching performance on the good sketches using race/gender ﬁltering with SIFT and MLBP feature-based matching on only speciﬁc
face regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

Figure 2.9: Two examples of typical cases in which the true subject photo (third
column) was not retrieved at rank 1, but the impostor subject (second
column) retrieved at rank 1 visually looks more similar to the sketch
(ﬁrst column) than the true subject. . . . . . . . . . . . . . . . . . .

57

Figure 2.10: Examples of the three of the best matches using LFDA. Below each
example are the rank scores obtained by using the proposed LFDA
method, FaceVACS, and component-based matching. . . . . . . . . .

59

Figure 2.11: Examples of the three of the worst matches using LFDA. Below
each example are the rank scores obtained by using the proposed LFDA
method, FaceVACS, and component-based matching. . . . . . . . . .

60

Figure 3.1: Examples images from each of the four heterogenous face recognition
scenarios tested in our study, as also shown in Chapter 1. The top row
contains probe images from (a) near-infrared, (b) thermal infrared,
(c) viewed sketch, and (d) forensic sketch modalities. The bottom
row contains the corresponding gallery photograph (visible band face
image, called VIS) of the same subject. . . . . . . . . . . . . . . . .

63

Figure 3.2: The proposed face recognition method describes a face as a vector
of kernel similarities to a set of prototypes. Each prototype has one
image in the probe and gallery modalities. . . . . . . . . . . . . . . .

67

Figure 3.3: Example of thermal probe and visible gallery images after being
ﬁltered by a diﬀerence of Gaussian, center surround divisive normalization, and Gaussian image ﬁlters. The SIFT and MLBP feature descriptors are extracted from the ﬁltered images, and kernel similarities
are computed within this image descriptor representation. . . . . . .

69

Figure 3.4: The process of randomly sampling image patches is illustrated. (a)
All image patches. (b), (c), (d) Bags of randomly sampled patches.
The kernel similarity between SIFT and MLBP descriptors at each
patch of an input image and the prototypes of corresponding modality
are computed for each bag. Images are from [89] . . . . . . . . . . . .

79

Figure 3.5: Proposed Prototype Random Subspace framework algorithm. Following the oﬄine training phase, a face image I is enrolled and the
vector Φ is returned for matching. . . . . . . . . . . . . . . . . . . . .

81

xix

Figure 3.6: CMC plot for the NIR HFR scenario. Results use an additional
10,000 gallery images to better replicate real world matching scenarios.
Listed are the accuracies for the proposed Prototype Random Subspace
(P-RS) method, the Direct Random Subspace (D-RS) method [52],
the sum-score fusion of P-RS and D-RS, and Congitec’s FaceVACS
system [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

88

Figure 3.7: CMC plot for the thermal HFR scenario. Results use an additional
10,000 gallery images. . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

Figure 3.8: CMC plot for the viewed sketch HFR scenario. Results use an
additional 10,000 gallery images. . . . . . . . . . . . . . . . . . . . . .

90

Figure 3.9: CMC plot for the forensic sketch HFR scenario. Results use an
additional 10,000 gallery images. . . . . . . . . . . . . . . . . . . . . .

91

Figure 3.10: Rank-1 accuracies (%) on the NIR and thermal modalities using
the proposed P-RS framework. The rows list the features used to
represent the probe images, and the columns list the features for the
gallery images. The non-diagonal entries in each table (in bold) use
diﬀerent feature descriptor representations for the probe images than
the gallery images. These results demonstrate another “heterogeneous”
aspect of the proposed framework: recognition using heterogeneous
features between the probe and gallery images. . . . . . . . . . . . .

94

Figure 3.11: Rank-1 accuracies (%) on the viewed sketch and forensic sketch
modalities using the proposed P-RS framework. The rows list the
features used to represent the probe images, and the columns list the
features for the gallery images. The non-diagonal entries in each table
(in bold) use diﬀerent feature descriptor representations for the probe
images than the gallery images. These results demonstrate another
“heterogeneous” aspect of the proposed framework: recognition using
heterogeneous features between the probe and gallery images. . . . .

95

Figure 3.12: Examples of thermal recognition not successfully matched by (a)
FaceVACS, and (b) the proposed P-RS method. Examples of forensic
sketch recognition not successfully matched by (c) FaceVACS, and (d)
P-RS. In each image pair the left and right images are the probe and
gallery, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

Figure 3.13: CMC plot of matcher accuracies with an additional 10,000 gallery
images when photos are used for both the probe and gallery (i.e. nonheterogeneous face recognition). . . . . . . . . . . . . . . . . . . . .

99

xx

Figure 3.14: Face recognition results (%) when photos are used for both the
probe and gallery (i.e. non-heterogeneous face recognition). The layout
is the same as in Figure 3.10 (i.e. results shown are when diﬀerent
features are used to represent the probe and gallery images). . . . .

99

Figure 4.1: Multiple images of the same subject are shown, along with the
match score (obtained by a leading face recognition system) between
the initial gallery seed and the image acquired after a time lapse. As the
time lapse increases, the recognition score decreases. This phenomenon
is a common problem in face recognition systems. The work presented
in this chapter (i) demonstrates this phenomenon on the largest aging
dataset to date, and (ii) demonstrates that solutions to improve face
recognition performance across large time lapse impact face recognition
performance in scenarios without time lapse. . . . . . . . . . . . . . . 103

Figure 4.2: The performance of two commercial face recognition systems as a
function of time lapse between probe and gallery images. . . . . . . . 107

Figure 4.3: The true accept rates (TAR) at a ﬁxed false accept rate (FAR)
of 1.0% across datasets with diﬀerent amounts of time lapse between
the probe and gallery images. Four diﬀerent RS-LDA subspaces were
trained on a separate set of subjects with the diﬀerent time lapse ranges
tested above. The results suggests the need for multiple recognition
subspaces depending on the time lapse. . . . . . . . . . . . . . . . . . 112

Figure 4.4: Inherent separability of diﬀerent facial regions with aging. (a)
The mean pixel values at each patch where MLBP feature descriptors
are computed. (b) The scale of the Fisher separability criterion used.
(c) The heat map showing Fisher separability values at each image
patch across diﬀerent time lapses. As time lapse increases, the eyes
and mouth regions seem to be the most stable sources of identiﬁable
information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Figure 4.5: The ability to improve face recognition performance by training on
the same time lapse being tested on suggests face recognition systems
should update templates over time. For example, at ﬁxed intervals
from the original acquisition date the template is updated to reside in a
subspaces trained for the time lapse that has occurred since acquisition.
Probe images would be projected into each subspace and matched in
the subspace corresponding to each gallery image. . . . . . . . . . . . 118
xxi

Figure 5.1: Examples of the diﬀerent demographics studied. (a-c) Age demographic. (d-e) Gender demographic. (f-h) Race/ethnicity demographic. Within each demographic, the following cohorts were isolated:
(a) ages 18 to 30, (b) ages 30 to 50, (c) ages 50 to 70, (d) female gender, (e) male gender, (f) Black race, (g) White race, and (h) Hispanic
ethnicity. The ﬁrst row shows the “mean face” for each cohort. A
“mean face” is the average pixel value computed from all the aligned
face images in a cohort. The second and third rows show diﬀerent
sample images within the cohorts. . . . . . . . . . . . . . . . . . . . 121
Figure 5.2: Dynamic face matcher selection. The ﬁndings in this study suggest that many face recognition scenarios may beneﬁt from multiple
face recognition systems that are trained exclusively on diﬀerent demographic cohorts. Demographic information extracted from a probe
image may be used to select the appropriate matcher, and improve face
recognition accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Figure 5.3: Overview of the Spectrally Sampled Structural Subspace Features
(4SF) algorithm. This custom algorithm is representative of state of
the art methods in face recognition. By changing the demographic
distribution of the training sets input into the 4SF algorithm, we are
able to analyze the impact the training distribution has on various
demographic cohorts. . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Figure 5.4: Performance of the COTS-A commerical face recognition system on
datasets seperated by cohorts within the gender demographic. . . . . 134
Figure 5.5: Performance of the COTS-B commerical face recognition system on
datasets seperated by cohorts within the gender demographic. . . . . 135
Figure 5.6: Performance of the COTS-C commerical face recognition system on
datasets seperated by cohorts within the gender demographic. . . . . 136
Figure 5.7: Performance of the local binary pattern-based non-trainable face
recognition system on datasets seperated by cohorts within the gender
demographic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Figure 5.8: Performance of the Gabor-based non-trainable face recognition system on datasets seperated by cohorts within the gender demographic.

138

Figure 5.9: Performance of the 4SF algorithm trained on an equal number of
samples from each gender on datasets seperated by cohorts within the
gender demographic. . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Figure 5.10: Performance of the diﬀerent trained versions of the 4SF algorirthm
on the Females cohort. . . . . . . . . . . . . . . . . . . . . . . . . . . 140
xxii

Figure 5.11: Performance of the diﬀerent trained versions of the 4SF algorithm
on the Male cohort. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Figure 5.12: Performance of the COTS-A commercial face recognition system
on datasets seperated by cohorts within the race demographic. . . . 142
Figure 5.13: Performance of the COTS-B commercial face recognition system
on datasets seperated by cohorts within the race demographic. . . . 143
Figure 5.14: Performance of the COTS-C commercial face recognition system
on datasets seperated by cohorts within the race demographic. . . . 144
Figure 5.15: Performance of the local binary pattern-based non-trainable recognition system on datasets seperated by cohorts within the race demographic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Figure 5.16: Performance of the Gabor-based non-trainable recognition system
on datasets seperated by cohorts within the race demographic. . . . . 146
Figure 5.17: Performance of the 4SF algorithm trained on an equal number of
samples from each race on datasets seperated by cohorts within the
race demographic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Figure 5.18: Performance of the diﬀerent trained versions of the 4SF algorithm
on the Black cohort. . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Figure 5.19: Performance of the diﬀerent trained versions of the 4SF algorithm
on the White cohort. . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Figure 5.20: Performance of the diﬀerent trained versions of the 4SF algorithm
on the Hispanic cohort. . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Figure 5.21: Performance of the COTS-A commercial face recognition system
on datasets seperated by cohorts within the age demographic. . . . . 151
Figure 5.22: Performance of the COTS-B commercial face recognition system
on datasets seperated by cohorts within the age demographic. . . . . 152
Figure 5.23: Performance of the COTS-C commerical face recognition system
on datasets seperated by cohorts within the age demographic. . . . . 153
Figure 5.24: Performance of the local binary pattern-based non-trainable face
recognition system on datasets seperated by cohorts within the age
demographic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Figure 5.25: Performance of the Gabor-based non-trainable face recognition
system on datasets seperated by cohorts within the age demographic.
xxiii

155

Figure 5.26: Performance of the 4SF algorithm trained on an equal distribution
of samples acoress age on datasets seperated by cohorts within the age
demographic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Figure 5.27: Performance of the diﬀerent trained versions of the 4SF algorithm
on the Ages 18 to 30 cohort. . . . . . . . . . . . . . . . . . . . . . . . 157
Figure 5.28: Performance of the diﬀerent trained versions of the 4SF algorithm
on the Ages 30 to 50 cohort. . . . . . . . . . . . . . . . . . . . . . . . 158
Figure 5.29: Performance of the diﬀerent trained versions of the 4SF algorithm
on the Ages 50 to 70 cohort. . . . . . . . . . . . . . . . . . . . . . . . 159
Figure 5.30: Match score distributions for the (a) male and (b) female genders
using the 4SF system trained with an equal number of male and female
subjects. All histograms are aligned on the same horizontal axis. . . . 166
Figure 5.31: Geniune and impostor score distributions for the male and female
genders using the 4SF system trained with an equal number of male
and female subjects. The increased distances (dissimilarities) for the
true match comparisons in the female cohort suggest increased withinclass variance in the female cohort. All histograms are aligned on the
same horizontal axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Figure 5.32: Shown are examples where dynamic face matcher selection improved the retrieval accuracy. The ﬁnal two columns show the less
frequent case where such a technique reduced the retrieval accuracy.
Retrieval ranks are out of roughly 8,000 gallery subjects for each cohort. Leveraging demographic information (such as race/ethnicity in
this example) allows a face recognition system to perform the matching using statistical models that are tuned to the diﬀerences within the
speciﬁc cohort. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Figure 6.1: Examples of caricatures (top row) and photographs (bottom row)
of four diﬀerent personalities. Shown above are: (a) Angelina Jolie
(drawn by Rok Dovecar), (b) Adam Sandler (drawn by Dan Johnson),
(c) Bruce Willis (drawn by Jon Moss), and (d) Taylor Swift (drawn by
Pat McMichael). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Figure 6.2: Diﬀerent forms of facial sketches (b-d). (a) Photograph of a subject.
(b) Portrait sketch. (c) Forensic sketch drawn by Semih Poroy from a
verbal description. (d) Caricature sketch. . . . . . . . . . . . . . . . . 178
Figure 6.3: Illustration of features numbered one through twelve in the set of
twenty ﬁve qualitative features used to represent both caricatures and
photographs. The similarity between sketches and photographs were
measured within this representation. . . . . . . . . . . . . . . . . . . 179
xxiv

Figure 6.4: Illustration of the features numbered thirteen through twentyfour
in the set of twenty ﬁve qualitative features used to represent both caricatures and photographs. The similarity between sketches and photographs were measured within this representation. . . . . . . . . . . 180
Figure 6.5: Overview of the caricature recognition algorithm. . . . . . . . . . . 183
Figure 6.6: The multiple kernel learning (MKL) weights (p), scaled by 10, for
each of the qualitative features. Higher weights indicate more informative features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

xxv

Chapter 1
Introduction
Automated face recognition is a rapidly growing ﬁeld that uses computer algorithms
to determine the similarity between two face images [73]. Automating this process of
facial identiﬁcation has enormous implications towards improving public safety and
security, and increasing the ubiquitous nature with which we interact with intelligent
machines.
The progress of face recognition technology over the past two decades has been
substantial, as benchmarked by the National Institute of Standards and Technology
(NIST) [34] (see Figure 1.1). Because error rates are shown to have dropped at an
exponential rate, one would justiﬁably assume that face recognition is becoming a
largely solved problem. Unfortunately, this is far from the case as many challenges in
face recognition still remain (see Figure 1.2). The reduction in error rates shown in
Figure 1.1 is for face images captured in a controlled environment with cooperative
subjects. However, face recognition performance signiﬁcantly deteriorates when variations in facial pose, facial expression, and illumination (collectively known as PIE)
are introduced [108]. Examples of such variations can be found in Figure 1.3. Other
factors such as image quality (e.g., resolution, compression, blur), time lapse or facial
aging (see Figure 1.4), and occlusion also contribute to face recognition errors [43,44].
1

Figure 1.1: The reduction in error rates over the past 20 years for state of the art
face recognition systems, as benchmarked by the National Institute of Standards and
Technology [34]. For interpretation of the references to color in this and all other
ﬁgures, the reader is referred to the electronic version of this dissertation.

When considering face recognition in videos, issues such as segmenting the face in
varying illuminations [63] and compression artifacts [61] must be considered as well.
One of the most challenging tasks in automated face recognition is matching between two face images that have been sensed in either alternate imaging modalities
(e.g. infrared images, hand drawn sketches, or depth images) or in diﬀerent sensing
environments and time (e.g. face images of the same person taken 10 years apart).
Called heterogeneous face recognition, successful solutions to this face recognition
paradigm extend the capabilities of face recognition to covert capture scenarios (e.g
face recognition at a distance and face recognition in nighttime environments), situations where no face image even exists (forensic sketch recognition), or in situations
where face images exhibit changes through the eﬀects of aging (aging-invariant face
recognition). Thus, while the majority of face recognition research seeks to mimic the
capabilities of humans, heterogeneous face recognition oﬀers the prospect of recognition capabilities beyond that of humans. The goals and objectives of the research
2

(a)

(b)

(c)
Figure 1.2: Some of the major challenges in automated face recognition. These
challenges include (a) heterogeneous face recognition (top row shows face images of
subjects in non-visible modalities and the bottom row shows corresponding faces in
visible light), (b) unconstrained face recognition (images from [39]), and (c) aginginvariant face recognition (same subject at four diﬀerent ages).

3

(a)

(b)

(c)

Figure 1.3: Pose, illumination and expression (PIE) challenges. (a) A face image
with controlled pose, expression, and illumination. Face images with variations in
(b) facial pose, (c) illumination, and (d) facial expression. These factors are common
sources of error in automated face recognition systems.

1

True Accept Rate

0.95
0.9
0.85
0.8
0.75

FAR=1.0%

0.7

FAR=0.1%

0.65
0.6

0-1
(20,741)

1-3
(10,744)

3-5
(5,951)

5-7
(3,238)

7-9
(1,680)

9 - 20
(1,349)

Aging in years
(Number of Subjects)

Figure 1.4: The decrease in accuracy of a leading commercial face recognition algorithm as a function of the time lapse between the probe and gallery images. Measured
on a mug shot database containing 94,631 face images of 28,031 subjects, this performance degradation illustrates one of several challenges in face recognition research.

4

presented in this thesis are to develop representations and matching schemes that will
improve the state of the art in heterogeneous face recognition.
The remainder of the introduction to this thesis on heterogeneous face recognition
is organized as follows. In Section 1.1 we will trace some of the lineage of face
recognition research. In Section 1.2 we will provide an overview of how automated face
recognition algorithms are designed. An overview of heterogeneous face recognition
will be presented in Section 1.3. Section 1.4 will outline the contributions of this
thesis. Finally, Section 1.5 will discuss the organization of the remaining chapters of
this thesis.

1.1

The Lineage of Face Recognition

Dating as far back as the invention of the abacus, we have sought for machines to
replicate intelligent tasks performed by the human brain. The current state of intelligent automation is such that we are able to design machines that perform some of
the most complicated human tasks, such as piloting vehicles, natural language processing, and face recognition. The key advancements that have allowed us to realize
these technologies are the progression of computing capacities at a rate predicted by
Moore’s Law [106], and advancements in computer algorithms.
The progression in computing algorithms that has enabled today’s intelligent machines may be attributed to research in the broad ﬁeld known as artiﬁcial intelligence.
With an aim to automate intelligent tasks otherwise performed by biological lifeforms,
artiﬁcial intelligence spans a host of engineering applications and has attempted to
leverage mathematical developments from almost every academic ﬁeld.

5

(a)

(b)

(c)
Figure 1.5: Examples images from three diﬀerent heterogeneous face recognition scenarios. The top row contains probe images from (a) near-infrared, (b) thermal infrared, and (c) forensic sketch modalities. The bottom row contains the corresponding
gallery (visible band face image) photographs.

6

Figure 1.6: The human visual system starts with the eyes (top of the image in red)
which senses visible light waves. The path of information detected by the eyes ends
with the visual cortex of the brain (bottom of the image in red). The visual cortex
is the largest component in the brain, and is responsible for such intelligent tasks as
object recognition, motion detection, and face recognition.

7

The academic discipline of pattern recognition is a ﬁeld within artiﬁcial intelligence that (broadly) seeks to infer high level information from low level data. Many
similarities lie between pattern recognition and the closely related ﬁeld of machine
learning. In theory, pattern recognition algorithms are not concerned with the particular source of the data (e.g. digital images, object measurements, depth ﬁelds), but
instead on the structure of the data (e.g. real-valued, nominal, ordinal, unstructured)
and what information needs to be inferred from the data (e.g. classiﬁcation, clustering, regression) [24]. Typically, inference from data is achieved after implementing a
learning stage, which uses empirical samples of data exemplar to the inference task
at hand in order to develop decision boundaries that best generalize to the aggregate
of the training data to unseen test samples.
When applying pattern recognition algorithms to digital images we enter the ﬁeld
of computer vision [126]. Computer vision faces the bold task of replicating the largest
processing system in the human brain: the visual system. The human visual system
is the primary method of sensing (and hence responding to) the environment for most
humans. The visual system operates by transmitting visible light waves detected by
the eye to the visual cortex region of the brain (see Figure 1.6). When the signals
arrive to the visual cortex, the information is transmitted through the dorsal and
ventral streams of the brain. Studies have indicated that the ventral stream is used
to process information regarding a person’s location in relation to his environment,
and the ventral stream has been shown to perform object recognition related tasks
from the visible light waves sensed [32].
Strong evidence shows that when the appearance of a face enters the ventral
stream, the task of associating an identity with that face is performed by a dedicated
region of the brain, called the fusiform face area [47, 91]. The Thatcher eﬀect [140],
illustrated in Figure 1.7, allows us to readily observe evidence supporting the suggestion that we have a dedicated area of the brain for face processing.
8

Shifting focus to automated face recognition algorithms, we realize that research
in the ﬁeld of automated face recognition goes beyond ﬁnding solutions to a typical
application of pattern recognition and computer vision. Face recognition research
seeks to replicate an entire region of the brain that is predominately dedicated to
this one task: extracting information from human faces. That the human brain
has evolved to weight this identiﬁcation task with such resources demostrates the
importance and the beneﬁt of designing computer algorithms capable of replicating
this task.

1.2

Automated Face Recognition Algorithms

The challenges in designing automated face recognition algorithms are numerous.
Charged with the task of outputting a measure of similarity between a given pair
of face images, such challenges manifest in the following stages performed by most
face recognition algorithms: (i) face detection, (ii) face alignment, (iii) appearance
normalization, (iv) feature description, (v) feature extraction, and (vi) matching.
This section provides an overview of each of the above mentioned stages in automated face recognition, and follows the same ordering illustrated in Figure 1.8. The
predominant focus will be on the face representation and feature extraction stages.
This is because our research on heterogeneous face recognition has generally relied on
improvements in these two stages to increase recognition accuracies between heterogeneous face images.

1.2.1

Detection, Alignment, and Normalization

The ﬁrst step in automated face recognition is the detection and alignment of face
images. Often viewed as a preprocessing step, this stage is critical to both detecting
the presence of a face in a digital image, and alignment of the face with the spatial
9

(a)

(b)
Figure 1.7: The Thatcher eﬀect. (a) Most people will notice only minor diﬀerences
between the two inverted face images shown. (b) However, when turned upright, the
diﬀerences between the same two face images are noticed to be far more severe. Our
reduced ability to see the strong diﬀerences between the images in (a) is believed to
be because inverted faces do not trigger the fusiform face area of the brain.

10

Face
Detection

Face
Alignment

Appearance
Normalization

Feature
Description

Feature
Extraction

Matching

Figure 1.8: The common steps utilized by most face recognition algorithms.

(a)

(b)

(c)

Figure 1.9: Diﬀerent methods for face alignment. (a) Face images before (left column)
and after (right column) alignment through planar rotation and scaling. (b) Face
images aligned using a morphable model (images from [13]), and (c) a video sequence
aligned using a 3D model whose parameters were solved from a structure from motion
algorithm (images from [102])

11

coordinate system used in the succeeding face description.
The face detector proposed by Viola and Jones [146], which uses a cascaded classiﬁer in conjunction with images represented using a verbose set of Haar-like features,
set the precedent for all modern detectors with it’s robust accuracy and scalable computational complexity. While many methods have been proposed to improve upon
Viola and Jones detector, it still serves as an optimistic baseline of state of the art
performance [105].
Face alignment is typically performed by ﬁrst detecting the location of some ﬁxed
set of anthropometric landmarks on the face. In its simplest form, these landmarks
are the centers of the two eyes. Using the two eye locations, a 2D aﬃne transformation is performed to ﬁx the angle and distance between the two eyes. More advanced
methods use 3D aﬃne transformations, or Procustes alignment, on a more verbose set
of landmarks (such as a set of landmarks outlining the locations on the mouth, nose,
and face outline). The landmarks are generally detected by Active Shape Models
(ASM) [21] or active appearance appearance models (AAM) [22]. Additional techniques include the use of 3D morphable models [13] and structure from motion [102].
Examples of face image alignment are shown in Figure 1.9.
Appearance normalization seeks to compensate for variations in illumination. A
variety of methods have been proposed to perform such compensation, including
the contrast equalization and diﬀerence of Gaussian ﬁlters proposed by Tan and
Triggs [136], cones models by Georghiades et al. [30], and light ﬁeld modeling by
Gross et al. [33].

1.2.2

Feature Representation

The feature representation stage encodes diﬀerent facial characteristics (often implicitly) in a feature descriptor vector. Such descriptive information can range from a
vector of ordered image pixel values, to distance measurements between facial com12

ponents (e.g. the distance from the nose to the mouth), or to even more complex
features such as convolutions of a face image with a set of Gabor ﬁlters.
The range of representations used in face recognition is quite. Klare and Jain
developed an organization of such features to facilitate studies of facial individuality
and help standardize the face recognition process [56]. Below we introduce this taxonomy to provide a better understanding of the diﬀerent methods by which a face
images can be represented.
Klare and Jain’s taxonomy organized the vast gamut of facial feature representations leveraged in automated and manual face recognition into three levels: Level 1,
Level 2, and Level 3. Level 1 features consist of gross facial characteristics that are
easily observable in a face, such as skin color, gender, and the general appearance of
the face. Level 2 features consist of localized face information that requires specialized cortex processing, such as the structure of the face, the relationship among facial
components, and the precise shape of the face. Level 3 features consist of certain
irregularities in the facial skin, which includes micro features such as facial marks,
skin discoloration, and moles. An example of this proposed feature grouping can be
found in Figure 1.10.
This categorization of facial features is intended to provide a better understanding
and standardization of both manual and automated face recognition processes. The
beneﬁt of this categorization is two fold: (i) facilitating an individuality measure for
face images that can be used in legal testimony, and (ii) improving the accuracy of
commercial matchers through a more careful selection of facial features. The current
ﬁngerprint feature categorization [87], accepted by both forensic scientists as well
as ﬁngerprint vendors, served as a guiding principle for our categorization of facial
features. Compared to face recognition, ﬁngerprint matching has over 100 years of
history and success. Furthermore, features used in automatic ﬁngerprint matchers
(AFIS) are compact and have a physical interpretation in terms of the ridge ﬂow
13

(a)

Gabor

LBP

Shape

(b)

(c)
Figure 1.10: Examples of the three levels of facial features [56]. (a) Level 1 features
contain low dimensional appearance information that is useful for determining highly
level identifying information such as ethnicity, gender, and the general shape of a
face. (b) Level 2 features require detailed processing for face recognition and captures
information regarding the structure and speciﬁc shape and texture of the face. (c)
Level 3 features include marks, moles, scars, and other irregular micro features of the
face.

14

patterns in the ﬁngerprint. Indeed state-of-the-art AFIS utilize essentially the same
features that are utilized by human ﬁngerprint examiners. This is not necessarily
true for face recognition; features extracted by humans are not easy to precisely
describe, and thus cannot be utilized in automatic face recognition systems. Salient
features in ﬁngerprints are categorized into three levels: Level 1 features encompass
the global structure or ridge pattern (e.g. arch, loop, whorl). Level 2 features consist
of minutiae location and orientation, and are primarily used for matching. Level 3
features consist of information available at higher spatial resolutions, such as dots,
incipients and ridge width. An example of these ﬁngerprint features can be found
in Figure 1.11. The analogy between these widely accepted ﬁngerprint feature levels
and the proposed face feature levels will be established below.
A major beneﬁt of Klare and Jain’s facial feature taxonomy is that the same feature levels can be deﬁned for both face recognition engines as well as human face
examiners. The lack of a well deﬁned and accepted method used in human face
identiﬁcation is being noticed as automated face recognition systems continue to mature [132]. The rapid growth in the use of face images captured from surveillance
cameras in legal proceedings in courts has also drawn into question the methods by
which human face examiners determine a person’s identity using typically low quality video frames [25]. The absence of a deﬁned set of face features prevents: (i) a
generally well accepted method of human face examination, and (ii) an understanding of the statistical uniqueness of face features derivable by humans [132], and (iii)
a likelihood of a false association occurring in automated face recognition systems.
Ongoing studies on the individuality of ﬁngerprints [101] are also motivated by challenges to ﬁngerprint evidence in court cases. A report from the National Academy of
Sciences on forensics [23] highlights the need for such individuality studies not only
for ﬁngerprints but for other biometric traits as well. A recent volume on forensic
facial comparison [26] also mentions this report among other motivating factors for
15

Ending

Core

Incipient

Delta

Bifurcation

(a)

(b)

Pore
(c)

Figure 1.11: A ﬁngerprint image and its (a) Level 1, (b) Level 2, and (c) Level 3
features. The organization of facial feature is analogous to such feature levels [56].
developing face individuality models. The organization of facial features assists in
conducting a study on the individuality of facial features.

Face Feature Levels
Level 1 Level 1 facial features encompass the global nature of the face, and can be
extracted from low resolution face images (< 30 interpupilary pixel distance (IPD)). In
automated face recognition, Level 1 features include appearance-based methods such
as PCA (Eigenfaces [143]) and LDA (Fisherfaces [10]). For example, these features
can generally discriminate between: (i) a short round face and an elongated thin
face; (ii) faces possessing predominantly male and female characteristics; or (iii) faces
from members of diﬀerent ethnicities. Level 1 features cannot, however, accurately
identify an individual over a large population of candidates. This is illustrated in
Figure 1.12, where a query image can easily be diﬀerentiated from a subject that has
a very diﬀerent appearance, but cannot be distinguished from a more similar looking
subject.
Level 1 facial features derivable by humans and machines are the gender, race,
and general age. The postulated feedfoward nature of human face recognition also
uses Level 1 features, where the initial layers can quickly discard a match candidate
16

Table 1.1: Example features from each of the three diﬀerent levels of facial features
that are used to represent face images by (a) humans, and (b) machines.
Source: Humans and machine
Level 1

gender, race, age

Level 2

anthropometric features

Level 3

moles, scars, freckles, birth marks
(a)

Source: Machine Only
Level 1

appearance-based methods (PCA, LDA, etc.)

Level 2

distribution-based feature descriptors (LBP, SIFT, etc.),
shape distribution models, texture descriptors

Level 3

high spatial frequency
(b)

17

if they have a largely diﬀerent facial appearance [18].
Level 1 face features are quite analogous to Level 1 ﬁngerprint features. In each of
these two traits, Level 1 features are simple to compute even in low resolution images.
However, Level 1 features alone are generally only useful for indexing or reducing the
search space. Level 1 features should be explicitly leveraged to improve the matching
speed by using them in early stages of a cascaded face recognition system.

Level 2 Level 2 features are representations that are explicit to face recognition,
and require more detailed face observations. These features are locally derived and
describe structures in the face that are only relevant to face recognition (as opposed
to general object recognition) due to their spatial uniqueness. Examples of such face
features in automated face recognition include the use of Gabor wavelets in elastic
bunch graph matching (EBGM) [151], local binary patterns (LBP) [3], SIFT feature
descriptors [59, 93], point distribution models [22], texture appearance models [22],
biologically inspired features proposed by Riesenhuber and Poggio (R&P) [93, 122],
and explicit face geometry [124] (which includes the Bertillon system [11]).
Level 2 features are essential for face recognition. Given the strong evidence
that suggests face recognition activity in humans takes place in the fusiform face
area [47, 142], which is a cortical region that appears to be dedicated to face recognition. In an attempt to replicate human visual processing for face recognition, the use
of Level 2 biologically inspired features in the form of Gabor wavelets have been successfully utilized in machine face recognition [93]. Along with other features such as
the local binary patterns and gradient-based methods, these features are face speciﬁc,
provided they are deﬁned with respect to their spatial coordinates on the face. For example, EBGM extracts Gabor descriptors at speciﬁed locations on the face [151], and
LBP and SIFT-descriptor methods extract these descriptors at uniformly distributed
locations on a face that has been normalized using the eye coordinates [3, 59].
18

Level 2 face features are analogous to minutiae location and orientation in ﬁngerprint recognition. In both face and ﬁngerprint, the Level 2 features are deﬁned with
respect to a particular spatial coordinate reference, and in each case the local features
can generally be computed independently of one another.
While Level 2 features are the most discriminative face features, and are predominantly used for face recognition, certain matching scenarios exist in which they alone
are not suﬃcient. One example is face recognition in monozygotic twins [60, 134]
(i.e. identical twins). Because the facial appearance of monozygotic twins is nearly
identical at medium resolutions (roughly 20 to 100 IPD), Level 2 features alone are
generally not suﬃcient for such a task. Another example where Level 2 features alone
may be insuﬃcient is age-invariant face recognition [57, 104]. As humans age, the
bone structure (in early aging) and cartilage (in late aging) of the face expands and
the skin wrinkles, causing both the facial shape and texture to change.
While humans extract “biological features” (i.e. neuron encodings of facial features) to recognize faces, we are limited in our knowledge of how to precisely describe
these features. As a result, expert testimony for face recognition in the legal system
is generally restricted to the geometric Level 2 features, such as face measurements
and ratios (e.g. the ratio of the distance between the eyes and the nose width).
These anthropomorphic methods were applied to systematic face recognition, prior
to the advent of ﬁngerprint identiﬁcation, in the Bertillon system [11, 119]. While
the uniqueness of such anthropometric features is not currently leveraged in face examination, anthropometric features: (i) have gained informal acceptance in the legal
system, and (ii) are computable by both humans and machines [22]. Thus, despite
the fact that anthropometric-based face recognition (i) is not a typical approach to
automated face recognition, and (ii) is currently used without a consistent and proven
methodology in court cases [25], a thorough examination of their uniqueness must be
undertaken. Such a study could be guided by similar statistical studies on the unique19

Probe

PCA Distance:
(a)

Gallery Images

0.1723
(b)

0.1379
(c)

0.7673
(d)

Figure 1.12: An example of how Level 1 features can easily ﬁlter out faces that
exhibit large diﬀerences, but cannot distinguish faces that possess many similarities.
The probe image in (a) was matched to each gallery image using a Level 1 image
pixel representation (diﬀerence in PCA features using the euclidean distance). Note
that a larger PCA distance indicates that the faces are less similar. Using this Level
1 representation, the face in (a) matched well to an image of a similar looking subject
(c) than its true mate (b), but was easily diﬀerentiated from other subjects that
looked largely diﬀerent (d). The information in Level 1 features is suﬃcient for
quickly discarding some subjects (d), but more detailed Level 2 features are needed
to discriminate between similar looking subjects (c). These images are from the AR
face database [89].
ness of ﬁngerprints [101] that are critical for the acceptance of ﬁngerprint evidence in
the legal system.

Level 3 Level 3 features contain unstructured, micro level features on the face,
which includes scars and facial marks. Only recently has this identiﬁable information
been explicitly considered for face recognition [60,103]. One challenging face recognition problem where Level 3 features are critical is the discrimination of monozygotic
(i.e. identical) twins [60]. Because identical twins are extremely diﬃcult for even
humans to distinguish, the presence of any small identifying mark on a face could be
the diﬀerence between successful and mistaken identiﬁcation. Research in the medical community has shown that while the number of moles (or nevus) in monozygotic
twins is correlated, the locations of these moles are not [159] (see Figure 1.13). Level
3 features have been shown to also improve the matching accuracy in standard face
20

Figure 1.13: Face images of two identical twins. While their Level 1 and Level 2
features are the same, the facial mark information contained in the Level 3 features
(shown in red circles) oﬀers discriminating information [60].

recognition scenarios [103].
Level 3 features in the form of marks should be relatively easy to extract by both
humans and computers. Given a good quality face image, the presence of freckles,
moles, marks, and scars can be manually marked. An automated approach to mark
extraction is also viable [103], though more attention is needed to develop robust
solutions. For high resolution images (> 100 IPD) machines are also able to extract
micro texture information, though very few studies have been conducted to explicitly
understand how micro texture analysis can improve face recognition. Results from the
2006 Face Recognition Vendor Test (FRVT) [110] demonstrated that high resolution
face images are able to improve the matching accuracy of most commercial matchers,
supporting the usefulness of micro texture information.
In ﬁngerprints, Level 3 features include micro information such as incipient ridges
and pores, and irregular information such as scars, creases and other permanent details [41]. This information is typically used by latent ﬁngerprint examiners. In the
case of AFIS matching, higher resolution ﬁngerprint images (1000 ppi) are necessary
to extract pore and ridge information to improve the matching accuracy, which is
generally consistent with the proposed Level 3 face features: many moles and facial
marks are not detectable at lower image resolutions. In the context of latent examination, the partial ﬁngerprints available may require the use of Level 3 features to make
21

a reliable determination of a subject’s identity since there may not be a suﬃcient
number of Level 2 features (minutiae) available. Similarly, forensic examination of
face images may need to leverage face mark information to make a successful identity
determination [133].
It is clear now that no optimal feature representation or encoding exists for face
images. However, the feature description stage is consolidated across all such representations in that each representation outputs some feature vector x that describes
the face. It is from this feature vector representation that automated algorithms
ultimately measure how similar two face images are.

1.2.3

Feature Extraction

With a face image I now represented by a vector x, where x is the feature vector from
the above mentioned feature descriptors encodings (LBP descriptors, pixel values,
etc.), a host of subspace manifold methods exist for leveraging training data (i.e.
exemplar face images) to extract feature combinations which project the original
features into a feature space with improved face class separation.

Principal Component Analysis Dating back to the original Eigenfaces method
[143], principal component analysis (PCA) has played a vital role in the ﬁeld of
face recognition. PCA seeks to ﬁnd an orthogonal subspace Ψ that reduces the
dimensionality of the original feature space while preserving the majority of the data
variance. This is achieved by performing an eigendecomposition on the covariance
matrix computed from samples in the feature space. Given n samples xi ∈ Rd,1 , i =
1 . . . n, where xi can be any of the feature representations discussed previously (LBP,
1
pixels, etc.), the ﬁrst step in PCA is to compute the sample mean µ = n

n
i=1 xi .

Letting X ∈ Rd,n be a matrix containing each data instance centered at the mean
(i.e. X = [x1 − µ, x2 − µ, . . . , xn − µ]), we compute the scatter matrix S as S = XX T .
22

Finally, we seek the subspace Ψ, where

Ψ = arg max Ψ T SΨ

(1.1)

Ψ

Ψ can be solved by performing an eigendecomposition on S, which yields the matrices
of eigenvectors Ψ and eigenvalues Λ, where the eigenvector in the i-th column of Ψ
corresponds to the i-th diagonal entry in Λ. That is,

SΨ = ΛΨ

(1.2)

Generally, d < d eigenvectors are retained, such that

d = arg min
˜
d

˜
d
i=1 Λ(i, i)
d
i=1 Λ(i, i)


> Ve 

(1.3)

where Ve ∈ (0, 1) is the fraction of data variation to be retained (e.g. 0.98).
A computational burden in solving for Ψ lies in the computation of S. This is due
to the fact that often d >> n. That is, the dimensionality d of the feature vectors x
is much greater than the number of images n. For example, if we had 1,000 images to
learn the PCA space Ψ, and x was an image pixel representation for 128 x 128 sized
images, then n = 103 and d ≈ 1.6 · 104 . Thus, d is an order of magnitude larger than
n. This means that the computational complexity for computing S is O(d2 n).
Though S is a d x d dimensional matrix, the rank of S will only be n. Turk and
Pentland [143] showed that Ψ could instead be solved by

X T XΨ = Λ Ψ

(1.4)

XX T XΨ = Λ Ψ

(1.5)

because

23

which means Ψ = XΨ . Solving Ψ in this manner reduces the computational complexity to O(dn2 ).
The chief beneﬁt of PCA lies in reducing the feature dimensionality from d to d
(d < d). Typically the majority of the data variation is captured in the eigenvectors
associated with large eigenvalues, and the eigenvectors associated with small eigenvalues correspond to noisy measurements. By discarding the eigenvectors associated
with small eigenvalues, the feature dimensionality is greatly reduced without losing
data variance information.

Linear Discriminant Analysis While PCA is eﬀective in reducing the feature
dimensionality to a more tractable size, it is not able to leverage the category information (class labels) in the training data to improve recognition accuracy. Belhumeur
et al. [10] adapted linear dicrminant analysis (LDA) as a face recognition subspace
technique that seeks a linear subspace projection Ψ that maximizes the discriminablity
of the feature space with respect to the Fisher criterion

Ψ = arg max
Ψ

ΨT SB Ψ
ΨT SW Ψ

(1.6)

where SB is the between-class scatter matrix and SW is the within-class scatter
matrix. That is, SB is the scaled covariance between images of diﬀerent subjects,
and SW is the scaled covariance between images of the same subject. By solving Eq.
4.1, a subspace projection is learned where (ideally) the images of the same subjects
form compact groups, and images of diﬀerent subjects are not well separated.
An LDA subspace projection is learned from a training set of face images of nS
diﬀerent subjects, with at least two images per subject. For each subject i, the ni
j

feature vectors xi , j = 1 . . . ni , for the i-th subject are used to compute the mean
n

j

1
i
vector µi = n
j=1 xi . From this, we compute the between-class scatter matrix SB
i
and the within-class matrix SW as

24

nS

SB =

ni (µi − µ)(µi − µ)T
i=1
nS ni

j

j

(1.7)

(xi − µi )(xi − µi )T

SW =

(1.8)

i=1 j=1
j

where µ is the mean of all feature vectors xi . Finally, Ψ is detertmined from the
solution of the generalized eigenvalue problem

SB Ψ = ΛSW Ψ

(1.9)

−1
Generally, this is equivalent to performing an eigendecomposition on SW SB .

However, the rank rW of SW will be rW = (

nS
j=1 nj ) − nS .

If rW < d (as is typically

the case), then SW will be singular and non-invertible. The common solution to this
problem is to ﬁrst perform PCA on the feature space to reduce the dimensionality to
d where d ≤ rW . After this dimensionality reduction, LDA can be applied on the
PCA reduced feature space.

Random Sampling Linear Discriminant Analysis Ideally, the aforementioned
LDA algorithm will learn subspace projections that oﬀer improved recognition accuracies over a PCA subspace. However, in practice, this is often not the case due to
the small sample size (SSS) problem [115]. Speciﬁcally, the problem being solved by
LDA is often ill-posed because the number of training samples per subject (i.e. face
images for each subject) is too small with respect to the number of feature dimensions. Because of this, subspaces learned through LDA may have high generalization
errors.
A solution to the SSS problem is the random sample linear discrminant analysis
(RSLDA). This approach was ﬁrst introduced by Wang and Tang [148] using image
25

pixel representations. Li et al. later extended this method in order to make it
applicable to Level 2 features (SIFT and LBP) [76]. Klare and Jain showed the
eﬀectiveness of the RSLDA framework on a large aging dataset [57].
Random sampling LDA mitigates the small sample size problem by decomposing
the feature space into more compact and solvable subsets. This approach follows the
concept of an ensemble classiﬁer, where Schapire combined multiple weak classiﬁers
into a single strong classiﬁer [125].
RSLDA learns B diﬀerent subspaces. For each subspace, the d-dimensional feature
space spanned by x is randomly sampled so that a subset of size dr < d of the
original d features is retained. LDA is then performed using this reduced feature
space. However, in addition to sampling the feature space, the training subjects are
also randomly sampled. Thus, only a portion of the original subjects are used in each
of the B stages.
Once the set of subspaces has been learned, a face feature vector is projected into
each of the B subspaces (resulting in B feature vectors describing the face). These
feature vectors may then be concatenated into a single vector for matching.

1.2.4

Matching

The matching stage outputs a measure of similarity (or dissimilarity) between two
face images, where the feature vectors used to compute such (dis)similarities are the
outputs from the feature extraction stage discussed above. Most simply, matching is
performed using the nearest neighbor classiﬁcation algorithm [24]. That is, a probe
(or query) image is matched against a gallery (or database) by ﬁnding the face image
in the gallery with the minimum distance (such as the Euclidean or cosine distance)
or maximum similarity.
Often the matching stage can be augmented by an additional stage of statistical
learning (that is, in addition to the learning that occurred in the feature extraction
26

stage). A common notion here is to map the task of generating a measure of similarity
between two faces images to a binary classiﬁcation problem that determines whether
or not two face images are of the ‘same subject’ or a ‘diﬀerent subject’. This notion can
easily leverage a host of binary classiﬁcation algorithms from machine learning and
pattern recognition literature by creating new feature vectors that are the diﬀerence
between feature vectors extracted from two face images.
The method by Moghaddam et al. was seminal in using this binary classiﬁcation
approach by modeling the diﬀerence vectors with Bayesian maximum a posteriori
density estimation [96] to generate a probabilistic measure of similarity between two
face images. This technique has also been applied using support vector machines [36].
Fusion techniques, as discussed by Ross and Jain [123], may also by exploited
to improve face matching. While typically applied to multi-biometric scenarios, the
same approach is viable for face recognition scenarios where, for example, the use of
multiple face representations (such as LBP and Gabor), multiple views of a face, or
multiple RSLDA subspaces can be consolidated to achieve better discrimination.

1.3

Heterogeneous Face Recognition

Now that we better understand the face recognition process, we can shift our attention to the topic of heterogeneous face recognition (HFR). The key diﬃculty in
matching face images from alternate modalities is that face images of the same subject may diﬀer in appearance due to the change in image modality. Heterogeneous
face recognition algorithms must develop representation schemes to be invariant to
such intra-class variations. Two of the most intuitive methods for achieving this
invariance are the selection of feature descriptor encodings that are stable between
heterogeneous modalities, and learning feature extractions from such encodings that
further compensate for such undesired diﬀerences.
27

(a)

(b)

Figure 1.14: Most heterogeneous face recognition scenarios leverage large visible light
face image databases to determine a subject’s identity from their face image acquired
in some non-visible modality. (a) Between driver’s licenses, passports, and mug shots,
a visible light face image exists for a majority of the population. (b) Many forensic
and law enforcement scenarios only have face images available from alternate imaging
sources such as infrared, LIDAR, or forensic sketches.

28

The most frequent heterogeneous face recognition scenario involves gallery
databases with visible light face images, and probe images from some alternate modality such as infrared, sketch, or depth images (see Figure 1.14). The motivation behind
solutions to these scenarios is that through sources such as state DMV driver license
photos, law enforcement mug shot records, the FBI’s Next Generation Identiﬁcation
initiative, and the US-VISIT program, visible photograph databases cover the majority of the U.S. population. In fact, visible face images databases cover the population
of most other developed nations as well.
While the standard face recognition paradigm is to match against the above mentioned face databases with homogeneous face images (e.g. surveillance images, mug
shots, images from social networking sites), heterogeneous face recognition seeks to
query these databases with images captured from imaging devices of an alternate
modality.
Many scenarios exist in which the only available probe images are not visible light
face images. For example, when no face image exists of a subject (suspect), a forensic
sketch may be developed through a verbal description of a subject’s appearance. In
nighttime environments infrared imaging must be used to capture a subject’s face
biometric. In order to identify subjects in these scenarios, specialized algorithms for
heterogeneous face recognition must be employed.
The collection of solutions to heterogeneous face recognition can be organized into
three approaches:
• Synthesis methods: Synthesis methods seek to generate a synthetic visible
light photograph from the alternate modality face image. Once a synthetic
visible face image has been generated, it can be matched using standard face
recognition algorithms. Synthesis solutions to heterogeneous face recognition
are generative methods, and have been solved using local linear embedding [82]
or Markov random ﬁelds [149]. Park et al. handled the heterogeneity in facial
29

aging by synthesizing the aging process [104].

• Feature-based methods: Feature-based methods encode face images from
both modalities using feature descriptors that are largely invariant to changes
between the two domains. For example, local binary patterns [99] and SIFT
feature descriptors [84] have been shown to be stable between sketch and visible
photographs [59], near-infrared face images and visible photographs [77], and
time-separated (aged) face images [57, 76]. Once face images from both modalities are represented using feature descriptors, feature extraction methods such
as LDA can be used to improve the discriminative abilities of the representation.
The matching stage of the feature-based methods is performed by measuring
the distance or similarity between the feature vector representation of two face
images.

• Prototype similarity methods: As we will discuss in Chapter 3, prototype
similarity methods represent a face image as a vector of similarities to a collection of prototype face images [54]. The prototypes are a collection of subjects
that each contain a face image from both the probe and gallery modalities. The
prototypes are analogous to a training set - in this case they help approximate
the distribution of faces. Because each prototype has a face image from each
modality, the vector of similarities for a face image is measured against the
images from the corresponding modality. Similarities are measured between
feature-based representations (e.g. LBP, SIFT) of the face images. The use of
similarities naturally extends to kernel similarities, with the kernel space oﬀering a non-linear feature space. Linear discriminant analysis may also be applied
on the vectors of prototype similarities to improve the recognition accuracy. A
chief beneﬁt of prototype similarity algorithms is that the feature representation
in which the similarities are computed may be diﬀerent for the probe and gallery
30

modalities. This property is important for scenarios such as 3D to 2D matching,
where common feature descriptors do not exist between the modalities.

1.4

Contributions

The contributions of this dissertation are as follows:
1. A framework for feature-based heterogeneous face recognition is developed. This
framework, called Local Feature-based Discriminant Analysis (LFDA), achieves
state of the art accuracy when applied to the heterogeneous face recognition task
of matching sketches and photographs.
2. An approach to heterogeneous face recognition is developed which uses prototype similarities to eliminate the need to directly measure the similarity between
images from heterogeneous modalities. In requiring that face similarities be
computed within each modality only, the prototype heterogeneous face recognition method generalizes to any HFR scenario.
3. In showing that aging-invariant face recognition systems do not generalize to
face images that have not aged, it is demonstrated that face recognition in
the presence of time lapse can be viewed as a heterogeneous face recognition
problem.
4. Diﬀerent sources of demographic information (race, gender, and age) are exploited to perform dynamic face matcher selection. This paradigm for face
recognition uses the available demographics of the probe image to improve face
recognition accuracies.
5. A set of qualitative facial features is developed to enable matching caricature
sketches to photographs. These features are: (i) able to encode key facial characteristics that are used in caricatures to convey a subject’s identity, and (ii)
31

robust to facial variations such as the unconstrained exaggerations performed
by a caricaturists.

1.5

Thesis Organization

In Chapter 2 we present a solution to the problem of heterogeneous face recognition
between forensic sketches and mug shot photographs. In Chapter 3, a framework for
heterogeneous face recognition is introduced which uses prototype similarity features
to generalize the heterogeneous face recognition task to any scenario. Chapter 4
presents a study on the generalization of aging-invariant face recognition to non-aging
scenarios. Chapter 5 presents a study on how heterogeneous demographics may be
exploited to improve face recognition performance. Chapter 6 studies the task of a
matching a caricature sketch to a facial photograph. Finally, we conclude with the
ﬁndings of this dissertation and suggestions for future research in Chapter 7.

32

Chapter 2
Forensic Sketch Recognition

2.1

Introduction

Progress in biometric technology has provided law enforcement agencies additional
tools to help determine the identity of criminals. In addition to DNA and circumstantial evidence, if a latent ﬁngerprint is found at an investigative scene, or a surveillance
camera captures an image of a suspect’s face, then these cues may be used to help
determine the culprit’s identity using automated biometric identiﬁcation. However,
many crimes occur where none of this information is present, but instead an eyewitness account of the crime is available. In these circumstances a forensic artist is
often used to work with the witness in order to draw a sketch that depicts the facial
appearance of the culprit according to the verbal description. Once the sketch image
of the transgressor is complete, it is then disseminated to law enforcement oﬃcers
and media outlets with the hope that someone will come forward who knows the
suspect. These sketches are known as forensic sketches and this chapter describes a
robust method for matching forensic sketches to large mughshot (image) databases
maintained by law enforcement agencies.
Improving forensic sketch recognition performance is perhaps the most impactful
33

area of heterogeneous face recognition research [55]. This is because enabling a search
of a large mug shot or driver license database using a forensic sketch is equivalent to
a face search using a verbal description. That is, we are able to search digital face
image databases without even having a face image as a query. As such, the research
presented in this chapter oﬀers a strong contribution to the goals of this dissertation.
Two diﬀerent types of face sketches are discussed in this chapter: viewed sketches
and forensic sketches (see Figure 2.1). Viewed sketches are sketches that are drawn
while viewing a photograph of the person or the person himself. Forensic sketches,
on the other hand, are drawn by interviewing a witness to gain a description of the
suspect. Published research on sketch to photo matching to this point has primarily focused on matching viewed sketches [53] [138] [82] [158] [149], despite the fact
that real world scenarios only involve forensic sketches. Both forensic sketches and
viewed sketches pose challenges to face recognition due to the fact that probe sketch
images contain diﬀerent textures compared to the gallery photographs they are being
matched against. However, forensic sketches pose additional challenges due to a witness’s inability to exactly remember the appearance of a suspect and her subjective
account of the description, which often results in inaccurate and incomplete forensic
sketches. Experimental results on viewed sketches1 are included primarily for historical reasons since all available research to date on sketch recognition has focused on
viewed sketches.
We highlight two key diﬃculties in matching forensic sketches: (1) Matching across
image modalities, and (2) performing face recognition despite possibly inaccurate depictions of the face. In order to solve the ﬁrst problem we use local feature-based discriminant analysis (LFDA) to perform minimum distance matching between sketches

A viewed sketch is a facial sketch drawn while viewing a
photograph of the subject. The scenario is not particularly
interesting because the photograph itself could be queried in
the FR system.
1

34

(a)

(b)

(c)
Figure 2.1: The diﬀerence between viewed sketches and forensic sketches. (a) Viewed
sketches and their corresponding photographs. (b) Two pairs of good quality forensic
sketches and the corresponding photographs. (c) Two pairs of poor quality forensic
sketches and the corresponding photographs. Sketches were labeled as “good” if they
(subjectively) exhibited a mostly accurate portrayal of a subject. Otherwise, if a
sketch did not strongly resemble the subject, it was labeled as “poor”.

35

and photos, which is described in Section 2.3 and summarized in Figure 2.2 and Figure
2.3. The second problem is considered in Section 2.5, where analysis and improvements are oﬀered for matching forensic sketches against large mugshot galleries.
The contributions of the chapter are summarized as follows: (i) We observe a
substantial improvement in matching viewed sketches to photos over published algorithms using the proposed local feature-based discriminant analysis; (ii) we present
the ﬁrst large-scale published experiment on matching operational forensic sketches;
(iii) using a mugshot gallery of 10,159 images, we perform race and gender ﬁltering
to improve the matching results; (iv) all experiments are validated by comparing the
proposed method against a leading commercial face recognition engine. The last point
is signiﬁcant since earlier studies on viewed sketches used PCA (eigenface) matcher
as the baseline. It is now well known that the performance of a PCA matcher can be
easily surpassed by other face matchers.

2.2

Related Work

Most research on sketch matching has dealt with viewed sketches. Much of the early
work in matching viewed sketches was performed by Tang et al. [137] [138] [82] [149]
[78]. These studies share a common approach in that a synthetic photograph is
generated from a sketch (or vice-versa), and standard face recognition algorithms are
then used to match the synthetic photographs to gallery photographs. The diﬀerent
synthesis methods used include an eigentransformation method (Tang and Wang [137]
[138]), Local Linear Embedding (Liu et al. [82]), and belief propagation on a Markov
random ﬁeld (Wang and Tang [149]). Other synthesis methods have been proposed
as well [158] [29] [152] [83] [74]. The impact of matching sketches drawn by diﬀerent
artists was studied by Al Nizami et al. [98].
We also proposed a method of sketch matching that uses the same feature-based
36

approach that has been successful in other heterogeneous face recognition scenarios
(speciﬁcally matching near infrared face images to visible light) [53]. In using SIFT
feature descriptors [84], the intrapersonal variations between the sketch and photo
modality was diminished while still maintaining suﬃcient information for interclass
discrimination. Such an approach is similar to other methods proposed in the literature [77] [68] [52] of matching near infrared images (NIR) to visible light images
(VIS), where local binary pattern (LBP) [99] feature descriptors are used to describe
both NIR and VIS images.
In this chapter we extend our previous feature-based approach to sketch matching
[53]. This is achieved by using local binary patterns (LBP) in addition to the SIFT
feature descriptor, which is motivated by LBP’s success in a similar heterogeneous
matching application by Liao et al. [77]. Additionally, we extend our feature-based
matching to learn discriminant projections on “slices” of feature patches, which is
similar to the method proposed by Lei and Li [68].

2.3

Feature-based Sketch Matching

Feature descriptors describe an image or image region using a feature vector that
captures the distinct characteristics of the image [95]. Image-based features have
been shown to be successful in face recognition, most notably with the use of local
binary patterns [3].

2.3.1

Feature-based Representation

We will now describe how to represent a face with image descriptors. Because most
image descriptors are not suﬃciently verbose to fully describe a face image, the descriptors are computed over a set of uniformly distributed subregions of the face. The
feature vectors at sampled regions are then concatenated together to describe the en37

Training
Training set
of sketch/photo
correspondences

Break image
into set of
overlapping
patches

SIFT and MLBP
feature
descriptor
0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

Group patch
vectors into slices

0

20

40

60

80

100

120

140

Learn
discriminant
projection for
each slice

12 3456. . . . N

Figure 2.2: An overview of training using the LFDA framework. Each sketch and
photo is represented by SIFT and MLBP feature descriptors extracted from overlapping patches. After grouping “slices” of patches together into feature vectors Φ(k)
(k = 1 · · · N ), we learn a discriminant projection Ψk for each slice.

38

Matching
Probe
Sketch

Feature extraction and
grouping into slices

0.35

0.3

0.25

Gallery
Photos

0.2

0.15

0.1

0.05

0

0

20

40

60

80

100

120

140

Discriminant
projection

Matching

Figure 2.3: An overview of matching using the LFDA framework. Recognition is
performed after combining each projected vector slice into a single vector ϕ and
measuring the normed distance between a probe sketch and gallery photo.

39

tire face. The feature sampling points are chosen by setting two parameters: a region
(or patch) size s, and a displacement size δ. The region size s deﬁnes the size of the
square window over which the image feature is computed. The displacement size δ
states the number of pixels the patch is displaced for each sample, thus (s − δ) is the
number of overlapping pixels in two adjacent patches. This is analogous to sliding
a window of size sxs across the face image in a raster scan fashion. For a HxW
image the number of horizontal (N ) and vertical (M ) sampling locations are given by
N = (W −s)/δ +1, and M = (H −s)/δ +1. At each of the M ·N patches, we compute
the d-dimensional image feature vector φ. These image feature vectors are concatenated into one single (M · N · d)-dimensional image vector Φ. Whereas f (I) : I → φ
denotes the extraction of a single feature descriptor from an image, sampling multiple
features using overlapping patches is denoted as F (I) : I → Φ. Minimum distance
sketch matching can be performed directly using this feature-based representation of
subjects i and j by computing the normed vector distance ||F (I i ) − F (I j )|| [53].
In our sketch matching framework, two feature descriptors are used: SIFT and
LBP. The SIFT feature descriptor quantizes both the spatial location and gradient
orientations within a sxs sized image patch, and computes a histogram in which each
bin corresponds to a combination of a particular spatial location and orientation. For
each image pixel, the histogram bin corresponding to its quantized orientation and
location are incremented by the product of: (i) the magnitude of the image gradient
at that pixel, and (ii) the value of a Gaussian function centered on the patch with a
standard deviation of s/2. Tri-linear interpolation is used on the quantized location
of the pixel, which addresses image translation noise. The ﬁnal vector of histogram
values is normalized to sum to one. The reader is referred to [84] for a more detailed
description of how the SIFT feature descriptor is designed. It is important to reiterate
that because we are sampling SIFT feature descriptors from a ﬁxed grid, we do not use
SIFT keypoint detection; the SIFT feature descriptor is computed at predetermined
40

locations.
For the local binary pattern feature descriptor [99], we extended the LBP to
describe the face at multiple scales, by combining the LBP descriptors computed
with radii r ∈ {1, 3, 5, 7}. We refer to this as the multi-scale local binary pattern
(MLBP). MLBP is similar to other variants of the LBP, such as MB-LBP [77], but
we obtained slightly improved face recognition accuracy using MLBP.
The choice of the MLBP and SIFT feature descriptors was based on reported success in heterogeneous face recognition and through a quantitative evaluation of their
ability to discriminate between subjects in sketches and photos [58]. Though variants
of LBPs have lead to substantial success in previous heterogeneous face recognition
scenarios, the use of SIFT feature descriptors for this application is quite novel. However, recent work [53] clearly demonstrates the success of SIFT feature descriptors
for viewed sketch recognition. SIFT feature descriptors have also been shown to
perform comparatively with LBP feature descriptors in a standard face recognition
scenario [93]. These feature descriptors are well suited for sketch recognition because
they describe the distribution of the direction of edges in the face; this information is
contained in both sketches and photos. By densely sampling these descriptors, suﬃcient discriminatory information is retained to more accurately determine a subject’s
identity over previously used synthesis methods [53].
The feature-based representation requires each sketch and photo image to be normalized by rotating the angle between the two eyes to 0◦ , scaling the images to a
75 interocular pixel distance, and cropping the image size to 200 by 250 pixels. The
experimental results reported in Sections 2.4 and 2.6 for each of the two descriptors
are based on a sum of score fusion of the match scores generated from computing
descriptors with patch sizes of s = 16 and s = 32. This also holds for the global
discriminant described in Section 2.3.2; we fuse the matching scores computed using
two separate patch sizes of 16 and 32. When combining the SIFT and MLBP features,
41

sum of score fusion is used as well.

2.3.2

Local Feature-based Discriminant Analysis

With both sketches and photos characterized using SIFT and MLBP image descriptors, we further reﬁne this feature space using discriminant analysis. This is done
to reduce the large dimensionality of the feature vector Φ. A straightforward approach would be to apply classical subspace analysis (such as LDA) directly on Φ,
and to extract discriminant features for classiﬁcation. However, there are several
problems with this approach. First, the feature dimensionality is too high for direct
subspace analysis. In our experiments each image is divided into either 154 overlapping patches (for s = 32) or 720 overlapping patches (for s = 16), with each patch
producing a 128-dimensional SIFT descriptor or a 236-dimensional MLBP descriptor.
The second problem is the possibility of overﬁtting due to the small sample size (SSS)
problem [115].
In order to handle the combination of a large dimensionality (feature size) and
small sample size, an ensemble of linear discriminant classiﬁers, called local featurebased discriminant analysis (LFDA), is proposed. Other discriminant analysis methods have been proposed to handle the SSS problem, such as random sampling
LDA [148], regularized LDA [85], and direct LDA [45]. However, we choose the
proposed LFDA method because it is designed to work with a feature descriptor representation (as opposed to an image pixel representation), and it resulted in high
recognition accuracy.
In LFDA framework each image feature vector Φ is ﬁrst divided into “slices”
of smaller dimensionality, where slices correspond to the concatenation of feature
descriptor vectors from each column of image patches. Next, discriminant analysis
is performed separately on each slice by performing the following three steps: PCA,
within-class whitening, and between-class discriminant analysis. Finally, PCA is
42

applied to the new feature vector to remove redundant information among the feature
slices to extract the ﬁnal feature vector.
To train the LFDA, we use a training set consisting of pairs of a corresponding
sketch and photo of n subjects (which are the n classes). This results in a total of 2n
training images with two supports for each subject i: the image feature representation
i
i
of the sketch Φi = F (Is ) and the photo Φi = F (Ip ). We combine these feature vectors
s
p

as a column vector in training matrices and refer to them as X s = [Φ1 Φ2 . . . Φn ] for
s s
s
the sketch, X p = [Φ1 Φ2 . . . Φn ] for the photo, and X = [Φ1 . . . Φn Φ1 . . . Φn ] for the
p
s p
s
p
p p
photo and sketch combined.
The ﬁrst step in LFDA is to separate the image feature vector into multiple
sub-vectors or slices. Given the M xN array of patches consisting of SIFT or MLBP
descriptors, we create one slice for each of the N patch columns. With a d-dimensional
feature descriptor, each of the N slices is of dimensionality (M · d). We call this a
“slice” because it is similar to slicing an image into N pieces. After separating the
p

s
feature vectors into slices, the training matrices now become Xk ∈ RM ·d,n , Xk ∈

RM ·d,n , and Xk ∈ RM ·d,2n (k = 1 . . . N ), which are all mean centered.
We next reduce the dimensionality of each training slice matrix Xk using the PCA
matrix Wk ∈ RM ·d,r with r eigenvectors. The purpose is to remove the noisy features
which are usually associated with the trailing eigenvectors with the smallest eigenvalues. In our experiments we use the 100 eigenvectors with the largest eigenvalues
(which preserves about 90% of the variance). The discriminant extraction proceeds
by generating the mean projected class vectors

p

T
s
Yk = Wk (Xk + Xk )/2

(2.1)

which are used to center the sketch and photo training instances of each class by
43

Table 2.1: Rank-1 recognition rates for matching viewed sketches using the CUHK
public dataset. The standard deviation across the ﬁve random splits for each method
in the middle and right columns is less than 1%.
Baseline
Method

Rank-1 Accuracy (%)

FaceVACS [1]
BP Synthesis [149]
SIFT Descriptor-based [53]

90.37
96.30
97.87

Without LFDA
Method

Rank-1 Accuracy (%)

SIFT
MLBP
SIFT + MLBP

97.00
96.27
97.33
LFDA

Method

Rank-1 Accuracy (%)

SIFT
MLBP
SIFT + MLBP

99.27
98.60
99.47

T s
˜s
Xk =Wk Xk − Yk
T p
˜p
Xk =Wk Xk − Yk

(2.2)

To reduce the intra-personal variation between the sketch and photo, a whitening
transform is performed. Whitening the within-class scatter matrix reduces the dimensionality by discarding features that represent the principal intra-personal variations,
which in this case corresponds to intra-personal diﬀerences between sketches and pho˜
˜s ˜p
tos. To do so, we recombine the training instances into Xk = [Xk Xk ]. PCA analysis
˜
˜
is performed on Xk , such that the computed PCA projection matrix Vk ∈ R100×100
˜
retains all of the data variance from Xk . Let Λk ∈ R100×100 be a diagonal matrix
˜
whose entries are the eigenvalues of the corresponding PCA eigenvectors Vk . The
44

whitening transform matrix is Vk =

−1 T
Λk 2 V k

T

.

The ﬁnal step is to compute a projection matrix that maximizes the intra-person
scatter by performing PCA on V T Yk (which is the whitening transform of the mean
class vectors). Using all but one of the eigenvectors in the PCA projection matrix,
the resultant projection matrix is denoted as Uk ∈ R100×99 . This results in the ﬁnal
projection matrix for slice k

Ψk = Wk Vk Uk

(2.3)

With each local feature-based discriminant trained, we match sketches to photos
using the nearest neighbor matching on the concatenated slice vectors. We ﬁrst
separate the feature representation of an image into individual slices

Φ = [Φ(1)T Φ(2)T . . . Φ(N )T ]T

(2.4)

where Φ(i) ∈ RM ·d is the i-th slice feature vector. We then project each slice using
the LFDA projection matrix Ψk yielding the new vector representation ϕ ∈ RM ·99

ϕ = (ΨT Φ(1))T (ΨT Φ(2))T . . . (ΨT Φ(N ))T
k
k
k

T

(2.5)

With the LFDA representation of a sketch ϕs and photo ϕp , the normed distance
||ϕs − ϕp || is used to select the gallery photo with the minimum distance to the probe
sketch.
The proposed LFDA algorithm is a simple yet eﬀective method. From the results in Section 2.4, we can clearly see that LFDA is able to signiﬁcantly improve
the recognition performance over the basic feature-based sketch matching framework.
Similar to other variants of LDA that are designed to handle the small sample size
problem [45] [85] [148], LFDA has several advantages over traditional linear discrimi45

nant analysis (LDA). First, LFDA is more eﬀective in handling large feature vectors.
The idea of segregating the feature vectors into slices allows us to work on more
manageable sized data with respect to the number of training images. Second, because the subspace dimensionality is ﬁxed by the number of training subjects, when
dealing with the smaller sized slices the LFDA algorithm is able to extract a larger
number of meaningful features. This is because the dimensionality of each slice subspace is bounded by the same number of subjects as a subspace for the entire feature
representation.

2.4

Viewed Sketch Matching Results

In order to compare our proposed LFDA framework to published methods on sketch
matching, we evaluated our method using viewed sketches from the CUHK dataset2
[149]. This dataset consists of 606 corresponding sketch/photo pairs that was drawn
from three face datasets: (1) 123 pairs from the AR face database [89], (2) 295
pairs from the XM2VTS database [92], and (3) 188 pairs from the CUHK student
database [137]. Each of these sketch images were drawn by an artist while looking at
the corresponding photograph of the subject. Two examples of these viewed sketches
are shown in Figure 2.1(a). For the methods presented in this chapter, all results
shown are the recognition rates averaged over ﬁve separate random splits of 306
training subjects and 300 test subjects.
The results of viewed sketch matching experiment are summarized in Table 2.1.
The ﬁrst column of the table shows the baseline methods, which includes the top two
performing methods in the literature [53] [149] (each used 306 training subjects and
300 test subjects) and Cognitec’s FaceVACS commercial face recognition engine [1].
FaceVACS has been shown [53] to perform at the same level as earlier solutions
2 The

CUHK Face Sketch Database is available for download at:
http://mmlab.ie.cuhk.edu.hk/facesketch.html
46

speciﬁcally trained for viewed sketch recognition [138]. In the second column the
matching accuracies achieved by directly comparing SIFT and MLBP feature vectors
Φ are listed. The method ‘SIFT + MLBP’ indicates a sum of score fusion [123] of
the match scores from SIFT matching and MLBP matching. While both the SIFT
and MLBP methods oﬀer similar levels of performance, using LFDA (third column)
the accuracy increases to the point where (on average) fewer than two sketches are
incorrectly identiﬁed out of the 300 sketches in the probe set.
While LFDA was able to reduce the error in half, the use of LDA actually induced
higher error. In the same experiment shown in Table 2.1, we applied LDA on the
entire feature vector Φ instead of breaking it into slices and performing LDA on
each slice vector as is done in LFDA. The accuracy of LDA+SIFT was 95.47%,
LDA+MLBP was 91.53%, and (SIFT+MLBP)+LDA was 97.07%. In each case LDA
actually lowered the accuracy from the LFDA case. The decrease in accuracy observed
when applying the standard LDA is due to the small sample size problem and the
resulting curse of dimensionality [115]. Given our large feature representation (for
a 32 pixel patch size, the SIFT representation contains 19,712 components and the
MLBP representation contains 36,344 components), the subspace projections are over
ﬁt to the training data. Because LFDA is an ensemble method, it is better suited to
overcome this overﬁtting problem. Other LDA variants have been shown to handle
the small sample size problem as well, such as RSLDA [148] and regularized LDA
(R-LDA) [85].

2.5

Matching Forensic Sketches

The available methods for matching forensic sketches to photos is limited. Uhl and
Lobo [144] proposed a now antiquated method of matching sketches drawn by forensic
artists using photometric standardization and facial features. Yuen and Man [156]
47

Table 2.2: Demographics of the 159 forensic sketch images and the 10,159 mugshot
gallery images.
Forensic Sketches Mugshot Gallery
Caucasian
African American
Other

58.49%
31.45%
10.06%

46.43%
46.93%
6.64 %

Male
Female
Unknown

91.19%
8.81%
0.00%

84.33%
15.52%
0.03%

(a)

(b)

(c)

Figure 2.4: An example of the internal (b) and external (c) features of the face image
in (a). Humans tend to use the internal facial features for recognizing faces they
are familiar with, and the external features for recognizing faces they are unfamiliar
with [155]. Witnesses of a crime are generally unfamiliar with the culprit, therefore
the external facial features should be more salient in matching forensic sketches.
matched lab generated forensic composites to photographs based on point distribution
models.

2.5.1

Forensic Sketch Database

In our study we used a dataset consisting of 159 forensic sketches, each with a corresponding photograph of the subject that was later identiﬁed by the law enforcement agency to belong to the suspect. All of these sketches were drawn by forensic
sketch artists working with witnesses who provided verbal descriptions after crimes
were committed by an unknown culprit. The corresponding photographs (mugshots)
are the result of the subject later being identiﬁed, possibly due to citizens coming
forward to provide clues. The forensic sketch data set used here comes from four
48

diﬀerent sources: (1) 73 images from forensic sketch artist Lois Gibson [31], (2) 43
images from forensic sketch artist Karen Taylor [139], (3) 39 forensic sketches provided by the Michigan State Police Department, and (4) 4 forensic sketches provided
by the Pinellas County Sheriﬀ’s Oﬃce. In addition to these 159 corresponding forensic sketch and photo pairs, we also made use of a dataset of 10,159 mugshot images
provided by the Michigan State Police to enlarge the gallery size. Thus, the matching
experiments attempt to replicate real world scenarios where a law enforcement agency
would query a large gallery of mugshot images with a forensic sketch. Examples of
the forensic sketches used in our experiments are shown in Figures 2.1, 2.9, 2.10, and
2.10.
In some cases a witness’s memory and hence the description of a suspect is inaccurate. This causes forensic sketches drawn from such witness’s descriptions to be of
poor quality in terms of not accurately capturing all the facial features of the suspect. For most of these sketches, it is unlikely that they can be successfully matched
automatically to the corresponding photos because they barely resemble the subject.
For this reason, we separated our forensic sketches into two categories: good quality and poor quality. This separation was performed subjectively by looking at the
corresponding pairs (sketch and photo) and labeling them as “good” if the sketch
possessed a reasonable resemblance to the subject in the photo, and labeling them as
“poor” if the sketch was grossly inaccurate. We believe this leads to a more accurate
portrayal of the performance of proposed automatic sketch to photo matching. Figure
2.1 shows the diﬀerence between good quality and poor quality sketches.

2.5.2

Human Memory and Forensic Sketches

A distinct diﬀerence between a viewed sketch and a forensic sketch is that the forensic
sketch may have many inaccuracies due to the witness’s inability to correctly remember the suspect’s face. A body of psychological research exists that focuses on a
49

person’s ability to successfully recall the appearance of an individual she is unfamiliar with and whom she viewed only momentarily [17, 27, 155]. A consistent ﬁnding
of these studies is that the facial features used to recognize someone depends on the
level of familiarity. In this respect, facial features are separated into internal and
external features (see Figure 2.4).
When we are familiar with the person we are attempting to recognize (e.g. a
co-worker, family member, or celebrity), we predominantly make use of the internal
facial features for identiﬁcation [17, 155]. These features include the nose, eyes, eye
brows, and mouth. Most research in automatic face recognition has observed that
these internal features are also the most discriminative areas of the face. [67]. When
we are attempting to recognize someone who is unfamiliar to us, the external features
of the face are predominantly used to establish identity [17, 155]. External features
consist of the outer region of the face, including the chin, hairstyle, and general shape
of the face.
Frowd et al. [27] studied whether humans are best able to match forensic sketches
using the internal or external features of the face. In their experiments, test subjects
were shown the photograph of a celebrity they were unfamiliar with and given approximately one minute to remember the appearance. Two days later the subjects worked
with a forensic sketch artist to draw a sketch of the person they viewed earlier in the
photograph. Using these composites, a separate set of subjects that had familiarity
with the same celebrities were asked to identify two diﬀerent versions of the sketches:
(i) sketches with only the interior regions of the face shown, and (ii) sketches with
only the exterior regions of the face shown. The experiments concluded that higher
identiﬁcation rates were achieved using the exterior regions of the face [27].
Frowd et al.’s results are based on a controlled experiment, so we must tread
lightly in using them for automated face recognition. One of the most important
properties for a biometric trait is permanence [40], which the external regions of the
50

Internal
(a)

External
(b)

Eyes
(c)

Nose
(d)

Mouth
(e)

Chin
(f)

Figure 2.5: Masks used for region based forensic sketch matching. Shown above are
the mean photo patches of each patch used for a particular region. The mosaic eﬀect
is due to the fact that face patches are extracted in an overlapping manner.
face do not satisfy well. By growing or removing a beard, changing hairstyles, or
donning headgear, a person can drastically change the appearance of their external
facial features. Therefore, assigning a higher prior probability to the decisions made
from external forensic sketch regions over internal regions may not be a wise choice.

2.5.3

Forensic Sketch Region Saliency

Due to the observation in the human cognition studies that diﬀerent regions of the
face have diﬀerent saliency, we measure the performance of automatic sketch matching
using only certain regions of the face. For our feature-based framework, it is quite
easy to implement this by only selecting the patches in the face that correspond to
a given region. We considered six separate face regions for localized identiﬁcation:
(1) internal, (2) external, (3) eyes, (4) nose, (5) mouth, and (6) chin. Figure 2.5
shows the patches used for each of these face regions (with patch size s = 32) and the
51

0.4
LFDA − Good
LFDA − Poor
FaceVACS − Good
FaceVACS − Poor

0.35

0.3

Accuracy

0.25

0.2

0.15

0.1

0.05

0

10

20

30

40

50

60

70

Rank

Figure 2.6: Performance of matching forensic sketches that were labeled as good (49
sketches) and poor (110 sketches) against a gallery of 10,159 mugshot images without
using race/gender ﬁltering.
average intensity for each patch (the mean patch). Thus, when matching using one
of the masks, we performed distance matching using only the patches shown in each
mask.
In Section 2.6 we will show the results of forensic sketch matching using only these
face regions.

2.5.4

Large-Scale Forensic Sketch Matching

Matching forensic sketches to large mugshot galleries is diﬀerent in several respects
from traditional face identiﬁcation scenarios. When presenting face recognition results
in normal recognition scenarios, we are generally concerned with exactly identifying
52

the subject in question in a fully automated manner. For example, when preventing
multiple passports from being issued to the same person, human interaction should
be limited to only ambiguous cases. This is due to the large volume of requests such
a system must process. The same is true for matching arrested criminals against existing mugshot databases to conﬁrm their identity. However, when matching forensic
sketches it is not critical for the top retrieval result to be the correct subject, as long
as it is in the top R retrieved results, say R = 50. This is because the culprit being
depicted in a forensic sketch typically has committed a heinous crime (e.g murder,
rape, armed robbery) that will receive a large amount of attention from investigators.
Instead of accepting or dismissing only the top retrieved photo, law enforcement oﬃcers will consider the top R retrieval results as potential suspects. Generally, many of
the returned subjects can be immediately eliminated as suspects for various reasons,
such as if they are currently incarcerated or deceased. The remaining candidates can
each then be investigated for their culpability of committing the crime. This scenario
is also true of crimes in which a photograph of a suspect is available. Investigators
will consider the top R retrieval results instead of only the highest match. Based on
the practice followed in forensics, we would like R to be around 50; that is, we are
mainly concerned with whether or not the true subject is within the top 50 retrieved
images.
In order improve the accuracy of matching forensic sketches, we utilize ancillary or
demographic information provided by the witness, to be used as a soft biometric [42].
For example, suppose the witness reports that the race of the culprit is Caucasian,
then we can eliminate all non-Caucasian members of the gallery to not only speed up
the matching but also to improve the matching performance. The same is true for
gender: if the suspect is reported to be a female then we disregard any male subjects
in the gallery. To use this approach, we manually labeled all of the 10,159 mugshot
images and all the forensic sketch/photo pairs in our database with race and gender.
53

0.7
LFDA
LFDA − Gender
LFDA − Race
LFDA − Race & Gender
FaceVACS

0.6

Accuracy

0.5

0.4

0.3

0.2

0.1

0

0

10

20

30

40

50

60

70

Rank

Figure 2.7: Performance of matching good sketches with and without using ancillary
demographic information (race and gender) to ﬁlter the results.

For gender, we considered one of three possible categories: male, female, and (in rare
cases) unknown. For race we considered one of three categories: Caucasian, AfricanAmerican, and “other”. The “other” includes individuals who are of Hispanic, Asian,
or multiple races. Table 2.2 lists the percentage of members from each race and gender
category in the forensic sketches and the mugshot gallery used in our experiments.

We lack additional ancillary information (e.g., age, height, scars, marks and tattoos) that could potentially be used to further improve the matching accuracy.
54

2.6

Forensic Sketch Matching Results

Forensic sketch recognition performance using the 159 forensic sketch images (probe
set) and 10,159 mugshot images (gallery) will now be presented. In these matching
experiments we use the local feature-based discriminant analysis (LFDA) framework
presented in Section 2.3. Our matching uses sum-score fusion of MLBP and SIFT
LFDA, as this provided the highest recognition performance for matching viewed
sketches (Table 2.1).
The performance of matching sketches classiﬁed as good and poor can be found
in Figure 2.6. There is a substantial diﬀerence in the matching performance of good
sketches and poor sketches. Despite the fact that poor sketches are extremely diﬃcult
to match, the CMC plots in Figure 2.6 shows that the proposed method performs
roughly the same on the poor sketches than a state of the art commercial matcher
(FaceVACS) performs on the good sketches.
Figure 2.7 shows the recognition performance when race and gender information is
used to ﬁlter the gallery. By utilizing this ancillary information, we can signiﬁcantly
increase the performance of forensic sketch recognition. We noticed a larger performance gain by using race information than the gender information. This is likely due
to the more uniform distribution of race membership than gender membership in our
gallery. The use of other demographic information such as age and height should oﬀer
further improvements.
Discriminatory information contained in individual face regions (eyes, nose,
mouth, etc.) is shown in Figure 2.8. Again, this is achieved by ﬁrst applying the
masks in Figure 2.5 to the face features patches. These results mostly agree with
cognitive science research (Section 2.5.2) that indicates that external regions of the
face provide more discriminating information in matching forensic sketches. Between
the eyes, nose, mouth, and chin, we found the chin to be the most informative region
of the face. In fact, only using the chin region for region recognition we were able
55

0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05

Eyes

Nose

Mouth

Chin

Internal External Whole
Face

Figure 2.8: Matching performance on the good sketches using race/gender ﬁltering
with SIFT and MLBP feature-based matching on only speciﬁc face regions.

56

Probe Sketch

Top Retrieval

True Subject

Figure 2.9: Two examples of typical cases in which the true subject photo (third
column) was not retrieved at rank 1, but the impostor subject (second column) retrieved at rank 1 visually looks more similar to the sketch (ﬁrst column) than the
true subject.
to achieve Rank-50 accuracy of 22.45% with a gallery size of 10,159 images, which is
interesting, given the fact that the chin is not generally regarded as an overly valuable
feature in face recognition research.
Examples of failed retrievals are shown in Figure 2.9. While the top retrieved
mugshot is not correct in these two examples, the probe sketch appears to be more
similar to the top matched photo than the true photograph. This was nearly always
the case:

the top retrieved images appeared highly similar to the probe sketch in

the incorrect matchings. This can be explained by the subjective and often incorrect
verbal description of the suspect provided by the witness.
Figure 2.10 shows three of the best matches and Figure 2.11 shows three of the
worst matches amongst all the good sketches using the proposed LFDA recognition
method. For each image, we have listed the match rank returned by LFDA and
FaceVACS.
57

One limitation of our study is the small number of forensic sketches in our dataset,
but obtaining a large collection of forensics sketches and the mated photographs from
law enforcement agencies has not been easy. Not only does a small database limit
the evaluation of our method, but it also aﬀects the performance of our local featurebased discriminant analysis. The LFDA needs a reasonably large number of training
examples to learn the most discriminative projections. In the case of viewed sketch
recognition we used 306 pairs of sketches and photos for training. For the forensic
sketches, even if we performed leave-one-out cross validation there would still be only
a small number of good quality training samples. For this reason, we trained the
discriminant on the viewed sketches when matching forensic sketches. However, we
believe that with a larger number of forensic sketches we could more properly train
our discriminant and further improve the matching performance. The bottleneck in
ﬁnding additional forensic sketches for our experiments is in obtaining the photograph
mates for the sketches of the suspects who have not yet been identiﬁed (cold cases).
While forensic sketches exist from numerous crimes, even if there is an eventual
identiﬁcation of the subject, the mated sketch and photo are often not stored together
in a central database. We are currently working with various law enforcement agencies
to increase our dataset of forensic sketch pairs.

2.7

Summary

We have presented methods and experiments in matching forensic sketches to photographs. Matching forensic sketches is a very diﬃcult problem in heterogeneous face
recognition for two main reasons. (1) Forensic sketches are often an incomplete portrayal of the subject’s face. (2) We must match across image modalities since the
gallery images are photographs and the probe images are sketches.
One of the key contributions of this chapter is using SIFT and MLBP feature
58

Method
LFDA
FaceVACS
Eyes
Nose
Mouth
Chin
Internal
External

Rank
1
320
210
152
1
33
3
3

Method
LFDA
FaceVACS
Eyes
Nose
Mouth
Chin
Internal
External

Rank
1
299
198
159
31
304
2
31

Method
LFDA
FaceVACS
Eyes
Nose
Mouth
Chin
Internal
External

Rank
1
2131
5
2
823
24
1
6

Figure 2.10: Examples of the three of the best matches using LFDA. Below each example are the rank scores obtained by using the proposed LFDA method, FaceVACS,
and component-based matching.

59

Method
LFDA
FaceVACS
Eyes
Nose
Mouth
Chin
Internal
External

Rank
775
2255
166
298
1776
3508
101
2231

Method
LFDA
FaceVACS
Eyes
Nose
Mouth
Chin
Internal
External

Rank
1599
215
3237
2974
1018
3742
3012
2402

Method
LFDA
FaceVACS
Eyes
Nose
Mouth
Chin
Internal
External

Rank
1617
429
1992
3634
3725
52
2246
692

Figure 2.11: Examples of the three of the worst matches using LFDA. Below each example are the rank scores obtained by using the proposed LFDA method, FaceVACS,
and component-based matching.

60

descriptors to represent both sketches and photos. We have improved the accuracy
of this representation by applying an ensemble of discriminant classiﬁers, and termed
this framework local feature-discriminant analysis (LFDA). The LFDA feature-based
representation of sketches and photos was clearly shown to perform better on a public
domain viewed sketch data set than previously published approaches.
Another major contribution of the chapter is the large-scale experiment on matching forensic sketches. While previous research eﬀorts have focused on viewed sketches,
most real world problems only involve matching forensic sketches. Using a collection
of 159 forensic sketches, we matched the sketches against a gallery populated with
10,159 mugshot images. Further improvements to the LFDA method were achieved
by utilizing ancillary information such as race and gender to ﬁlter the 10,159 member gallery. For a fair evaluation of our methods, we used a state-of-the-art face
recognition system, FaceVACS [1].
Together, these improvements in forensic sketch recognition advance the state
of the art and demonstrate the utility of heterogeneous face recognition which is
the focus of this dissertation. In developing a sketch recognition algorithm with
substantially improved recognition accuracy, we oﬀer a tool that is critical for assisting
law enforcement agencies in apprehending suspects.

61

Chapter 3
Heterogenous Face Recognition
using Kernel Prototype Similarities

3.1

Introduction

In the previous chapter we discussed a solution to forensic sketch recognition. The
solution provided was speciﬁc to forensic sketch recognition, and will generally not
extend to other face recognition scenarios. The heterogeneous face recognition algorithm presented in this chapter is not built for any speciﬁc HFR scenario. Instead, it
is designed to generalize to any HFR scenario. Further, this framework can be used
for homogeneous face recognition (e.g. visible to visible face recognition) as well. Such
a framework oﬀers a strong contribution to the proposed thesis of this dissertation
by providing an improvement to the problem of heterogeneous face recognition as a
whole.
Again, the motivation behind heterogeneous face recognition is that circumstances
exist in which face image to be identiﬁed is available only in a particular modality.
For example, when a subject’s face can only be acquired in nighttime environments,
the use of infrared imaging may be the only modality for acquiring a useful face
62

(a)

(b)

(c)

(d)

Figure 3.1: Examples images from each of the four heterogenous face recognition
scenarios tested in our study, as also shown in Chapter 1. The top row contains
probe images from (a) near-infrared, (b) thermal infrared, (c) viewed sketch, and
(d) forensic sketch modalities. The bottom row contains the corresponding gallery
photograph (visible band face image, called VIS) of the same subject.

image of the subject. Another example is situations in which no imaging system was
available to capture the face image of a suspect, as addressed in Chapter 2. In this
case a forensic sketch, drawn by a police artist based on a verbal description provided
by a witness or the victim, is likely to be the only available source of a face image.
Despite continued progress in the accuracy of face recognition systems [110], most
commercial oﬀ the shelf (COTS) face recognition systems (FRS) are not designed to
handle HFR scenarios. The need for face recognition systems speciﬁcally designed for
the task of matching heterogeneous face images is of substantial interest.
This chapter proposes a uniﬁed approach to heterogeneous face recognition that
(i) achieves high accuracy on multiple HFR scenarios, (ii) does not necessitate feature
descriptors that are invariant to changes in image modality, (iii) facilitates recognition using diﬀerent feature descriptors in the probe and gallery modalities, and (iv)
naturally extends to additional HFR scenarios due to properties (ii) and (iii) above.
63

3.2
3.2.1

Related Work
Heterogeneous Face Recognition

A ﬂurry of research has emerged providing solutions to various heterogeneous face
recognition problems. This began with sketch recognition using viewed sketches, and
has continued into other modalities such as near-infrared (NIR) and forensic sketches.
In this section we will highlight a representative selection of studies in heterogeneous
face recognition as well as studies that use kernel based approaches for classiﬁcation.
Tang et al. spearheaded the work in heterogeneous face recognition with several
approaches to synthesize a sketch from a photograph (or vice-versa) [82, 138, 149].
Tang and Wang initially proposed an eigen-tranformation method [138]. Later, Liu
et al. performed the transformation using local linear embedding to estimate the corresponding photo patch from a sketch patch [82]. Wang and Tang proposed a Markov
random ﬁeld model for converting a sketch into a photograph [149]. Other synthesis
methods have been proposed as well [29, 157]. A key advantage of synthesis methods
is that once a sketch has been converted to a photograph, matching can be performed
using existing face recognition algorithms. The proposed prototype framework is
similar in spirit to these methods in that no direct comparison between face images
in the probe and gallery modalities is needed. The generative transformation-based
approaches have generally been surpassed by discriminative feature-based approaches.
A number of discriminative feature-based approaches to HFR have been proposed [12, 52, 59, 77], which have shown good matching accuracies in both the sketch
and NIR domains. These approaches ﬁrst represent face images using local feature
descriptors, such as variants of local binary patterns (LBP) [99] and SIFT descriptors [84]. Liao et al. ﬁrst used this approach on NIR to VIS face recognition by
processing face images with a diﬀerence of Gaussian ﬁlter, and encoding them using
multi-block local binary patterns (MB-LBP). Gentle AdaBoost feature selection was
64

used in conjunction with R-LDA to improve the recognition accuracy. Klare and Jain
followed this work on NIR to VIS face recognition by also incorporating SIFT feature
descriptors and an RS-LDA scheme [52]. Bhatt et al. introduced an extended uniform
circular local binary pattern to the viewed sketch recognition scenario [12]. Klare et
al. encoded both viewed sketches and forensic sketches using SIFT and MLBP feature
descriptors, and performed local feature-based discriminant analysis (LFDA) to improve the recognition accuracy [59]. Yi et al. [154] oﬀered a local patch-based method
to perform HFR on partial NIR face images.
The synthesis method by Li et al. is the only known method to perform recognition between thermal IR and visible face images [71]. The only method to perform
recognition between forensic sketches and visible face images is Klare et al. [59], which
is also one of two methods, to our knowledge, that has been tested on two diﬀerent
HFR scenarios (viewed sketch and forensic sketch). The other method is Lin and
Tang’s [78] common discriminant recognition framework which was applied to viewed
sketches and near-infrared images. In this work the proposed prototype random subspace framework is tested on four diﬀerent HFR scenarios.

3.2.2

Kernel Prototype Representation

The core of the proposed approach involves using a relational feature representation
for face images (illustrated in Figure 3.2). By using kernel similarites between a novel
face pattern and a set of prototypes, we are able to exploit the kernel trick [9], which
allows us to generate a high dimensional, non-linear representation of a face image
using compact feature vectors.
The beneﬁt of a prototype-based approach is provided by Balcan et al. [9]. Given
access to the data distribution and a kernel similarity function, a prototype representation is shown to approximately maintain the desired properties of the high
dimensional kernel space in a more eﬃcient representation by using the kernel trick.
65

While it is not common to refer to kernel methods as prototype representations, in
this work we emphasize the fact that kernel methods use a training set of images
(which serve as prototypes) to implicitly estimate the distribution of the patterns in
a non-linear feature space. One key to our framework is that each prototype has one
pattern for each image modality.
The proposed kernel prototype approach is similar to the object recognition
method of Quattoni et al. [112]. Kernel PCA [48] and Kernel LDA [46,81] approaches
to face recognition have used a similar approach, where a face is represented as the
kernel similarity to a collection of prototype images in a high dimensional space.
These diﬀer from the proposed method because only a single prototype is used per
training subject, and our approach is designed for heterogeneous face recognition.
Our earlier work [53] utilized a similar approach but did not exploit the beneﬁt of
non-linear kernels, but, like the proposed method, it used a separate pattern from
each image modality (sketch and photo) for each prototype.

3.2.3

Proposed Method

The proposed method presents a new approach to heterogeneous face recognition,
and extends existing methods in face recognition. The use of a kernel similarity
representation is well suited for the HFR problem because a set of training subjects
with an image from each modality can be used as the prototypes, and, depending
on the modality of a new image (probe or gallery), the image from each prototype
subject can be selected from the corresponding modality. Unlike previous featurebased methods, where an image descriptor invariant to changes between the two
HFR modalities was needed, the proposed framework only needs descriptors that are
eﬀective within each domain. Further, the proposed method is eﬀective even when
diﬀerent feature descriptors are used in the probe and gallery domains. The proposed
prototype framework is described in detail in Section 3.4.
66

Gallery Image

Probe Image
Prototype 3

Prototype 3

Prototype 1

Prototype 1

Prototype 2

Prototype 2

Figure 3.2: The proposed face recognition method describes a face as a vector of
kernel similarities to a set of prototypes. Each prototype has one image in the probe
and gallery modalities.

67

The accuracy of the HFR system is improved using a random subspace framework in conjunction with linear discriminant analysis, as described in Section 3.5.
The previous method of feature-based random subspaces [52] is revisited in Section
3.6. Experimental results on four diﬀerent heterogeneous face recognition scenarios
(thermal, near-infrared, viewed sketch, and forensic sketch) are provided in Section
3.7, and all the results are benchmarked with a commercial face matcher.
We demonstrate the strength of the propsed framework on many diﬀerent HFR
scenarios, however the parameters controlling the framework are the same across all
tested scenarios. This is due to the fact that the contribution of this work is a generic
framework for improving solutions to the general HFR problem. Future use of the
proposed framework will beneﬁt from tuning the parameters to a speciﬁc scenario.

3.3

Preprocessing and Representation

All face images are initially represented using a feature-based representation. The
use of local feature descriptors has been argued to closely resemble the postulated
representation of the human visual processing system [122], and they have been shown
to be well suited for face recognition [56].

3.3.1

Geometric Normalization

The ﬁrst step in representing face images using feature descriptors is to geometrically
normalize the face images with respect to the location of the eyes. This step reduces
the eﬀect of scale, rotation, and translation variations. The eye locations for the face
images from all modalities are automatically estimated using Cognitec’s FaceVACS
SDK [1]. The only exceptions are the thermal face images where the eyes are manually
located for both the proposed method and the FaceVACS baseline. For thermal
images, the available eye detectors do not work.
68

Probe Image

DoG

CSDN

SIFT MLBP

SIFT MLBP

Gallery Image

Gauss

DoG
SIFT MLBP

SIFT MLBP

CSDN
SIFT MLBP

Gauss
SIFT MLBP

Figure 3.3: Example of thermal probe and visible gallery images after being ﬁltered
by a diﬀerence of Gaussian, center surround divisive normalization, and Gaussian image ﬁlters. The SIFT and MLBP feature descriptors are extracted from the ﬁltered
images, and kernel similarities are computed within this image descriptor representation.
Face images are geometrically normalized by (i) performing planar rotation to set
the angle between the eyes to 0 degrees, (ii) scaling the images so that the distance
between the two pupils is 75 pixels, and (iii) cropping the images to a height of 250
pixels and a width of 200 pixels, with the eyes horizontally centered and vertically
placed at row 115.

3.3.2

Image Filtering

Face images are ﬁltered with three diﬀerent image ﬁlters. These ﬁlters are intended
to help compensate for both intensity variations within an image domain (such as
non-uniform illumination changes), as well appearance variations between image domains. The second aspect is of particular importance for the direct random subspace
69

framework (Section 3.6). An example of the eﬀects of each image ﬁlter can be seen
in Figure 3.3.
The three image ﬁlters used are:

1. Diﬀerence of Gaussian A diﬀerence of Gaussian (DoG) image ﬁlter has been
shown by Tan and Triggs to improve face recognition performance in the presence of
varying illumination [136], as well as in an NIR to VIS matching scenario by Liao et
al. [77]. A diﬀerence of Gaussian image is generated by convolving an image with a
ﬁlter obtained by subtracting a Gaussian ﬁlter of width σ1 from a Gaussian ﬁlter of
width σ2 (σ2 > σ1 ). In this chapter, σ1 = 2 and σ2 = 4.
2.

Center-Surround Divisive Normalization Meyers and Wolf [93] introduced

the center-surround divisive normalization (CSDN) ﬁlter in conjunction with their
biologically inspired face recognition framework. The CSDN ﬁlter divides the value
of each pixel by the mean pixel value in the sxs neighborhood surrounding the pixel.
The non-linear nature of the CSDN ﬁlter is seen as a compliment to the DoG ﬁlter.
In our implementation s = 16.

3. Gaussian The Gaussian smoothing ﬁlter has long been used in image processing
applications to remove noise contained in high spatial frequencies while retaining the
remainder of the signal. The width of the ﬁlter used in our implementation is σ = 2.

3.3.3

Local Descriptor Representation

Once an image is geometrically normalized and ﬁltered using one of the three ﬁlters,
local feature descriptors are extracted from uniformly distributed patches across the
face. In this work we use two diﬀerent feature descriptors to represent the face
image: the SIFT descriptor [84] and Local Binary Patterns (LBP) [99]. The SIFT
feature descriptor has been used eﬀectively in face recognition [56], sketch to VIS
70

matching [59], and NIR to VIS matching [52]. LBP features have a longer history of
successful use in face recognition. Ahonen et al. originally proposed their use for face
recognition [3], Li et al. demonstrated their use in NIR to NIR face matching [72],
and they have also been successfully applied to several HFR scenarios [12, 52, 59, 77].
The SIFT and LBP feature representations are eﬀective in describing face images
due to their ability to encode the structure of the face and their stability in the
presence of minor external variations [56]. Each feature descriptor describes an image
patch as a d-dimensional vector that is normalized to sum to one. The face image
is divided into a set of N overlapping patches of size 32x32. Each patch overlaps its
vertical and horizontal neighbors by 16 pixels. With a face image of size 200x250,
this results in a total of 154 total patches.
Multi-scale local binary patterns (MLBP) [59], a variant of the LBP descriptor,
is used in place of LBP in this work. MLBP is the concatenation of LBP feature
descriptors with radii r = {1, 3, 5, 7}.
Let I be a (normalized and ﬁltered) face image. Let fF,D (I, a) denote the local feature descriptor extracted from image I at patch a, 1 ≤ a ≤ N using image
ﬁlter F and feature descriptor D. The DoG, CSDN, and Gaussian image ﬁlters
are, respectively, referred to as Fd , Fc , and Fg . The MLBP and SIFT descriptors
are, respectively, referred to as Dm and Ds . The SIFT descriptor yields a 128dimensional feature descriptor, fF,Ds (I, a) ∈ R128 . The LBP descriptor yields a
59-dimensional feature descriptor, resulting in a 236-dimensional MLBP feature descriptor (fF,Dm (I, a) ∈ R236 ). Finally, we have

fF,D (I) = fF,D (I, 1)T , . . . , fF,D (I, N )T

T

(3.1)

which is the concatenation of all N feature descriptors. Thus, fF,Ds (I) ∈ R128·N and
fF,Dm (I) ∈ R236·N .
Using the three ﬁlters and two descriptors, we have six diﬀerent representations
71

available for face image I, namely fF ,Dm (I), fFc ,Dm (I), fFg ,Dm (I), fF ,Ds (I),
d
d
fFc ,Ds (I), and fFg ,Ds (I).

3.4

Heterogeneous Prototype Framework

The heterogeneous prototype framework begins with images from the probe and
gallery modalities represented by (possibly diﬀerent) feature descriptors for each of
the N image patches, as described in the previous section. For compactness, let f (I)
represent fF,D (I). The similarity between two images is measured using a kernel
function k : f (I) x f (I) → R.
Let T1 be a set of training images consisting of nt1 subjects. The training set
contains a probe image Pi and gallery image Gi for each of the nt1 subjects. That is

T1 = {P1 , G1 , . . . , Pnt , Gnt }
1
1

(3.2)

For both the probe and gallery modalities, two positive semi-deﬁnite kernel matrices K P and K G are computed between the training subjects. The probe kernel
n ,n
n ,n
matrix is K P ∈ R t1 t1 , and the gallery kernel matrix is K G ∈ R t1 t1 . The entry

in the i-th row and j-th column of K P and K G are

K P (i, j) = k( f (Pi ) , f (Pj ) )

(3.3)

K G (i, j) = k( f (Gi ) , f (Gj ) )

(3.4)

where k ( · , · ) is the kernel similarity function. All the experiments in this chapter
use the third degree polynomial kernel k(f (Pi ), f (Gi )) = (f (Pi )T ·f (Gi ))3 , which was
empirically choosen over a radial basis function kernel and a second degree polynomial
kernel. Again, a generic framework is being presented, and parameter choices such
72

as the kernel function should be optimized when this framework is engineered into a
solution for a speciﬁc problem.
Let P and G, respectively, be new probe and gallery face images, i.e. (P, G ∈ T1 ).
/
The function φ (P ) returns a vector containing the kernel similarity of image P to
each image Pi in T1 . For gallery image G, φ (G) returns a vector of kernel similarities
to the gallery prototypes Gi . Thus, face images are represented as the relational
n
n
vector φ (P ) ∈ R t1 for a probe image, and φ (G) ∈ R t1 for a gallery image. More

precisely, we have

φ (P ) =

k (f (P ), f (P1 )), . . . , k (f (P ), f (Pnt ))

φ (G) =

k (f (G), f (G1 )), . . . , k (f (G), f (Gnt ))

T

1

1

T

(3.5)
(3.6)

Because the feature vectors φ (P ) and φ (G) are a measure of the similarity between the image and the prototype training images, the feature spaces for similarity
computation do not have to be the same for the probe and gallery modalities. For example, the probe images could be represented using FF,Ds (P ) and the gallery images
could be represented using FF,Dm (G). Despite the fact that the SIFT and MLBP
feature descriptors are heterogeneous features, the relational representation allows
them to be represented in a common feature space. This is based on the assumption
that

k (f (P ), f (Pi )) ≈ k (f (G), f (Gi ))

(3.7)

In practice we ﬁnd that Eq. (3.7) does not precisely hold. To compensate for this, we
introduce a method called the “R” transform to better align the probe and gallery
modalities. The “R” transform uses a matrix R to align the probe prototype feature
space with the gallery prototype feature space by:
73

T

T

R = K G ((K P ) K P )−1 (K P )

(3.8)

We prove in Appendix A that the R transform is, in fact, a special case of Tang
and Wang’s eigen-transformation method [138]. Thus, while this transformation was
originally applied to synthesize the appearance of a sketch in the photo domain [138],
we improve this linear transformation method by incorporating (i) a non-linear feature space (i.e., the kernel prototype similarities), and (ii) a feature descriptor based
representation (i.e., the LBP or SIFT representation used to measure the kernel similarities). We do not call the R transform an eigen-transformation because our special
case allows for a simpler solution that does not make use of an eigen-decomposition.
The importance of this R transform step is experimentally demonstrated in Section
3.7. It is important to note that the scale (or distribution) of K P and K G will already
be similar because the σ parameter in the RBF kernel is tuned for each modality. Any
extreme input values to the system (e.g. a non-face image), will cause the kernel similarity to degenerate to 0, and thus allowing the system to remain stable with respect
to scale.
The strength of the R transformation lies in its ability to leverage the constraint
that the prototype representation will be nt1 -dimensional and the number of training
samples at this phase being nt1 . This allows the R transformation to exactly align
the probe prototype feature space to the gallery feature space (with respect to the
training set). While this would cause concern that the solution is too tightly ﬁt to the
training data, the extension of random sampling provided below alleviates concerns
of this being the case.
The beneﬁt of the R transformation is demonstrated quantitatively in the experimental results. Qualitatively, the R transformation is seen as a method to handle
additional heterogeneous properties remaining in the new prototype similarity vectors. Despite the fact that φ (·) oﬀers a common representation for both modalities,
74

issues such as the similarities in each modality having diﬀerent scales (e.g. from the
use of diﬀered descriptors in the probe and gallery modalities) are addressed by the R
transformation. Using R, we now introduce the ﬁnal prototype based representation
φ(·) as

φ(P ) = R · φ (P )
φ(G) = φ (G)

(3.9)
(3.10)

We alter the tersely presented notation to φF,D (I) to specify which feature descriptor
and image ﬁlter are initially being used to represent the image I. For example,
φFc ,Ds (I) denotes the prototype similarity of image I when represented using the
CSDN image ﬁlter and SIFT descriptors.

3.4.1

Discriminant Analysis

A second training set is used to enhance the discriminative capabilities of the prototype representation. This independent training set T2 consists of probe and gallery
/
images of nt2 subjects such that ∀{Pi , Gi } ∈ T2 , {Pi , Gi } ∈ T1 .
A linear subspace of the prototype representation φ(·) is learned using linear discriminant analysis (LDA) [10] on the images in T2 . LDA (and its variants) has
consistently demonstrated its ability to improve the recognition accuracy of various
algorithms. The beneﬁts of LDA in the context of face recognition have been demonstrated on image pixel representations [10,147], global Gabor features [75], and image
descriptors [59, 77].
We learn the linear projection matrix W by following the conventional approach
for high dimensional data, namely by ﬁrst applying PCA, followed by LDA [10]. In
all experiments the PCA step was used to retain 99.0% of the data variance. Let X
75

be a matrix whose columns contain the prototype representation of each image in T2 ,

X = φ(P1 ), φ(G1 ), . . . , φ(Pnt ), φ(Gnt )
2
2

(3.11)

Let X denote the mean-centered version of X. The initial step involves learning the
subspace projection matrix W1 by performing principal component analysis (PCA)
on X to reduce the dimensionality of the feature space. Next, the within-class and
T

between-class scatter matrices of W1 · X , respectively, SW and SB , are computed.
The dimension of the subspace W1 is such that SW will be of full rank. The scatter
matrices are built using each subject as a class, where one image each from the probe
and gallery modality represents each class. Lastly, the matrix W2 is learned by solving
the generalized eigenvalue problem

SB · W2 = Λ · SW · W2

(3.12)

This yields the LDA projection matrix W , where

T

W = W2 · W1

T T

(3.13)

Letting µ denote the mean of X, the ﬁnal representation for an unseen probe or
gallery image I using the prototype framework is W T · (φ(I) − µ). Subsequent uses
of W in this chapter will assume the appropriate removal of the mean µ from φ(I)
for terseness.
76

3.5
3.5.1

Random Subspaces
Motivation

The proposed heterogeneous prototype framework uses training data to (i) deﬁne the
prototypes, (ii) learn the prototype transformation matrix R, and (iii) learn the linear
subspace projection matrix W .
The reliance on training data raises two (somewhat exclusive) issues in the prototype representation framework. The ﬁrst issue is that the number of subjects in
T1 (i.e. the number of prototypes) is generally too small for an expressive prototype
representation. Balcan et al. demonstrated that the number of prototypes does not
need to be large (with respect to the margin) to approximately replicate the data
distribution [9]. However, their applications primarily dealt with binary classiﬁcation
and a small number of features. When applying a prototype representation to face
recognition, a large number of classes (or subjects) and features are present. The small
sample size problem implies that the number of prototypes needed to approximate
the underlying data distribution should be large [115].
The second issue is also related to the small sample size problem [115]. This common problem in face recognition arises from too few training subjects to learn model
parameters that are not susceptible to generalization errors. In the heterogeneous
prototype framework this involves learning the R and W matrices that generalize
well.
A number of solutions exist to the small sample size problem in face recognition.
Most are designed to handles deﬁciencies in the subspace W , such as dual-space LDA
[147], and direct LDA [45]. Regularization methods such as R-LDA [85] also address
degenerative properties of W , and could potentially be extended to the learned matrix
R as well. However, these methods do not address the issue of too few prototypes for
an expressive representation.
77

Another approach to handle deﬁciencies in learning parameters is the use of random subspaces [37]. The random subspace method samples a subset of features and
performs training in this reduced feature space. Multiple sets (or bags) of randomly
sampled features are generated, and for each bag the parameters are learned. This
approach is similar to the classical bagging classiﬁcation scheme [15], where the training instances are randomly sampled into bags multiple times and training occurs on
each bag separately. Ensemble methods such as Ho’s random subspaces [37] and
Breiman’s bagging classiﬁers have been demonstrated to increase the generalization
ability of an arbitrary classiﬁer [125].
Wang and Tang demonstrated the eﬀectiveness of random sampling LDA (RSLDA) for face recognition. Their approach combined random subspaces and bagging
by sampling both features and training instances. For each random sample space,
a linear subspace was learned. Klare and Jain utilized this approach in the HFR
scenario of NIR/VIS by using multiple subset samples of face patches described by
local feature descriptors [52].
We consider random sampling ideal for the prototype recognition framework because it is able to satisfactorily address the two limitations: (i) the number of prototypes is multiplied by the number of bags, which improves the expressiveness of the
prototype representation, and (ii) the use of an ensemble method improves deﬁciencies in the W and R matrices. Further uniﬁcation of these two separate problems
into a single solution oﬀers a simpler framework.

3.5.2

Prototype Random Subspaces

The prototype random subspace (P-RS) framework uses B diﬀerent bags (or samples)
of the N face patches. Each sample consists of α · N patches, 0 ≤ α ≤ 1. For bag
b, b = 1 . . . B, we have the integer vector κb ∈ Zα·N , where each component of κb is
a unique randomly sampled value from 1 . . . N . It is assumed that α is selected such
78

(a)

(b)

(c)

(d)

Figure 3.4: The process of randomly sampling image patches is illustrated. (a) All
image patches. (b), (c), (d) Bags of randomly sampled patches. The kernel similarity between SIFT and MLBP descriptors at each patch of an input image and
the prototypes of corresponding modality are computed for each bag. Images are
from [89]

that α · N is an integer. An example of randomly sampled face patches is shown in
Figure 3.4.
Let f (I, κb ) denote the concatenation of the α · N descriptors from the randomly
selected patch indices in κb . That is,

f (I, κb ) = f (I, κb (1))T , . . . , f (I, κb (α · N ))T

T

(3.14)

G
P
Letting Kb and Kb denote the probe and gallery kernel similarity matrices for bag

b, we modify Eqs. (3.3) and (3.4) to

P
Kb (i, j) = k(f (Pi , κb ) , f (Pj , κb ))

(3.15)

G
Kb (i, j) = k(f (Gi , κb ) , f (Gj , κb ))

(3.16)

The preliminary prototype representation φ (·) is now modiﬁed to φ (·, ·) as
79

φ (P, κb ) = [ k (f (P, κb ), f (P1 , κb )) , . . . ,
k (f (P, κb ), f (Pnt , κb ))
1

(3.17)

T

φ (G, κb ) = [ k (f (G, κb ), f (G1 , κb )) , . . . ,
k (f (G, κb ), f (Gnt , κb ))
1

(3.18)

T

A separate transformation matrix Rb is now learned for each bag as

G
P
Rb = Kb · (Kb )−1

(3.19)

resulting in the ﬁnal prototype representation (modiﬁcation of Eqs. (3.9) and (3.10))
as

φ(P, κb ) = Rb · φ (P, κb )

(3.20)

φ(G, κb ) = φ (G, κb )

(3.21)

Linear discriminant analysis is performed separately for each bag. Using training
set T2 , we learn B subspace projection matrices Wb , b = 1 . . . B.
A new face image I is represented in the random subspace prototype framework
as Φ(I), where Φ(I) is the concatenation of each linearly projected prototype representation from each of the B random subspace bags. That is,

Φ(I) =

T
W1 · φ(I, κ1 )

T

, ...,

T
WB · φ(I, κB )

T T

(3.22)

For terseness we have omitted the subscripts F and D in the above equations. For
example, in Eq. (3.22), ΦF,D (I) is abbreviated to Φ(I) by omitting image ﬁlter F
80

Global Parameters
# of bags B, random sample vectors κb ,
image ﬁlter F , feature descriptor D
Training
Input: Training sets T1 = {P1 , G1 , . . . , Pnt , Gnt },
1
1
T2 = {P1 , G1 , . . . , Pnt , Gnt }
2

2

Output: R1 , . . . RB , W1 , . . . , WB
-FOR b = 1 . . . B:
P
G
- Compute kernel matrices Kb , Kb
using prototypes in T1 Eqs. (3.15), (3.16)
P
G
- Solve Rb using Kb and Kb
Eq. (3.19)
- FOREACH image I in T2 :
- Compute φF,D (I, κb ))
Eqs. (3.20), (3.21)
- Using all I ∈ T2 , learn LDA subspace Wb
using representation φF,D (I, κb )
Face Enrollment
Input: Image I , T1 (prototypes), R1 , . . . , RB ,
W1 , . . . , WB
Output: Φ
- FOR b = 1 . . . B:
- IF I is probe:
- φF,D (I ) = [k fF,D (I , κb ), fF,D (P1 , κb ) , . . . ,
k(fF,D (I , κb ), fF,D (Pnt , κb ))] Eq. (3.17)
1
- φF,D (I ) = Rb · φ (I , κb )
Eq. (3.20)
- ELSE I is gallery:
- φF,D (I ) = [k fF,D (I , κb ), fF,D (G1 , κb ) , . . . ,
k(fF,D (I , κb ), fF,D (Gnt , κb ))] Eq. (3.18)
1

T
- Φb = Wb · φF,D (I )
- Concatenate vectors Φ = [Φ1 ; . . . ; ΦB ]

Eq. (3.22)

Figure 3.5: Proposed Prototype Random Subspace framework algorithm. Following
the oﬄine training phase, a face image I is enrolled and the vector Φ is returned for
matching.

81

and descriptor D to represent I.
A summary of the training and image enrollment steps can be found in Figure
3.5.

3.5.3

Recognition

Given a probe face image P and a gallery face image G, we deﬁne their similarity
S(P, G) using the cosine similarity measure

S(P, G) =

Φ(P ), Φ(G)
||Φ(P )|| · ||Φ(G)||

(3.23)

F ,D

Further, we let SF 1,D 1 (P, G) denote the similarity between the probe P represented
2 2
using ﬁlter F1 and descriptor D1 , and gallery image G represented in terms of ﬁlter
F2 and descriptor D2 . That is

F ,D
2 2

SF 1,D 1 (P, G) =

ΦF1 ,D1 (P ), ΦF2 ,D2 (G)
||ΦF1 ,D1 (P )|| · ||ΦF2 ,D2 (G)||

(3.24)

This similarity measure facilitates recognition using a threshold for a veriﬁcation
scenario (claimed identity for the probe is true or false), or a nearest neighbor matcher
for an identiﬁcation scenario (which one of N identities (classes) should be assigned
to the probe).

3.5.4

Score Level Fusion

The proposed framework naturally lends to fusion of the diﬀerent feature representations. For example, given one image ﬁlter F and two feature descriptors D1 and
D2 , we can utilize the following sum of similarity scores between probe image P and
F ,D

F ,D

F ,D

F ,D

gallery image G: {SF 1,D 1 (P, G) + SF 2,D 2 (P, G) + SF 1,D 1 (P, G) + SF 2,D 2 (P, G)}.
1 1
2 2
2 2
1 1
Min-max score normalization is performed prior to fusion.
82

3.6
3.6.1

Baselines
Commercial Matcher

The accuracy of the proposed prototype random subspace framework is compared
against Cognitec’s FaceVACS [1] COTS FRS. Comparing the accuracy of our system
against this leading COTS FRS oﬀers an unbiased baseline for each HFR scenario.
FaceVACS was chosen because in our internal tests it excels at HFR scenarios (with
respect to other commercial matchers). For example, the accuracy of FaceVACS on
NIR to VIS [52] and Viewed Sketch to VIS [59] performed at par with some previously
published HFR methods.

3.6.2

Direct Random Subspaces

In addition to a commercial face recognition system, the proposed prototype recognition system is compared against a recognition system that directly measures the
diﬀerence between probe and gallery images using a common feature descriptor representation. As discussed previously, most recent approaches to heterogeneous face
recognition involve directly measuring the similarity between two face images from
alternate modalities using feature descriptors [12, 52, 59, 77].
The random subspace framework from [52] is used as the baseline because it is most
similar to the proposed prototype framework, thus helping to isolate the diﬀerence
between using kernel prototype similarities versus directly measuring the similarity.
Further, because most of the datasets tested in Section 3.7 are in the public domain,
the proposed framework may also be compared against any other published method
on these data sets.
To brieﬂy summarize the direct random subspace (D-RS) approach using our
notation, for each bag b the D-RS framework represents an image as fF,D (I, κb ).
˜
LDA is performed on each bag to learn the projection matrix Wb . Because only one
83

Table 3.1: Rank-1 accuracies for the proposed Prototype Random Subspace (P-RS)
method across ﬁve recognition scenarios using an additional 10,000 subjects in the
gallery.
Rank-1 Accuracy (%)
Method
NIR
Thermal
Sketch
Forensic *
P-RS
D-RS
(P-RS) + (D-RS)

88.4 ± 4.99 55.3 ± 2.62 99.4 ± 0.54 19.6 ± 6.06
90.1 ± 2.71 20.1 ± 2.23 97.2 ± 1.03 28.7 ± 4.09
91.9 ± 2.91 57.4 ± 2.25 99.6 ± 0.41 26.8 ± 9.66

FaceVACS
79.7 ± 3.75 20.7 ± 1.54 82.4 ± 2.39 4.2 ± 3.38
* Results for forensic sketch are the Rank-50 accuracy.
training set is needed, LDA is learned from the images in T1 and T2 combined. The
ﬁnal representation Ψ(·) is the concatenation of the projected vector on the subspace
for each bag

ΨF,D (I) =

˜T
W1 · fF,D (I, κ1 )

T

, ...,

˜T
WB · fF,D (I, κB )

T T

(3.25)

˜
The dissimilarity S between probe image P and gallery image G (each represented
with ﬁlter F and descriptor D) is

˜
SF,D (P, G) = ||ΨF,D (P ) − ΨF,D (G)||2

(3.26)

Unlike P-RS, D-RS must use the same D for the probe and gallery images. This
is obvious as ff,D1 (I) and ff,D2 (I) will be of diﬀerent dimensionality, and also have
a diﬀerent interpretation.
D-RS will be used in conjunction with the six ﬁlter/descriptor representations
presented in Section 3.3 (SIFT + DoG, MLBP + CSDN, etc.). Results will be
presented from the sum-score fusion of the min-max normalized scores from these six
representations.
84

Table 3.2: Rank-1 accuracies for the proposed Prototype Random Subspace (P-RS)
method on a standard photograph to photograph matching scenario using an additional 10,000 subjects in the gallery.
Rank-1 Accuracy (%)
Method
Standard
P-RS
D-RS
(P-RS) + (D-RS)
FaceVACS

3.7

95.0 ± 1.58
94.0 ± 1.30
95.3 ± 1.42
99.5 ± 0.31

Experiments

The results reported below use the parameter values α = 0.1 and B = 200. A third
degree polynomial kernel was used to compute the prototype similarity and 99.0% of
the variance was retained in the PCA step of LDA.

3.7.1

Databases

Five diﬀerent matching scenarios are tested in this chapter: four heterogeneous face
recognition scenarios, and one standard (homogeneous) face recognition scenario. Example images from each of HFR dataset can be found in Figure 3.1. Results shown on
each dataset are the average recognition accuracy and the standard deviation over ﬁve
random splits of training and testing subjects. No subject that was used in training
was used for testing.

Dataset 1 - Near-Infrared to Visible (Fig. 3.1(a)) The ﬁrst dataset consists
of 200 subjects with probe images captured in the near-infrared spectrum ( 7801,100 nm) and gallery images captured in the visible spectrum. Portions of this
dataset are publicly available for download1 . This dataset was originally used by Li
et al. [72, 77]. Our experiments used only one NIR and one VIS image per subject,
1 http://www.cbsr.ia.ac.cn/english/Databases.asp

85

making the scenario more diﬃcult than previous experiments which beneﬁted from
multiple images per subject in training and gallery enrollment [52, 77]. The data was
split as follows: nt1 = 67 subjects were used for training set T1 , nt1 = 66 subjects
were used for training set T2 , and the remaining 67 subjects were used for testing.
Dataset 2 - Thermal to Visible (Fig. 3.1(b)) The second dataset is a private
dataset collected by the Pinellas County Sheriﬀ’s Oﬃce, and consists of 1,000 subjects with thermal infrared probe images and visible (mug shot) gallery images. The
thermal infrared images were collected using a FLIR Recon III ObservIR camera,
which has sensitivity in the range of 3-5 µm and 8-12 µm. The data was split as
follows: nt1 = 333 subjects were used for training set T1 , nt1 = 334 subjects were
used for training set T2 , and the remaining 333 subjects were used for testing.
Dataset 3 - Viewed Sketch to Visible (Fig. 3.1(c)) The third dataset is the
CUHK sketch dataset2 , which was used by Tang and Wang [138, 149]. The CUHK
dataset consists of 606 subjects with a viewed sketch image for probe and a visible
photograph for gallery. A viewed sketch is a hand drawn sketch of a face which is
drawn while looking at a photograph of the subject. The photographs in the CUHK
dataset are from the AR [89], XM2VTS [92], and CUHK student [138, 149] datasets.
The 606 subjects were equally divided to form the training sets T1 , T2 , and the test
set.

Dataset 4 - Forensic Sketch to Visible (Fig. 3.1(d)) The fourth and ﬁnal
heterogeneous face dataset consists of real-world forensic sketches and mug shot photos of 159 subjects. This dataset is described in [59]. Forensic sketches are drawn
by an artist based only on an eye witness description of the subject. The forensic
sketch dataset is a collection of images from Gibson [31], Taylor [139], the Michigan
2 The

CUHK
dataset
is
publicly
http://mmlab.ie.cuhk.edu.hk/facesketch.html
86

available

for

download

at

State Police, and the Pinellas County Sheriﬀ’s Oﬃce. Each sketch contains a suspect
involved in a real crime, and the mug shot photo was only available after the subject
had later been identiﬁed by means other than face recognition. Forensic sketches contain incomplete information regarding the subject, and are one of the most diﬃcult
HFR scenarios because the sketches often do not closely resemble the photograph of
the true suspect. Here 53 diﬀerent subjects each are used in T1 , T2 , and the test set.
Dataset 5: Standard Face Recognition A ﬁfth non-heterogeneous (i.e. homogeneous) dataset is used to demonstrate the ability of the proposed approach to
operate in standard face recognition scenarios. The dataset consists of one probe
and one gallery photograph of 876 subjects, where 117 subjects were from the AR
dataset [89], 294 subjects were from the XM2VTS dataset [92], 193 subjects from the
FERET dataset [109], and 272 subjects were from a private data set collected at the
University of Notre Dame. This is the same dataset used in [56].
Enlarged Gallery A collection of 10,000 mug shot images were used in certain
experiments to increase the size of the gallery. These mug shot images were provided
by the Michigan State Police, and were also used in [59]. Any experiment using these
additional images will have a gallery with the number of testing subjects plus these
additional 10,000 mug shot images. Experiments with a large gallery are meant to
present results that more closely resemble real-world face matching scenarios.

3.7.2

Results

Tables 3.1 and 3.2 lists the results of P-RS, D-RS, and FaceVACS for each dataset
using the additional 10,000 gallery images for each experiment. The results for PF ,Ds

F ,D

Fg ,Ds

F ,Dm

RS are the fusion of the match scores from {SF d,D +SF c,D s +SF ,D +SF d,D
c s
g s
d s
d m
Fg ,Dm
Fc ,Dm
+SF ,D +SF ,D }, i.e. the same features are used in the probe and gallery images.
c

m

g

m

˜
˜
˜
Similarly, D-RS is the fusion of the match scores from {SF ,Ds +SFc ,Ds +SFg ,Ds
d
87

Near IR
100

Accuracy

95

90

85

80

75

P−RS
D−RS
(P−RS)+(D−RS)
FaceVACS
0

10

20

30

40

50

60

70

80

90

100

Rank
Figure 3.6: CMC plot for the NIR HFR scenario. Results use an additional 10,000
gallery images to better replicate real world matching scenarios. Listed are the accuracies for the proposed Prototype Random Subspace (P-RS) method, the Direct
Random Subspace (D-RS) method [52], the sum-score fusion of P-RS and D-RS, and
Congitec’s FaceVACS system [1].

88

Thermal IR
100

90

80

Accuracy

70

60

50

40

P−RS
D−RS
(P−RS)+(D−RS)
FaceVACS

30

20

0

10

20

30

40

50

60

70

80

90

100

Rank
Figure 3.7: CMC plot for the thermal HFR scenario. Results use an additional 10,000
gallery images.

89

Viewed Sketch
100

98

96

Accuracy

94

92

90

88

86

P−RS
D−RS
(P−RS)+(D−RS)
FaceVACS

84

82

0

10

20

30

40

50

60

70

80

90

100

Rank
Figure 3.8: CMC plot for the viewed sketch HFR scenario. Results use an additional
10,000 gallery images.

90

Forensic Sketch
35

30

Accuracy

25

20

15

10

P−RS
D−RS
(P−RS)+(D−RS)
FaceVACS

5

0

0

10

20

30

40

50

60

70

80

90

100

Rank
Figure 3.9: CMC plot for the forensic sketch HFR scenario. Results use an additional
10,000 gallery images.

91

˜
˜
˜
+SF ,Dm +SFc ,Dm +SFg ,Dm }. Results from these same matchers are also displayed
d
in CMC (cumulative match characteristic) plots in Figures 3.6, 3.7, 3.8, and 3.9.
Again, the P-RS method represents face images using their similarity to a set of
prototype subjects, while the D-RS method directly measures the similarity between
two face images using SIFT and LBP descriptors.
The CMC results of matching NIR face images to standard face images are shown
in Figure 3.6. The Rank-1 accuracy of 88.4% from Table 3.1 and Figure 3.6 demonstrates that the proposed P-RS matcher is able to perform at a similar level as D-RS
and FaceVACS. FaceVACS was earlier benchmarked as performing at the same level
as the top methods [52]. Thus, the proposed P-RS method is on par with leading
methods in NIR to VIS matching.
The CMC results of matching thermal face images to standard face images are
shown in Figure 3.7. P-RS is able to achieve an average Rank-1 accuracy of 55.3%.
By comparison, it is observed that the D-RS method achieves a Rank-1 accuracy of
only 20.1% and FaceVACS has a Rank-1 accuracy of 20.7%. This drastic improvement demonstrates the beneﬁt of P-RS’s notable property of not requiring a feature
descriptor that is invariant to changes in the probe and gallery modalities. A Rank1 accuracy of 55.3% still falls short of the accuracy desired in lights out systems,
however, the examples in Figures 3.1 and 3.12 show that even humans would have
diﬃculty in this recognition task. The only previous method on thermal to visible
matching achieved a Rank-1 accuracy of 50.06% but it was evaluated on only 47 subjects in the gallery [71]. By contrast, the Rank-1 accuracy of 55.3% of the proposed
P-RS method used a gallery consisting of 10,333 subjects.
The CMC results of matching viewed sketch face images to standard face images
are shown in Figure 3.8. P-RS achieved near perfect accuracy with an average Rank-1
accuracy of 99.4%. Other methods have also achieved nearly 99% Rank-1 accuracy
[59, 157], though the results in Figure 3.8 are based on a gallery with over 10,000
92

subjects compared to a gallery size of less than 1, 000 in previous studies.
The CMC results of matching forensic sketch face images to standard face images
are shown in Figure 3.9. For forensic sketches the Rank-50 accuracy is most relevant
because the Rank-1 accuracy is too low to be useful in practice: forensic investigators
generally examine roughly the top 50 retrieved matches from a query. It is observed
that this is the one scenario in which P-RS (Rank-50 accuracy of 19.6%) was outperformed by D-RS (Rank-50 accuracy of 28.7%). The only previous method to publish
results on forensic sketch matching also used the same extended gallery and achieved
a Rank-50 accuracy of 13.4% [59] (this number is the weighted average of a 32.65%
Rank-50 accuracy on 49 good sketches and a 8.16% accuracy on 110 poor sketches).
It is important to note that the matcher in [59] was trained on viewed sketches, and
not forensic sketches like P-RS and D-RS.
The decreased accuracy of P-RS compared to D-RS on the forensic sketch dataset
is attributed to two factors. The primary factor is the small size of the data set.
While both methods utilize learning, D-RS is able to leverage the a priori knowledge
that SIFT and MLBP perform well for direct similarity measurement. Further, D-RS
is able to use both training sets to learn the LDA subspaces. By contrast, P-RS
must use the ﬁrst training set to develop the prototypes. An additional reason for
P-RS’s lower accuracy on forensic sketch matching is that these sketches are often not
completely accurate due to the inability of a witness to adequately describe the face
of a suspect, which impacts the assumption in Eq. (3.7). Despite these limitations,
P-RS still achieved approximately four times accuracy improvement over a leading
COTS FRS.
Examples cases where (i) P-RS succeeds but FaceVACS fails and (ii) P-RS fails
but FaceVACS succeeds are shown for the two most diﬃcult HFR scenarios (thermal
and forensic sketch) in Figure 3.12.
Figures 3.10 and 3.11 demonstrate the ability of the P-RS framework to perform
93

NIR
Gallery Features
Probe Features

DoG
SIFT

DoG
MLBP

CSDN CSDN Gauss Gauss
SIFT MLBP SIFT MLBP

DoG SIFT
DoG MLBP
CSDN SIFT
CSDN MLBP
Gauss SIFT
Gauss MLBP

77.9
60.6
75.8
66.6
62.4
52.8

72.8
85.4
63.9
77.9
49.0
56.1

77.9
59.1
81.8
69.6
72.5
59.1

76.7
78.8
76.7
84.8
72.5
70.4

66.6
48.7
74.0
69.0
72.2
63.0

58.8
57.0
68.1
75.5
66.9
67.5

(a)
Thermal
Gallery Features
Probe Features

DoG
SIFT

DoG
MLBP

CSDN CSDN Gauss Gauss
SIFT MLBP SIFT MLBP

DoG SIFT
DoG MLBP
CSDN SIFT
CSDN MLBP
Gauss SIFT
Gauss MLBP

50.8
42.0
46.7
41.9
29.0
24.4

46.1
57.8
35.7
49.0
20.5
22.6

49.1
35.9
50.2
44.7
37.4
31.6

47.7
49.7
47.3
55.1
33.2
36.3

36.0
23.6
40.1
33.6
36.0
28.8

34.7
30.0
36.4
37.1
32.1
32.6

(b)
Figure 3.10: Rank-1 accuracies (%) on the NIR and thermal modalities using the
proposed P-RS framework. The rows list the features used to represent the probe
images, and the columns list the features for the gallery images. The non-diagonal
entries in each table (in bold) use diﬀerent feature descriptor representations for the
probe images than the gallery images. These results demonstrate another “heterogeneous” aspect of the proposed framework: recognition using heterogeneous features
between the probe and gallery images.

94

Viewed Sketch
Gallery Features
Probe Features

DoG
SIFT

DoG
MLBP

CSDN CSDN Gauss Gauss
SIFT MLBP SIFT MLBP

DoG SIFT
DoG MLBP
CSDN SIFT
CSDN MLBP
Gauss SIFT
Gauss MLBP

98.6
87.5
98.4
85.8
96.6
90.5

95.6
96.3
91.0
92.3
74.2
83.7

98.3
82.3
98.7
87.6
98.3
94.8

95.1
92.5
95.1
96.2
91.6
95.1

96.8
60.8
97.6
75.1
97.8
95.6

89.4
66.8
93.2
81.2
94.6
98.0

(a)
Forensic Sketch
Gallery Features
Probe Features

DoG
SIFT

DoG
MLBP

CSDN CSDN Gauss Gauss
SIFT MLBP SIFT MLBP

DoG SIFT
DoG MLBP
CSDN SIFT
CSDN MLBP
Gauss SIFT
Gauss MLBP

6.4
6.0
7.5
6.8
4.9
6.4

7.2
10.6
4.5
8.3
3.8
5.3

8.7
6.0
5.7
8.7
6.4
7.9

6.4
9.4
6.4
11.3
5.3
5.3

7.5
5.7
5.7
5.7
7.5
9.1

9.1
8.7
6.0
4.9
9.1
10.9

(b)
Figure 3.11: Rank-1 accuracies (%) on the viewed sketch and forensic sketch modalities using the proposed P-RS framework. The rows list the features used to represent
the probe images, and the columns list the features for the gallery images. The nondiagonal entries in each table (in bold) use diﬀerent feature descriptor representations
for the probe images than the gallery images. These results demonstrate another
“heterogeneous” aspect of the proposed framework: recognition using heterogeneous
features between the probe and gallery images.

95

P-RS: Rank 1
FaceVACS: Rank 7622
(a)

P-RS: Rank 891
FaceVACS: Rank 1
(b)

P-RS: Rank 1
FaceVACS: Rank 7622
(c)

P-RS: Rank 891
FaceVACS: Rank 1
(d)

Figure 3.12: Examples of thermal recognition not successfully matched by (a) FaceVACS, and (b) the proposed P-RS method. Examples of forensic sketch recognition
not successfully matched by (c) FaceVACS, and (d) P-RS. In each image pair the left
and right images are the probe and gallery, respectively.

96

recognition using diﬀerent feature descriptors for the probe and gallery images. Figure
3.10 lists the Rank-1 accuracy for the NIR and thermal HFR scenarios, and Figure
3.11 lists the same for the viewed sketch and forensic sketch scenarios. These scores
are averaged over ﬁve random training/testing splits but do not use the additional
10,000 gallery images. The columns indicate each of the six diﬀerent image ﬁlter and
feature descriptor combinations used to represent the gallery, and the rows indicate
the representations used for the probe images. Thus, the non-diagonal entries for
each scenario are when the probe and gallery images are represented with diﬀerent
features. The accuracy is generally higher when the same features are used for faces
in the probe and gallery (i.e. the diagonal entries). Various levels of accuracy are
achieved when using diﬀerent image features, ranging from poor to high.
The ability to perform face recognition with the probe and gallery images using
diﬀerent representations is a property that previous feature-based methods did not
possess. This property is important to mention because it demonstrates the proposed
method’s ability to generalize to other unknown HFR scenarios. For example, in
the case of thermal to visible recognition, if a local feature descriptor is developed
that performs at a very high level in matching thermal to thermal, it can be incorporated into this framework even if it does not work well in the visible domain. As
other HFR scenarios are attempted (such as matching 3D depth map to 2D visible
photograph), this property could prove extremely useful in overcoming the hurdle of
ﬁnding a feature descriptor that is invariant to changes between the two domains,
which feature-based methods rely on.
Table 3.3 lists the Rank-1 accuracy (without the additional gallery) for each scenario with and without various components of the prototype random subspace framework (namely LDA, the transformation matrix R, and random subspaces, RS). The
improvement in recognition accuracy when using the R tranformation quantitatively
demonstrates the importance of this step in our algorithm.
97

The proposed P-RS framework also generalizes to standard face recognition scenarios. Using the standard dataset, Figure 3.14(a) compares the accuracy of P-RS,
D-RS, and FaceVACS. FaceVACS clearly outperforms P-RS and D-RS as it is consistently one of the top performers in NIST face recognition benchmarks. However,
using four diﬀerent face datasets we see that P-RS and D-RS both achieve Rank-1
accuracies around 95% with 10,876 subjects in the gallery, compared to 99.5% accuracy for FaceVACS. In Figure 3.14(b) the results of matching using diﬀerent feature
descriptors in the probe and gallery domain are shown. The ability to match probe
and gallery images using diﬀerent feature representations is novel and could beneﬁt
situations in which only the face templates, instead of the face image are available.
The proposed P-RS method is computationally scalable to meet the demands of
real world face recognition systems. Running in Matlab and using a single core from
a 2.8GHz Intel Xeon processor, the breakdown of compute time is needed to enroll
a single face image is as follows. Image ﬁltering requires roughly 0.008 sec for DoG,
1.1 sec for CSDN, and 0.004 sec for Gauss. The image MLBP and SIFT features
descriptors each take roughly 0.35 sec to compute. Because each image ﬁltering is
performed once, and each feature descriptor is computed three times (once for each
ﬁlter), computing all six ﬁlter/descriptor combinations takes around 3.2 sec. The

Table 3.3: Eﬀect of each component in the P-RS framework on recognition accuracy.
Components tested are LDA, the transformation matrix R, and random subspaces
(RS). Listed are the average Rank-1 accuracies for each scenario without the additional 10,000 gallery images.
LDA
No LDA
R
RS
NIR 0.904
Thermal 0.643
Viewed Sketch 0.994
Forensic Sketch 0.136

No R

R

No R

No RS

RS

No RS

RS

No RS

RS

No RS

0.901
0.637
0.992
0.147

0.725
0.508
0.981
0.143

0.699
0.474
0.970
0.140

0.722
0.217
0.970
0.102

0.472
0.174
0.867
0.064

0.600
0.178
0.939
0.094

0.319
0.120
0.618
0.068

98

Standard Photograph Matching
100

99

Accuracy

98

97

96

95

P−RS
D−RS
(P−RS)+(D−RS)
FaceVACS

94

93

0

10

20

30

40

50

60

70

80

90

100

Rank
Figure 3.13: CMC plot of matcher accuracies with an additional 10,000 gallery images
when photos are used for both the probe and gallery (i.e. non-heterogeneous face
recognition).

Probe Features

DoG
SIFT

Gallery Features
DoG
CSDN CSDN Gauss Gauss
MLBP SIFT MLBP SIFT MLBP

DoG SIFT
DoG MLBP
CSDN SIFT
CSDN MLBP
Gauss SIFT
Gauss MLBP

93.0
78.4
90.7
73.8
80.5
71.0

85.6
92.9
79.8
87.3
53.3
65.7

90.8
73.0
93.6
79.7
89.8
84.9

84.4
89.2
88.4
94.1
81.0
88.6

81.8
53.5
90.4
73.4
93.2
89.9

72.9
61.6
85.7
81.6
90.6
95.1

Figure 3.14: Face recognition results (%) when photos are used for both the probe
and gallery (i.e. non-heterogeneous face recognition). The layout is the same as in
Figure 3.10 (i.e. results shown are when diﬀerent features are used to represent the
probe and gallery images).

99

prototype random subspace representation with 200 bags takes roughly 0.3 sec to
compute for a single ﬁlter/descriptor combination. Thus, all six ﬁlter/descriptor
combinations take roughly 1.8 sec. In total, a face image needs around 5.0 sec. to
enroll in Matlab. With a gallery of ng subjects and the ﬁnal feature vector Φ of size
d , identiﬁcation of a subject takes O(d · ng ) time. Depending on the number of bags,
the number of prototypes for each scenario, and the variance retained in the PCA
step, d is of the order 1, 000.

3.8

Summary

A method for heterogeneous face recognition, called Prototype Random Subspaces (PRS), is proposed. Probe and gallery images are initially ﬁltered with three diﬀerent
image ﬁlters, and two diﬀerent local feature descriptors are then extracted. A training
set of prototypes is selected, in which each prototype subject has an image in both
the gallery and probe modalities. The non-linear kernel similarity between an image
and the prototypes is measured in the corresponding modality. A random subspace
framework is employed in conjunction with LDA subspace analysis to further improve
the recognition accuracy.
The proposed method leads to excellent matching accuracies across four diﬀerent
HFR scenarios (near infrared, thermal infrared, viewed sketch, and forensic sketch).
Results were compared against a leading commercial face recognition engine. In most
of our experiments the gallery size was increased with an additional 10,000 subjects to
better replicate real matching scenarios. In addition to excellent matching accuracies,
one key beneﬁt of the proposed P-RS method is that diﬀerent feature descriptors
can be used to represent the probe and gallery images. Finally, the P-RS method
performed comparable to a leading commercial face recognition engine on a visible to
visible matching scenario (i.e. non-heterogeneous face recognition).
100

Future work will focus on (i) improving the accuracy of each of the tested HFR
scenarios separately, (ii) improving the runtime complexity of the prototype representation, and (iii) incorporating additional HFR scenarios. Tailoring the P-RS
parameters and learning weighted fusion schemes for each HFR scenario separately
should oﬀer further accuracy improvements. Another potential technique to improve
the recognition accuracies is to allow the P-RS method to leverage multiple training
samples per subject. That is, in the conducted experiments each training subject
has only one image per modality. However, in many operational scenarios training
data will be available with multiple images per modality for each subject. This additional information will improve the ability to estimate the within-class scatter in our
discriminant analysis. We will also continue to improve aspects of the D-RS method
(which achieved high accuracy on several of the HFR scenarios), such as the similarity
metrics and image ﬁlters. Improvements to the runtime complexity is of the P-RS
method should be explored through an examination of the maximum number of prototypes needed to achieve the highest recognition accuracy on a given scenario, and
kernel approximation methods such as the Nystrom method [150]. One additional
HFR scenario that should be considered is 3D to 2D face matching. P-RS should
be particularly impact fool in this scenario because heterogeneous features will be
required to represent faces in the 3D and 2D modalities.
While a vast majority of previous algorithms for heterogeneous face recognition
have been designed for a speciﬁc application (e.g. sketch recognition [12, 59, 82, 138,
149] and near-infrared recognition [52, 77]), the algorithm presented in this chapter
generalizes to any HFR scenario. By providing a generalized approach to heterogeneous face recognition this chapter oﬀers a strong contribution the ﬁeld of heterogeneous face recognition. As new heterogeneous face recognition scenarios are
attempted (such as much depth images acquired from LIDAR), the P-RS algorithm
proposed in this chapter should allow for success in these future endeavors.

101

Chapter 4
Face Recognition Across Time
Lapse

4.1

Introduction

In addition to changes in image modality, another variate that is known to greatly
impact face recognition performance is the alteration in facial appearance that occurs
through the aging process [120].
This chapter will look at the performance of aging-invariant face recognition systems in both aging and non-aging scenarios. By showing that aging invariant face
recognition systems do not generalize to non-aging scenarios, we are able to pose
aging-invariant face recognition as a heterogeneous face recognition problem. That
is, the identiﬁable facial features for faces that have not undergone aging are largely
heterogeneous from the identiﬁable features of faces that have aged. This is evidenced
by the diﬀerent discriminative subspaces that are learned from aged face images than
non-aged face images.
Unlike the pose, expression, and illumination, aging factors cannot be constrained
in order to improve face recognition performance. For example, many years may pass
102

Jan 1995

Jul 1998

Nov 1999

Nov 2003

Feb 2005

Gallery seed

Score=0.99

Score=0.62

Score=0.41

Score=0.26

Figure 4.1: Multiple images of the same subject are shown, along with the match
score (obtained by a leading face recognition system) between the initial gallery seed
and the image acquired after a time lapse. As the time lapse increases, the recognition
score decreases. This phenomenon is a common problem in face recognition systems.
The work presented in this chapter (i) demonstrates this phenomenon on the largest
aging dataset to date, and (ii) demonstrates that solutions to improve face recognition
performance across large time lapse impact face recognition performance in scenarios
without time lapse.

before a released prisoner is recidivate, resulting in a large time lapse between the
mug shot image in the gallery and the current booking image (probe). Similarly, a
U.S. passport is valid for ten years, and most state driver’s licenses only need to be
renewed every ﬁve to ten years. Thus, in many critical applications the success of
face recognition technology may be impacted by the large time lapse between a probe
image and its true mate in the gallery.
Over the past ﬁve years there has been a growing interest in understanding the
impact of aging on face recognition performance and proposing solutions to mitigate
any negative impact from aging. A major contributor to these advances has been
the availability of the MORPH database by Ricanek et al. [116, 121]. The MORPH
database consists of two albums, which, in total, contains roughly 100,000 images
of about 25,000 subjects. The MORPH dataset has facilitated studies on synthetic
aging [104,135], age invariant face recognition [76,104], age estimation [35], and aging
analysis [107]. A broader examination of facial aging methods in the literature can
be found in the summary provided by Ramanathan et al. [114].
103

Various approaches for improving face recognition performance in the presence of
aging can be dichotomized into two groups. The ﬁrst contains generative synthesis
methods which seek to learn an aging model that can estimate the appearance of an
aged face from an input image. While these methods have shown some success in
mimicking the aging process [104, 129, 135], generative methods are challenging due
to the large number of parameters that must be estimated. Synthesis methods also
rely on the appearance of the face in order to simulate the aging process, which can
suﬀer from the minor pose and illumination variations that are encountered in large
datasets. Further, synthesis methods do not handle the problem of face recognition,
and need a separate face engine to perform matching. Of course, this also speaks to
one of the advantages of synthetic aging methods: they can be easily integrated with
existing face recognition engines.
An alternative solution to improving face recognition performance across time
lapse is through discriminative learning methods [76, 79, 86, 113]. Such methods seek
to ﬁnd the weighted combination of features that are most stable across a particular
time lapse. Discriminative approaches are able to leverage both the wide range of
facial feature representations [56], as well as the family of learning methods in face
recognition. Beginning with Belhuemer’s FisherFaces approach [10], discriminative
learning approaches have been critical to the advancement of face recognition over
the past two decades.
Li et al. used a discriminative random subspace method that outperforms a leading commercial face recognition engine on the MOPRH dataset [76]. This work helped
demonstrate that a face recognition system could be trained to improve performance
in the presence of aging [76]. While these contributions helped advance the state
of the art in face recognition in the presence of time lapse, they also raise another
question regarding the design of face recognition systems: does the learned subspace
for face recognition across time lapse impact the face recognition performance in a
104

non-aging scenario?

In other words, while we can improve face recognition perfor-

mance in the presence of a large time lapse between the probe and gallery images, we
also want to maintain the performance on two images with minimal time lapse. This
question, to our knowledge, has not yet been addressed before.
The contributions of the research presented in this chapter are motivated by the
need to answer the question posed above. This question is answered by providing the
largest study to date on the impact of aging on face recognition performance. Leveraging a dataset of 200,000 mug shot images from 64,000 subjects, we demonstrate
(i) a degradation in face recognition performance from two leading commercial-ofthe-shelf (COTS) face recognition systems (FRS) on match sets partitioned by the
amount of time lapse occurring between the probe and gallery images, and (ii) training to improve performance on a particular time lapse range impacts performance on
other time lapse ranges. These ﬁndings suggest that face recognition systems should
update face templates after a certain time interval has passed from the original acquisition date in order to maximize the beneﬁt of age invariant subspaces without
impacting face recognition in non-aging scenarios.
The remainder of this chapter is outlined as follows. In Section 4.2 we discuss
the face dataset used in this research. In Section 4.3 we revisit the random subspace
framework and discuss how it was adopted for this work. Section 4.4 presents experiments on the impact of training for age invariant face recognition, as well as the
computational demands that were born from undertaking such a large scale study.

4.2

Dataset

This study leverages a set of 200,000 mug shot images from roughly 64,000 subjects
collected in the state of Florida, U.S.A. Each image contains a subject id and an
image acquisition date, which enables the time lapse between any two images to be
105

determined. The 200,000 images are a subset of a larger 1.5 million image dataset
available to us; these 200,000 images were selected so that diﬀerent time lapse ranges
were equally represented in this study.
The time lapse ranges (between a probe and gallery image) analyzed in this study
were (i) 0 to 1 year, (ii) 1 to 5 years, (iii) 5 to 10 years, and (iv) more than 10 years.
Training sets for each of these time lapse ranges are generated so that each range
has 8,000 subjects. The only exception is the 10+ year time lapse range, where only
around 2,000 subjects were available in the database in the database for training.
Similarly, test sets were generated to represent each of the time lapse ranges listed
above. For each time lapse range, 12,000 subjects were used for testing. Again, the
10+ year time lapse test set contained only 2,000 subjects, similar to the training set.
For each subject in the study, their oldest face image was used as the gallery seed
image. Multiple probe images that fell within the time lapse range for a subject were
often available as well. For example, the 1 to 5 year test set contained 12,000 gallery
images and 33,443 probe images, where each probe image was taken between one to
ﬁve years after the corresponding gallery image.
All parameter tuning in this work was performed using the training set. This was
performed by using the ﬁrst half of the training set to train on diﬀerent parameter
values and the second half of the training set to determine the optimal parameter
combination (with respect to face recognition performance). Thus, the second half of
the training set also served as a validation set.
The analysis performed on this dataset is the largest such study reported to date.
Further, because the images are pulled from a larger pool of an operational database
of mug shot images, the study is unique in that it controls the time lapse variate
so that the same number of subjects are available to analyze 5 to 10 years aging
as 0 to 1 year aging (for example). As such, measuring the performance of COTS
FRS on this dataset will provide a convincing demonstration of how commercially
106

TAR at FAR = 1.0%
100%
90%
80%
70%
60%
50%
(0-1)

(1-5)

(5-10)

(10+)

Time lapse in years
COTS 1

COTS 2

Figure 4.2: The performance of two commercial face recognition systems as a function
of time lapse between probe and gallery images.

available face recognition technology performs in the presence of aging. Because
both Li et al. [76] and Ling et al. [79] have been able to surpass the performance of
COTS FRS by performing discriminative learning on face images with time lapse,
it is generally accepted that face recognition performance degrades monotonically as
the time between image acquisition increases.
We analyzed to performance of two commercial-of-the-shelf face recognition systems: Cognitec’s FaceVACS SDK [1], and PittPatt’s Face Recognition SDK [2]. Both
matchers were competitive participants in the 2010 NIST Multi-Biometrics Evaluation (MBE) [34]. Recognition results reported here list the two matchers as “COTS
1” and “COTS 2” in order to make anonymous each matcher’s performance relative
to the other.
Figure 4.2 shows the matching accuracies of the two COTS matchers as a function
of the time lapse between the probe and gallery image. The decrease in performance
as the time lapse increases clearly shows the diﬃculty face recognition systems have
with age variation.
107

4.3

Random Subspace Face Recognition

In this work we adopt a random subspace linear discriminant analysis (RS-LDA)
algorithm, based on Wang and Tang’s original face subspace method [148]. More
recently, Li et al. [76] have used a variant of this approach to improve face recognition
in the presence in aging. Klare and Jain have also demonstrated the beneﬁt of RSLDA on a heterogeneous face recognition scenario [52, 54].
RS-LDA is based on the FisherFace linear discriminant analysis algorithm [10],
where a linear subspace Ψ is learned from the original feature space by solving the
generalized eigenvalue problem Sb · Ψ = Λ · Sw · Ψ with the between-class and withinclass matrices Sb and Sw built from a set of training images. In RS-LDA, multiple
subspaces Ψb , b = 1 . . . B, are learned using both randomly sampled subsets of the
original feature space as well as randomly sampled subjects from the set of training
instances. The motivation for using RS-LDA over standard LDA is due to degenera−1
tive properties that often manifest in Sw (which must be of full rank to solve Sw ·Sb ).

While Level 2 facial feature representations [56] (such the local binary patterns [99]
used in this work) oﬀer improved recognition accuracies, they also increase the dimensionality of the facial feature vectors. This in turn increases the likelihood that
Sw is degenerate, and further necessitates the need for a method such as RS-LDA.
Other LDA variants oﬀer solutions to this small sample size problem [45,85], however
RS-LDA is preferred due to the ease of implementation and wider range of successful
applications in face recognition [52, 54, 76, 148].
The approach used in this work is mostly based on the method by Li et al. [76],
however we had to modify their method in order to reduce the computational requirements because the number of images handled in this experiment is an order of
magnitude larger than their work. Again, the intent of this work is not to provide a
method that can improve on commercially available face recognition technology (this
capability has already been demonstrated [76, 79]). Instead, we wish to understand
108

how training a face recognition system to improve recognition accuracies on a particular time lapse scenario performs on scenarios with a larger or smaller amount of
time lapse than training time lapse.

4.3.1

Face Representation

We represent face images in this experiment with multi-scale local binary patterns
(MLBP), which is the concatenation of local binary patterns [99] of radius 1, 3, 5,
and 7. Ahonen et al. ﬁrst demonstrated the eﬀectiveness of representing face images
with LBP descriptors [3].
In order to represent a face with MLBP feature descriptors, the face is ﬁrst geometrically normalized using the eye locations to (i) perform planar rotation so the
angle between the eyes is 0 degrees, (ii) scale the face so the inter-pulilary distance
between eyes is 75 pixels, and (iii) crop the face 250x200 pixels. Once geometrically
normalized, MLBP feature descriptors are densely sampled from patches of size 24x24
across the face, with an overlap of 12 pixels. In total, this yields 285 MLBP descriptors representing the face. The size of the patch (24x24) was selected by using the
training set to perform parameter validation.
To reduce the total feature vector size, principal component analysis (PCA) was
performed on one half of the training set to learn a subspace for each of the 285
MLBP feature sampling locations. The second half of the training set was used to
determine the minimum energy variation that needed to be retained without impacting face recognition performance. It was determined that 98% of the variance could be
retained without impacting the recognition performance. The original MLBP descriptor is 236 dimensional (4 · 59). After PCA dimensionality reduction, the descriptor
size, on average, was reduced to 99 dimensions at each of the 285 sampling locations.
After the dimensionality of the MLBP descriptor for each face patch was reduced, all
descriptors are concatenated together, resulting in a feature vector of dimensionality
109

d = 28, 187. Without this PCA step, the feature dimensionality would have been
67, 260.

4.3.2

Random Subspaces

A total of B random LDA subspaces Ψb are learned from B random samples of the ddimensional feature space. The eigenvalues corresponding to each feature dimension
extracted from the PCA step were used to weight the random sampling so that
features with higher variation energy will have a larger likelihood of being selected.
The beneﬁt of this approach was conﬁrmed by evaluation on the validation set. The
number of features sampled with the weighted random sampling was controlled by the
parameter ρ (0 < ρ < 1) in order to select d = ρ·d features at each stage b = 1, . . . , B.
Additionally, from the N training subjects available, a subset of size N < N was
b
randomly sampled to build the between-class scatter matrix SBtwn ∈ Rd ,d and the
w
within-class scatter matrix SWthn ∈ Rd ,d at each stage b. Finally, we learn the

subspace Ψb as

Ψb = argmax
Ψ

b
||Ψ T · SBtwn · Ψ ||
b
||Ψ T · SWthn · Ψ ||

(4.1)

After learning the set of B subspaces Ψb , b = 1 . . . B, a new face image is represented as the concatenation of the each of the B subspace projections. The dissimilarity between two faces is then measured by their L-2 norm.
Despite reducing the feature dimensionality and only using a ρ percent of the
(d = ρ · d) features, we still have a feature vector that is too large to accurately solve
Eq. 4.1. To resolve this, a second PCA step was applied at each stage b to perform
feature reduction on the d -dimensional feature vector. This second PCA step was
performed by retaining 0 < p < 1 percent of the variance in the training instances at
stage b.
110

The parameters in the RS-LDA framework are the number of training subjects to
use at each stage (N ), the percentage of features to sample at each stage (ρ), the
number of random sample stages (B), and the percentage of variance in the PCA
step for each stage (p). Using the training set for validation to ﬁnd the highest
recognition accuracies, the following parameters values were selected: N = 300,
ρ = 0.45, B = 20, and p = 0.95.

4.4

Experiments

Figure 4.2 shows the negative correlation between face recognition accuracy and the
amount of time lapse between probe and gallery image capture. A strong case has
been made to handle this problem by training discriminative face recognition systems
[76, 79]. Here we will use the random subspace framework developed in Section 4.3
to understand if training a face recognition system to improve performance on aging
impacts the standard face recognition scenarios.
Using the training set splits discussed in Section 4.2, we trained ﬁve diﬀerent
versions of RS-LDA matcher using the algorithm presented in Section 4.3.
• The ﬁrst RS-LDA matcher was trained on the 8,000 training subjects with 0 to
1 year time lapse between probe and gallery image.
• The second matcher was trained on 8,000 subjects with 1 to 5 year time lapse.
• The third matcher was trained on 8,000 subjects with 5 to 10 year time lapse.
• A fourth matcher was trained on 2,000 subjects with over 10 years times lapse
(again, only 2,000 subjects were available with such a large time lapse).
• A ﬁnal matcher was trained using 8,000 subjects whose time lapse was equally
distributed amongst the four time lapse splits considered above. Thus, this
111

Test set: 0 to 1 year time lapse
RS-LDA trained on (time lapse in years):
Baselines:
(0-1)
(1-5)
(5-10) (10+)
(All)
MLBP Only COTS1
94.5% 94.1% 93.1% 91.8% 94.1%
71.2%
96.3%
# of Match Comparions:
# of Non-Match Comparions:

COTS2
89.8%

19,996
239,572,034

(a)

Test set: 1 to 5 year time lapse
RS-LDA trained on (time lapse in years):
Baselines:
(0-1)
(1-5)
(5-10) (10+)
(All)
MLBP Only COTS1
90.3% 90.5% 89.1% 87.7% 90.2%
62.9%
94.3%
# of Match Comparions:
# of Non-Match Comparions:

COTS2
84.6%

33,443
401,282,557

(b)

Test set: 5 to 10 year time lapse
RS-LDA trained on (time lapse in years):
Baselines:
(0-1)
(1-5)
(5-10) (10+)
(All)
MLBP Only COTS1
75.2% 81.2% 82.0% 80.4% 81.3%
46.7%
88.6%
# of Match Comparions:
# of Non-Match Comparions:

COTS2
75.5%

24,036
215,795,208

(c)

Test set: 10+ year time lapse
RS-LDA trained on (time lapse in years):
Baselines:
(0-1)
(1-5)
(5-10) (10+)
(All)
MLBP Only COTS1
65.6% 72.2% 72.4% 71.0% 71.2%
39.2%
80.5%
# of Match Comparions:
# of Non-Match Comparions:

COTS2
61.7%

6,221
12,995,669

(d)
Figure 4.3: The true accept rates (TAR) at a ﬁxed false accept rate (FAR) of 1.0%
across datasets with diﬀerent amounts of time lapse between the probe and gallery
images. Four diﬀerent RS-LDA subspaces were trained on a separate set of subjects
with the diﬀerent time lapse ranges tested above. The results suggests the need for
multiple recognition subspaces depending on the time lapse.
112

matcher trained on subjects with 0 year time lapse up to 17 years (17 years is
the maximum time lapse in the 10+ aging set).
Figure 6.1 shows the accuracy on each of the four test sets using the ﬁve trained
systems. The ﬁrst test set (Fig. 6.1(a)) has 0 to 1 year time lapse between the probe
and gallery images for 12,000 subjects. The results show that the best performance
from the ﬁve trained systems is the system trained on 0 to 1 year time lapse. As the
time lapse between the training set and the test set increases, the face recognition
accuracy decreases. These results help provide the following answer to the question
originally posed:

training a face recognition system to improve on face aging does

seem to reduce its performance when facial aging has not occurred.
The recognition performance on face images that have 1 to 5 years time lapse
(Fig. 6.1(b)) shows the best performance from the ﬁve RS-LDA systems is the system
trained on 1 to 5 year lapse. However, the performance from 0 to 1 year time lapse
training is not much lower. In fact, the diﬀerence between training and testing on
0 to 1 year and 1 to 5 years is rather minimal. This is likely due the fact that only
minor aging changes have occurred in these time spans.
The recognition performance on face images with 5 to 10 years time lapse (Fig.
6.1(c)) shows how learning can help improve recognition accuracies in the presence of
a large amount of aging. The true accept rate improves by nearly 7.0% when trained
on the 5 to 10 years set than with the 0 to 1 year training set. Thus, the feature
subspaces learned on data with minimal aging did not generalize well to data with
larger amounts of aging.
The recognition results on aging over 10 years (Fig. 6.1(c)) is the only scenario
in which the subspace trained on the same time lapse as tested on did not oﬀer the
highest results. However, the 10+ year subspace only had 2,000 subjects to train on
while the other subspaces had 8,000 subjects available for training. This could also
be explained by the complex nature of face aging that manifests itself in diﬀerent
113

ways for diﬀerent individuals especially when the time lapse is large.
In each testing scenario the subspace labeled (All) is the one trained on 8,000
subjects exhibiting all the time lapse ranges considered above. While this subspace
never had the top accuracy with respect to the other RS-LDA subspaces, it consistently performed well on all time lapses. This indicates that choosing training with
equally distributed amount of time lapse is a viable solution when learning multiple
subspace models is not reasonable.
The performance of COTS1 exceeded the RS-LDA system in each testing scenario. However, the RS-LDA system was purposely designed to be relatively simple
to help facilitate the scope of this study. Incoporating additional features such as the
SIFT descriptors and multiple patch sizes that Li et al. used in their aging-invariant
recognition system [76] would result in improved performance. Despite this, the role
of the training set in training RS-LDA is clearly established when examining the
performance of the RS-LDA subspaces over the baseline MLBP only performance.
MLBP only makes use of the initial MLBP feature representation to measure the
(dis)similarity between the faces, but does not perform training. Through the use of
RS-LDA the recognition accuracy is improved substantially.
The large time lapse dataset with a large number of subjects presented in this
study also enabled us to examine which regions of the face remained the most persistent or retained the most discriminative power over time. To examine this stability,
we measured the Fisher separability at each patch where the MLBP feature descriptors were computed. For a given face patch, we measured the Fisher separability
as the ratio of the sum of eigenvalues from the between-class scatter to the sum of
eigenvalues from the within-class scatter. This indicates the inherent separability
provided by the Level 2 MLBP features at diﬀerent regions of the face. These Fisher
separability values at diﬀerent time lapses are shown in Figure 4.4. The results show
that while, as expected, the inherent separability decreases for each facial region as
114

Fisher Seperability:
0.0

19.12

(a)
O to 1 Years
Aging

1 to 3 Years
Aging

(b)
3 to 5 Years
Aging

5 to 7 Years
Aging

7 to 9 Years
Aging

9 + Years
Aging

(c)
Figure 4.4: Inherent separability of diﬀerent facial regions with aging. (a) The mean
pixel values at each patch where MLBP feature descriptors are computed. (b) The
scale of the Fisher separability criterion used. (c) The heat map showing Fisher
separability values at each image patch across diﬀerent time lapses. As time lapse
increases, the eyes and mouth regions seem to be the most stable sources of identiﬁable
information.

115

time increases, the mouth region has more discriminative information than the nose
region, especially with the progression of time or aging. This also conﬁrms the discriminative information contained in the region of the face around the eyes. Such
information could be useful in explicitly assigning diﬀerent weights to diﬀerent face
regions.

4.4.1

Computational Demands

Future work will attempt to leverage the additional face images contained in the 1.5
million mug shot image dataset available to us. However, one of the major diﬃculties
we anticipate in this analysis is the computational demands for processing such a
wide corpus of data. In this section we brieﬂy highlight some of the challenges of
processing a large scale face database.
In this study, each of the roughly 120,000 test images used were enrolled by the
Cognitec’s and PittPatt’s FRS. After enrollment, 869 million match comparisons were
performed by each matcher to measure the performance on each time lapse data set.
The analysis of RS-LDA on the MLBP feature representation used all the 200,000
images. This, in turn, required all images to be geometrically normalized using the
eye locations automatically detected by the FaceVACS system. Once the images
were aligned, the MLBP feature descriptors was extracted. With a 236-dimensional
MLBP descriptor extracted at 285 patches across each face, roughly 48Gb of space
was needed for storing these features.
For analyzing RS-LDA performance on each of the ﬁve time lapse training sets,
a total of 869 million test set comparisons needed to be performed ﬁve times, resulting in a total of 4.34 billion comparisons. Other computational demands arose from
the training of the RS-LDA subspaces on various sets of 8,000 subjects, perform116

ing parameter validation on four diﬀerent parameter combinations1 in the RS-LDA
framework, and generating the ROC curves for each score matrix.
Machines with large amounts of RAM were also required to eﬃciently process the
data. For example,the covariate analysis necessary for RS-LDA needed the MLBP
features from all 8,000 subjects to be loaded into the main memory. For testing, a
major bottleneck occurred loading the MLBP feature descriptors from each of the
12,000 subjects from disk. This made it necessary to keep the MLBP features in
memory as each of the 20 random subspaces were being processed (as opposed to
releasing the memory as each image was projected into one of the subspaces).
Eﬃcient code design helped overcome some of these computational challenges.
However, this study was primarily made possible by MSU’s High Performance Computing Center (HPCC), which provides a cloud computing service where at times over
40 diﬀerent compute nodes, each with over 10gb of RAM, were used at the same time
to meet the computational demands of this study.

4.5

Conclusions

This chapter presents studies on the largest face aging dataset reported to date.
These results demonstrate that (i) face recognition systems degrade monotonically as
the time lapse between face images to be matched increases, (ii) training to improve
face recognition performance in the presence of aging can lower its performance in
non-aging scenarios, and (iii) the best performance on a particular amount of time
lapse is achieved by training a system on that particular time lapse. Indeed, we see
that face recognition across time lapse is similar to more traditional heterogeneous
face recognition problems in that a diﬀerent sets of feature subspaces are necessary
to maximize the recognition accuracies. Similar to heterogeneous face recognition
1 Recognition

accuracies based on training on roughly 10,000 subjects and testing
on 10,000 subjects was explored on over two hundred parameter combinations.
117

Image enrolled in
gallery with
template in Age
Space 1

April
1995

Template
updated to
Age Space 2

Template
updated to
Age Space 3

Template
updated to
Age Space4

April
1996

April
2001

April
2006

Figure 4.5: The ability to improve face recognition performance by training on the
same time lapse being tested on suggests face recognition systems should update
templates over time. For example, at ﬁxed intervals from the original acquisition
date the template is updated to reside in a subspaces trained for the time lapse that
has occurred since acquisition. Probe images would be projected into each subspace
and matched in the subspace corresponding to each gallery image.
between diﬀerent image modalities, these feature subspaces do not generalize well to
the more constrained case (i.e. minimal time lapse).
The ﬁndings presented in this chapter suggest a periodic update of face templates
(see Figure 4.5). With a signiﬁcant time lapse, updating the face template to reside
in a subspace designed to capture the most discriminative features is likely to help
improve the recognition performance in the presence of aging without compromising
performance in cases where only a minimal amount of aging has occurred. Thus,
much like heterogeneous face recognition between images from diﬀering modalities,
expanding face recognition algorithms to also handle diﬀerent time lapses between
face images requires multiple system conﬁgurations that are designed for speciﬁc
recognition scenarios (e.g. matching faces with large amounts of aging, matching
sketches, infrared images to photos).

118

Chapter 5
Face Recognition Performance:
Role of Demographic Information

5.1

Introduction

In the previous chapter we examined heterogeneity with respect to diﬀerences in age
of the same person. In this chapter we will examine heterogeneity with the respect
to diﬀerences in the race/ethnicity, gender, and age of diﬀerent persons. That is,
previously we examined the impact of within-class demographic variations (namely,
age). The chapter presents the complement study: an examination of the impact of
between-class demographic variations.
As discussed in the ﬁrst chapter, sources of errors in automated face recognition
algorithms are generally attributed to the well studied variations in pose, illumination,
and expression [108], collectively known as PIE. Other factors such as image quality
(e.g., resolution, compression, blur), time lapse (facial aging), and occlusion also
contribute to face recognition errors [43]. Previous studies have also shown within a
speciﬁc demographic group (e.g., race/ethnicity, gender, age) that certain cohorts are
more susceptible to errors in the face matching process [34, 111]. However, there has
119

yet to be a comprehensive study that investigates whether or not we can train face
recognition algorithms to exploit knowledge regarding the demographic cohort of a
probe subject.
This study presents a large scale analysis of face recognition performance on three
diﬀerent demographics (see Figure 5.1): (i) race/ethnicity, (ii) gender, and (iii) age.
For each of these demographics, we study the performance of six face recognition algorithms belonging to three diﬀerent types of systems: (i) three commercial oﬀ the shelf
(COTS) face recognition systems (FRS), (ii) face recognition algorithms that do not
utilize training data, and (iii) a trainable face recognition algorithm. While the COTS
FRS algorithms leverage training data, we are not able to re-train these algorithms;
instead they are black box systems that output a measure of similarity between a
pair of face images. The non-trainable algorithms use common feature representations to characterize face images, and similarities are measured within these feature
spaces. The trainable face recognition algorithm used in this study also outputs a
measure of similarity between a pair of face images. However, diﬀerent versions of
this algorithm can be generated by training it with diﬀerent sets of face images, where
the sets have been separated based on demographics. Both the trainable algorithms,
and (presumably) the COTS FRS, initially use some variant of the non-trainable
representations.
The study of COTS FRS performance on each of the demographics considered is
intended to augment previous experiments [34, 111] on whether these algorithms, as
used in government and other applications, exhibit biases. Such biases would cause
the performance of commercial algorithms to vary across demographic cohorts. In
evaluating three diﬀerent COTS FRS, we conﬁrmed that not only do these algorithms
perform worse on certain demographic cohorts, they consistently perform worse on
the same cohorts (females, Blacks, and younger subjects).
Even though biases of COTS FRS on various cohorts were observed in this study,
120

Age

Gender

Young

Middle-Aged

Old

Female

Male

(a)

(b)

(c)

(d)

(e)

Race/Ethnicity
Black

White

Hispanic

(f)

(g)

(h)

Figure 5.1: Examples of the diﬀerent demographics studied. (a-c) Age demographic.
(d-e) Gender demographic. (f-h) Race/ethnicity demographic. Within each demographic, the following cohorts were isolated: (a) ages 18 to 30, (b) ages 30 to 50, (c)
ages 50 to 70, (d) female gender, (e) male gender, (f) Black race, (g) White race,
and (h) Hispanic ethnicity. The ﬁrst row shows the “mean face” for each cohort. A
“mean face” is the average pixel value computed from all the aligned face images in a
cohort. The second and third rows show diﬀerent sample images within the cohorts.

121

these algorithms are black boxes that oﬀer little insight into to why such errors manifest on speciﬁc demographic cohorts. To understand this, we also study the performance of non-commercial trainable and non-trainable face recognition algorithms,
and whether statistical learning methods can leverage this phenomenon.
By studying non-trainable face recognition algorithms, we gain an understanding
of whether or not the errors are inherent to the speciﬁc demographics. This is because
non-trainable algorithms operate by measuring the (dis)similarity of face images based
on a speciﬁc feature representation that, ideally, encodes the structure and shape of
the face. This similarity is measured independent of any knowledge of how face images
vary for the same subject and between diﬀerent subjects. Thus, cases in which the
non-trainable algorithms have the same relative performance within a demographic
group as the COTS FRS indicates that the errors are likely due to one of the cohorts
being inherently more diﬃcult to recognize.
Relative diﬀerences in performance between the non-trainable algorithms and the
COTS FRS indicate that the lower performance of COTS FRS on a particular cohort
may be due to imbalanced training of the COTS algorithm. We explore this hypothesis by training the Spectrally Sampled Structural Subspace Features (4SF) face
recognition algorithm [50] (i.e., the trainable face recognition algorithm used in this
study) on image sets that consist exclusively of a particular cohort (e.g., White only).
The learned subspaces in 4SF are applied to test sets from diﬀerent cohorts to understand how unbalanced training with respect to a particular demographic impacts
face recognition accuracy.
The 4SF trained subspaces also help answer the following question: to what extent
can statistical learning improve accuracy on a demographic cohort? For example, it
will be shown that females are more diﬃcult to recognize than males. We will investigate how much training on only females, for example, can improve face recognition
accuracy when matching females. Such improvements suggest the use of multiple dis122

criminative subspaces (or face recognition algorithms), with each trained exclusively
on diﬀerent cohorts. The results of these experiments indicate we can improve face
recognition performance on the race/ethnicity cohort by using an algorithm trained
exclusively on diﬀerent demographic cohorts. This ﬁnding leads to the notion of dynamic face matcher selection, where demographic information may be submitted in
conjunction with a probe image in order to select the face matcher trained on the
same cohort. This framework, illustrated in Figure 5.2, should lead to improved face
recognition accuracies.
The remainder of this chapter is organized as follows. In Section 5.2 we discuss
previous studies on demographic introduced biases in face recognition algorithms
and the design of face recognition algorithms. Section 5.3 discusses the data corpus
that was utilized in this study. Section 5.4 identiﬁes the diﬀerent face recognition
algorithms that were used in this study (commercial systems, trainable and nontrainable algorithms). Section 6.7 describes the matching experiments conducted on
each demographic. Section 5.6 provides analysis of the results in each experiment and
summarizes the contributions of this chapter.

5.2

Prior Studies and Related Work

Over the last twenty years the National Institute of Standards and Technology (NIST)
has run a series of evaluations to quantify the performance of automated face recognition algorithms. Under certain imaging constraints these tests have measured a
relative improvement of over two orders of magnitude in performance over the last
two decades [34]. Despite these improvements, there are still many factors known to
degrade face recognition performance (e.g., PIE, image quality, aging). In order to
maximize the potential beneﬁt of face recognition in forensics and law enforcement
applications, we need to improve the ability of face recognition to sort through facial
123

images more accurately and in a manner that will allow us to perform more specialized or targeted searches. Facial searches leveraging demographics represents one
such avenue for performance improvement.
While there is no standard approach to automated face recognition, most face
recognition algorithms follow a similar pipeline [73]: face detection, alignment, appearance normalization, feature representation (e.g., local binary patterns [99], Gabor
features [151]), feature extraction [10, 148]), and matching [96]. Feature extraction
generally relies on an oﬄine training stage that utilizes exemplar data to learn improved feature combinations (such as feature subspaces). For example, variants of
the linear discriminant analysis (LDA) algorithm [10, 148] use training data to compute between-class and within-class scatter matrices. Subspace projections are then
computed to maximize the separability of subjects based on these scatter matrices.
This study examines the impact of training on face recognition performance.
Without leveraging training data, face recognition algorithms are not able to discern between noisy facial features and facial features which oﬀer consistent cues to
a subject’s identity. As such, automated face recognition algorithms are ultimately
based on statistical models of the variance between individual faces. These algorithms
seek to minimize the measured distance between facial images of the same subject,
while maximizing the distance between the subject’s images and those of the rest of
the population. However, the feature combinations discovered are functions of the
data used to train the recognition system. If the training set is not representative
of the data a face recognition algorithm will be operating on, then the performance
of the resulting system may deteriorate. For example, the most distinguishing features for Black subjects may diﬀer from White subjects. As such, if a system was
predominantly trained on White faces, and later operated on Black faces, the learned
representation may discard information useful for discerning Black faces.
The observation that the performance of face recognition algorithms could suﬀer
124

Probe Image

1:N
Match
Biometric
System Operator

Gallery
Database

Subject Demographic:
White, Male, 18-30 y.o.
Black
Male

Black
Female

White
Female

White
Male

Suite of Face Recognition Systems Trained
Exclusively on Different Demographics

Figure 5.2: Dynamic face matcher selection. The ﬁndings in this study suggest that
many face recognition scenarios may beneﬁt from multiple face recognition systems
that are trained exclusively on diﬀerent demographic cohorts. Demographic information extracted from a probe image may be used to select the appropriate matcher,
and improve face recognition accuracy.

if the training data is not representative of the test data is not new. One of the
earliest studies reporting this phenomenon is not in the automated face recognition
literature, but instead in the context of human face recognition. Coined the “otherrace eﬀect”, humans have consistently demonstrated a decreased ability to recognize
subjects from races diﬀerent from their own [14, 127]. While there is no generally
agreed upon explanation for this phenomenon, many researchers believe the decreased
performance on other races is explained by the “contact” hypothesis, which postulates
that the lower performance on other races is due to a decreased exposure [20]. While
the validity of the contact hypothesis has been disputed [97], the presence of the
“other-race eﬀect” has not.
From the perspective of automated face recognition, Phillips et als ﬁndings in the
2002 government sponsored NIST Face Recognition Vendor Test (FRVT) is believed
to be the ﬁrst ﬁnding that face recognition algorithms have diﬀerent recognition
125

accuracies depending on a subject’s demographic cohort [111]. Among other ﬁndings,
this study demonstrated for commercial face recognition algorithms on a dataset
containing roughly 120,000 images that (i) female subjects were more diﬃcult to
recognize than male subjects, and (ii) younger subjects were generally more diﬃcult
to recognize than older subjects.
More recently, Grother et als measured the performance of seven commercial face
recognition algorithms and three university face recognition algorithms in the 2010
NIST Multi-Biometric Evaluation [34]. The experiments conducted also concluded
that females were more diﬃcult to recognize than males. This study also measured
the recognition accuracy of diﬀerent races and ages.
Previous studies have investigated what impact the distribution of a training set
has on recognition accuracy. Furl et als [28] and O’Toole et als [100] conducted
studies to investigate the impact of cross training and matching on White and Asian
races. Similar training biases were investigated by Klare and Jain [57], who showed
that aging-invariant face recognition algorithms suﬀer from decreased performance in
non-aging scenarios.
The study in [100] was motivated by a rather surprising result in the 2006 NIST
Face Recognition Vendor Test (FRVT) [110]. In this test, the various commercial
and academic face recognition algorithms tested exhibited a common characteristic:
algorithms which originated in East Asia performed better on Asian subjects than
did algorithms from the West. The reverse was true for White subjects: algorithms
developed in the western hemisphere performed better. O’Toole et als suggested that
this discrepancy was due to the diﬀerent racial distribution in the training sets for
the Western and Asian algorithms.
The impact of these training sets on face recognition algorithms cannot be overemphasized; face recognition algorithms do not generally rely upon explicit physiological
models of the human face for determining match or non-match. Instead, the measure
126

of similarity between face images is based on statistical learning, generally in the
feature extraction stage [10, 80] or during the matching stage [96].
In this work, we expand on previous studies to better demonstrate and understand
the impact of a training set on the performance of face recognition algorithms. While
previous studies [28,100] only isolated the race variate, and only considered two races
(i.e., Asian and White), this study explores both the inherent biases and training
biases across gender, race (three diﬀerent races/ethnicities) and age. To our knowledge, no studies have investigated the impact of gender or subject age for training
face recognition algorithms.

5.3

Face Database

This study was enabled by a collection of over one million mug shot face images
from the Pinellas County Sheriﬀ’s Oﬃce1 (examples of these images can be found in
Figure 5.1). Accompanying these images are complete subject demographics. The
demographics provide the race/ethnicity, gender, and age of the subject in each image,
as well as a subject ID number.
Given this large corpus of face images, we were able to use the metadata provided
to control the three demographics studied: race/ethnicity, gender, and age. For
gender, we partitioned image sets into cohorts of (i) male only, and (ii) female only.
For age, we partitioned the sets into three cohorts: (i) young (18 to 30 years old),
(ii) middle-age (30 to 50 years old), and (iii) old (50 to 70 years old). There were
very few individuals in this database with age less than 18 and older than 70. For
race/ethnicity2 , we partitioned the sets into cohorts of (i) White, (ii) Black, and (iii)
1 The

mug shot data used in this study was acquired in the public domain through
Florida’s ”Sunshine” laws. Subjects shown in this manuscript may or may not have
been convicted of a criminal charge, and thus should be presumed innocent of any
wrongdoing.
2 Racial identiﬁers (i.e. White, Black, and Hispanic) follow the FBI’s National
Crime Information Center code manual.
127

Table 5.1: Number of subjects used for training and testing for each demographic
category. Two images per subject were used. Training and test sets were disjoint. A
total of 102,942 face images were used in this study.
Demographic

Cohort

# Training

# Testing

Gender

Female
Male

7995
7996

7996
7998

Race

Black
White
Hispanic

7993
7997
1384

7992
8000
1425

Age

18 to 30
30 to 50
50 to 70

7998
7995
2801

7999
7997
2853

Hispanic3 . A summary of these cohorts and the number of subjects available for each
cohort can be found in Table 5.1. Asian, Indian, and Unknown race/ethnicities were
not considered because an insuﬃcient number of samples were available.
For each of the eight cohorts (i.e., male, female, young, middle-aged, old, White,
Black, and Hispanic), we created independent training and test sets of face images.
Each set contains a maximum of 8,000 subjects, with two images (one probe and one
gallery) for each subject. Table 5.1 lists the number of subjects included for each set.
Cohorts far less than 8,000 subjects (i.e., Hispanic and older) reﬂect a lack of data
available to us. Cases with cohorts containing only slightly fewer than 8,000 subjects
are the result of removing a few images that could not be successfully enrolled in the
COTS FRS.
The dataset of mug shot images did not contain a large enough number of Asian
subjects to measure that particular race/ethnicity cohort. However, studies by Furl
et al. [28] and O’Toole et al. [100] investigated the impact of the Whites and East
Asians. As previously discussed, these studies concluded that algorithms developed
3 Hispanic

is not technically a race, but instead an ethnic category.
128

in the Western Hemisphere did better on White subjects and Asian algorithms did
better on Asian subjects.

5.4

Face Recognition Algorithms

In this section we will discuss each of the six face recognition algorithms used in this
study. We have organized these algorithms into commercial algorithms (Sec. 5.4.1),
non-trainable algorithms (Sec. 5.4.2), and trainable algorithms (Sec. 5.4.3).

5.4.1

Commercial Face Recognition Algorithms

Three commercial face recognition algorithms were evaluated in this study: (i) Cognitec’s FaceVACS v8.2, (ii) PittPatt v5.2.2, and (iii) Neurotechnology’s MegaMatcher
v3.1. The results in this study obfuscate the names of the three commercial matchers.
These commercial algorithms are three of the ten algorithms evaluated in the
NIST sponsored Multi-Biometrics Evaluation (MBE) [34]. As such, these algorithms
are representative of the state of the art performance in face recognition technology.

5.4.2

Non-Trainable Face Recognition Algorithms

Two non-trainable face recognition algorithms were used in this study: (i) local binary patterns (LBP), and (ii) Gabor features. Both of these methods operate by
representing the face with Level 2 facial features (LBP and Gabor), where Level 2
facial features are features that encode the structure and shape of the face, and are
critical to face recognition algorithms [56].
These non-trainable algorithms perform an initial geometric normalization step
(also referred to as alignment) by using the automatically detected eye coordinates
(eyes were detected using FaceVACS SDK) to scale, rotate, and crop a face image.
After this step, the face image has a height and width of 128 pixels. Both algorithms
129

are custom implementations by the authors.

Local Binary Patterns
A seminal method in face recognition is the use of local binary patterns [99] (LBP) to
represent the face [3]. Local Binary Patterns are Level 2 features that represent small
patches across the face with histograms of binary patterns that encode the structure
and texture of the face.
Local binary patterns describe each pixel using a p-bit binary number. Each bit
is determined by sampling p pixel values at uniformly spaced locations along a circle
of radius r, centered at the pixel being described. For each sampling location, the
corresponding bit receives the value 1 if it is greater than or equal to the center pixel,
and 0 otherwise.
A special case of LBP, called the uniform LBP [99], is generally used in face
recognition. Uniform LBP assigns any non-uniform binary number to the same value,
where uniformity is deﬁned by whether more than u transitions between the values
0 and 1 occur in the binary number. In the case of p = 8 and u = 2, the uniform
LBP has 58 uniform binary numbers, and the 59th value is reserved for the remaining
256 − 58 = 198 non-uniform binary numbers. Thus, each pixel will take on a value
ranging from 1 to 59. Two diﬀerent radii are used (r = 1 and r = 2), resulting in
two diﬀerent local binary pattern representations that are subsequently concatenated
together (called Multi-scale Local Binary Patterns, or MLBP).
In the context of face recognition, LBP values are ﬁrst computed at each pixel
in the (normalized) face image as previously described. The image is tessellated into
patches with a height and width of 12 pixels. For each patch i, a histogram of the
LBP values Si ∈ Zds is computed (where ds = 59). This feature vector is then
normalized to the feature vector Si ∈ Rds by Si =
the N vectors into a single vector x of dimensionality
130

Si
ds S . Finally,
i
i
ds · N .

we concatenate

Training
Images

PCA
Decomposition
(per LBP
Local Binary Patterns
histogram)
(Densely Sampled)

Random
Spectral
Sampling

Linear
Discriminant
Analysis
(per Spectral
Sample)

Figure 5.3: Overview of the Spectrally Sampled Structural Subspace Features (4SF)
algorithm. This custom algorithm is representative of state of the art methods in face
recognition. By changing the demographic distribution of the training sets input into
the 4SF algorithm, we are able to analyze the impact the training distribution has
on various demographic cohorts.

In our implementation, the illumination ﬁlter proposed by Tan and Triggs [136] is
used prior to computing the LBP codes in order to suppress non-uniform illumination
variations. This ﬁlter resulted in improved recognition performance.

Gabor Features
Gabor features are one of the ﬁrst Level 2 facial features [56] to have been used with
wide success in representing facial images [80, 128, 151]. One reason Gabor features
are popular for representing both facial and natural images is their similarity with
human neurological receptor ﬁelds [94, 122].
A Gabor image representation is computed by convolving a set of Gabor ﬁlters
with an image (in this case, a face image). The Gabor ﬁlters are deﬁned as

2

2

f
2 f
2
f 2 − γ2 x + γ2 y
j2πf x
e
e
G(x, y, θ, η, γ, f ) =
πγη

(5.1)

x

= x cos θ + ysinθ

(5.2)

y

= −xsinθ + y cos θ

(5.3)

131

where f sets the ﬁlter scale (or frequency), θ is the ﬁlter orientation along the major
axis, γ controls the ﬁlter sharpness along the major axis, and η controls the sharpness
along the minor axis. Typically, combinations across the following values for the scale
f and orientation θ are used: f = {0, 1, . . . , 4} and θ = {π/8, π/4, 3π/8, . . . , π}. This
creates a set (or bank) of ﬁlters with diﬀerent scales and orientations. Given the
bank of Gabor ﬁlters, the input image is convolved with each ﬁlter, which results in
a Gabor image for each ﬁlter. The combination of these scale and orientation values
results in 40 diﬀerent Gabor ﬁlters, which in turn results in 40 Gabor images (for
example).
In this chapter, the recognition experiments using a Gabor image representation
operate by: (i) performing illumination correction using the method proposed by
Tan and Triggs [136], (ii) computing the phase response of the Gabor images with
f = {1, 2}, and θ = 0, π/4, π/2, 3π/4, (iii) tessellating the Gabor image(s) into patches
of size 12x12, (iv) quantizing the phase response (which ranges from 0 to 2π) into 24
values and computing the histogram within each patch, and (v) concatenating the
histogram vectors into a single feature vector. Given two (aligned) face images, the
distance between their corresponding Gabor feature vectors is used to measure the
dissimilarity between the two face images.

5.4.3

Trainable Face Recognition Algorithm

The trainable algorithm used in this study is the Spectrally Sampled Structural Subspace Features algorithm [50], which is abbreviated as 4SF@. This algorithm uses
multiple discriminative subspaces to perform face recognition. After geometric normalization of a face image using the automatically detected eye coordinates (eyes
were detected using FaceVACS SDK), illumination correction is performed using the
illumination correction ﬁlter presented by Tan and Triggs [136]. Face images are
then represented using histograms of local binary patterns at densely sampled face
132

patches [3] (to this point, 4SF is the same as the non-trainable LBP algorithm described in Sec. 5.4.2). For each face patch, principal component analysis (PCA) is
performed so that 98.0% of the variance is retained. Given a training set of subjects,
multiple stages of weighted random sampling is performed, where the spectral densities (i.e., the eigenvalues) from each face patch are used for weighting. The randomly
sampled subspaces are based on Ho’s original method [37], however the proposed approach is unique in that the sampling is weighted based on the spectral densities. For
each stage of random sampling, LDA [10] is performed on the randomly sampled components. The LDA subspaces are learned using subjects randomly sampled from the
training set (i.e., bagging [15]). Finally, distance-based recognition is performed by
projecting the LBP representation of face images into the per-patch PCA subspaces,
and then into each of the learned LDA subspaces. The sum of the Euclidean distance
in each subspace is the dissimilarity between two face images. The 4SF algorithm is
summarized in Figure 5.3.
As shown in the experiments conducted in this study, the 4SF algorithm performs
on par with several commercial face recognition algorithms. Because 4SF is initially
the same approach as the non-trainable LBP matcher, the improvement in recognition accuracies (in this study) between the non-trainable LBP matcher and the 4SF
algorithm clearly demonstrates the ability of 4SF to leverage training data. Thus, a
high matching accuracy and the ability to leverage training data make 4SF an ideal
face recognition algorithm to study the eﬀects of training data on face recognition
performance. The 4SF algorithm was developed in house.

133

COTS−A
1.0

q
q

0.9
True Accept Rate

q

0.8
0.7

Dataset

q

0.6

Females
Males

0.5
10−4

10−3
10−2
10−1
False Accept Rate

100

Figure 5.4: Performance of the COTS-A commerical face recognition system on
datasets seperated by cohorts within the gender demographic.

134

COTS−B
1.0

q

q

True Accept Rate

0.9
0.8

q

0.7

Dataset

q

0.6

Females
Males

0.5
10−4

10−3
10−2
10−1
False Accept Rate

100

Figure 5.5: Performance of the COTS-B commerical face recognition system on
datasets seperated by cohorts within the gender demographic.

135

COTS−C
1.0
q

0.9
True Accept Rate

q

0.8

q

0.7
0.6

q

Dataset

q

q

Females
Males

0.5
10−4

10−3
10−2
10−1
False Accept Rate

100

Figure 5.6: Performance of the COTS-C commerical face recognition system on
datasets seperated by cohorts within the gender demographic.

136

LBP
1.0
q

True Accept Rate

0.9
0.8

q

0.7
q

Dataset

q

0.6

Females
Males

q

0.5
10−4

10−3
10−2
10−1
False Accept Rate

100

Figure 5.7: Performance of the local binary pattern-based non-trainable face recognition system on datasets seperated by cohorts within the gender demographic.

137

Gabor
1.0
q

0.9
True Accept Rate

q

0.8
q

0.7
q

0.6

Dataset

q

Females
Males

q

0.5
10−4

10−3
10−2
10−1
False Accept Rate

100

Figure 5.8: Performance of the Gabor-based non-trainable face recognition system on
datasets seperated by cohorts within the gender demographic.

138

4SF trained on all cohorts equally
1.0
q

True Accept Rate

0.9
0.8

q

0.7

Dataset

q

q

0.6

Females
Males

q

0.5
10−4

10−3
10−2
10−1
False Accept Rate

100

Figure 5.9: Performance of the 4SF algorithm trained on an equal number of samples
from each gender on datasets seperated by cohorts within the gender demographic.

139

4SF evaluated on Females
1.0
0.9
True Accept Rate

q

0.8
0.7

Dataset

q

0.6

q

Trained on Females
Trained on Males
Trained on All

0.5
10−4

10−3
10−2
10−1
False Accept Rate

100

Figure 5.10: Performance of the diﬀerent trained versions of the 4SF algorirthm on
the Females cohort.

140

4SF evaluated on Males
1.0

q

True Accept Rate

0.9
0.8

q

0.7

Dataset

q

0.6

Trained on Females
Trained on Males
Trained on All

0.5
10−4

10−3
10−2
10−1
False Accept Rate

100

Figure 5.11: Performance of the diﬀerent trained versions of the 4SF algorithm on
the Male cohort.

141

COTS−A
1.0

q
q

0.9
True Accept Rate

q

0.8
0.7

Dataset

q

0.6

Black
White
Hispanic

0.5
10−4

10−3
10−2
10−1
False Accept Rate

100

Figure 5.12: Performance of the COTS-A commercial face recognition system on
datasets seperated by cohorts within the race demographic.

142

COTS−B
1.0

q

0.9
True Accept Rate

q

0.8
0.7

Dataset

q

0.6

Black
White
Hispanic

0.5
10−4

10−3
10−2
10−1
False Accept Rate

100

Figure 5.13: Performance of the COTS-B commercial face recognition system on
datasets seperated by cohorts within the race demographic.

143

COTS−C
1.0
q

True Accept Rate

0.9
q

0.8
q

0.7

Dataset

q
q

0.6

Black
White
Hispanic

0.5
10−4

10−3
10−2
10−1
False Accept Rate

100

Figure 5.14: Performance of the COTS-C commercial face recognition system on
datasets seperated by cohorts within the race demographic.

144

LBP
1.0
0.9
True Accept Rate

q

0.8
q

0.7

Dataset

q

0.6

q

Black
White
Hispanic

0.5
10−4

10−3
10−2
10−1
False Accept Rate

100

Figure 5.15: Performance of the local binary pattern-based non-trainable recognition
system on datasets seperated by cohorts within the race demographic.

145

Gabor
1.0
0.9
True Accept Rate

q

0.8
q

0.7

Dataset

q
q

0.6

Black
White
Hispanic

0.5
10−4

10−3
10−2
10−1
False Accept Rate

100

Figure 5.16: Performance of the Gabor-based non-trainable recognition system on
datasets seperated by cohorts within the race demographic.

146

4SF trained on all cohorts equally
1.0
0.9
True Accept Rate

q

0.8
q

0.7

Dataset

q

0.6

Black
White
Hispanic

0.5
10−4

10−3
10−2
10−1
False Accept Rate

100

Figure 5.17: Performance of the 4SF algorithm trained on an equal number of samples
from each race on datasets seperated by cohorts within the race demographic.

5.5

Experimental Results

For each demographic (gender, race/ethnicity, and age), three separate matching
experiments are conducted. The results of these experiments are presented per demographic. Figures 5.4 to 5.11 delineate the results for all the experiments on the
gender demographic. Figures 5.12 to 5.20 delineate the results for all experiments on
the race/ethnicity demographic. Finally, Figures 5.21 to 5.29 delineate the results for
all experiments on the age demographic. The true accept rate at a ﬁxed false accept
147

4SF evaluated on Black
1.0

True Accept Rate

0.9
0.8

q

0.7
Dataset

q

0.6
0.5
10−4

Trained on Black
Trained on White
Trained on Hispanic
Trained on All

10−3
10−2
10−1
False Accept Rate

100

Figure 5.18: Performance of the diﬀerent trained versions of the 4SF algorithm on
the Black cohort.

148

4SF evaluated on White
1.0

True Accept Rate

0.9
0.8

q

0.7
Dataset

q

0.6
0.5
10−4

Trained on Black
Trained on White
Trained on Hispanic
Trained on All

10−3
10−2
10−1
False Accept Rate

100

Figure 5.19: Performance of the diﬀerent trained versions of the 4SF algorithm on
the White cohort.

149

4SF evaluated on Hispanic
1.0

True Accept Rate

0.9
0.8
q

0.7

Dataset

q

0.6
0.5
10−4

Trained on Black
Trained on White
Trained on Hispanic
Trained on All

10−3
10−2
10−1
False Accept Rate

100

Figure 5.20: Performance of the diﬀerent trained versions of the 4SF algorithm on
the Hispanic cohort.

150

COTS−A
1.0
q

0.9
True Accept Rate

q

0.8
0.7

Dataset

q

0.6

Ages 18 to 30
Ages 30 to 50
Ages 50 to 70

0.5
10−4

10−3
10−2
10−1
False Accept Rate

100

Figure 5.21: Performance of the COTS-A commercial face recognition system on
datasets seperated by cohorts within the age demographic.

151

COTS−B
1.0
q

True Accept Rate

0.9
0.8
0.7

Dataset

q

0.6

Ages 18 to 30
Ages 30 to 50
Ages 50 to 70

0.5
10−4

10−3
10−2
10−1
False Accept Rate

100

Figure 5.22: Performance of the COTS-B commercial face recognition system on
datasets seperated by cohorts within the age demographic.

152

COTS−C
1.0

True Accept Rate

0.9

q

0.8

q

0.7

Dataset

q

q

0.6

Ages 18 to 30
Ages 30 to 50
Ages 50 to 70

0.5
10−4

10−3
10−2
10−1
False Accept Rate

100

Figure 5.23: Performance of the COTS-C commerical face recognition system on
datasets seperated by cohorts within the age demographic.

153

LBP
1.0
q

True Accept Rate

0.9
0.8

q

0.7

Dataset

q

q

0.6

Ages 18 to 30
Ages 30 to 50
Ages 50 to 70

0.5
10−4

10−3
10−2
10−1
False Accept Rate

100

Figure 5.24: Performance of the local binary pattern-based non-trainable face recognition system on datasets seperated by cohorts within the age demographic.

154

Gabor
1.0
0.9
True Accept Rate

q

0.8
q

0.7

Dataset

q

q

0.6

Ages 18 to 30
Ages 30 to 50
Ages 50 to 70

0.5
10−4

10−3
10−2
10−1
False Accept Rate

100

Figure 5.25: Performance of the Gabor-based non-trainable face recognition system
on datasets seperated by cohorts within the age demographic.

155

4SF trained on all cohorts equally
1.0
q

True Accept Rate

0.9
0.8

q

0.7

Dataset

q

0.6

Ages 18 to 30
Ages 30 to 50
Ages 50 to 70

0.5
10−4

10−3
10−2
10−1
False Accept Rate

100

Figure 5.26: Performance of the 4SF algorithm trained on an equal distribution of
samples acoress age on datasets seperated by cohorts within the age demographic.

156

4SF evaluated on Ages 18 to 30
1.0

True Accept Rate

0.9
q

0.8
0.7

Dataset

q

0.6
0.5
10−4

Trained on Ages 18 to 30
Trained on Ages 30 to 50
Trained on Ages 50 to 70
Trained on All

10−3
10−2
10−1
False Accept Rate

100

Figure 5.27: Performance of the diﬀerent trained versions of the 4SF algorithm on
the Ages 18 to 30 cohort.

157

4SF evaluated on Ages 30 to 50
1.0

True Accept Rate

0.9
0.8

q

0.7
Dataset

q

0.6
0.5
10−4

Trained on Ages 18 to 30
Trained on Ages 30 to 50
Trained on Ages 50 to 70
Trained on All

10−3
10−2
10−1
False Accept Rate

100

Figure 5.28: Performance of the diﬀerent trained versions of the 4SF algorithm on
the Ages 30 to 50 cohort.

158

4SF evaluated on Ages 50 to 70
1.0

True Accept Rate

0.9
0.8
q

0.7

Dataset

q

0.6
0.5
10−4

Trained on Ages 18 to 30
Trained on Ages 30 to 50
Trained on Ages 50 to 70
Trained on All

10−3
10−2
10−1
False Accept Rate

100

Figure 5.29: Performance of the diﬀerent trained versions of the 4SF algorithm on
the Ages 50 to 70 cohort.

159

rate of 0.1% for all the aforementioned plots are summarized in Tables 5.2, 5.3, and
5.4.
The ﬁrst experiment conducted on each demographic measures the relative performance within the demographic cohort for each COTS FRS@. That is, for a particular commercial matcher (e.g., COTS-A), we compare it’s matching accuracy on
each cohort within that demographic. For example, on the gender demographic, this
experiment will measure the diﬀerence in recognition accuracy for commercial matchers on males versus females. The results from this set of experiments can be found
in Figures 5.4, 5.5, and 5.6 for the gender demographic, Figures 5.12, 5.13, 5.14 for
the race/ethnicity demographic, and Figures 5.21, 5.22, and 5.23 for the age demographic. The second experiment conducted on each demographic cohort measures the
relative performance within the cohort for non-trainable face recognition algorithms.
Because the non-trainable algorithms do not leverage statistical variability in faces,
they are not susceptible to training biases. Instead, they reﬂect the inherent (or a priori) diﬃculty in recognizing cohorts of subjects within a speciﬁc demographic group.
The results from this set of experiments can be found in Figures 5.7 and 5.8 for the
gender demographic, Figures 5.15 and 5.16 for the race/ethnicity demographic, and
Figures 5.24 and 5.25 for the age demographic.
The ﬁnal experiment investigates the inﬂuence of the training set on recognition performance. Within each demographic cohort, we train several versions of the
4SF algorithm (one for each cohort). These diﬀerently trained versions of the 4SF
algorithm are then applied to separate testing sets from each cohort within the particular demographic. This enables us to understand within the gender demographic
(for example), how much training exclusively on females (i) improves performance on
females, and (ii) decreases performance on males. In addition to training 4SF exclusively on each cohort, we also use a version of 4SF trained on an equal representation
of speciﬁc demographic cohorts (referred to as “Trained on All”). For example, in the
160

gender demographic, this would mean that for “All”, 4SF was trained on 4,000 male
subjects and 4,000 female subjects. The results from this set of experiments can be
found in Figures 5.9 to 5.11 for the gender demographic, Figures 5.17 to 5.20 for the
race/ethnicity demographic, and Figures 5.26 to 5.29 for the age demographic.

5.6

Analysis

In this section we provide an analysis of the ﬁndings of the experiments described in
Section 6.7. A strength of this study is the large face dataset leveraged; accuracies
measured on each cohort (except Hispanic and Old cohorts) are from roughly 8,000
subjects.

5.6.1

Gender

Each of the three commercial face recognition algorithms performed signiﬁcantly
worse on the female cohort than the male cohort (see Figures 5.4, 5.5, and 5.6). Additionally, both non-trainable algorithms (LBP and Gabor) performed signiﬁcantly
worse on females (see Figures 5.7 and 5.8).
The agreement in relative accuracies of the COTS FRS and the non-trainable LBP
method on the gender demographic suggests that the female cohort is more diﬃcult to
recognize using frontal face images than the male cohort. That is, if the results in the
COTS algorithms were due to imbalanced training sets (i.e., training on more males
than females), then the LBP matcher should have yielded similar matching accuracies
on males and females. Instead, the non-trained LBP and Gabor matchers performed
worse on the female cohort. When training on males and females equally (Figure 5.9),
the 4SF algorithm also did signiﬁcantly worse on the female cohort. Together, these
results strongly suggest that the female cohort is inherently more diﬃcult to recognize.
The results of the 4SF algorithm on the female cohort (Figure 5.10) oﬀer additional
161

Table 5.2: Listed are the true accept rates at a ﬁxed false accept rate of 0.1% for each
matcher on the gender demographic.
Females
COTS-A
COTS-B
COTS-C
LBP
Gabor
4SF trained on All
4SF trained on Females
4SF trained on Males

Males

89.5
81.6
70.3
54.4
56.0
73.0
71.5
69.0

94.4
89.3
80.9
74.0
68.2
86.2
85.0
86.3

evidence about the nature of the discrepancy. The performance of training on only
females is not higher than the performance of training on a mix of males and females
(labeled “All”). Further, the diﬀerence in performance when training on only males
versus training on only females is much lower than the diﬀerence in performance
between males and females on the non-trainable algorithm. In other words, the
diﬃculty in recognizing females seems to be due to a higher ratio of inter-class variance
to intra-class variance in the initial face image representations.

Diﬀerent factors may explain why females appear more diﬃcult to recognize
than males. One explanation may be that because females often use cosmetics (i.e.,
makeup), and males generally do not, there is a higher within-class variance in females. This hypothesis is supported by the match score distributions for males and
females (see Figures 5.30 and 5.31). A greater diﬀerence in the true match distributions is noticed when compared to the false match distributions. The increased
dissimilarities between images of the same female subjects demonstrate intra-class
variability. Again, a cause of this may be due to the application of cosmetics.
162

Table 5.3: Listed are the true accept rates at a ﬁxed false accept rate of 0.1% for each
matcher on the race dataset.
Black

White

Hispanic

88.7
81.3
74.0
65.3
61.6
78.4
80.2
75.4
74.5

94.4
89.0
79.8
70.5
63.7
83.0
81.0
84.5
80.2

95.7
90.7
87.3
73.5
70.9
86.3
59.8
59.9
60.1

COTS-A
COTS-B
COTS-C
LBP
Gabor
4SF trained on All
4SF trained on Black
4SF trained on White
4SF trained on Hispanic

Table 5.4: Listed are the true accept rates at a ﬁxed false accept rate of 0.1% for each
matcher on the age dataset.
18 to 30 y.o.
COTS-A
COTS-B
COTS-C
LBP
Gabor
4SF trained on All
4SF trained on 18 to 30 y.o.
4SF trained on 30 to 50 y.o.
4SF trained on 50 to 70 y.o.

30 to 50 y.o.

50 to 70 y.o.

91.7
86.1
76.5
69.4
61.7
81.5
83.3
82.1
78.7

94.6
89.1
80.7
74.7
68.2
85.6
85.9
86.0
84.5

94.4
87.5
83.6
75.1
65.7
83.6
80.7
82.2
82.0

163

5.6.2

Race

When examining the race/ethnicity cohort, all three commercial face recognition algorithms achieved the lowest matching accuracy on the Black cohort (see Figures 5.12
to 5.14). The two non-trained algorithms had similar results (Figures 5.15 and 5.16).
When matching against only Black subjects (Figure 5.18), 4SF has higher accuracy
when trained exclusively on Black subjects (about a 5% improvement over the system
trained on Whites and Hispanics only). Similarly, when evaluating 4SF on only
White subjects (Figure 5.19), the system trained on only the White cohort had the
highest accuracy. However, when comparing the 4SF algorithm trained equally on all
race/ethnicity cohorts (Figure 5.17), we see that the performance on the Black cohort
is still lower than on the White cohort. Thus, even with balanced training, the Black
cohort still is more diﬃcult to recognize.
The key ﬁnding in the training results shown in Figures 5.17 to 5.20 is the ability to improve recognition accuracy by training exclusively on subjects of the same
race/ethnicity. Compared to balanced training (i.e., training on “All”), the performance of 4SF when trained on the same race/ethnicity it is recognizing is higher.
Thus, by merely changing the distribution of the training set, we can improve the
recognition rate by nearly 2% on the Black cohort and 1.5% on the White cohort (see
Table 5.3).
The inability to eﬀectively train on the Hispanic cohort is likely due to the insuﬃcient number of training samples available for this cohort. However, the biogeographic
ancestry of the Hispanic ethnicity is generally attributed to a three-way admixture
of Native American, European, and West Black populations [88]. Even with an increased number of training samples, we believe this mixture of races would limit the
ability to improve recognition accuracy through race/ethnicity speciﬁc training.
164

5.6.3

Age Demographic

All three commercial algorithms had the lowest matching accuracy on subjects
grouped in the ages 18 to 30 (see Figures 5.21 to 5.23). The COTS-A matcher
performed nearly the same on the 30 to 50 year old cohort as the 50 to 70 year old
cohort. However, COTS-B had slightly higher accuracy on 30 to 50 age group than
50 to 70 age group, while COTS-C performed slightly better on 50 to 70 than 30 to
50 age groups.
The non-trainable algorithms (Figures 5.24 and 5.25) also performed the worst on
the 18 to 30 age cohort.
When evaluating 4SF on only the 18 to 30 year old cohort (Figure 5.27) and
the 30 to 50 year old cohort (Figure 5.28), the highest performance was achieved
when training on the same cohort. Table 5.4 helps elaborate on the exact accuracies.
Similar to race, we were able to improve recognition accuracy by merely changing the
distribution of the training set.
When comparing the 4SF system that is trained with equal number of subjects
from all age cohorts, the performance on the 18 to 30 year old cohort is the lowest.
This is consistent with the accuracies of the commercial face recognition algorithms.
The less eﬀective results from training on the 50 to 70 year old cohort is likely due
to an small number of training subjects. This is consistent with the training results
on the Hispanic cohort, which also had a small number of training subjects.

5.6.4

Impact of Training

The demographic distribution of the training set generally had a clear impact
on the performance of diﬀerent demographic groups. Particularly in the case of
race/ethnicity, we see that training on a set of subjects from the same demographic
cohort as being matched oﬀers an increase in the True Accept Rate (TAR). This ﬁnding is particularly important because in most operational scenarios, particularly those
165

Male Match Scores

9

Frequency
(scaled)

4

x 10

False Positive
True Positive

2

0
600

800

1000

1200

Dissimilarity

1400

1600

(a)

Female Match Scores

9

Frequency
(scaled)

6

x 10

4

False Positive
True Positive

2
0
600

800

1000

1200

Dissimilarity

1400

1600

(b)
Figure 5.30: Match score distributions for the (a) male and (b) female genders using
the 4SF system trained with an equal number of male and female subjects. All
histograms are aligned on the same horizontal axis.

166

True Match Scores

5

Frequency
(scaled)

6

x 10

Male
Female

4
2
0
600

800

1000

1200

Dissimilarity

1400

1600

1400

1600

(a)

False Match Scores

9

Frequency
(scaled)

6

x 10

4

Male
Female

2
0
600

800

1000

1200

Dissimilarity
(b)

Figure 5.31: Geniune and impostor score distributions for the male and female genders
using the 4SF system trained with an equal number of male and female subjects.
The increased distances (dissimilarities) for the true match comparisons in the female
cohort suggest increased within-class variance in the female cohort. All histograms
are aligned on the same horizontal axis.

167

dealing with forensics and law enforcement, the use of face recognition is not being
done in a fully automated, “lights out” mode. Instead, an operator is usually interacting with a face recognition system, performing a one-to-one veriﬁcation task, or
exploring the gallery to group together candidates in clusters for further exploitation.
Each of these scenarios can beneﬁt from the use of demographic-enhanced matching
algorithms, as described below.

Scenario 1 - 1:N Search In many large face recognition database searches, the
objective is to have the true match candidates ranked high enough to be found by
the analyst performing the candidate adjudication. While it will not always be the
case, under many conditions, the analyst will be able to categorize the demographics
of the probe image based on age, gender, and/or race/ethnicity. In such a situation,
if the analyst has the option to select a diﬀerent matching algorithm that has been
trained for that speciﬁc demographic group, then improved matching results should
be expected. An schematic of this is shown in Figure 5.2. This individual could be
searched using an algorithm trained on male, Whites, and aged 18 to 30. If a true
match is not found using that algorithm, then a more generic algorithm might be used
as a follow up to further search the gallery. Note that this scenario does not require
that the gallery images be pre-classiﬁed based on speciﬁc demographic information.
Instead, the algorithm should simply generate higher match scores for subjects that
share the characteristics of that demographic cohort. We call this method of face
search dynamic face matcher selection. In cases where the demographic is unclear
(e.g., a mixed race/ethnicity subject), the matcher trained on all cohorts equally can
be used. Examples of improved retrieval instances through applying this technique
can be found in Figure 5.32.

Scenario 2 - 1:1 Veriﬁcation It is often the case that investigators will identify
a possible match to a known subject and will request an analyst to perform a 1:1
168

Probe Images:

Gallery Mates:

Retrieval Rank for 4SF Trained on all cohorts equally:
873
866
763
679
628
5
Retrieval Rank for 4SF Trained on White cohort exclusivly:
10
42
48
20
10
48

3
42

(a)
Probe Images:

Gallery Mates:

Retrieval Rank for 4SF Trained on all cohorts equally:
820
.112
730
640
574
7
Retrieval Rank for 4SF Trained on Black cohort exclusively:
20
34
43
9
43
42

6
41

(b)
Figure 5.32: Shown are examples where dynamic face matcher selection improved
the retrieval accuracy. The ﬁnal two columns show the less frequent case where
such a technique reduced the retrieval accuracy. Retrieval ranks are out of roughly
8,000 gallery subjects for each cohort. Leveraging demographic information (such
as race/ethnicity in this example) allows a face recognition system to perform the
matching using statistical models that are tuned to the diﬀerences within the speciﬁc
cohort.
169

veriﬁcation of the match. This also happens as a result of a 1:N search, once a
potential match to a probe is identiﬁed. In either case, the analyst must reach a
determination of match or no-match. In fully automated systems, this decision is
based on a numerical similarity threshold. In some environments, the analyst is
prevented from seeing the similarity score out of concern that his judgment will be
biased. But in others, the analyst is permitted to incorporate this into his analysis. In
either case, it is anticipated that an algorithm trained on a speciﬁc demographic group
will return higher match scores for true matches than one that was more generic. As
a result, the analyst is more likely to get a hit and the 1:1 matching results process
will be improved.

Scenario 3 - Veriﬁcation at Border Crossings The results presented here provide support for further testing of additional demographic groups, potentially including speciﬁc country or geographic-region of origin. Assuming such demographics
proved eﬀective at improving match scores, then use of dynamic face matcher selection could be extended to immigration or border checks on entering subjects to verify
that their passport or other documents accurately reﬂects their country of origin.

Scenario 4 - Face Clustering Another analyst-driven application involves the
exploitation of large sets of uncontrolled face imagery. Images encountered in intelligence or investigative applications often include large sets of videos or arbitrary
photographs taken with no intention of enrolling them in a face recognition environment. Such image sets oﬀer a great potential for development of intelligence leads by
locating multiple pictures of speciﬁc individuals and giving analysts an opportunity
to link subjects who may be found within the same photographs. Clustering methods
are now being used on these datasets to group faces that appear to represent the
same subject. Implementations of such clustering methods today usually rely upon
a single algorithm to perform the grouping and an analyst must perform the quality
170

control step to determine if a particular cluster contains only a single individual. By
combining multiple demographic-based algorithms into a sequential analysis, it may
be possible to improve the clustering of large sets of face images and thereby reduce
the time required for the analyst to perform the adjudication of individual clusters.

5.7

Conclusions

In this chapter we examined face recognition performance on diﬀerent demographic
cohorts on a large operational database of 102,942 face images. Three demographics
were analyzed: gender (male and female), race/ethnicity (White, Black, and Hispanic), and age (18 to 30 years old, 30 to 50 years old, and 50 to 70 years old).
For each demographic cohort, the performances of three commercial face recognition algorithms were measured. The performances of all three commercial algorithms
were consistent in that they all exhibited lower recognition accuracies on the following
cohorts: females, Blacks, and younger subjects (18 to 30 years old).
Additional experiments were conducted to measure the performance of nontrainable face recognition algorithms (local binary pattern-based and Gabor-based),
and a trainable subspace method (the Spectrally Sampled Structural Subspace Features (4SF) algorithm). These experiments oﬀered additional evidence to form hypotheses about the observed discrepancies between certain demographic cohorts.
Some of the keys ﬁndings in this study are:
• The female, Black, and younger cohorts are more diﬃcult to recognize for all
matchers used in this study (commercial, non-trainable, and trainable).
• Face recognition performance on race/ethnicity and age cohorts generally improve when training exclusively on that same cohort.
• The above ﬁnding suggests the use of dynamic face matcher selection, where
multiple face recognition systems, trained on diﬀerent demographic cohorts, are
171

available as a suite of systems for operators to select based on the demographic
information of a given query image (see Figure 5.2).
• In scenarios where dynamic matcher selection is not possible, training face recognition systems on datasets that are well distributed across all demographics is
critical to reduce face matcher vulnerabilities on speciﬁc demographic cohorts.
Finally, as with any empirical ﬁnding, additional ways to exploit the ﬁndings of
this research are likely to be found. Of particular interest is the observation that
women appear to be more diﬃcult to identify through facial recognition than men.
If we can determine the cause of this diﬀerence, it may be possible to use that information to improve the overall matching performance.
The experiments conducted in this chapter should have a signiﬁcant impact on
the design of face recognition algorithms. Similar to the large body of research on algorithms that improve face recognition performance in the presence of other variates
known to compromise recognition accuracy (e.g., pose, illumination, and aging), the
results in this study should motivate the design of algorithms that speciﬁcally target
diﬀerent demographic cohorts within the race/ethnicity, gender and age demographics. By focusing on improving the recognition accuracy on such confounding cohorts
(i.e., females, Blacks, and younger subjects), researchers should be able to further
reduce the error rates of state of the art face recognition algorithms and reduce the
vulnerabilities of such systems used in operational environments.

172

Chapter 6
Towards Automated Caricature
Recognition
6.1

Introduction

Among the remarkable capabilities possessed by the human visual system, perhaps
none is more compelling than our ability to recognize a person from a caricature.
A caricature is a face image in which certain facial attributes and features have
been exaggerated to a degree that is often beyond realism, and yet the face is still
recognizable (see Fig. 6.1). As Leopold et al. discussed [69], the caricature generation
process can be conceptualized by considering each face to lie in a face space. In this
space, a caricature face beyond the line connecting the mean face1 and a subject’s
face. In other words, a caricature is an extrapolated version of the original face.
Despite the (often extreme) exaggeration of facial features, the identity of a subject
in a caricature is generally obvious, provided the original face is known to the viewer.
In fact, studies have suggested that people may be better at recognizing a familiar
person through a caricature portrait than from a veridical portrait2 [90, 118].
1A
2A

mean face is the average appearance of all faces.
verdical portrait is a highly accurate facial sketch of a subject.
173

(a)

(b)

(c)

(d)

Figure 6.1: Examples of caricatures (top row) and photographs (bottom row) of four
diﬀerent personalities. Shown above are: (a) Angelina Jolie (drawn by Rok Dovecar),
(b) Adam Sandler (drawn by Dan Johnson), (c) Bruce Willis (drawn by Jon Moss),
and (d) Taylor Swift (drawn by Pat McMichael).
So why is it that an exaggerated, or extrapolated, version of a face can be so
easy to recognize? Studies in human cognition have suggested this phenomenon is
correlated to how humans represent and encode facial identity [90]. Empirical studies
suggest that this representation involves the use of prototype faces, where a face image
is encoded in terms of its similarity to a set of prototype face images [69, 130, 145].
Under this assumption, the eﬀectiveness of a caricature would be due to its ability
to emphasize deviations from prototypical faces. This would also explain why faces
that are “average” looking, or typical, are more diﬃcult to recognize [145].
Automated face recognition, despite its signiﬁcant progress over the past decade
[34], still has many limitations. State of the art face recognition algorithms are not
able to meet the performance requirements in uncontrolled and non-cooperative face
matching scenarios, such as surveillance. We believe clues on how we can better compute the similarity between faces may be found through investigating the caricature
174

matching process [51].
In this chapter we further expand our contributions of heterogeneous face recognition by studying the process of automatically matching a caricature to a facial
photograph. To accomplish this, we deﬁne a set of qualitative facial attributes (e.g.
“nose to mouth distance”) that are used to encode the appearance of a face (caricature or photograph). These features, called “qualitative features”, are generally
on a nominal scale (and occasionally on an ordinal scale) that characterize when a
particular facial attribute is either typical or atypical (deviates from the mean face).
Statistical learning is performed to learn feature weighting and the optimal subset of
these features.
While several methods exist for automating the caricature generation process, to
our knowledge, this is the ﬁrst attempt to automate the caricature recognition process. In addition to posting impressive accuracies on this diﬃcult heterogeneous face
recognition task, we are also releasing a caricature recognition dataset, experimental
protocol, and qualitative features to the research community. Through the design
and performance evaluation of caricature recognition algorithms, it is our belief that
we will help advance the state of automatic face recognition through the discovery of
additional facial representations and feature weightings [5].

6.2

Related Work

Caricature recognition belongs to a face recognition paradigm known as heterogeneous
face recognition (HFR) [54], which has been well discussed in this dissertation. In
brief, heterogeneous face recognition is the task of matching two faces from alternate
modalities.
Solutions to heterogeneous face recognition problems generally follow one of two
approaches. The ﬁrst approach, popularized by Wang and Tang [149], seeks to syn175

thesize an image from one of the modalities (e.g. sketch) in the second modality (e.g.
photograph). Once this synthesis has occurred, standard matching algorithms can be
applied in the now common modality.
The second approach to HFR is to densely sample feature descriptors (such as local
binary patterns (LBP) [99]) from the images in each modality. The feature descriptor
is selected such that it varies little when moving between the imaging modalities,
while still capturing key discriminative information. A beneﬁt of this feature-based
approach is that it facilitates statistical subspace learning (such as linear discriminant
analysis (LDA) [10] and its variants) to further improve the class separability. This
approach has been successfully used by Liao et al. [77], Klare and Jain [52, 54, 59],
and Bhatt et al. [12].
In the context of caricature recognition, an image feature descriptor-based approach is challenged because the descriptors from the caricature and photograph may
not be highly correlated due to misalignment caused by feature exaggerations (e.g. the
nose in the caricature may extend to where the mouth or chin is in the photograph).
However, the application of LDA, in a manner similar to other HFR studies [52, 59],
somewhat compensates for these misalignments. Further, LDA oﬀers a solution to
the intra-artist variability through the modeling of the within-class scatter. For these
reasons, the study in this chapter also makes use of the image feature descriptor-based
approach in addition to the qualitative feature based approach (see Section 6.6).
A major contribution of this of this chapter is the deﬁnition of a set of categorical, or nominal, facial attributes. This approach is similar to the attribute and simile
features proposed by Kumar et al. [65], who demonstrated the beneﬁt of this nominal
feature representation for recognizing face images. While we present a similar representation, the features proposed here have been carefully deﬁned by a professional
artist with experience in drawing caricatures.
A number of methods in graphics have been developed for automatically generat176

ing caricature images [6,7,16,64,70]. However, to our knowledge, no previous research
on matching caricatures to photographs has been conducted. The method proposed
by Hsu and Jain [38] was the closest attempt, where facial photographs were matched
by ﬁrst synthesizing them into caricature drawings. Klare and Jain considered the
task of matching facial carvings [62] and avatar face images [153], both of which
exhibited some facial disproportions that are similar to caricatures.

6.3

Caricature Dataset

In this section we describe the dataset that was used in this study. Future studies
comparing accuracies on this dataset should follow the protocol detailed in Section
6.7.
The dataset consists of pairs of a caricature sketch and a corresponding facial
photograph from 196 subjects (see Fig. 1 for examples). Two sources were used to
collect these images. The ﬁrst was through contacts with various artists who drew
the caricatures. For these images, permission was granted to freely distribute the
caricatures. In total 89 caricatures were collected from this source.
The second source of caricature images was from Google Image searches. For these
caricatures, the url of the image was recorded, and is included in the dataset release
(along with the actual image). There were 107 pairs from this source.
The corresponding face image for each subject was provided by the caricature
artist for caricatures from the ﬁrst source, and by Google Image search for the second
source. When selecting face photographs, care was taken to ﬁnd images that had
minimal variations in pose, illumination, and expression. However, such “ideal” images do not always exist. Thus, many of the PIE (pose, illumination and expression)
factors still persist.
177

(a)

(b)

(c)

(d)

Figure 6.2: Diﬀerent forms of facial sketches (b-d). (a) Photograph of a subject. (b)
Portrait sketch. (c) Forensic sketch drawn by Semih Poroy from a verbal description.
(d) Caricature sketch.

6.4

Qualitative Feature Representation

We deﬁne a set of categorical facial features for representing caricature images and
face photographs. These features were developed by one of the authors who is a
cartoonist (in addition to being a professor of electrical engineering [4]).
While face images are typically encoded by high-dimensional numerical features
(such as local binary patterns [99]), the tendency of a caricature to exaggerate distinctive facial features [117] makes such numerical encodings not appropriate for representing caricatures images. Instead, the proposed qualitative features describe facial
features that a caricature artist may portray as to whether or not it is present. Thus,
if “large distance between the nose and mouth” is a feature the artist chooses to emphasize, the proposed representation is able to capture this without being impacted
by exactly how much the artist extrapolates this distance from the norm [5].
A caricaturist can be likened to a “ﬁlter” that only retains useful information
in a face for identiﬁcation. As a ﬁlter, the artist uses his talent to analyze a face,
eliminate insigniﬁcant facial features, and capture the identity though exaggeration
of the prominent features. Most of the caricaturists start with the description of the
general shape of the head. They assemble the eyes, nose, eyebrows, lips, chin and
178

Level 1
Face Length:
Face Shape:
Hairstyle 1:
Beard:
Hairstyle 2:
Mustache:

Level 2
Eye Seperation:
Nose to Eye Distance:
Nose to Mouth Distance:
Mouth to Chin Distance:
Mouth Width:
Nose Width:
Figure 6.3: Illustration of features numbered one through twelve in the set of twenty
ﬁve qualitative features used to represent both caricatures and photographs. The
similarity between sketches and photographs were measured within this representation.

179

Level 2
Nose (Up or Down):
Forehead Size:
Thick Eyebrows:
Eyebrows (Up or Down):
Eyebrows Connected:
Eyebrow Shape:
Eye Color:
Sleepy Eyes:
Almond Eyes:
Slanted Eyes:
Sharp Eyes:
Baggy Eyes:
Cheeks:
Figure 6.4: Illustration of the features numbered thirteen through twentyfour in the
set of twenty ﬁve qualitative features used to represent both caricatures and photographs. The similarity between sketches and photographs were measured within
this representation.

180

ears with some exaggerations in geometrically correct locations (always maintaining
the appropriate ratios amongst them); ﬁnally, they include the hair, moustache and
beard (depending on the gender or their presence in the face).
In this study, following the caricaturists methodology [117], we deﬁne a set of 25
qualitative facial features that are classiﬁed into two levels (see Figure 6.4). The ﬁrst
level (Level 1) is deﬁned for the general shapes and sizes of the facial components and
the second level (Level 2) is deﬁned for the size and appearance of facial components,
as well as ratios amongst the locations of diﬀerent components (e.g. distance of the
mouth from the nose).

6.4.1

Level 1 Qualitative Features

Level 1 features describe the general appearance of the face. These features can be
more quickly discerned than Level 2 features. In standard face recognition tasks,
Level 1 features are less informative than Level 2 features [56] due to their lack of
persistence and uniqueness. However, in caricature recognition experiments these
features are shown to be the most informative (see Section 6.7).
The length of the face is captured by the Face Length feature (narrow or elongated). The shape of the face is described by the Face Shape feature (boxy, round,
or triangular). Two diﬀerent features are used to capture the hair style, with values
including: short bangs, parted left, parted right, parted middle, bald, nearly bold,
thin middle, and curly. Facial hair is represented with the Beard feature (none, normal, Abraham Lincoln, thin, thick, and goatee) and Mustache feature (normal, none,
thin, and thick) features. See Figure 6.4 for visual examples of these features.

6.4.2

Level 2 Features

Speciﬁc facial details are captured by the Level 2 facial features. Level 2 facial features
will oﬀer more precise descriptions of speciﬁc facial components (such as the eyes,
181

nose, etc.) compared to Level 1 features.
Several binary features are used to represent the appearance of the eyes. These
include whether or not the eyes are dark, sleepy, “almond” shaped, slanted, sharp,
or baggy. Similarly, the eyebrows are represented by their thickness, connectedness,
direction (up or down), and general shape (normal, rounded, or pointed). The nose
is represented by its width (normal, thin or wide) and direction (normal, points up,
or points down). The mouth is characterized by its width (normal, thin, or wide).
The cheeks are described as being either normal, thin, fat or baggy.
Several features are used to capture the geometric relationships among the facial
components. They describe the distance between the nose and the eyes, the nose
and the mouth, and the mouth and the chin. Two additional features describe the
distance between the eyes, and the length of the forehead.

6.4.3

Feature Labeling

Each image (caricature and photo) was labeled with qualitative features by annotators
provided through Amazon’s Mechanical Turk3 . Several annotators combined to label
the entire set of image pairs with each of the 25 qualitative features. Each annotator
was asked to label a single image with a single feature value at a time. Thus, the
annotator was shown an image of either a caricature or a photograph, and each of the
possible feature values (along with their verbal description) for the current feature
being labeled.
To compensate for diﬀerences in annotator opinions on less obvious image/feature
combinations, each image was labeled three times by three diﬀerent annotators. Thus,
given 25 qualitative features and three labelers per feature, a total of 75 feature labels
were available per image. In all, 29,400 labeling tasks were performed through this
crowdsourcing method (costing roughly $300 USD).
3 https://www.mturk.com/

182

Caricature

Encode both as
Qualitative Features

Difference
Vector Classifier
MKL
Log.
Regr.
SVM

.
.
.

Photograph

Histogram

Fusion

Similarity
Score

Figure 6.5: Overview of the caricature recognition algorithm.

6.5

Matching Qualitative Features

With each image labeled with 25 qualitative attributes u times (u = 3, see Sec.
6.4.3), each image (photo or caricature) can be represented by a u × 25 matrix C ∈
u×25
Z+ . Note that the matrix elements are nonnegative integers since each feature is

of categorical type.
In order to improve the matching performance, we adopt machine learning techniques for feature subset selection and weighting. To facilitate this, we encode the
categorical attributes into binary features by using ri bits for each attribute, where
ri is the number of possible choices for the ith attribute. For example, ri = 2 for
“Thick Eyebrows” and ri = 3 for “Nose Width” (see Figure 6.4).
Ideally, the binary valued feature vector should lead to a vector with only one nonzero element per feature. However, the annotators may give contradicting annotations
(e.g. one subject can be labeled as having a “Wide Nose” and “Normal Nose” by
two diﬀerent annotators). Hence, we accumulate the binary valued feature vectors
into histogram feature vectors. Thus, a single feature will no longer be represented
by an ri -bit binary number, but instead by an ri -dimensional feature vector. Each
component will have a minimum value of 0 and a maximum value of u. Finally,
for each image, we concatenate the 25 individual attribute histograms to get a 77dimensional feature vector (x ∈ Z77 , ||x||1 = 25u). Given this representation, the
+
simplest method for matching is to perform nearest neighbor search with Euclidean
183

distance (referred to as N NL2 ).
Next, we convert the caricature-photo matching problem into a binary classiﬁcation task by calculating the absolute diﬀerence vector for every possible caricaturephoto pair in the training set. In the binary classiﬁcation setting, the diﬀerence vector
for the caricature and photo pair of the same subject (i.e. a true match) is labeled as
‘1’ whereas the diﬀerence vector for caricature-photo pair of two diﬀerent subjects (i.e.
a false match) is labeled as ‘-1’. This gives us n positive samples (genuine matches)
and n2 − n negative samples (imposter matches), where n is the number of subjects
in the training set.
With the caricature recognition problem reformulated as a binary classiﬁcation
task, we leverage a fusion of several binary classiﬁers. Let {(xi , yi ), xi ∈ Rd , yi ∈
{−1, 1}, i = 1, 2, ...., m} be the m pairs of diﬀerence vectors, where d = 77. Again,
if xi is a diﬀerence vector between a caricature and photograph of the same subject
then yi = 1, otherwise yi = −1.

6.5.1

Logistic Regression

Logistic regression seeks to ﬁnd a function that maps the diﬀerence vectors to their
numerical label (+1 or −1). The output of this regression can be interpreted as a
similarity score, which facilitates fusion and receiver operator characteristic (ROC)
analysis.
The objective function of the logistic regression is as follows
m

{−yi xi β + log(1 + exp(xi β))} + λR(β),

min
β

(6.1)

i=1

where β is the vector of the feature weights to be learned, R(β) is a regularization
term (to avoid overﬁtting and impose structural constraints) and λ is a coeﬃcient to
control the contribution of the regularizer to the cost. Two diﬀerent regularizers are
184

Method

FAR=10.0%

FAR=1.0%

Qualitative Features (no learning):
N NL2

39.2 ± 5.4

Qualitative Features (learning):
Logistic Regression
MKL
N NMKL
SVM
Logistic Regression+N NMKL +SVM

50.3 ± 2.4
39.5 ± 3.2
46.6 ± 3.9
52.6 ± 5.0
56.9 ± 3.0

11.3 ± 2.9
7.4 ± 3.9
10.3 ± 3.6
12.1 ± 2.8
15.5 ± 4.6

Image Descriptors (learning):
LBP with LDA

33.4 ± 3.9

11.5 ± 2.5

61.9 ± 4.5

22.7 ± 3.5

Qualitative Features + Image Descriptors:
Logistic Regression+N NMKL +
SVM+LBP with LDA

Table 6.1: Average veriﬁcation accuracies of the proposed qualitative, image featurebased, and baseline methods. Shown are the true accept rates (TAR) at ﬁxed false
accept rates (FAR) of 1.0% and 10.0%. Average accuracies and standard deviations
were measured over 10 random splits of 134 training subjects and 62 testing subjects
(subjects in training and test sets are diﬀerent).

commonly used: (i) the L1 -norm regularizer, R(β) = ||β||1 (also know as Lasso [141]),
which imposes sparseness on the solutions by making most of the coeﬃcients to be
equal to zero for large values of λ, and (ii) the L2 -norm regularizer, R(β) = ||β||2 ,
which leads to non-sparse solutions.

Our experimental results with the implementation of [49] favored the L2 -norm
regularizer, which we refer to in Section 6.7 as Logistic Regression. Having solved for
β using a gradient descent method, we compute the similarity value of the diﬀerence
vector x between a caricature and photograph as: f (x) = xβ − log(1 + exp(xβ)).
185

Method

Rank-1

Rank-10

Qualitative Features (no learning):
N NL2

12.1 ± 5.2

52.1 ± 7.1

Qualitative Features (learning):
Logistic Regression
MKL
N NMKL
SVM
Logistic Regression+N NMKL +SVM

17.7 ± 4.2 62.1 ± 3.8
11.0 ± 3.9 50.5 ± 4.0
14.4 ± 2.9 59.5 ± 3.9
20.8 ± 5.6 65.0 ± 3.8
23.7 ± 3.5 70.5 ± 4.4

Image Descriptors (learning):
LBP with LDA

15.5 ± 4.6

42.6 ± 4.6

32.3 ± 5.1

74.8 ± 3.4

Qualitative Features + Image Descriptors:
Logistic Regression+N NMKL +
SVM+LBP with LDA

Table 6.2: Average identiﬁcation accuracies of the proposed qualitative, image
feature-based, and baseline methods. Average accuracies and standard deviations
were measured over 10 random splits of 134 training subjects and 62 testing subjects
(subjects in training and test sets are diﬀerent).

186

6.5.2

Multiple Kernel Learning and SVM

One limitation of the logistic regression method is that it is restricted to ﬁnding linear
dependencies between the features. In order to learn non-linear dependencies we use
support vector machines (SVM) and multiple kernel learning (MKL) [8].
Given m training images, we let {Kj ∈ Rm×m , j = 1, . . . , 25} represent the set
of base kernels. p = (p1 , . . . , ps )
these base kernels, and K(p) =

∈ Rs denotes the coeﬃcients used to combine
+
s
j=1 pj Kj

is the combined kernel matrix. We learn

the coeﬃcient vector p by solving the convex-concave optimization of the MKL dual
formulation [66]:

1
min max L(α, p) = 1 α − (α ◦ y) K(p)(α ◦ y),
2
p∈∆ α∈Q

(6.2)

where ◦ denotes the Hadamard (element-wise) product, 1 is a vector of all ones, and
Q = {α ∈ [0, C]m } is the domain for dual variables α. Note that this formulation
can be considered as the dual formulation of SVM for the combined kernel.
One popular choice for domain ∆ is ∆2 = p ∈ Rs : p 2 ≤ 1 . Often the L1
+
norm is used to generate a sparse solution, however, in our application, the small
sample size impacted the accuracy of this approach.
For MKL, each individual attribute is considered as a separate feature by constructing one kernel for each attribute (resulting in 25 base kernels). Our MKL
classiﬁer was trained using an oﬀ-the-shelf MKL tool [131].
Once this training is complete, we are able to measure the similarity of a
caricature and photograph (represented as the diﬀerence vector x) by: f (x) =
ns
i=1 αi yi Kp (xi , x),

where ns is the number of support vectors, and Kp (·) is the

combined matrix. In Section 6.7, we refer to this method as MKL.
In addition to the MKL algorithm, we also use the standard SVM algorithm [19]
by replacing the multiple kernel matrix Kp (·) with a single kernel K(·) that utilizes
187

all feature components together. In Section 6.7, we refer to this approach as SVM.
Both the MKL and SVM algorithms used RBF kernels (the kernel bandwidth was
determined empirically).
Finally, we introduce a method known as the nearest neighbor MKL (N NMKL ).
Because the vector p in Eq. 6.2 assigns weight to each of the 25 qualitative features,
we can explicitly use these weights to perform weighted nearest neighbor matching.
Thus, the dissimilarity between a caricature and photograph is measured as the sum
of weighted diﬀerences between each of the qualitative feature vectors.

6.6

Image Descriptor-based Recognition

As discussed, encoding facial images with low level feature descriptors such as local
binary patterns [99] is challenged with respect to matching caricatures to photograph due to the misalignments caused from the feature exaggeration in caricatures.
However, since this approach has seen success in matching facial sketches to photographs [12, 54, 59], we also employ a similar technique for the caricature matching
task.
The ﬁrst step in the image descriptor-based algorithm is to align the face images
using the two eye locations. These locations are marked manually due to the wide
variations of pose in both the photographs and (especially) the caricatures. Using
the center of the eyes, we performed planar rotation to ﬁx the face upright, scaled
the image to 75 pixels between the eyes, and cropped the image to a height of 250
pixels and a width of 200 pixels.
For both caricatures and photographs, we densely sampled local binary pattern
histograms from image patches of 32 by 32 pixels. Next, all of the LBP histograms
computed from a single image are concatenated into a single feature vector. Finally,
we performed feature-based random subspace analysis [52] by randomly sampling the
188

feature space b times. For each of the b subspaces, linear discriminant analysis (LDA)
is performed to extract discriminative feature subspaces [10]. In Section 6.7 we will
refer to this method as LBP with LDA.

6.7

Experimental Results

The 196 pairs of caricatures and photographs (see Section 6.3), were randomly split
such that 134 pairs (roughly 2/3rd) were made available for training, and 62 pairs
(roughly 1/3rd) was available for testing. These sets were non-overlapping (i.e. no
subject used in training was used for testing). We partitioned the data into training
and testing sets 10 diﬀerent times, resulting in 10 diﬀerent matching experiments.
The results shown in this section are the mean and standard deviation of the matching accuracies from those 10 random partitions. The precise splits used for these
experiments are included with the release of the caricature image dataset.
The performance of each matching algorithm was measured using both the cumulative match characteristic (CMC) and the receiver operating characteristic (ROC)
curves. For the CMC scores, we list the Rank-1 and Rank-10 accuracies. With 62
subjects available for testing, the gallery size was 62 images (photographs), and the
scores listed are the average rank retrieval when querying the gallery with the 62
corresponding caricatures. The ROC analysis is listed as the true accept rate (TAR)
at ﬁxed false accept rates (FAR) of 1.0% and 10.0%.
Table 6.1 and Table 6.2 lists the identiﬁcation and retrieval accuracies (respectively) for each of the recognition algorithms discussed in this work. Even without
learning the qualitative features (N NL2 ) still had a higher accuracy than the image
descriptor-based method (LBP with LDA). Thus, while image descriptor-based methods work well in matching verdical sketches to photographs [59], the misalignments
caused by the exaggerations in the caricatures challenge this method. At a false ac189

cept rate of 10.0%, several of the proposed learning methods (Logistic Regression,
N NMKL , and SVM) are able to improve the accuracy of the qualitative features by
around 10%. Despite the inability of the MKL method to improve the matching accuracy, using the weights from MKL with the nearest neighbor matcher (N NMKL )
improves the matching accuracy.

Because the classiﬁcation algorithms used in this study output numerical values
that indicate the similarity of a caricature image and a photograph, we are able to
leverage fusion techniques to further improve the accuracy. Fusion of algorithms in
Table 6.1 and Table 6.2 are denoted by the a ‘+’ symbol between algorithms names.
This indicates the use of sum of score fusion with min-max score normalization [123].

Using only qualitative features, the matching accuracy (at FAR=10.0%) was
improved to nearly 57% (using Logistic Regression+N NMKL +SVM). While the
image descriptor-based method performed poorly with respect to the qualitative features, it proved valuable when added to the fusion process:

Logistic

Regression+N NMKL +SVM+LBP with LDA had an accuracy of 61.9%.

Using the estimated p vector in the multiple kernel learning (MKL) algorithm, we
are able to interpret the relative importance of each of the qualitative features. Since
each component of p corresponds to the weight assigned to each of the 25 qualitative
features, we can loosely interpret this vector to understand which features provided
the most discriminative information. Figure 6.6 lists the weights for each of the 25
facial features. Surprisingly, we see that the Level 1 qualitative features are more
discriminative than the Level 2 facial features. While this is counter to a standard
face recognition task [56], caricatures are diﬀerent in nature than face images. We
believe the relative importance of Level 1 facial features in this setting is akin to the
information an artist ﬁlters from the face.
190

6.8

Summary

This chapter introduced a challenging new problem in heterogeneous face recognition: matching facial caricatures to photographs. Unlike the other heterogeneous
face recognition scenarios encountered in this paper, the development of a caricature
recognition system does not have direct societal beneﬁts. However, the indirect beneﬁts of such research may be substantial. Given the human ability to ascertain identify
from these extremely exaggerated sketches, designing common facial representations
for both caricatures and photographs is akin to mimic human facial representations.
In order to facilitate research in caricature matching, we released the initial dataset
of 196 pairs of caricatures and photographs used in this study in order to allow other
researchers to study this problem.
A major contribution of this research is the deﬁnition of a set of qualitative facial
features for representing both caricatures and photographs. Given these representations, a suite of statistical learning algorithms were adopted to learn the most salient
combinations of these features from a training set of caricature and photograph pairs.

191

Feature Name
Hairstyle 1
Beard
Mustache
Hairstyle 2
Eyebrows (Up or Down)
Nose to Mouth Distance
Eye Seperation
Nose Width
Face Length
Cheeks
Mouth Width
Mouth to Chin Distance
Eyebrow Shape

Weight
2.86
0.85
0.81
0.70
0.45
0.43
0.43
0.42
0.27
0.27
0.26
0.23
0.22

Feature Name
Almond Eyes
Nose (Up or Down)
Face Shape
Forehead Size
Eye Color
Sleepy Eyes
Sharp Eyes
Baggy Eyes
Nose to Eye Distance
Thick Eyebrows
Eyebrows Connected
Slanted Eyes

Weight
0.21
0.21
0.20
0.19
0.18
0.14
0.13
0.12
0.12
0.11
0.10
0.10

Figure 6.6: The multiple kernel learning (MKL) weights (p), scaled by 10, for each
of the qualitative features. Higher weights indicate more informative features.

192

Chapter 7
Summary and Conclusions
This thesis studied the problem of heterogeneous face recognition across both speciﬁc
and broad applications. The primary contributions were realized by advancing facial
feature representations to better handle heterogeneous face images, and adapting
feature extraction algorithms (i.e. statistical learning) to better leverage training
data exemplar to a particular heterogeneous face matching task.

7.1

Contributions

In Chapter 2 we developed a framework for matching forensic sketches to facial photographs that oﬀered the following contributions:
• Presented the ﬁrst large-scale experiment on automated identiﬁcation using
operational forensic sketches.
• In encoding sketches and photographs with SIFT and LBP feature descriptors,
we proposed the ﬁrst feature-based approach to automated sketch recognition.
• Developed a recognition framework called local feature-based discriminant
analysis, which demonstrated a substantial improvement in matching viewed
193

sketches to photos over both previously published algorithms and a state of the
art commercial face recognition system.
• Applied race and gender ﬁltering to further improve sketch recognition accuracy.
The prototype-based framework in Chapter 3 oﬀered the following contributions:
• Presented a prototype-based representation for heterogeneous face images. This
approach represents images from each modality as a vector of their similarities
to a common set of prototypes.
• Improved the recognition accuracy of the prototype features by applying linear
discriminant analysis on randomly sampled prototype features, resulting in a
ﬁnal framework called prototype-random subspaces (P-RS).
• The P-RS framework provides a method for computing inter-modality similarities by using only intra-modality similarity measures, thus extending it (conceptually) to any heterogeneous face recognition scenario in which intra-modality
similarities can be computed.
• Demonstrated the ability of the P-RS framework to perform face recognition using feature templates from alternate facial feature representation (e.g. matching
LBP to SIFT).
The studies of facial aging presented in Chapter 4 oﬀer the following contributions:
• Performed the largest facial aging study to date by using a dataset of 200,000
mug shot images from 64,000 subjects with time lapses up to 17 years between
images of the same subject.
• Demonstrated a degradation in face recognition performance from two leading
commercial-of-the-shelf (COTS) face recognition systems (FRS) as a function
of the amount of time lapse that has occurred between two face images.
194

• Showed that training a face recognition system on a particular amount of time
lapse resulted in the highest recognition accuracy on that time lapse.
• The previous ﬁnding suggests to the notion of periodically updating face templates to reside in feature subspaces that are trained for the amount of time
that has lapsed since the image was acquired.
The study on the impact of the demographic distributions on face recognition performance from over 50,000 subjects presented in Chapter 5 oﬀered the following contributions:
• A demonstration that female, black, and younger cohorts are more diﬃcult
to recognize for six diﬀerent face matchers (commercial, non-trainable, and
trainable).
• That training a matcher exclusively on the images of a particular racial cohort
will result in improved recognition accuracies on that cohort.
• The notion of dynamic face matcher selection is presented, where multiple face
recognition systems that are each trained on diﬀerent demographic cohorts are
available for an operator to use in matching with the goal of improving face
retrieval performance.

Finally, the study on caricature recognition in Chapter 6 oﬀered the following contributions:
• The ﬁrst study on matching caricatures to photographs was presented.
• Developed a set of qualitative facial features for representing both caricatures
and photographs. The features resulted in the highest accuracy for matching
caricatures to their photographs.
195

• Adapted classiﬁcation algorithms to categorize the diﬀerence vector between a
caricature and a photograph as being either a match or non-match.
• Collected a database of caricature and photograph pairs that are freely available
to other researchers.

7.2

Future Work

Though contributions were oﬀered across a range of challenges in heterogeneous face
recognition, the studies presented in this thesis also segue into many new research
challenges that are both within the scope of heterogeneous face recognition and beyond.
The studies on forensic sketch recognition developed a system that had near perfect accuracy on accurate viewed sketches, while the accuracy on real world forensic
sketches was signiﬁcantly lower. The chief similarity between these two types of data
is that they are both hand drawn sketches. Thus, we see that the true heterogeneous
aspect of this problem (i.e. matching a sketch to a photo) is largely solved. However,
the diﬀerence between viewed and forensic sketches is that forensic sketches rely on
the human memory. While many researchers are still seeking to improve the trivial
problem of sketch recognition on viewed sketches, it is only through a more in depth
examination of forensic sketches that we make the gains needed to close the gap between the two forms of sketches. Researchers must make greater use of the previous
ﬁndings in the cognitive science literature on human witness memory to help shape
the next generation of sketch recognition algorithms.
The prototype-based framework for performing heterogeneous face recognition
demonstrated that heterogeneous facial features could be used to match heterogeneous face images. While this was notably demonstrated on the diﬃcult scenario of
matching thermal images to photographs (a task which humans struggle with), more
196

ambitious scenarios can now be considered. One such example is the matching of face
depth maps acquired from a LIDAR (Light Detection And Ranging) sensor. LIDAR
sensors have demonstrated the ability to acquire high resolution images across large
distances, making for an ideal heterogeneous face recognition scenario in intelligence
and law enforcement applications.
The prototype framework may also be extended to problems outside the scope of
face recognition. For example, the prototype-based representation should be explored
in the context of cross-lingual retrieval, as well as heterogeneous image retrieval challenges (such as querying a database of large images with much smaller thumbnail
images).
Both the study on facial aging and demographics demonstrated the tight coupling between face matcher accuracy and the dataset with which it was trained on.
Though quite intuitive, this idea that the demographic distribution of a dataset could
be controlled to generate several diﬀerent versions of a face recognition system had
not previously been explored. While the experimental ﬁndings in the two chapters
on aging and demographics indirectly demonstrated the beneﬁt of these approaches
(namely, template update over time and the use of dynamic face matcher selection),
more explicit experiments should be conducted. While such follow on studies are
perhaps more suited for system-related research (as opposed to pattern recognition
research), they are, none the less, important.
Perhaps the study on caricature recognition presented in this thesis oﬀers the
most avenues for future research. The release of the dataset will hopefully prompt
other researchers to explore orthogonal ideas on how to solve the problem. The union
of many diﬀerent approaches to this problem should in turn yield a wide set of approaches that can then be extended to the standard face recognition problem. Within
our approach of encoding caricatures and photographs using qualitative features, several new avenues for research exist. One of which is the application of the qualitative
197

features to face matching and retrieval.

7.3

Conclusions

A thesis comprised of a set of studies on heterogeneous face recognition has been presented. In each study contributions are made to improve face recognition performance
given heterogeneous forms of data. The results of these studies is an improvement
to speciﬁc problems that are of interest in law enforcement, defense, and intelligence
applications.

198

Appendices

199

Appendix A
“R” Transform is a Special Case of
Eigen-transform

Given the matrix of probe features Kp and gallery features Kg , Tang and Wang’s
eigen-transformation method [138] performs a transformation from the probe face
modality to the gallery face modality by ﬁrst performing the eigen-decomposition
using the dominant eigenvector method

T
(Kp Kp )Kp Vp = Kp Vp Λp

(A.1)

T
(Kg Kg )Kg Vg = Kg Vg Λg

(A.2)

Here Kp and Kg are the matrices containing the kernel prototype similarities from the
training instances as described in Eqs. (3.3) and (3.4). Given a feature vector φ from
the probe modality, the eigen-transformation method transforms (or synthesizes) φ
to the vector φ in the gallery modality by
200

−1/2

Up = Kp Vp Λp

(A.3)

T
bp = Up φ

(A.4)

−1/2

cp = Vp Λp

bp

φ = K g cp

(A.5)
(A.6)

We will now prove that Eq. (A.6) is equivalent to the R transform shown in
Eq. (3.9), given the special case that Kp and Kg are square matrices (as is always the
case in our framework). For terseness, φ(P ) and φ (P ) (from Eq. (A.6)) are simply
represented as φ and φ (respectively).
Expanding Eq. (A.6) we have

φ = K g cp

(A.7)
−1/2

= Kg Vp Λp

−1/2

= Kg Vp Λp

bp

(A.8)
−1/2 T T T
) Vp Kp φ

(Λp

T T
= Kg Vp Λ−1 Vp Kp φ
p

(A.9)
(A.10)

Now, going back to Eq. (A.6), we see that we transform φ to φ as

φ = Rφ

(A.11)

T
T
= Kg (Kp Kp )−1 Kp φ

(A.12)

If Eq. (A.6) was equivalent to Eq. (3.9), then, from Eq. (A.10) and Eq. (A.12),
T
T
we see that it must be true that Vp Λ−1 Vp = (Kp Kp )−1 . By deﬁnition, we have
p

201

T
T
Kp Kp = Vp Λp Vp

(A.13)

T
−1
Because Vp is a square, orthonormal matrix (and, thus, Vp = Vp ), we see that

T
T
(Kp Kp )−1 = (Vp Λp Vp )−1
T
= Vp Λp Vp

(A.14)
(A.15)

Thus, given the special case that Kp and Kg are square matrices, the proposed “R”
transform in Chapter 3 is in fact equivalent to Tang and Wang’s eigen-transformation
method [138].

202

Bibliography

203

Bibliography
[1] FaceVACS Software Developer
http://www.cognitec-systems.de.
[2] PittPatt Face Recognition
http://www.pittpatt.com.

SDK,

Kit,

Cognitec

Pittsburgh

Systems

Pattern

GmbH,

Recognition,

[3] T. Ahonen, A. Hadid, and M. Pietikainen. Face description with local binary
patterns: Application to face recognition. IEEE Trans. Pattern Analysis &
Machine Intelligence, 28(12):2037–2041, Dec. 2006.
[4] T. Akgul. Introducing the cartoonist, Tayfun Akgul. IEEE Antennas and
Propagation Magazine, 49(3):162, 2007.
[5] T. Akgul. Can an algorithm recognize montage portraits as human faces? IEEE
Signal Processing Magazine, 28(1):160 –158, 2011.
[6] E. Akleman. Making caricatures with morphing. In Proc. ACM SIGGRAPH,
1997.
[7] E. Akleman. Modeling expressive 3d caricatures. In Proc. ACM SIGGRAPH,
2004.
[8] F. R. Bach. Consistency of the group lasso and multiple kernel learning. Journal
of Machine Learning Research, 9:1179–1225, 2008.
[9] M.-F. Balcan, A. Blum, and S. Vempala. Kernels as features: On kernels,
margins, and low-dimensional mappings. Machine Learning, 65:79–94, 2006.
[10] P. Belhumeur, J. Hespanda, and D. Kriegman. Eigenfaces vs. ﬁsherfaces: Recognition using class speciﬁc linear projection. IEEE Trans. Pattern Analysis &
Machine Intelligence, 19(7):711–720, 1997.
[11] A. Bertillon. The Bertillon System of Identiﬁcation. Chicago, IL, 1896.
[12] H. Bhatt, S. Bharadwaj, R. Singh, and M. Vatsa. On matching sketches with
digital face images. In Proc. of IEEE Conference on Biometrics: Theory, Applications and Systems, pages 1 –7, 2010.
[13] V. Blanz and T. Vetter. Face recognition based on ﬁtting a 3d morphable model.
IEEE Trans. Pattern Anal. Mach. Intell., 25(9):1063 – 1074, sept. 2003.
204

[14] R. K. Bothwell, J. Brigham, and R. Malpass. Cross-racial identication. Personality & Social Psychology Bulletin, 15:1925, 1989.
[15] L. Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996.
[16] S. Brennan. Caricature generator: The dynamic exaggeration of faces by computer. Leonardo, 18:170–178, 1985.
[17] V. Bruce, Z. Henderson, K. Greenwood, P. Hancock, A. Burton, and P. Miller.
Veriﬁcation of face identities from images captured on video. Journal of Experimental Psychology: Applied, 5(4):339–360, 1999.
[18] V. Bruce and A. Young. Understanding face recognition. British Journal of
Psychology, 77(3), 1986.
[19] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines.
ACM Trans. Intelligent Systems and Technology, 2(3):27:1–27:27, 2011.
[20] P. Chiroro and T. Valentine. An investigation of the contact hypothesis of the
own-race bias in face recognition. Quarterly Journal of Experimental Psychology, A, Human Experimental Psychology, 48A:879894, 1995.
[21] T. Cootes, C. Taylor, D. Cooper, and J. Graham. Active shape models-their
training and application. Computer Vision and Image Understanding, 61(1):38
– 59, 1995.
[22] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models.
IEEE Trans. Pattern Analysis and Machine Intelligence, 23(6):681–685, 2001.
[23] N. R. Council. Strengthening Forensic Science in the United States: A Path
Forward. National Academies Press, 2009.
[24] R. Duda, P. Hart, and D. Stork. Pattern Classiﬁcation. Wiley-Interscience, 2
edition, 2000.
[25] G. Edmond, K. Biber, R. Kemp, and G. Porter. Laws looking glass: Expert
identiﬁcation evidence derived from photographic and video images. Current
Issues in Criminal Justice, 20(3):337–377, 2009.
[26] M. P. Evison and R. W. Vorder Bruegge, editors. Computer-aided Forensic
Facial Comparison. CRC Press, 2010.
[27] C. Frowd, V. Bruce, A. McIntyr, and P. Hancock. The relative importance of external and internal features of facial composites. British Journal of Psychology,
98(1):61–77, 2007.
[28] N. Furl, P. J. Phillips, and A. J. O’Toole. Face recognition algorithms and
the other-race eﬀect: computational mechanisms for a developmental contact
hypothesis. Cognitive Science, 26(6):797 – 815, 2002.
205

[29] X. Gao, J. Zhong, J. Li, and C. Tian. Face sketch synthesis algorithm based
on e-hmm and selective ensemble. IEEE Transactions on Circuits and Systems
for Video Technology, 18(4):487–496, April 2008.
[30] A. Georghiades, P. Belhumeur, and D. Kriegman. From few to many: illumination cone models for face recognition under variable lighting and pose. IEEE
Trans. Pattern Analysis & Machine Intelligence, 23(6):643 –660, jun 2001.
[31] L. Gibson. Forensic Art Essentials. Elsevier, 2008.
[32] M. A. Goodale and A. Milner. Separate visual pathways for perception and
action. Trends in Neurosciences, 15(1):20–25, 1992.
[33] R. Gross, I. Matthews, and S. Baker. Appearance-based face recognition and
light-ﬁelds. IEEE Trans. Pattern Analysis & Machine Intelligence, 26(4):449
–465, 2004.
[34] P. J. Grother, G. W. Quinn, and P. J. Phillips. MBE 2010: Report on the
evaluation of 2d still-image face recognition algorithms. National Institute of
Standards and Technology, NISTIR, 7709, 2010.
[35] G. Guo and G. Mu. Human age estimation: What is the inﬂuence across race
and gender? In Proc. of IEEE Conference on Computer Vision & Pattern
Recognition, 2010.
[36] B. Heisele, P. Ho, and T. Poggio. Face recognition with support vector machines:
global versus component-based approach. In Proc. of Int. Conf. on Computer
Vision, 2001.
[37] T. K. Ho. The random subspace method for constructing decision forests. IEEE
Trans. Pattern Analysis & Machine Intelligence, 20(8):832–844, Aug 1998.
[38] R.-L. Hsu and A. Jain. Generating discriminating cartoon faces using interacting snakes. IEEE Trans. Pattern Analysis and Machine Intelligence,
25(11):1388 – 1398, 2003.
[39] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the
wild: A database for studying face recognition in unconstrained environments.
In Technical Report 07-49, University of Massachusetts, Amherst, 2007.
[40] A. Jain, A. Ross, and S. Prabhakar. An introduction to biometric recognition.
IEEE Trans. Circuits and Systems for Video Technology, 14(1):4–20, Jan. 2004.
[41] A. K. Jain, Y. Chen, and M. Demirkus. Pores and ridges: High-resolution
ﬁngerprint matching using level 3 features. IEEE Trans. Pattern Analysis and
Machine Intelligence, 29(1):15–27, 2007.
[42] A. K. Jain, S. C. Dass, K. Nandakumar, and K. N. Soft biometric traits for
personal recognition systems. In Proc. of International Conference on Biometric
Authentication, 2004.
206

[43] A. K. Jain, B. Klare, and U. Park. Face matching and retrieval: Applications
in forensics. IEEE Multimedia, 19(1):20–28, 2012.
[44] A. K. Jain, B. F. Klare, and U. Park. Face recognition: Some challenges in
forensics. In Proc. Int. Conference on Automatic Face and Gesture Recognition,
2011.
[45] H. Y. Jie, H. Yu, and J. Yang. A direct LDA algorithm for high-dimensional
data – with application to face recognition. Pattern Recognition, 34:2067–2070,
2001.
[46] L. Juwei, K. Plataniotis, and A. Venetsanopoulos. Face recognition using kernel direct discriminant analysis algorithms. IEEE Trans. Neural Networks,
14(1):117 – 126, 2003.
[47] N. Kanwisher, J. McDermott, and M. M. Chun. The fusiform face area: A
module in human extrastriate cortex specialized for face perception. Journal of
Neuroscience, 17(11):4302–4311, 1997.
[48] K. I. Kim, K. Jung, and H. J. Kim. Face recognition using kernel principal
component analysis. IEEE Signal Processing Letters, 9(2):40 –42, 2002.
[49] Y. Kim, J. Kim, and Y. Kim. Blockwise sparse regression. Staistica Sinica,
16:375–390, 2006.
[50] B. Klare. Spectrally sampled structural subspace features (4SF). In Michigan
State University Technical Report, MSU-CSE-11-16, 2011.
[51] B. Klare, S. S. Bucak, T. Akgul, , and A. K. Jain. Towards automated caricature
recognition. In Proc. Int. Conference on Biometrics, 2012.
[52] B. Klare and A. Jain. Heterogeneous face recognition: Matching NIR to visible
light images. In Proc. International Conference on Pattern Recognition, 2010.
[53] B. Klare and A. Jain. Sketch to photo matching: A feature-based approach. In
Proc. SPIE Conference on Biometric Technology for Human Identiﬁcation VII,
2010.
[54] B. Klare and A. Jain. Heterogeneous face recognition using kernel prototype
similarities. IEEE Trans. on Pattern Analysis and Machine Intelligence (under
review), pages 639–646, 2011.
[55] B. Klare and A. Jain. Matching forensic sketches and mug shots to apprehend
criminals. IEEE Computer, 44(5):94–96, 2011.
[56] B. Klare and A. K. Jain. On a taxonomy of facial features. In Proc. of IEEE
Conference on Biometrics: Theory, Applications and Systems, 2010.
[57] B. Klare and A. K. Jain. Face recognition across time lapse: On learning feature
subspaces. In Int. Joint Conference on Biometrics, 2011.
207

[58] B. Klare, Z. Li, and A. Jain. On matching forensic sketches to mugshot photos.
In MSU Technical Report, MSU-CSE-10-3, 2010.
[59] B. Klare, Z. Li, and A. Jain. Matching forensic sketches to mugshot photos. IEEE Trans. on Pattern Analysis and Machine Intelligence, 33(3):639–646,
2011.
[60] B. Klare, A. Paulino, and A. K. Jain. Analysis of facial features in identical
twins. In Int. Joint Conference on Biometrics, 2011.
[61] B. F. Klare and M. Burge. Assessment of H.264 video compression on automated
face recognition performance in surveillance and mobile video scenarios. In Proc.
of SPIE, Biometric Technology for Human Identiﬁcation VII, 2010.
[62] B. F. Klare, P. Mallapragada, A. Jain, , and K. Davis. Clustering face carvings:
Exploring the devatas of angkor wat. In Proc. Int. Conference on Pattern
Recognition, 2010.
[63] B. F. Klare and S. Sarkar. Background subtraction in varying illuminations
using an ensemble based on an enlarged feature set. In Proc. IEEE Conference
on Computer Vision and Pattern Recognition Workshops (CVPRW), 2009.
[64] H. Koshimizu, M. Tominaga, T. Fujiwara, and K. Murakami. On kansei facial
image processing for computerized facial caricaturing system picasso. In Proc.
IEEE Conference on Systems, Man, and Cybernetics, 1999.
[65] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. Attribute and simile
classiﬁers for face veriﬁcation. In Proc. IEEE Int. Conference on Computer
Vision, 2009.
[66] G. Lanckriet, N. Cristianini, P. Bartlett, and L. E. Ghaoui. Learning the kernel
matrix with semi-deﬁnite programming. Journal of Machine Learning Research,
5:27–72, 2004.
[67] P.-H. Lee, G.-S. Hsu, and Y.-P. Hung. Face veriﬁcation and identiﬁcation using
facial trait code. In Proc. of IEEE Conference on Computer Vision and Pattern
Recognition, pages 1613–1620, June 2009.
[68] Z. Lei and S. Li. Coupled spectral regression for matching heterogeneous faces.
In Proc. of IEEE Conference on Computer Vision & Pattern Recognition, pages
1123 –1128, june 2009.
[69] D. A. Leopold, A. J. O’Toole, T. Vetter, and V. Blanz. Prototype-referenced
shape encoding revealed by high-level aftereﬀects. Nature Neuroscience, 4:89–
94, 2001.
[70] T. Lewiner, T. Vieira, D. Martinez, A. Peixoto, V. Mello, and L. Velho. Interactive 3d caricature from harmonic exaggeration. Computers and Graphics,
35(3):586–595, 2011.
208

[71] J. Li, P. Hao, C. Zhang, and M. Dou. Hallucinating faces from thermal infrared
images. In Proc. Int. Conference on Image Processing, pages 465 –468, 2008.
[72] S. Li, R. Chu, S. Liao, and L. Zhang. Illumination invariant face recognition
using near-infrared images. IEEE Trans. on Pattern Analysis and Machine
Intelligence,, 29(4):627 –639, 2007.
[73] S. Z. Li and A. K. Jain, editors. Handbook of Face Recognition. Springer, 2nd
edition, 2011.
[74] Y. Li, M. Savvides, and V. Bhagavatula. Illumination tolerant face recognition
using a novel face from sketch synthesis approach and advanced correlation
ﬁlters. In Proc. IEEE Int’l Conf. on Acoustics, Speech and Signal Processing,
2006.
[75] Z. Li, D. Lin, and X. Tang. Nonparametric discriminant analysis for face recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence,, 31(4):755
–761, 2009.
[76] Z. Li, U. Park, and A. K. Jain. A discriminative model for age invariant face
recognition. IEEE Trans. Information Forensics and Security, 6(3):1028–1037,
2011.
[77] S. Liao, D. Yi, Z. Lei, R. Qin, and S. Li. Heterogeneous face recognition
from local structures of normalized appearance. In Proc. Int. Conference on
Biometrics, 2009.
[78] D. Lin and X. Tang. Inter-modality face recognition. In Proc. of European
Conference on Computer Vision, 2006.
[79] H. Ling, S. Soatto, N. Ramanathan, and D. Jacobs. Face veriﬁcation across
age progression using discriminative methods. IEEE Trans. on Information
Forensics and Security, 5(1):82 –91, 2010.
[80] C. Liu and H. Wechsler. Gabor feature based classiﬁcation using the enhanced
ﬁsher linear discriminant model for face recognition. IEEE Trans. on Image
Processing, 11(4):467–476, 2002.
[81] Q. Liu, H. Lu, and S. Ma. Improving kernel ﬁsher discriminant analysis for
face recognition. IEEE Trans. on Circuits and Systems for Video Technology,
14(1):42 – 49, 2004.
[82] Q. Liu, X. Tang, H. Jin, H. Lu, and S. Ma. A nonlinear approach for face sketch
synthesis and recognition. In Proc. of IEEE Conference on Computer Vision &
Pattern Recognition, pages 1005–1010, 2005.
[83] W. Liu, X. Tang, and J. Liu. Bayesian tensor inference for sketch-based facial photo hallucination. In Proc. of 20th International Joint Conference on
Artiﬁcial Intelligence, 2007.
209

[84] D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
[85] J. Lu, K. Plataniotis, and A. Venetsanopoulos. Regularization studies of linear
discriminant analysis in small sample size scenarios with application to face
recognition. Pattern Recognition Letters, 26(2):181 – 191, 2005.
[86] G. Mahalingam and C. Kambhamettu. Age invariant face recognition using
graph matching. In Proc. of IEEE Conference on Biometrics: Theory, Applications and Systems, 2010.
[87] D. Maltoni, D. Maio, A. Jain, and S. Prabhakar. Handbook of Fingerprint
Recognition. pages 39–40. Springer, 2009.
[88] X. Mao, A. W. Bigham, R. Mei, G. Gutierrez, K. M. Weiss, T. D. Brutsaert,
F. Leon-Velarde, L. G. Moore, E. Vargas, P. M. McKeigue, M. D. Shriver,
and E. J. Parra. A genomewide admixture mapping panel for hispanic/latino
populations. The American Journal of Human Genetics, 80(6):1171 – 1178,
2007.
[89] A. Martinez and R. Benavente. The AR face database. In CVC Technical
Report 24, 1998.
[90] R. Mauro and M. Kubovy. Caricature and face recognition. Memory & Cognition, 20(4):433–440, 1992.
[91] G. McCarthy, A. Puce, J. C. Gore, and T. Allison. Face-speciﬁc processing in
the human fusiform gyrus. Journal of Cognitive Neuroscience, 9(5):605–610,
1997.
[92] K. Messer, J. Matas, J. Kittler, and K. Jonsson. XM2VTSDB: The extended
M2VTS database. In Proc. of Audio and Video-based Biometric Person Authentication, 1999.
[93] E. Meyers and L. Wolf. Using biologically inspired features for face processing.
Int. Journal of Computer Vision, 76(1):93–104, 2008.
[94] E. Meyers and L. Wolf. Using biologically inspired features for face processing.
Int. Journal of Computer Vision, 76(1):93–104, 2008.
[95] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors.
IEEE Trans. Pattern Analysis & Machine Intelligence, 27(10):1615–1630, Oct.
2005.
[96] B. Moghaddam, T. Jebara, and A. Pentland. Bayesian face recognition. Pattern
Recognition, 33(11):1771 – 1782, 2000.
[97] W. Ng and R. C. Lindsay. Cross-race facial recognition: Failure of the contact
hypothesis. Journal of Cross-Cultural Psychology, 25:217232, 1994.
210

[98] H. Nizami, J. Adkins-Hill, Y. Zhang, J. Sullins, C. McCullough, S. Canavan,
and L. Yin. A biometric database with rotating head videos and hand-drawn
face sketches. In Proc. of IEEE Conference on Biometrics: Theory, Applications
and Systems, 2009.
[99] T. Ojala, M. Pietik¨inen, and T. M¨enp¨a. Multiresolution gray-scale and
a
a
a¨
rotation invariant texture classiﬁcation with local binary patterns. IEEE Trans.
Pattern Analysis & Machine Intelligence, 24(7):971–987, 2002.
[100] A. O’Toole, P. J. Phillips, A. Narvekar, F. Jiang, and J. Ayyad. Face recognition
algorithms and the other-race eﬀect. Journal of Vision, 8(6), 2008.
[101] S. Pankanti, S. Prabhakar, and A. K. Jain. On the individuality of ﬁngerprints.
IEEE Trans. Pattern Analysis and Machine Intelligence, 24:1010–1025, 2002.
[102] U. Park and A. K. Jain. 3d model-based face recognition in video. In Proc.
International Conference on Biometrics, 2007.
[103] U. Park and A. K. Jain. Face matching and retrieval using soft biometrics.
IEEE Trans. on Information Forensics and Security, 6(3):1028–1037, 2011.
[104] U. Park, Y. Tong, and A. Jain. Age-invariant face recognition. IEEE Trans.
Pattern Analysis and Machine Intelligence, 32(5):947–954, 2010.
[105] J. Parris, M. Wilber, B. Helﬁn, H. Rara, A. E. barkouky Aly Farag, J. Movellan,
M. Santana, J. Lorenzo, M. N. Teli, S. Marcel, and C. Atanasoaei. Face and
eye detection on hard datasets. In Int. Joint Conference on Biometrics, 2011.
[106] D. A. Patterson and J. L. Hennessy. Computer Architecture: A Quantitative
Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1990.
[107] E. Patterson, A. Sethuram, M. Albert, K. Ricanek, and M. King. Aspects
of age variation in facial morphology aﬀecting biometrics. In Proc. of IEEE
Conference on Biometrics: Theory, Applications and Systems, 2007.
[108] P. Phillips, J. Beveridge, B. Draper, G. Givens, A. O’Toole, D. Bolme, J. Dunlop, Y. M. Lui, H. Sahibzada, and S. Weimer. An introduction to the good,
the bad, & the ugly face recognition challenge problem. In Proc. of Automatic
Face Gesture Recognition, 2011.
[109] P. Phillips, H. Moon, P. Rauss, and S. Rizvi. The FERET evaluation methodology for face-recognition algorithms. In Proc. of IEEE Conference on Computer
Vision & Pattern Recognition, 1997.
[110] P. Phillips, W. Scruggs, A. O’Toole, P. Flynn, K. Bowyer, C. Schott, and
M. Sharpe. FRVT 2006 and ICE 2006 large-scale experimental results. IEEE
Trans. Pattern Analysis & Machine Intelligence, 32(5):831 –846, 2010.
211

[111] P. J. Phillips, P. J. Grother, R. J. Micheals, D. Blackburn, E. Tabassi, and J. M.
Bone. Face recognition vendor test 2002: evaluation report. National Institute
of Standards and Technology, NISTIR, 6965, 2003.
[112] A. Quattoni, M. Collins, and T. Darrell. Transfer learning for image classiﬁcation with sparse prototype representations. In Proc. of IEEE Conference on
Computer Vision & Pattern Recognition, 2008.
[113] N. Ramanathan and R. Chellappa. Face veriﬁcation across age progression.
IEEE Trans. on Image Processing, 15(11):3349 –3361, 2006.
[114] N. Ramanathan, R. Chellappa, and S. Biswas. Computational methods for
modeling facial aging: A survey. Journal of Visual Languages & Computing,
20(3):131 – 144, 2009.
[115] S. Raudys and A. Jain. Small sample size eﬀects in statistical pattern recognition: Recommendations for practitioners. IEEE Trans. Pattern Analysis &
Machine Intelligence, 13(3):252 –264, 1991.
[116] A. Rawls and K. Ricanek. Morph: Development and optimization of a longitudinal age progression database. In Biometric ID Management and Multimodal
Communication, 2009.
[117] L. Redman. How to draw caricatures. McGraw-Hill, 1984.
[118] G. Rhodes, S. Brennan, and S. Carey. Identiﬁcation and ratings of caricatures: Implications for mental representations of faces. Cognitive Psychology,
19(4):473–497, 1987.
[119] H. T. F. Rhodes. Alphonse Bertillon, Father of Scientiﬁc Detection. AbelardSchuman, New York, 1956.
[120] K. Ricanek, A. Sethuram, E. K. Patterson, A. M. Albert, and E. J. Boone.
Craniofacial Aging. John Wiley & Sons, Inc., 2008.
[121] K. Ricanek and T. Tesafaye. Morph: a longitudinal image database of normal
adult age-progression. In Proc. of Automatic Face and Gesture Recognition,
2006.
[122] M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in
cortex. Nature Neuroscience, 2(11):10191025, 1999.
[123] A. Ross and A. Jain. Information fusion in biometrics. Pattern Recognition
Letters, 24(13):2115–2125, 2003.
[124] T. Sakai, M. Nagao, and T. Kanade. Computer analysis and classiﬁcation of
photographs of human faces. In Proc. First USA-JAPAN Computer Conference,
pages 55–62, 1972.
212

[125] R. E. Schapire. The strength of weak learnability. Machine Learning, 5:197–227,
1990.
[126] L. G. Shapiro and G. C. Stockman. Computer Vision. Prentice Hall, 2001.
[127] P. N. Shapiro and S. D. Penrod. Meta-analysis of face identication studies.
Psychological Bulletin, 100:139156, 1986.
[128] L. Shen and L. Bai. A review on gabor wavelets for face recognition. Pattern
Analysis & Applications, 9:273–292, 2006.
[129] R. Singh, M. Vatsa, A. Noore, and S. Singh. Age transformation for improving
face recognition performance. In Pattern Recognition and Machine Intelligence,
2007.
[130] R. L. Solso and J. E. McCarthy. Prototype formation of faces: A case of
pseudo-memory. British Journal of Psychology, 72(4):499–503, 1981.
[131] S. Sonnenburg, G. R¨tsch, C. Sch¨fer, and B. Sch¨lkopf. Large scale multiple
a
a
o
kernel learning. Journal of Machine Learning Research, 7:1531–1565, 2006.
[132] N. Spaun. Facial comparisons by subject matter experts: Their role in biometrics and their training. In Int. Conference on Advances in Biometrics, 2009.
[133] N. A. Spaun. Forensic biometrics from images and video at the federal bureau
of investigation. In Int. Conference on Biometrics: Theory, Applications and
Systems, page 13, 2007.
[134] Z. Sun, A. Paulino, J. Feng, Z. Chai, T. Tan, and A. K. Jain. A study of
multibiometric traits of identical twins. In Proc of SPIE, Biometric Technology
for Human Identiﬁcation VII, 2010.
[135] J. Suo, S.-C. Zhu, S. Shan, and X. Chen. A compositional and dynamic model
for face aging. IEEE Trans. Pattern Analysis & Machine Intelligence, 32(3):385
–401, 2010.
[136] X. Tan and B. Triggs. Enhanced local texture feature sets for face recognition
under diﬃcult lighting conditions. IEEE Trans. on Image Processing, 19(6):1635
–1650, 2010.
[137] X. Tang and X. Wang. Face sketch synthesis and recognition. In Proc. of IEEE
International Conference on Computer Vision, pages 687–694, 2003.
[138] X. Tang and X. Wang. Face sketch recognition. IEEE Trans. Circuits and
Systems for Video Technology, 14(1):50–57, 2004.
[139] K. Taylor. Forensic Art and Illustration. CRC Press, 2001.
[140] P. Thompson. Margaret Thatcher: A new illusion. Perception, 9(4):483–484,
1980.
213

[141] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society, Series B, 58:267–288, 1994.
[142] D. Y. Tsao, W. A. Freiwald, R. B. Tootell, and M. S. Livingstone. A cortical
region consisting entirely of face-selective cells. Science, 311(5761):670–674, Feb
2006.
[143] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive
Neuroscience, 3(1):71–86, 1991.
[144] R. Uhl and N. Lobo. A framework for recognizing a facial image from a police sketch. In Proc. of IEEE Conference on Computer Vision and Pattern
Recognition, 1996.
[145] T. Valentine and V. Bruce. The eﬀects of distinctiveness in recognising and
classifying faces. Perception, 15(5):525–535, 1986.
[146] P. Viola and M. J. Jones. Robust real-time face detection. Int. Journal of
Computer Vision, 57:137–154, 2004.
[147] X. Wang and X. Tang. Dual-space linear discriminant analysis for face recognition. In Proc. of IEEE Conference on Computer Vision & Pattern Recognition,
2004.
[148] X. Wang and X. Tang. Random sampling for subspace face recognition. Int.
Journal of Computer Vision, 70(1):91–104, 2006.
[149] X. Wang and X. Tang. Face photo-sketch synthesis and recognition. IEEE
Trans. Pattern Analysis & Machine Intelligence, 31(11):1955–1967, Nov. 2009.
[150] C. Williams and M. Seeger. Using the Nystrom method to speed up kernel
machines. Advances in Neural Information Processing Systems, 15(13), 2001.
[151] L. Wiskott, J.-M. Fellous, N. Kuiger, and C. von der Malsburg. Face recognition
by elastic bunch graph matching. IEEE Trans. Pattern Analysis & Machine
Intelligence, 19(7):775 –779, 1997.
[152] B. Xiao, X. Gao, D. Tao, and X. Li. A new approach for face recognition by
sketches in photos. Signal Processing, 89(8):1576 – 1588, 2009.
[153] R. Yampolskiy, B. F. Klare, , and A. K. Jain. Face recognition in the virtual
world: recognizing avatar faces. In Proc. of SPIE, Biometric Technology for
Human Identiﬁcation IX, 2012.
[154] D. Yi, S. Liao, Z. Lei, J. Sang, and S. Li. Partial face matching between near
infrared and visual images in mbgc portal challenge. In Proc. Int. Conference
on Biometrics, pages 733–742. 2009.
214

[155] A. W. Young, D. Hay, K. H. McWeeny, B. M. Flude, and A. W. Ellis. Matching
familiar and unfamiliar faces on internal and external features. Perception,
14:737–746, 1985.
[156] P. Yuen and C. Man. Human face image searching system using sketches. IEEE
Trans. Systems, Man and Cybernetics, 37(4):493–504, July 2007.
[157] W. Zhang, X. Wang, and X. Tang. Lighting and pose robust face sketch synthesis. In Proc. European Conference on Computer Vision. 2010.
[158] J. Zhong, X. Gao, and C. Tian. Face sketch synthesis using e-hmm and selective ensemble. In Proc. of IEEE Conference on Acoustics, Speech and Signal
Processing, 2007.
[159] G. Zhu, D. L. Duﬀy, A. Eldridge, M. Grace, C. Mayne, L. O’Gorman, J. F.
Aitken, M. C. Neale, N. K. Hayward, A. C. Green, and N. G. Martin. A major
quantitative-trait locus for mole density is linked to the familial melanoma gene
CDKN2A: A maximum-likelihood combined linkage and association analysis in
twins and their sibs. The American Journal of Human Genetics, 65(2):483–492,
1999.

215