.l...«. : 3....

.1. w

 

.
:

J

«m.

.3
Ln.
..

,.
ﬂ
3

:E.

a

I
u
.

 

 

.w

.72

Put.

1

2.1%? .
l

,1

$1.)"

H“?

03'.

a
3%

h.

.4

.
“Er. 4.1. | L:

I»): .
«mew.
H830

it 3.333.,

Dr‘s. .

 

,ran
.1. .‘y a
I}. ‘

 

Th' ' t rt'f th tth
.MUERAg. 'Slzezizzlili: e
ic lgan a «2'

University

A NON-OBTRUSIVE HEAD MOUNTED FACE CAPTURE
SYSTEM

presented by

CHANDAN REDDY

has been accepted towards fulfillment
of the requirements for the

MS. degree in COMPUTER SCIENCE AND
ENGINEERING

 

_ , '«h’h'ih—

Major Professor’s Signature

Date

MSU is an Affirmative Action/Equal Opportunity Institution

PLACE IN RETURN Box to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE

DATE DUE

DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6/01 cJClRC/DanDuo.p65«p. 15

 

A NON—OBTRUSIVE HEAD MOUNTED FACE CAPTURE
SYSTEM

By

Chandan Raddy

A THESIS

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

MASTER OF SCIENCE

Department of Computer Science and Engineering

2003

ABSTRACT
A NON-OBTRUSIVE HEAD MOUNTED FACE CAPTURE SYSTEM
By
Chandan Raddy

Capturing a face in a multi-user collaborative environment has been a problem of
concern for several years. The problem becomes more severe in mobile applications
where the capturing system may occlude the user’s ﬁeld of view. In this thesis, a
system is proposed that captures two side views of the face simultaneously and gen-
erates a frontal view. Applications of facial capture system include teleconferencing,
wearable computing, collaborative work and other mobile applications. The system
is designed to produce in real—time a stable, quality frontal view of an HMD user
whose face is captured with little obstruction of the ﬁeld of view. The frontal views
are generated by warping and blending of the side views after a calibration step. In
tests, the generated views compared well with real video based on both normalized
cross correlation and the Euclidean distance between some of the prominent feature
points. Preliminary qualitative assessment of these generated views also concludes
that the generated video is adequate to support the intended applications. A 3D face

model that can support the generation of arbitrary views was also constructed.

C0pyright by
CHANDAN REDDY
2003

To my parents

iv

ACKNOWLEDGMENTS

I would like to express my sincere thanks to Dr. George Stockman, my main advisor
for his expert guidance and mentorship. I am grateful to my co—advisor, Dr. Frank
Biocca for his moral and financial support throughout my stay at MSU. I would like
to express my Sincere gratitude to Dr. J annick Rolland, my external faculty member
for her discussions related to optics. I would also like to thank Dr. Charles Owen for

providing me some of the hardware components required for this project.

Special thanks to Zena Biocca and Joy Mulvaney for their help during my stay in
the MIND Lab. I would like to express my gratitude towards the graduate secretaries
Kim Bassa and Linda Moore. Help from my colleagues play a major role in the success
of this thesis. In this regard, I would like to thank the students of MIND Lab, MET
Lab and PRIP Lab at Michigan State University and the ODA Lab at the University
of Central Florida for their help and valuable technical discussions. I would also like
to thank my friends Prasanna, Raja, Badri and Shankar for their continuous support.

Finally, I thank my family for being with me and supporting me all the time.

TABLE OF CONTENTS

LIST OF FIGURES viii
LIST OF TABLES x
1 Introduction 1
1.1 Motivation ................................... 1
1.2 Broad Objectives ............................... 3
1.3 Problem Deﬁnition .............................. 5
1.3.1 Face Capture System (FCS) ........................ 5
1.3.2 Virtual View Synthesis ........................... 8
1.3.3 Head Mounted Projection Display .................... 8
1.3.4 3D Virtual and Tele—immersive environments .............. 10
1.3.5 Broadband Network Connections ..................... 10
1.4 Thesis Contributions ............................. 11
1.5 Organization of the Thesis .......................... 12
2 Relevant Background 13
2.1 Face Capture Systems ............................ 13
2.2 Virtual View Synthesis ............................ 15
2.3 Depth Extraction and 3D Face Modeling .................. 16
2.4 Head Mounted Displays and Tele-immersion ................ 17
3 System Overview 21
3.1 Equipment required .............................. 21
3.1.1 Hardware .................................. 21
3.1.2 Software ................................... 22
3.1.3 Network Connections ............................ 22
3.2 Optics Design Issues ............................. 23
3.2.1 General System Layout .......................... 23
3.2.2 Speciﬁcation Parameters .......................... 25
3.2.3 Estimation of the Variable Parameters Dmf and Dm .......... 27
3.2.4 Customization of the Cameras and Mirrors ................ 31
3.3 Experimental Prototype ........................... 31
3.3.1 Environment-static Camera Face Capture ................ 32
3.3.2 Porting to a Head Mounted System .................... 34

vi

4 Methodology

4.1 Description .............................
4.2 Off-line Calibration .........................
4.2.1 Color Balancing ..........................
4.2.2 Calibration for Virtual Video Synthesis .............
4.2.3 Calibration for 3D Face Modeling ................
4.3 Virtual Video Synthesis ......................
4.3.1 Face Warping ...........................
4.3.2 Face Mosaicking ................... _ ......
4.3.3 Post-Processing ..........................
4.4 3D Face Model Construction ....................
4.4.1 Stereo Computations .......................
4.4.2 3D Model Generation ......................
4.5 Implementation Details .......................

5 Experimental Results

5. 1 Calibration ..............................
5.1.1 Virtual Video Synthesis .....................
5.1.2 Face Modeling ..........................
5.2 Virtual Video Synthesis ......................
5.2.1 Face Warping ...........................
5.2.2 Virtual View Synthesis ......................
5.2.3 Video Synchronization ......................
5.3 3D Face Model Construction ....................
5.3.1 Stereovision Computations ....................
5.3.2 Customization of the 3D Face Model ..............

6 Assessment of the Results

6.1 Evaluation Schemes .........................
6.1.1 Objective Evaluation .......................
6.1.2 Subjective Evaluation ......................
6.2 Discussion of the Results ......................
6.2.1 Time Taken ............................
6.2.2 Positioning of Cameras and Mirrors ...............
6.2.3 Depth of Field Issues .......................

7 Conclusion

7.1 Summary ..............................
7.2 Future Work ............................

APPENDICES
A Conversion of Spherical to Cartesian Coordinates

BIBLIOGRAPHY

vii

36
37
38
38
39
42
46
48
48
49
49
49
52
54

55
55
55
58
61
61
62
63
66
66
67

69
69
70
75
77
77
77
78

79
79
80

83

83

84

1.1
1.2
1.3

2.1

2.2

3.1
3.2
3.3
3.4

4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9

5.1

5.2

5.3

LIST OF FIGURES

Face capture system with two convex mirrors and two lipstick cameras
A prototype of the Head Mounted Projection Display ............

Complete integrated system of the head mounted display with the face
capture unit. ...............................

A collaborative room with many cameras. (LeftzNational tele—immersion
project. Right: Sea of cameras) .....................

Head mounted face capture systems (Left: F acecap3D product from stan-
dard deviation company. Right: Head mounted optical face tracker
from adaptive optics) ...........................

General layout of the face capture system. .................
Unfolded layout of the face capture system .................
Estimation of the variable parameters Dmf and Dm ............
Experimental bench prototype of the FCS .................

Top view of the face capture system .....................
Demonstration of the behaviour of the grid pattern ............
Illustration of bilinear interpolation technique ................
The off-line calibration stage during the synthesis of the virtual frontal view.
The calibration sphere with labeled calibration points ..........
Operational stage during the synthesis of the virtual frontal view

The closest approach method ........................
Front view and side view of the 3D generic mesh model of the face . . . .
HCS with the origin (0) and three perpendicular axes (x, y and z) . . . .

A square grid with alternating three colors is projected onto the face. Each
grid cell has a row-width =24 pixels, col-width=18 pixels ........

Face images captured during the calibration stage using environment-
static F CS .................................

The images that are captured from the left camera and the right camera
during the camera calibration .......................

viii

14

15

23
24
27
33

37
40
40
42
43
47
51
53
53

57

57

58

5.4

5.5
5.6

5.7

5.8
5.9

5.10

5.11

5.12
5.13

6.1

6.2

6.3

7.1

A.1

Frontal view generation during the calibration stage and reconstruction
of the frontal image from the Side view using the grid: (a) left image
captured during the calibration stage. (1)) virtual image constructed
using the transformation tables and the right image during the calibra—
tion stage. (c) right image captured during the operational stage. ((1)
result of the reconstructed frontal view from the transformation tables

and the right image during the operational stage ............ 62
Face images captured during the operational phase using the ESFCS . . 62
(a) Frontal view that is obtained from the camcorder and (b) virtual frontal

view generated from our algorithm ................... 63

Face images captured using the HMF CS (a) left image and (b) right image 63
Virtual frontal view generated from the Side views captured through HMF CS 64
(a) Top row: images captured from the left camera. (b) Second row: im-
ages captured using the right camera. (c) Third row: images captured
using camcorder that is placed in front of the face. ((1) Final row:

virtual frontal views generated from the images in the ﬁrst two rows . 64
Synchronization of the eyeball movements: real video is in the top row

and the virtual video is in the bottom row ................ 65
Synchronization of the eyelids during blinking: real video is in the top row

and the virtual video is in the bottom row ................ 65
Frontal texture used for 3D face model construction ............ 67
Different views rendered from the texture mapped 3D face model . . . . 68
Images considered for objective evaluation (a)Top row: real video frames

(b) Bottom row: virtual video frames .................. 70
(a) Facial regions compared using normalized cross-correlation (Left: real

view and Right: virtual view.) ...................... 71

Facial feature points and the distances that are considered for evaluation
using Euclidean distance measure (Left: real view. Right: virtual view.) 73

Conclusion and future work. Solid blocks indicate implemented subsys-
tems. Dashed block indicates future subsystem. ............ 80

Spherical coordinate system ......................... 84

ix

3.1

3.2

5.1
5.2
5.3

6.1

6.2

LIST OF TABLES

Estimated values of the variable parameters obtained by varying the f-
number(Fc =12mm). All dimensions are in mm ...............
Estimated values of the variable parameters obtained by varying the f-
number(Fc 24mm). All dimensions are in mm. ..............

Results of left camera calibration for points on the calibration sphere.

Results of projector calibration for points on the calibration sphere. .....

Depth estimation from left camera and projector for points on the calibration
sphere. 3D coordinate dimensions are in inches. ..............

Results of normalized cross-correlation between the real and the virtual frontal
views. This normalized cross-correlation is applied in various regions of the
face concentrating more at the eye and mouth regions ...........

Euclidean distance measurement of the prominent facial distances in the real
image and virtual image and the deﬁned average error. All dimensions are
in pixels. ..................................

3O

30

60
61

66

72

74

Chapter 1

Introduction

The overall aim of this work is to design an augmented reality based face-to-face
teleconferencing system. The main advantages of such a system will be the ability
to produce stable video-based images of all the remote participants whose faces are
captured without obstructing the users’ ﬁeld of view. This system can be used in
the fields of augmented reality, wearable computing, and other mobile applications.
This thesis advances the work in face-to-face telecommunication by creating a non-

obtrusive real-time Face Capture System (FCS).

1 . 1 Motivation

One key motivation for the creation of advanced collaborative environments is the in-
crease in computer-mediated communications. Recent concerns over terrorism, high-
way gridlock and delays at airports dramatically increased the demand for social

presence technologies. Consider this evidence of how social disruptions, such as ter-

rorism, increased demand for telepresence technologies. Due to the fear of ﬂying,

after the Sept. 11, 2001 attack:

0 The use of the collaborative servers increased 300% immediately [HamOl].

0 According to CNN’S report, “National Business Travel Association Showed that

88% of companies planned to increase use of videoconferencing” [LinOl].

0 American suppliers reported an increase of 140% in videoconferencing bookings

[BorOl].

0 Based on a report from British Telecomm, there was an increase of 85% in video

conferencing and 30% in audio conferencing in the world.

0 A poll by Osterman Research found 60% of business organizations had greater

interest in teleconferencing and 41% reported drops in air travel [Ost01].

Although we may have face-tO-face interactions with workmates or others, many
of the social interactions include an increasing number of purely virtual interactions
with others; we rarely or never meet face—to-face. When it comes to communications
from remote places, the human face is one of the most important representative parts
of the human body because it has great expressive ability that can provide clues to
the personality and emotions of a person. The facial expressions of a remote col-
laborative or a mobile user can convey a sense of urgency, lack of understanding or
conﬁdence in action and other non-verbal elements of communications. According to
Gary Faigin, “There is no landscape that we know as well as the human face. The

2

twenty—ﬁve—odd square inches containing the features is the most intimately scru-
tinized piece of territory in existence, examined constantly, and carefully, with far
more than an intellectual interest” [Fai90]. With an increase in telecommunication
systems, research is being brought in terms of making advanced capture technolo—
gies. Even the most advanced teleconferencing and telepresence systems transmit
frames of video. These frames are nothing but 2D images. III order to get additional
views, the systems are using either a panoramic system and /or interpolate between
a set of views [CW93, SD96]. 3D teleconferencing systems are still in the research
development stage. Efforts are made in this direction at Michigan State University in
collaboration with the University of Central Florida to develop a Tele-collaborative

environment called the “Teleportal System”.

1.2 Broad Objectives

The overall goal of the teleportal system is to allow multiple users to enter a room-
sized display and use a broadband telecommunication link to engage in face-to—face
interaction with other remote users in a 3D augmented reality environment. The most
fundamental challenge of capture and display technologies is creating a compelling
sense of interacting with spaces and people that are not directly present physically
in our proximity. Advanced media technologies are designed to give a strong sense of
telepresence [OYTY98], deﬁned as the sense of “being there” in a remote location, or
social presence[WBL+96], the sense of “being with others” who are not in the same

room or place as the user.

The teleportal system will provide remote communication between at least two
users. A facial capture system and a projection display provide high quality video of
remote participant’s faces to the user. The teleportal system will provide a channel
for the transmission of the non-verbal cues through an unobtrusive capture of the
face. The remote presentation module will display quality frontal views of the remote

user in an augmented reality based environment.

A principal feature of this teleportal system is that Single or multiple users at a
local site and a remote site use a telecommunication link to engage in face-tO-face
interaction with other users in a 3D augmented reality environment. Each user utilizes
a system that includes a display such as a Head Mounted Projection Display (HMPD)
and a facial expression video capture system. The video capture system allows the
participants to view an image of the face of all remote participants and hear their
voices, view the local participants, and view a room that blends physical with virtual
objects with which users can interact. The HMPD projects high quality graphics in
real-time towards a screen that is covered by a ﬁne grain retro-reflective fabric. The
HMPD and video capture system do not occlude vision of the physical environment
in which the user is located. This system allows users to see both virtual and physical
objects, so that the Objects appear to occupy the same space.

4

1.3 Problem Deﬁnition

In the preferred embodiment of the teleportal system, multiple local and remote users
can interact in a room-sized space draped in a ﬁne—grained retro-reﬂective fabric. The
Teleportal Face-to—Face System allows individuals to see 3D stereoscopic images of
remote participants in an augmented reality environment. The teleportal system will

accomplish this through the following key components or sub~systemsz

Face capture system.

Virtual view synthesis.

Head Mounted Projection Display.

3D virtual and tele-immersive environments.

High bandwidth network connections.

1.3.1 Face Capture System (FCS)

Capturing a face in a multi-user collaborative environment has been a topic of research
for several years. The problem becomes more severe in mobile applications where the
capturing system must be designed in such a way that a user’s ﬁeld of view is not
occluded. The FCS is responsible for obtaining and transmitting a quality frontal
face video of a remote user involved in the communication. The FCS proposed here
captures the two side views of the face Simultaneously and generates the frontal view.
This face capture equipment consists of two miniature video cameras and convex

mirrors [BROO]. A ﬂexible bracket attached to the ear rests Of the system wraps over

 

Figure 1.1: Face capture system with two convex mirrors and two lipstick cameras

the top of the users head. Two mirrors are attached to the cap assembly and point;
down towards the user’s face. The two cameras are placed near the ears and they
focus the side view of the face images through the convex mirrors.

Figure 1.1 illustrates the face-capture cameras and the mirrors with respect to
the user’s head. Each of the cameras is pointed towards the respective convex mirror
which is angled to reﬂect an image of one side of the face. The convex mirrors
produce a slight distortion of the side view of the face. The left and right video
cameras capture the corresponding side views of the human face in real-time. Optics
issues concerned with the mirrors and camera lenses are studied in Section 3.2. The
side view captured from the cameras will introduce some additional distortion to the

images that must be removed in the virtual frontal video.

Advantages

In contrast with the conventional capturing techniques, where either the capture
system is static with the environments or the capture system is huge and consumes

6

enormous horse power, our system is static with respect to the user’s head movements
and works on any basic processor. The primary goal of the FCS is to increase the
telepresence of two remote collaborators. This thesis focuses on providing quality face
video in real-time for a human who is in a modestly equipped environment. Since the
proposed face capture system is light and portable, it can be used more effectively in
mobile applications. It is a simple and user-friendly system that captures the human
face in real-time in a mobile environment without obstructing the ﬁeld of view of
the user. A long-term goal of such a system is to achieve a video see through view

simultaneously by ﬂipping the mirror and changing the focus of the camera lenses.

Applications

Telephones and teleconferencing systems are widely used communication facilities.
High quality video of participants’ faces will signiﬁcantly improve collaborative com-
munication. The current and projected applications of our system will improve the
means of communications between two people located at remote places. The main
application areas include multi-user collaborative work, mobile environments and tele-
conferencing. This system can also yield beneﬁts in other areas such as biometrics
and e-business. Using the system, a remote expert can view a surgical procedure
while observing the surgeon actually performing the procedure, allowing the expert
to assess not only the performance, but also how well instructions are being com-
prehended. For mobile communications, even the latest cell phone technology forces
the user to have a small web camera in front of his face and to carry it in his hand

showing his face to the camera at all times. This is an over burden to the user and

7

this is not an effective way of capturing because it might fail in some cases where the
user turns around suddenly and when he fails to place the camera in front of him.
Our proposed system will be more effective in such an application.

This system can be used effectively in medical and military applications. In these
kinds of applications, it is most common to have a system that can capture the
person’s face as well as the area the person is viewing. If we ﬂip the mirror, the
camera can view the area that the person is viewing; so, FCS solves the problem of
changing the camera positions every time to view the area of concern. It is clear that
the cameras will move with respect to the head motion and we can view the face by

just ﬂipping the mirrors.

1.3.2 Virtual View Synthesis

This thesis focuses mainly on vision-based algorithms that are applied to construct
the frontal view from the two side views captured by the cameras through the mirrors
and also in developing techniques for camera calibration to extract information. A
3D head model of the face is constructed so that the face may be rendered from many

frontal view points.

1.3.3 Head Mounted Projection Display

Most of the advanced 3D environments are based on either a “CAVE” based tech-
nology [CNSD+92] or head-mounted displays [BC94]. A new type of head-mounted
display, which combines the advantages of both technologies, has been designed by

8

 

Figure 1.2: A prototype of the Head Mounted Projection Display.

 

Figure 1.3: Complete integrated system of the head mounted display with the face
capture unit.

Rolland et. al [HGGROO]. The HMPD consists of a pair of miniature projection
lenses and displays mounted on the helmet and retro—reﬂective sheeting materials
placed strategically in the environment. The novel properties of this technology sug-
gest solutions to some of the problems of state-of-art visualization devices and make
it suitable for multiple-user collaborative environments.

9

1.3.4 3D Virtual and Tele—immersive environments

One essential aspect of the teleport al system is to design a tele-immersive environment
that can accommodate spatially matched volumetric datasets. These datasets are
transmitted between two (or more) remote Sites and are projected by a Head Mounted
Projection Display in an Augmented Reality environment. The projection display
system requires a retro reflective surface to bounce the projected light directly to the
user’s eyes. Spatially distributed volumetric datasets such as medical, engineering,
and scientiﬁc data can be shared across different Sites. The retro-reﬂective material
Optical properties allow to reflect the light back to its source with little diffusion
whatever computer—generated image is being projected onto it. The conﬁguration of
the display surfaces is the most ﬂexible part of the system 1. Display surfaces can be
created using forms of the retro reﬂective material such as wallpaper forms, cloth-like
forms, etc. Hence, the surfaces of all types can be prepared as display surfaces for

the system.

1.3.5 Broadband Network Connections

The Internet2 test bed has been implemented and tested using MPEG 2 video streams
between the MIND Lab at Michigan State University and the ODA Lab at the Uni-
versity of Central Florida. The test bed has been formed to test the reliability and

performance for sharing of stereoscopic video signals. These Internet2 [web] connec-

 

1A cylindrical volumetric display has been implemented and tested to project and track a medical
dataset of a virtual skeleton. A tabletop display is also tested to project an architectural model of the
Beaumont Tower, a landmark of the MSU Campus. Several other displays are being conceptualized
to research other volumetric datasets.

10

tions are capable of transmitting full broadcast quality video streams using MPEG2
video encoding and decoding technology between remote collaborative Sites. This
Internet2 test bed aims to support real time remote collaboration by allowing the
sharing of bi-directional video streams capable of carrying enormous amounts of data
that can be used for medical visualization, teleconferencing, or other applications
that make use of large bandwidth data transmission. Two teleportal rooms are be-
ing developed, one at Michigan State University and the other at the University of

Central Florida. These will be linked using an Internet 2 connection.

1.4 Thesis Contributions

The scientiﬁc contributions of this thesis include the

1. Establishment of the basic hardware required for the FCS.
2. Estimation of camera-mirror parameters for the optimal FCS conﬁguration.
3. Algorithms for generating a quality frontal video from two side videos.

4. 3D Face model construction based on two Side views.

The overall framework for the teleportal project is to design some innovative
advanced 3D interfaces among people sharing a common space as well as people
located far apart. The combination of quality facial capture, communications, and
graphical display technologies allows for full interaction between remote and local
users. The best application where this new technology can be used effectively is
a face—to—face distributed collaboration system. There is always an acute need for

11

more effective collaboration and more effective knowledge sharing systems among
geographically scattered people. This thesis forms a stepping-stone for the creation
of a complete 3D augmented reality based face-to—face communication system that
can produce stereoscopic views of the users via a real-time augmented reality display.
Even though the overall framework of this proposed system is to support distributed
collaborative work, the major focus of this thesis is towards capturing the human face
and providing quality face views in real-time for a human who is in an unobtrusive

instrumented environment.

1.5 Organization Of the Thesis

The remainder of this thesis is organized as follows. Chapter 2 describes the relevant
background for the FCS and the display environments. Chapter 3 describes the
system analysis. The experimental setup is detailed and the optics issues that are
involved in the conﬁguration of the face capture system are discussed. The algorithms
and methods used in the FCS are explained in Chapter 4. Chapter 5 illustrates some
results of using the prototype developed. Assessment of the generated frontal videos
and discussion of some issues regarding the portability of the system is discussed in
Chapter 6. Finally, Chapter 7 presents conclusions of the work with FCS and suggests

ideas for future work.

12

Chapter 2

Relevant Background

This chapter describes the related work that has been done in the sub parts of the F CS.
Section 2.1 describes the existing face capture systems and explains the different ways
of capturing a human face using the state-of-art devices. Background for synthesizing
novel views without reconstructing the complete 3D model is discussed in Section 2.2.
Section 2.3 describes some of the traditional methods for depth extraction and face
modeling. Section 2.4 describes the relevant work that was done in the areas related

to HMDS and tele-immersive environments

2.1 Face Capture Systems

Extensive research work has been done in the areas of face modeling [PW96, DMSQ8,
GGW+98] and face recognition [BP93]. However, capturing a human face in real-time
has been a topic of less concern when compared to other areas of face analysis. Face
capture can be done in a collaborative environment using a sea of cameras in a highly

13

equipped environment [FBA+94]. Recently the National Tele~immersion project used
many cameras that are statically placed in the environment to develop the 3D head
model of a user in a collaborative scenario (see Figure 2.1). Sometimes even such
a highly calibrated environment might fail to produce quality frontal views mainly

because of the orientation of the head.

 

Figure 2.1: A collaborative room with many cameras. (LeftzNational tele-immersion
project. Right: Sea of cameras)

In a mobile scenario, it will be quite difficult to capture the human face, even
though some solutions exist (see Figure 2.2). The user wears a head mounted setup
and hence the capturing device is assumed to be static with the head movements.
The main problem that arises by using the head mounted face capture system shown
in Figure 2.2 is that it obstructs the ﬁeld of view of the user who is wearing the
head mounted set. These devices are used mostly for character animation and are
not being used for day-to—day tele—collaboration scenarios. In this thesis, the design
and construction of “a non-obtrusive head mounted face capture system” is described.

14

 

Figure 2.2: Head mounted face capture systems (Left: Facecap3D product from
standard deviation company. Right: Head mounted optical face tracker from adaptive
optics)

2.2 Virtual View Synthesis

Novel views can be synthesized either by a panoramic system[ZT92, Sei01] and/or
by interpolating between a set of views [CW93, SD96]. Using image-based rendering
techniques, one need not explicitly derive the 3D information of an object [GGSC96].
These techniques concentrate on rendering novel views without actually reconstruct-
ing the 3D structure. In recent years, the techniques in computer graphics and com—

puter vision are being combined to produce interesting results [Len98].

Producing novel views in a dynamic scenario was successfully shown for a highly
rigid motion [MD99]. These techniques extend the interpolation techniques to the
temporal domain and do not strictly restrict them to the spatial domain. A novel
view at a new time period was generated by interpolating views at nearby time inter-
vals using spatio—temporal view interpolation[VBK02], where a dynamic 3—D scene is
modelled and novel views are generated at intermediate time intervals.

15

In this thesis, we study, develop, and test techniques for capturing the human face

and reconstructing novel views in a real-time video sequence.

2.3 Depth Extraction and 3D Face Modeling

Structured light is a commonly used technique in computer vision [DT96, PGD98].
Grid patterns have been used successfully in the past for reconstruction and pose
estimation of 3D surfaces [HS89, S889]. Color coding of a structured pattern was
ﬁrst described in[BK87]. Structure from motion [TK92] estimates the structure of
a 3D object from an image sequence. The binocular stereo method [BF82, OK93]
is the most commonly used method for estimating the 3D coordinates of points on
the object and the depth is determined using two images taken by two cameras from
different angles. For depth estimation, even though a silicon range ﬁnder measures
3D coordinates quickly, it requires expensive hardware for scanning purposes [Sat94].
The Shape from shading method [HB86] and photometric stereo method [W0080]
can be used for measuring the normal vectors of objects. These measurements are
based on photometric properties. Shape from shading techniques use a single camera
and two or more images are taken of the face in a ﬁxed position but under different
lighting conditions. Even though these methods can provide accurate 3D information
about complicated shapes, they are not recommended for our application because the
accuracy of the measured 3D data is highly inﬂuenced by the various photometric
properties. In our case, the traditional stereo method between the two cameras will
not work because of the occlusion of the facial features. A Light projection method

16

has also been used for face shape estimation[SN97] by stereo vision algorithm. In
this technique, multiple light stripes are projected onto the face using the slit pattern
projector [.lar83, RK75].

A 3D individualized head model can be constructed from two orthogonal views
[IY96]. A three-dimensional face model has also been created from a video sequence
of face imageleCOl]. Realistic expressions have been synthesized from photographs
[PHL+98]. Various vision techniques have been effectively used for visualizing the 3D

world in a collaborative application [XLH02].

2.4 Head Mounted Displays and Tele-immersion

The main aim of 3D visualization devices is to visualize computer-generated ob-
jects and make them appear as real objects. Interest in 3D visualization devices
has endured and permeated various virtual and augmented reality domains. The
ﬁrst attempt to create a virtual reality (VR) environment was the production of an
augmented reality (AR) navigational aid for helicopter pilots [Sut65]. Since the ﬁrst
head-mounted display (HMD) originated by Ivan Sutherland in the 1960’s [Sut65], 3D
visualization devices most commonly used in virtual and augmented reality domains
have evolved into three typical formats: standard monitors accompanied with Shut-
ter glasses, head-mounted displays (HMDS), and projection-based displays. HMDS
provide a ﬁne balance of affordability and unique capabilities such as spanning the
continuum proposed by [MK94] from reality, via mixed reality, to immersive envi-
ronments, creating mobile displays [Fei02] and enabling teleportal capability with

17

face-to-face interaction [BR00]. There are two major categories of HMDS: immersive
and see-through [BC94]. Immersive HMDS present a user with a view that is under
full control of computers at the expense of the physical view. These systems require a
virtual representation of a user’s hand to manipulate the virtual world and avatars of
collaborative team members in multi-user environments [BF98]. See-through HMDS
superimpose virtual objects on an existing scene to enhance rather than replaCe the
real scene. Video and optical fusion are two basic approaches to combine real and vir-
tual images. The main trade-offs include the resolution, the ﬁeld of view (FOV), the
presence of large distortion for wide FOV designs, the inaccurate eye point representa-
tion, the conﬂict of accommodation and convergence and the occlusion contradiction
between virtual and real objects.

The VIDEOPLACE system used vision algorithms to track users within the envi-
ronment to generate visual and auditory responses [KGH85]. The most popularly used
3D teleconferencing system is the CAVES system. It uses multiple screens arranged in
a room conﬁguration to display virtual information. It is often implemented as a cube
of approximately 12 feet on each side [CNSD+92] and it uses four CRT projection
systems and crystal shutter glasses. As a viewer moves within its display boundaries,
the correct perspective and stereo projections of the environment are updated, and
the image moves with the viewer surrounding him. In CAVES systems, there is only
one correct viewpoint, all other local users have a distorted perspective on the virtual
scene. Scenes in the CAVES are only projected onto a wall. So two local users can
view a scene on the wall, but an object cannot be presented in the space between
users. These systems also use multiple rear screen projectors, and therefore are bulky

18

and expensive.

An alternative to a cave is to create an immersive virtual environment through
a head-mounted display (HMD). While conventional types of head—mounted displays
employ eyepiece optics to create the virtual images [RFOO], an emerging technol-
ogy known as a head-mounted projection display (HMPD) has fairly recently been
demonstrated to yield 3D visualization capability [KOQ7, PR98]. The concept of
head-mounted projection displays (HMPDs) was initially patented by Fisher in 1996
[F is96] and was proposed as an alternative to remote displays, head-mounted dis-
plays and stereo projection systems for 3D visualization applications. Potentially,
the HMPD concept provides solutions to some of the problems existing in state-of-art
visualization devices. FCS will be integrated with HMPD to provide a tele—immersive
environment.

A thorough discussion of traditional video mediated communication is described
in [FSVV97]. A number of teleconferencing technologies support collaborative virtual
environments that allow interaction between individuals in local and remote sites.
For example, video-teleconferencing systems use simple video screens and wide screen
displays to allow interaction between individuals in local and remote sites. However,
wide screen displays are disadvantageous because virtual 3D Objects presented on the
screen are not blended into the environment of the room of the users. A mixed reality
computer supported collaborative environments, which enable transitions along the
virtuality continuum was ﬁrst illustrated in the Magic Book [BKPOl]. The Magic
Book provides an experience where users have the capability and the incentive to
travel from real to fully immersive environments within the same application. A

19

mean to travel along the virtuality continuum is also discussed in [DRHL+03].

The most fundamental challenge of face capture and display technology is to
create a compelling sense of interacting with environment and remote people who
are not directly present physically in our proximity. Quality presentation of the
remote users and an effective user interface is critical for all media communications.
The combination of quality facial capture, high bandwidth network communication
channel, and graphical display technology will provide a complete neat interaction
between local and remote participants thus emphasizing the concept Of “people need

not physically be there”.

20

Chapter 3

System Overview

This chapter describes the equipment required by the FCS designed and details the
design issues concerned with the cameras and mirrors that are used. Section 3.3

explains the experimental setup and the procedure followed during the face capture.

3.1 Equipment required

The F CS requires some hardware, software and network connections. The network

connections are required to transmit the videos to remote locations.

3.1.1 Hardware

Our prototype uses an Intel Pentium III processor running at 746 MHz with 384 MB

RAM. It is installed with two Matrox Meteor 11 standard cards 1. These cards are

 

1Matrox Meteor 11 Multi Channel can capture multiple video streams simultaneously in real—time.
However, this can be done only using monochrome videos. Hence, two Matrox Meteor II standard
cards were chosen for implementing FCS.

21

connected to the control units of the lipstick cameras through a general cable. The
camera that is used is a Sony DXC LSl NTSC camera with 12 mm focal length lenses.
We use Matrox Meteor II Standard that supports both multiple composite and s-video
inputs. The video is digitized by a Matrox Meteor II standard capture card, yielding
interlaced 320 X 240 video ﬁelds at 60 Hz. During the off-line calibration stage, the
system also used an Infocus LP350 projector to project a grid onto the user’s face. A
calibration sphere is used in the process of extracting the depth information. Voice

is recorded in the same system using a microphone.

3. 1 .2 Software

The API required to do the programming for controlling this hardware is MIL-LITE
7.02. The standard Windows based sound recording software is used to record the
voice of the user during the conversation. The sound ﬁle is appended to the .avi ﬁle
using Adobe Premiere 6.0, which is a popularly used video editing software.

Using this hardware and software, two videos are captured Simultaneously at the rate

of 30 frames per second.

3.1.3 Network Connections

The Internet2 testbed has been implemented and tested encoding MPEG 2 video

streams at 3 Mbps and decoding video streams at 4 Mbps between the MIND Lab in

 

2MIL is another API provided by MAtrox. The main difference between MIL and MIL-LITE is
that MIL is used for high-level programming. In our application, we need to read each frame and
applu our own algorithms. This is treated as low-level programming in which we need to have the
control of the data at each frame. Hence, MIL-LITE 7.0 is appropriate for our application.

22

Michigan State University and the ODA Lab in the University of Central Florida.

3.2 Optics Design Issues

3.2.1 General System Layout

The general layout of the system is shown in Figure 3.1. The calculations for

‘ Mirror

camera

'Mirror

Figure 3.1: General layout of the face capture system.

 

camera

estimating the variable parameters are simpliﬁed by unfolding the overall system
(see Figure 3.2). When the system is unfolded, the mirror can be represented as
a negative lens. The face is being imaged through the mirror and the camera is
actually focusing on the face through the mirror. We shall now consider various
parameters for each of the components that are involved in the system. The various
components of this system will be the (a) human face, (b) camera and (c) mirror.
(a) Human face: The main parameters of the face that will affect the geometry

23

 

Figure 3.2: Unfolded layout of the face capture system

of the system will be the height and the width of the face. Even though some other
factors such as the skin color and illumination might affect the performance of the

system, they will not have any effect on the geometry of the system.

(b) Camera: The camera, made up of a sensor and a focusing lens, requires
careful study. The various parameters that signiﬁcantly affect the geometry of the
system are the sensing area, the ﬁeld of view, the pixel dimensions, the focal length,
the f-number or lens diameter, the minimum working distance and, the depth of

ﬁeld.

(c) Mirror: This is the most ﬂexible component of the system. Hence, all the pa-
rameters of this component are estimated and the component is manufactured based
on the estimated values of the parameters. The various parameters of this component
are the focal length, the f—number or mirror diameter, the radius of curvature, and
the magniﬁcation factor.

24

3.2.2 Speciﬁcation Parameters

The various parameters that are involved in the calculations are as follows.

(a) Human face: Without loss of generality, let us assume that we can take the

dimensions of an average face for further calculations.

0 H — Height of the head to be captured (3 250mm)
0 W - Width of the head to be capture (2 175mm)

(b) Camera: There are several parameters that signiﬁcantly affect the geometry
of the system. For this application, the main parameters that are to be taken into
consideration are the miniature size, the lightweight, the minimum working distance,
and the ﬁeld of view. Based on the approximate values of these parameters, we
have obtained the off-the-Shelf lipstick camera, SONY DXC LSl. For this camera,
two focal length lenses are available, one is of 4mm and the other is 12 mm lens.
Because of the wide ﬁeld of view (45° X 33°), the 4mm lens is not well suited for
our application when compared to the 121nm lens. The parameters of camera 12mm

lens are the

o Sensing area: The sensing area is 1 / 4”, or equivalently 3.2 mm(y) X 2.4 mm(x).

0 Pixel Dimensions: According to the speciﬁcations, the image sensed has a res-
olution of 768 X 494. However, when this image is captured using a Matrox

25

Meteor II standard frame grabbing card, the image is digitized into 320 X 240.
For any further evaluations of other parameters (e.g. depth of ﬁeld) the resolu-
tion of the image iS considered to be 320 X 240. Even though higher resolution
images (640 X 480) can be captured, restrictions of the RAM size force us to

capture low resolution images.

0 Focal Length(FC): The focal length Of the lens that was selected is 12 mm (VCL

- 12UVM).

0 Field of View (FOV): The ﬁeld of view of the camera with the above mentioned

lens is 15.2"X11.4°.

0 Diameter (Dc): The diameter of the lens and the camera is 12mm.

0 f-number (NC): The f-number for this camera lens is 1. While in practice, we
will adjust the iris to satisfy optimum illumination of the face provided external
room illumination. We shall consider a f-number of 1 in the estimation of the

other parameters.

0 Minimum Working Distance (MWD): The minimum working distance for the

selected lens is 200 mm.

0 Depth of Field (DOF): This parameter is dependent on all the above mentioned
parameter values. This helps in making the system portable. If the system has
large depth Of ﬁeld then it will be more portable and can accommodate many
users without. much changes in the position and focus of the cameras.

26

(C) Mirror: This part Of the system can be customized. The various parameters

of the mirror that will affect the geometry of the system are the
0 Diameter (Dm) / f-number (Nm)
0 Focal Length (Fm) or equivalently the Radius of Curvature (Rm)
0 Magniﬁcation Factor (Mm)

(d) Distances: There are basically two distances that can be adjusted within a

visible range. From Figure 3.1, we can approximate the following distances
0 Dem - Distance between the camera and the mirror (x 150mm)

0 Dmf - Distance between the mirror and the face. (x 200mm)

3.2.3 Estimation Of the Variable Parameters Dmf and Dm

 

 

 

 

I + vo
Mirror (Window) _ .1:
Camera (Pupil) B , -. .. .. J E
A -'-'-' ------- c - - — - F
De _-__-_ _-_- _____ _ _ HoiehtIH)
I Dm :
I I
' l ' _ J!
| I x' l
. p——+I .
[A MIND >i |
Dcm Dmf !
|

_ 7‘ _
. 1'
A
v

Figure 3.3: Estimation of the variable parameters D,,,; and Dm

27

From the theory of pupils and windows [C0095], the camera is the limiting aper-
ture from the intermediary image plane located behind the mirror. Hence, the camera

acts as the pupil of the system and the mirror is the window.

In the unfolded conﬁguration, the mirror is represented as a negative lens with
image focal length f,’,, equal in magnitude to that of the mirror with an opposite

Sign. The imaging equation [Mk/197] for the equivalent lens to the mirror yields

1 1 1

+ __
17’ Dmf fr,"

 

(3.1)

where 33’ is negative because the values Dmf and f,’,, are negative. Hence, the
image in the unfolded case is virtual and thus it is always between the lens and the
human face. In the case of the mirror, the image will be optically located behind the

mirror.

Let the FOV of the lens be 6,, x 62. To maximize the F OV of the capture, we Shall
image the height (H) onto 6,, and the width (W) onto 62. Let H ' be the maximum

Size of the intermediary image formed through the mirror. H’ is given by

H’ = 2 x tan(6y/2) x MWD

Also, the magniﬁcation factor of the mirror is given by
HI
NImirr : _
or H
and,

28

.r.’

 

 

: AImir-ror
Dmf
(3.2)
Substituting Equation 3.2 in Equation 3.1, we get,
D — ( 1 1) * f’
mf — Almirror m
(3.3)
Also, from the deﬁnition of f-number,
f,’,, = Nm x Dm

Based on the similar triangles ABC and AEF Shown in Figure 3.3, we can write

Dm/2 — DC/2 _ H’/2 — 00/2
MWD + :r’ ‘ MWD

 

 

where, Dm must be written as a function of N.

The FOV of the lens is approximately 15° x 11.5°. Hence

Mm"... 252.6 / 250 = 0.21

Taking MWD equal to 200mm and Dc equal to 12mm, Dm can be written in terms

of N as shown below

26.3 x 2

Dm =
(1+ 0.16 x N)

 

Table 3.1 presents a summary of estimated values for Dm as a function of the

f-number.

A similar computation for a 4mm focal length lens, which was our other Off-the-

shelf option for the camera considered, is summarized in Table 3.2.

29

Table 3.1:

number(Fc =12mm). All dimensions are in mm.

Estimated values Of the variable parameters Obtained by varying the f-

 

9

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

f- Diameter of Distance from X Focal Length Radius of Cur-

numbcr(N,n) the mirror Mirror to Face of the Mirror vature of the
(D...) (Dmi) (Fm) Mirror (R...)
45.34 -170 -35.80 -45.34 -90.69

1.5 42.42 -238.6 -50.23 —63.63 -127.26

2 39.85 -298.9 -62.92 -79.70 -l59.39

2.5 37.57 -352.2 -74.15 -93.93 -187.86

3 35.54 -399.8 -84.17 -106.62 -213.24

3.5 33.72 -442.5 -93.17 -118.01 -236.03

4 32.07 -481.1 -101.28 -128.29 -256.59

4.5 30.58 -516.1 -108.64 -137.62 -275.23

5 29.22 -547.9 -115.35 -146.11 -292.22

5.5 27.98 -577.1 -121.49 -153.88 -307.77

6 26.84 -603.8 -127.12 -161.02 -322.04

6.5 25.78 -628.5 -132.31 -167.60 -335.20

7 24.81 -651.3 —137.12 -173.68 -347.36

 

Table 3.2: Estimated values of the variable parameters obtained by varying the f-

number(Fc 24mm). All dimensions are in mm.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

f. Diameter of Distance from X, Focal Length Radius of Cur-
number-(Nm) the mirror Mirror to Face of the Mirror vature of the
(Dm) (Dmf) (Fm) Mirror (Rm)

1 29.59 -l39.1 -24.40 -29.59 -59.19

1.5 25.47 -179.5 -31.50 -38.20 -76.40

2 22.35 ~210.1 -36.85 -44.69 —89.39

2.5 19.91 -233.9 -41.04 -49.77 -99.55

3 17.95 -253.1 -44.40 -53.85 -107.70

3.5 16.34 -268.8 -47.17 -57.20 -114.40

4 15.00 -282 -49.47 -60.00 -120.00

4.5 13.86 ~293.2 ~5l.43 —62.37 -124.75

5 12.88 -302.7 -53.11 -64.41 -128.82

 

30

 

 

Based on the practical values that are optimal for the size of the mirror(Dm) and
the distances (Dm f and Dem), the third row (corresponding to f-number = 2) in Table
3.1 represents the most suitable values for the parameters. The parameters of the

mirror are customized using these values.

3.2.4 Customization of the Cameras and Mirrors

The mirrors were manufactured according to the speciﬁcation table for the 12 mm
focal length camera, and available off-the—shelf components. A convex mirror of radius
of curvature 155.04 mm was selected, corresponding to a f-number of 2. The convex

side of the mirror was coated for the visible light spectrum.

3.3 Experimental Prototype

Even though the overall goal of the FCS is to achieve quality frontal views from
cameras and mirrors placed on a headset, as a start, we have simpliﬁed the problem
into environment based face capture. The general problem is to generate a virtual
frontal view from two side views. Hence, in our further discussions, we will have two

kinds of Face Capture Systems

0 Environment Static Face Capture System (ESFCS)

0 Head Mounted Face Capture System (HMF CS)

We used a structured light approach to synthesize novel frontal views. The basic
idea is to project a structured pattern onto a human face and capture the corre-

31

sponding grid points on the face from the side views. Based on the distortions of the
grid pattern on the face, transform functions are generated to reconstruct the virtual
frontal view.

In order to generate the 3D model, a stereo method is applied. According to
the design of our system, there is not. much overlap of the face region in the two side
views. Hence, it is not possible to have stereo between the two cameras. Hence, stereo
computation is made between a projector (which is used to project the grid) and each
of the cameras. We shall now discuss the procedure of the overall experiment.

The experimental procedure that is followed during the face capture using envi-
ronment static cameras is discussed in Section 3.3.1. Section 3.3.2 discusses the

issues regarding the design of the HMFCS.

3.3.1 Environment-static Camera Face Capture

The experimental bench shown in Figure 3.4 is designed to maintain the same virtual
geometry between the subject’s head and the cameras / projector. This prototype ﬁxes
FCS relative to the user’s head but is not a head mounted set. Hence, it has the basic
requirement that the user should not move his head. Once, the algorithms are tested

thoroughly, we will apply them to a head mounted FCS.

Experimental Procedure

1. The user is asked to sit in a chair. The chair is adjusted according to the
convenience of the user and workspace of the cameras.
2. The cameras are focused on the face of the user.

32

h ‘ W.- __....M . a. .
m ' W;
~17an . N... ,. .-:-: .--.-- .- .53.. _

 

Figure 3.4: Experimental bench prototype of the FCS

3. A microphone is used to record the user’s voice.

4. The projector is switched on and a grid is projected onto the face. The grid is
projected in such a way that a vertical line passes through the center of the face and
bisects the face into two halves.

5. After these initial setup procedures, a “start” command is issued and the cameras
start recording the user’s face.

7. After 1-2 seconds, the projector is switched off and the grid is no longer projected
onto the face.

8. The user is asked to say “The quick brown fox jumped over the lazy dog”
continuously for 10 seconds.

9. For the evaluation of the results, a camcorder records the entire user’s face during

33

the experiment.
10. The user’s head must be as static as possible with only facial movements and

without any head motion.

3.3.2 Porting to a Head Mounted System

In the HMFCS, delicate mirrors off to the side of the user’s face reﬂect the side face
image to a camera near each ear. However, the overall problem to reconstruct the
novel frontal View from the side view remains the same except that the two side view

images are captured via the mirrors.

Placing the cameras and mirrors in appropriate positions is crucial to obtain
quality results. Since, this prototype is still in development, it is important for us to
have flexibility in the design of the system. Placement of the mirrors and the cameras

must be flexible enough to capture quality side views.

Flexible design can be achieved by translation and rotation of the cameras and the
mirrors. Basically, this design helps us to estimate effective positions of the cameras
and mirrors. Ideally, one would like to have a system that can be used by a wide
range of human head sizes. Placement of these cameras and mirrors will significantly
affect the quality of the results that are obtained. Most importantly, features of the

face must be viewed completely through the mirrors and should be in focus.

The experimental procedure followed to capture the face remains similar to the
procedure discussed in the previous section. The main advantage is the freedom

34

of head motion for the user wearing HMF CS. In both cases, the calibration of the
projector and the cameras has to be done only for one frame in the video sequence.

Details of the calibration procedure will be discussed in Section 4.2.2.

Chapter 4

Methodology

This chapter explains the methods that are used during calibration and generation
of the virtual frontal views and the 3D face model. Section 4.1 describes the overall
system and introduces the notations and the coordinate systems that are being used in
the rest of the chapter. Steps followed during the calibration procedure are discussed
in Section 4.2. Synthesizing virtual videos primarily consists of two phases, namely
(1) calibration phase (discussed in Section 4.2.2) and (2) operational phase (discussed
in Section 4.3). Similarly, the steps required during the calibration for the 3D
face model are discussed in Section 4.2.3 and Section 4.4 describes the various
steps followed during the generation of a texture mapped 3D face model. Various
implementation issues, including the details about the video capture and processing,
are discussed in Section 4.5.

36

4. 1 Description

We shall now discuss the notations that are followed in our algorithms.

 

— Real Vlew
- - - - Imaginary View

 

 

 

 

Figure 4.1: Top view of the face capture system

There are basically five coordinate systems that are involved in our system:

1. World Coordinate System (WCS)

2. Head Coordinate System (HCS)

3. Left Camera Coordinate System (LCS)

4. Right Camera Coordinate System (RCS)

5. Projector Coordinate System (PCS)

WCS and HCS are 3D coordinate systems and the rest are 2D coordinate systems.
Generating the 2D virtual frontal video, mapping is done between 2D coordinate
systems. In this case, the 3D coordinate systems are not used. However, to construct
the 3D face model, the mapping is done with respect to the WCS and HCS as shown
in later sections. The origin and the coordinate axes in the HCS are shown in Figure
4.9. The coordinate axes in the WCS are aligned in the same manner as the HCS.

37

The origin of HCS is defined to be the center of the calibration sphere shown in Figure
4.5.

From Figure 4.1, we can see that there are two real images (side views) and a
virtual image V has to be generated. The coordinates in these images are described
as follows:

V[x,y] - Virtual Image with x,y coordinates (defined in PCS)
I L [3, t] - Left Image with s,t coordinates (deﬁned in LCS)

I R[u, v] - Right Image with u,v coordinates (defined in RCS)

4.2 Off-line Calibration

Section 4.2.1 deals with the color balancing technique that is used in our application.
The calibration procedure for synthesizing the virtual video is described in Section
4.2.2. Section 4.2.3 discusses the calibration procedures required for the generation

of a texture mapped 3D face model.

4.2.1 Color Balancing

Before calibrating the cameras and the projector geometrically, one has to make sure
that the cameras are color balanced. Even though, several software based approaches
for color balancing can be taken, the color balancing in our work is done at the
hardware level. Before the cameras are used for calibration, they are balanced using
the white balancing technique. A single white paper is shown to both cameras and

38

cameras are white balanced instantly. This hardware solution is more reliable than
a software based solution. This solution has an upper hand over the software based
approach in that it will be much faster, it can handle varying lighting conditions in
a more effective manner, nothing is assumed about the color distribution, and this
method will give more natural colors. The software based approach might reveal
that the virtual frontal view has been synthesized when it is used in varying lighting
conditions because some pre—knowledge about the color of the images might have to
be stored if the software based approach is used. When the lighting is changed the
color balancing will not be handled more effectively [YLW98].

The main reasons for the change in the skin color are

c variation in the lighting condition
0 change in the input video camera

0 change in the white-balance of the camera

In our case the video cameras remain the same throughout the process and the
white balance of the camera is not changed at all. Hence, the only variation is in the
lighting conditions. In such a case, if we have some skin color model predefined, it
will become problematic. Hence, a hardware based solution is more reliable than a

software based solution.

4.2.2 Calibration for Virtual Video Synthesis

During the calibration phase, the transformation tables are generated using the grid
pattern coordinates. To get the transformation, a rectangular grid is projected onto

39

the face and the two side views are captured as shown in Figure 4.2. To generate
the virtual video, the cameras and the projector are to be calibrated relative to each
other. In essence, the transformation has to be done between PCS and LCS and
between PCS and ROS. Since the 3D information is not required to generate the
novel frontal view, this calibration will not consider the HCS and WCS. The grid will
enable the transformation of corresponding points between the coordinate systems
mentioned above. Using these transformation tables, one can map every pixel in the

front View to the side view.

 

Figure 4.2: Demonstration of the behaviour of the grid pattern

 

 

 

 

 

 

5\ lat

C c ' d

l

D #91

I

i
A a b

B
————-—95 ————)x

Figure 4.3: Illustration of bilinear interpolation technique.

The behaviour of a single gridded cell in the original side view and the virtual

40

frontal View is demonstrated in Figure 4.2. A grid cell in the frontal image will
map to a quadrilateral with curved edges in the side image. Bilinear interpolation
is used to reconstruct the original frontal grid pattern. The bilinear interpolation
(see Figure 4.3) is the traditional way of warping a quadrilateral into a square or a
rectangle. The number of pixels in the side image in a single grid cell might be less,
equal or more than the number of pixels in the corresponding frontal grid cell. This

distortion is because of the 3D shape of the human face.

3 = fl(fL‘,y) and t = 91(1)?!)
(4.1)

v. = fr(1:,y) and v = 9.413).?!)
(4.2)

Equations 4.1 and 4.2 represent four functions that are to be determined during
the calibration stage (off-line). Once the four functions are obtained the transfor-
mation tables can be generated. These transformation tables will be used in the
operational stage described in 4.3. A calibration procedure is used to deﬁne the

mapping of the grid cells.
The procedure followed during the calibration is as follows:

1. Capture the two side views (with a grid projected on the face) from the two

cameras and store them in the corresponding images (I L [3, t] and 13hr, v]) .

2. Take some grid intersection points and deﬁne transform functions for determining
the (s,t) coordinates in the left image (I L) and (u,v) coordinates in the right image

(I R)(see equations 4.1 and 4.2) .

3. Apply bilinear interpolation to map any points inside the grid coordinates.

41

4. To implement the transformation functions, construct two transformation tables
(one for the left image and one for the right) which have index as (x,y) and gives a
corresponding (s,t) of IL and (u,v) of IR.

5. These transformation tables define the mapping (fl/[An y]) of each frontal pixel in

the virtual view to the corresponding pixel in the side views (IL or I H).

Human
Face

Leﬂ ’I’
Camera /

 

 

 

  

1
Left Calibration Right Calibration
Face Image Face Image

 

 

 

 

 

 

J...

Transformation Tables

 

 

 

 

Figure 4.4: The off-line calibration stage during the synthesis of the virtual frontal
view.

As shown in Figure 4.4, a projector is used to project a grid pattern onto the
human face. The face with gridded pattern is captured from the two side cameras.
Using the coordinates of the original grid pattern and the corresponding coordinates

in the side views, transformation tables are generated.

4.2.3 Calibration for 3D Face Modeling

For generating the 3D model of the face, some depth information of prominent facial
features is to be estimated. The technique that is used here for depth extraction is

42

structure from stereo. For extracting the depth information, the three components
(namely LCS, ROS and PCS) have to be calibrated with respect to the HCS. A
calibration sphere (see Figure 4.5) is used for calibrating the system. The origin of
this calibration sphere is considered to be the origin of the WCS. A detailed discussion
about the conversion of the spherical coordinates to cartasian coordinates is given in

appendix A.

Camera Calibration

In order to estimate the depth information, the system has to be calibrated with
respect to the WCS. To calibrate the system, a calibration sphere is used. The origin
of this calibration sphere is the origin of the WCS. The depth values are ﬁrst estimated
in the WCS and then they will be transformed into the HCS. There are 17 calibration

points (A-Q) shown in Figure 4.5

 

Figure 4.5: The calibration sphere with labeled calibration points

43

In the Figure 4.5, the calibration points are chosen in such a way that the

azimuthal angle is varied in steps of 30° and the polar angle is varied in steps of 45°.

The equations for camera calibration [8801] are explained below.

l’Vp
str C11 012 013 014 I
Wpy
stc = C21 C22 C23 C24 WP
8 631 632 033 1 1 2
(4.3)
Eliminating the homogeneous coordinate s from the Equation 4.6, we get
Uj = (611 -— C31Uj)$j +(012 — 03215)!” + (013 — C33ltj)Zj + CM
(4.4)
113' = (C21 — Calvj)17j + (C22 - 032Uj)yj + (C23 - €33Uj)2j + C24
(4.5)

During the calibration, the 2D image and the 3D world coordinates of the calibration
points are given. We will have to determine the values for the calibration matrix.
Hence, we will have two linear equations for each of the calibration points. Equations
4.4 and 4.5 can be rearranged such that all the known values are placed one side and
unknown values on the other. The new representation of Equations 4.4 and 4.5 is

shown in Equation 4.6

44

’I

C11
(‘12
013
(314

C21
IJ' yj Zj 1 0 0 0 0 —:1:J-uj -ijj —Zj’U.j _ Uj
0 0 0 0 'r- '1- ~- 1 -:L'-v- — -v- —z-v- 622 _ v-
- J JJ “J J J y] J J J 023 J

C24
Cs 1
032
£33-

 

 

(4.6)

In the above matrix representation, all the entities on the left are known from
the calibration tuples, while the calibration matrix (CU) values are unknown. Using
11 calibration points, we can obtain 211 linear equations. In our case, we need to have
at least 6 calibration points to estimate the values of the 11 unknown calibration
parameters. As discussed earlier, we have 12 points for calibrating the cameras and
17 points for calibrating the projector. In both of the cases, we will have more than
10 calibration points, so this is an over determined system. This system can be solved

using the least squares approach.

Calibration of the Projector

The calibration of the projector is done in the same way as that of the cameras. The
basic difference is that the image coordinates of the calibration points on the sphere
are not obtained directly from the 2D image. Instead, the image coordinates are
obtained by projecting a “blank image” onto the calibration sphere. All the labeled
points on the calibration sphere will be “seen” from the projector. The points in

45

the projected image of all these points are noted by clicking on the 2D-screen image

coordinates in the PCS while it is projected onto the calibration sphere in WCS.

4.3 Virtual Video Synthesis

The transformation tables that are generated in the off-line calibration phase are
used in the operational phase to generate each virtual frontal frame in the video. The
algorithm is described as follows:
1. Get the two side views without a grid projected on the face from the two cameras(IL
and I R).
2. Reconstruct each (x,y) coordinate in the virtual View by accessing the correspond-
ing location in the transformation table and retrieve the pixel in I L (or I R) using the
mapping (Mp[:c, y]).
3. Smooth the geometrical and lighting variations across the vertical midline in V by
applying a linear (one-dimensional) filter.
4. Continue this reconstruction of V[x,y] for every frame of the videos to produce the
ﬁnal virtual frontal video.

Figure 4.6 shows the complete block diagram of the operational phase. The

operational stage can be split into mainly three steps:
a Face Warping
e ace Mosaicking

e Post-processing

46

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Face
as .
Left / . \ Right
Mll’TOI’lx,’ \1\ Mirror
Left 9”” “3s Right
Came'a/ Transformation Gamma
i Tables
Left Right
Face Image Face Image
\Wa/rping Warmly
Left Warped Right Warped
Face Image Face Image

 

 

 

 

 

 

Noseickir/

Mosaicked Face Image

 

 

 

 

Post-processing

 

 

Final Virtual
Frontal Face Image

 

 

 

Figure 4.6: Operational stage during the synthesis of the virtual frontal view

47

4.3. 1 Face Warping

Each grid in the side views is warped into the frontal rectangular grid. The transfor-
mation tables will define the mapping that is required at the pixel level. These tables
are responsible to generate each frontal pixel in the virtual view. Since the transfor-
mation is based on the bilinear interpolation technique, each pixel can be generated
only when it is inside four grid coordinate points whose transformation is deﬁned by
the transformation tables. This is the main reason for our algorithm being unable to

generate the ears and and hair portion of the face.

4.3.2 Face Mosaicking

The two warped side views are placed adjacent to each other. The reference points
will be the horizontal intersection points that are present on the vertical line passing
through the center of the face, thus bisecting the face into two halves. During the
calibration stage, if the transformation tables are not created with the vertical line
passing through the center of the face, it will significantly affect the mosaicking stage.
After these two views are placed side-by-side, a smoothing algorithm is applied at the
edge. This is important because this smoothing algorithm will smooth any variation
in intensity and geometry. The geometrical variations, especially in the lip region,
will become more pronounced when the person speaks fast with large lip movements.

48

4.3.3 Post-Processing

This is the video editing stage. The video that is obtained will contain the grid
pattern in the first few frames. After the grid pattern is stopped there will be a
color transform of the skin. These frames with the gridded pattern are to be deleted
from the final output. A microphone records the voice of the user and is stored in
a separate .wav ﬁle. This file is appended to the video file and the final output is

obtained.

4.4 3D Face Model Construction

In order to construct a 3D face model, a generic 3D mesh model is used. Once the
vertices and the edges of the mesh are defined, it can be customized to any individual
user based on that particular user’s facial features [IY96]. Hence, for constructing
the 3D model from the 2D frontal image, one has to know the depth information of

some of the facial features. This section deals with the following subtopics:

0 Depth estimation using the stereo algorithm
0 Customization of the generic head model

0 Texture mapping the 3D mesh model

4.4.1 Stereo Computations

Since a projector is being used to project a pattern onto facial surface, stereo can
be established between the projector and each camera. As per our earlier discussion

49

the two side views do not share many common face features between them, so it is
required to establish stereo between the cameras and projector.

Using two calibrated cameras, an unknown 3D point [x,y,z] can be computed from
its two images. Let [x,y,z] be the 3D point whose coordinates are to be found. Let
this point be projected at [71,61] and [’7‘2,C2] correspondingly in the two images. Using

the above discussed camera model,we get

:16

3T1 [’11 512 b13 (314 y

801 = 5'21 1’22 523 1’24 z

3 b31 532 (J33 1 1
(4.7)

:r

“‘2 011 012 013 614 y

tC2 = (T 21 C22 C23 C24 2

t 031 032 633 1 1
(4.8)

Eliminating the homogeneous coordinates from equations 4.7 and 4.8, we get

7‘1 = (1)11 — b31T1)~T + (512 — b32r1)y + ((913 — b33'r1)z + b14

(4.9)

Ci = (521 — 5310055 + (522 — (9326011 + “’23 — b33cl)z + 524
(4.10)

7‘2 = (C11 — 031710-73 + (012 _ 6327?)?! + (013 - 0337‘2)Z + 014
(4.11)

C2 = (621 — 03162).r + (C22 — 632(72):; + (623 - 63362)z + C24
(4.12)

Equations 4.9, 4.10, 4.11 and 4.12 can be solved for three unknowns (i.e. x,
y and 2). However, any three equations will yield slightly different results. This is
because of the approximations that are made during the calibration stage. A better
way of solving for the unknown values is to use the “closest-approach” algorithm

50

which is described as follows:
Let P1 and P2 be points on a line and Q1 and Q2 be two points on another line. Let
ul and 112 be the two unit vectors in the direction of the line joining the points as

shown in the Figure 4.7.

 

Figure 4.7: The closest approach method

Let v be the shortest vector connecting the two lines P1192 and Qng. Hence, v

can be written as

v = P1+ alul -— (Q1 + agug)

(4.13)
Now, a1 and (12 can be represented in terms of the known values.
((131 + alul) “ (Q1+ (121%)) ' ””1 = 0
(4.14)
((P1+ (11111) -" (Q1 + 02112)) ° 1Q = O
(4.15)

Solving Equations 4.14 and 4.15, we get

51

(Q1 — P1) ' U1 — ((Qi — P1) “11.2) X (21., 41:2)

 

 

al — l — (u, wag)2
(4.16)
a. _ ((Ql—P1)'u1)‘(ul'U2l—(Q1—Pil'u2
2 — 1—(u1'u‘2)2
(4.17)

Solving Equations 4.16 and 4.17, the values of a1 and a2 are obtained and hence
the vector v is estimated. if |V|is less than some threshold value then the intersection

of the two rays is reported as shown in Equation 4.18

[x,y,z]‘ = (1/2)[(P1+ a1u1)+ (621+ (11111)]

(4.18)

4.4.2 3D Model Generation

A generic 3D face mesh, shown in Figure 4.8, was obtained from the University of
Sheffield’s 3D Computer Graphics Research Lab. In this head model, there are 395

vertices and 818 triangles.

The HCS is defined in Figure 4.9. The three axes and the origin are shown in the
figure. To construct an individualized texture mapped 3D head model, the depth 2 of
the facial parameters estimated in the WCS is converted into the HCS. The frontal
view is texture mapped onto the 3D mesh in the HCS. The coordinate axes of the
calibration sphere are decided by the orientation of the sphere. During the calibration
stage, the sphere is positioned in such a way that the z - axis in both the WCS and
HCS are parallel to each other.

52

 

Figure 4.8: Front view and side View of the 3D generic mesh model of the face

    
 
 

 

 

\

A' 33%;:5,‘
5

w ”a,

571’

 

 
 

      

   

Figure 4.9: HCS with the origin (0) and three perpendicular axes (x, y and z)

53

4.5 Implementation Details

File Sizes: As described earlier, two videos are captured in real-time. The image
sequences are captured without any compression for effective CPU usage. Moreover,
any processing of the video will be much easier in an uncompressed form. Each video
is captured for 10 seconds which means 300 frames. Each frame is of resolution 320
X 240 and contains 3 channels (RGB). Hence the total size of each file = 300 * 320
* 240 * 3 / (1024 * 1024) = 65.9 MB.

Data Processing: Currently, the data is being processed off-line. This is due to the
restrictions of the hardware and specifically the hard disk writing speed. The process
of creating the calibration table takes a couple of minutes. The practical writing
speed of the hard disk is measured to be 9MB / s. For every second, an uncompressed
.avi (audio-video interlace) file of size 6.59 MB is generated and hence the hard disk
and the data bus must be capable of transmitting the data at the rate of 14MB / s . If
the resolution of the video is to be 640 X 480, then the total required write speed rate
must be at least 54 MB / S. The voice is captured using a microphone and is stored in
a .wav file. The voice is synchronized with the video file in the post-processing stage

using Adobe Premiere 6.0 software.

Chapter 5

Experimental Results

This chapter shows some results obtained from our system. Section 5.1 gives the
results of the calibration process. Frames from synthesized virtual videos are shown
in Section 5.2. Some preliminary results of the face modeling are shown in Section

5.3.2

5. 1 Calibration

Section 5.1.1 describes the results obtained during the calibration of the virtual video
synthesis. The calibration results for the 3D face modeling are described in Section

5.1.2

5.1.1 Virtual Video Synthesis

During the calibration, a rectangular grid of dimension 400 x 400 is projected onto
the face. The grid is made by repeating three colored lines. Colored lines were used

55

because it is easy to distinguish the lines (see Figure 5.1)on the captured side views.
We used white, green and cyan colors for this purpose. These colors were chosen
because of their bright appearance over the skin color. The first few frames will have
the grid projected onto the face and then the grid is turned off. One of the frames
with the grid is taken and the transformation tables are generated. We calculate the
transform functions represented by Equations 4.1 and 4.2 from the grid coordinates
in both side views and the original grid coordinates. Each pixel in the frontal virtual

grid is then obtained from the corresponding inverse transform functions.

The size of the grid pattern that is projected in the calibration stage plays a
significant role in the quality of the video. This size is decided based on the trade-off
between the quality of the video and the time taken. An appropriate grid size has
be chosen based on trial and error. We started by projecting a sparse grid pattern
onto the face and then increasing the density of the grid pattern. At one point, the
increase in the density does not signiﬁcantly improve the quality of the face image
but starts consuming more time. At that point, the grid is ﬁnalized. Also, this grid
is decided based upon the average face size. In our experiments, we settled on a grid

cell size of row-width of 24 pixels and column-width of 18 pixels.

Figure 5.2 shows the frames that are captured during the calibration stage of the
experiment.

56

_ White Line
Cvan Line

Green Line

 

Figure 5.1: A square grid with alternating three colors is projected onto the face.
Each grid cell has a row-width =24 pixels, col-width=18 pixels.

 

Figure 5.2: Face images captured during the calibration stage using environment-
static FCS

57

5.1 .2 Face Modeling

As discussed earlier (Section 4.2.3), to apply the structure from stereo method, we
need to calibrate both cameras and the projector with respect to WCS. We shall now

discuss the results of the cameras and projector calibration.

Camera Calibration

This section shows some of the sample results of the camera and projector calibration
procedures that were discussed in Section 4.2.3. Figure 5.3 shows some of the images
of the calibration sphere that were captured from the left camera and the right camera

during the calibration stage.

 

Figure 5.3: The images that are captured from the left camera and the right camera
during the camera calibration.

Equations 5.1 and 5.2 shows the camera matrices obtained from the 12

calibration points shown in Figure 5.3.

58

D -l

32.150 0.921 8.699 135.927

2.809 —32.033 —4.347 124.153

 

 

0.015 0.003 —0.023 1.000

A

 

 

(5.1)
28.197 1.474 —16.748164.733
—0.509 —32.143 —4.232 118.646
—0.011 0.002 —0.023 1.000
(5.2)

After obtaining the calibration matrices using the methods described in the pre-
vious chapter, one can evaluate the accuracy of these matrices using the calibration
points as test points. Estimation of these calibration points in the 2D image can
be done using the given 3D coordinates of the calibration points and the calibration
matrix. The difference in the actual 2D image coordinates of these calibration points
and the estimated values of the 2D image coordinates is termed the “residual”. Table

5.1 shows the residuals obtained while calibrating the left camera.

Calibration of the Projector

The Table 5.2 shows the errors in the calibration of the projector. The errors are
slightly higher when compared to both cameras. Perhaps, these errors might have
been due to the radial distortion of the projector. Also in the case of the projector,

59

Table 5.1: Results of left camera calibration for points on the calibration sphere.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

INPUT DATA OUTPUT DATA
Point 2D Image 3D Coordinates 2D Fit Data Residuals
u v x y z u1 vr uerm van—0, Totalerm.
A 198 116 0.00 0.00 4.65 198.0 116.7 0.0 -0.7 0.7
B 190 36 0.00 2.33 4.03 189.8 35.3 0.2 0.7 0.9
C 135 56 -1.64 1.64 4.03 135.0 55.7 -0.0 0.3 0.3
D 109 115 -2.33 0.00 4.03 110.4 114.9 -1.4 0.1 1.5
E 134 176 -1.64 -1.64 4.03 132.9 176.4 1.1 -0.4 1.5
F 187 202 0.00 -2.33 4.03 187.7 201.4 -0.7 0.6 1.3
G 240 177 1.64 -1.64 4.03 240.3 177.2 -0.3 -0.2 0.5
H 262 121 2.33 0.00 4.03 261.6 120.5 0.4 0.5 0.9
I 241 62 1.64 1.64 4.03 241.3 62.7 -0.3 -0.7 1.0
K 74 16 -2.85 2.85 2.33 73.8 16.3 0.2 -0.3 0.5
L 30 116 -4.03 0.00 2.33 30.1 115.9 -0.1 0.1 0.2
M 69 220 -2.85 -2.85 2.33 69.2 220.1 -0.2 -0.1 0.3

 

 

 

 

 

 

 

 

 

 

 

 

all the 17 calibration points (shown in Figure 4.5) are chosen for calibration. The
main difference in the case of the projector when compared with the previously
discussed camera calibration method is that the (u,v) coordinates are obtained by
projecting an image with known image points onto the 3D calibration points. The
mouse was used to click on the 2D image that is projected onto the 3D world points.

Equation 5.3 shows the projector calibration matrix.

 

 

r -
35.434 0.613 —9.550 512.584
0.256 —35.130 —11.145 449.176
0.000 0.000 —0.018 1.000

(5.3)

60

Table 5.2: Results of projector calibration for points on the calibration sphere.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

INPUT DATA OUTPUT DATA
Point 2D Image 3D Coordinates 2D Fit Data Residuals
u v x y z 141 v1 uerrm. verm, Totalerm.
A 511 432 0.00 0.00 4.65 510.3 433.1 0.7 -1.1 1.8
B 512 348 0.00 2.33 4.03 511.9 347.3 0.1 0.7 0.8
C 449 373 -1.64 1.64 4.03 448.9 372.7 0.1 0.3 0.4
D 420 435 -2.33 0.00 4.03 422.0 434.8 -2.0 0.2 2.2
E 448 496 -1.64 -1.64 4.03 447.0 497.4 1.0 —1.4 2.4
F 509 524 0.00 —2.33 4.03 509.4 523.7 -0.4 0.3 0.7
G 573 498 1.64 -l.64 4.03 572.4 498.2 0.6 -0.2 0.8
H 600 437 2.33 -0.00 4.03 599.3 436.0 0.7 1.0 1.7
I 573 374 1.64 1.64 4.03 574.2 373.5 -1.2 0.5 1.7
J 513 292 0.00 4.03 2.33 513.6 293.7 -0.6 —1.7 2.3
K 409 336 -2.85 2.85 2.33 407.9 336.2 1.1 -0.2 1.3
L 363 441 -4.03 0.00 2.33 362.7 440.5 0.3 0.5 0.8
M 405 547 -2.85 -2.85 2.33 404.8 545.5 0.2 1.5 1.7
N 508 588 0.00 -4.03 2.33 509.4 589.6 -1.4 -1.6 3.0
O 616 548 2.85 -2.85 2.33 615.2 546.8 0.8 1.2 2.0
P 660 441 4.03 0.00 2.33 660.2 442.4 -0.2 -1.4 1.6
Q 618 339 2.85 2.85 2.33 618.1 337.6 -0.1 1.4 1.5
5.2 Virtual Video Synthesis

This section discusses the results of the virtual video synthesis algorithm that is
described in Section 4.3. It also discusses the synchronization issues that might

affect the quality of the novel virtual views.

5.2.1 Face Warping

The results 1 of the warping during the calibration and the operation stage is shown

in Figure 5.4.

 

1If this thesis work was accessed through the University Microﬁlm (UMI) database, the images
will not be in color. Even though the color information is lost, the author feels that the concept is
well explained using gray scale images.

61

   

(c) (d)

Figure 5.4: Frontal view generation during the calibration stage and reconstruction
of the frontal image from the side view using the grid: (21) left image captured during
the calibration stage. (b) virtual image constructed using the transformation tables
and the right image during the calibration stage. (c) right image captured during the
operational stage. (d) result of the reconstructed frontal view from the transformation
tables and the right image during the operational stage

5.2.2 Virtual View Synthesis
Figure 5.6 shows the output image of the frontal view that is generated by our algo-

rithm. This output is obtained by applying our warping and mosaicking algorithms

to the left and right views shown in Figure 5.5.

   

Figure 5.5: Face images captured during the operational phase using the ESFCS

Figure 5.7 shows the side views of the human face captured using the HMFCS.

The main problems of capturing the faces using HMFCS are:

1. Lighting variations

2. Distortion caused by the mirrors

62

 

Figure 5.6: (a) Frontal view that is obtained from the camcorder and (b) virtual
frontal view generated from our algorithm

3. Vibrations of the cameras and the mirrors

 

 

Figure 5.7: Face images captured using the HMFCS (a) left image and (b) right image

Figure 5.8 shows the output of the virtual view generated from the images cap-

tured using HMFCS.

5.2.3 Video Synchronization

Synchronization in the two videos is crucial in our application. Since, two views of a
face with lip movements are merged together, any small changes in the synchroniza-
tion will have high impact on the misalignment of the lips. This synchronization can

63

 

Figure 5.8: Virtual frontal view generated from the side views captured through
HMFCS

 

Figure 5.9: (a) Top row: images captured from the left camera. (b) Second row:
images captured using the right camera. (c) Third row: images captured using cam-
corder that is placed in front of the face. (d) Final row: virtual frontal views generated
from the images in the ﬁrst two rows

64

be evaluated based on sensitive movements such as eyeball movements (see Figure
5.10) and blinking eyelids (see Figure 5.11). Similarly, mouth movements can be

analyzed from the virtual videos.

 

Figure 5.10: Synchronization of the eyeball movements: real video is in the top row
and the virtual video is in the bottom row.

 

Figure 5.11: Synchronization of the eyelids during blinking: real video is in the top
row and the virtual video is in the bottom row.

65

5.3 3D Face Model Construction

A 3D individualized head model can be constructed by adjusting a generic mesh
model to fit the 3D feature points of an individual face. Depth information at the eye
corners, mouth corners and nose tip is estimated using the stereo method. Section

5.3.1 deals with the results of the stereovision computations.

5.3.1 Stereovision Computations

Table 5.3 shows the error obtained during the estimation of 3D world coordinates on
the calibration sphere. On average, the error in coordinates was less than 0.1 inches.
The maximum total error was not more than 0.2 inches. For estimating the depth
information of the facial feature points this accuracy might be sufficient.

Table 5.3: Depth estimation from left camera and projector for points on the calibration
sphere. 3D coordinate dimensions are in inches.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

INPUT DATA OUTPUT DATA
Point 3D Coordinates Left Camera Projector Computed points Error
x y z u v x y 2:1 y1 21
A 0.00 0.00 4.65 198 116 511 432 0.00 0.02 4.69 0.06
B 0.00 2.33 4.03 190 36 512 348 -0.02 2.30 4.12 0.14
C -1.64 1.64 4.03 135 56 449 373 -1.65 1.64 4.07 0.05
D -2.33 0.00 4.03 109 115 420 435 ~2.38 -0.01 4.09 0.12
E -1.64 -1.64 4.03 135 176 448 496 -1.62 -1.62 4.14 0.15
F 0.00 -2.33 4.03 187 202 509 524 -0.02 -2.34 4.04 0.04
G 1.64 -1.64 4.03 240 177 573 498 1.65 -1.64 4.04 0.02
H 2.33 0.00 4.03 262 121 600 437 2.32 -0.03 4.11 0.12
I 1.64 1.64 4.03 241 62 573 374 1.59 1.64 4.17 0.19
K -2.85 2.85 2.32 74 16 409 336 -2.83 2.86 2.30 0.05
L -4.03 0.00 2.33 30 116 363 441 -4.03 0.00 2.31 0.02
M -2.85 -2.85 2.33 69 220 405 547 -2.84 -2.86 2.26 0.09

 

66

 

5.3.2 Customization of the 3D Face Model

The generic model used to generate the customized texture mapped face model was
described in Figure 4.8. This model is made up of 395 vertices and a total of 2454
edges forming 818 triangles. Each triangle is deﬁned by three vertices. This model is a
complete head model that contains the ears and the back head portion. It also includes
the eye balls, which are not common in some other face models. The customization
of the generic head model is done by distorting the generic model as described in
[IY96]. As discussed earlier, the 3D information of some of the prominent facial
feature points are estimated and the 3D model is distorted to obtain the customized
head model. The frontal view texture used for texture mapping the 3D model is
shown in Figure 5.12. This texture map was obtained by extending the grid lines
during the calibration stage. Figure 5.13 shows different rendered views of the 3D

face model that was constructed from the two side views.

 

Figure 5.12: Frontal texture used for 3D face model construction

67

 

Figure 5.13: Different views rendered from the texture mapped 3D face model

68

Chapter 6

Assessment of the Results

Section 6.1 describes some evaluation procedures used to assess the quality of the
generated virtual frontal views. Some more discussion of the results and issues related

to the stability and portability of the F CS are given in Section 6.2.

6. 1 Evaluation Schemes

Some researchers have worked on the evaluation of facial expressions [POM99,
SCWTOI]. Evaluating novel views is not studied much in the literature. In our
case, we need to evaluate the synthesized videos in comparison with the real videos.
This evaluation must give the accuracy of facial alignment, lip and eye movements
and the perceptual quality of the synthesized videos. One can broadly classify the

evaluation procedures into two kinds:
0 Objective evaluation

0 Subjective evaluation or quality assessment.

69

Section 6.1.1 describes the methods used for objective evaluation. The quality as-

sessment of the videos is discussed in Section 6.1.2.

6.1.1 Objective Evaluation

One approach to assess the video is to theoretically evaluate the system. This ap-
proach doesn’t require any human intervention or feedback. An error is obtained by
comparing the virtual video frames with the real video frames of the frontal face. For
effective comparison, the real video frames that are captured using a camcorder and
the virtual video frames are normalized to a size of 200 X 200. The ﬁve images that

were considered for evaluation are shown in Figure 6.1.

 

Figure 6.1: Images considered for objective evaluation (a)Top row: real video frames
(b) Bottom row: virtual video frames

This evaluation can give some information regarding the facial feature alignment
and facial movements which form the basis for facial interpretation and recognition.

This can be done in two ways:

0 Normalized cross-correlation of the 2D intensity arrays

70

e Euclidean distance measure between facial feature points

Figure 6.2 shows the bounding boxes of regions that are considered for evaluation
using normalized cross—correlation method. The entire window was also used for

evaluation.

 

 

Figure 6.2: (a) Facial regions compared using normalized cross-correlation (Left: real
view and Right: virtual view.)

Normalized Cross-correlation

Let,
h be the height of the image and

w be the width of the image.

The cross correlation between virtual image (V) and real image (R) of width w

and height h is given by Equation 6.1.

w— 1h— 1
CCVR = V[i, j ]R[’i, j] (6.1)
i=0 j=0

71

S
J.
:-

—1

ll VII = V[lkJ'WlAJ’l (62)
2:0 j=0
w—l h—l

II R H = Ritlel’iJl (6-3)
i=0 j=0

Equations 6.2 and 6.3 deﬁne the magnitudes of image V and R respectively. Equation

6.4 gives the normalized cross correlation between two images V and R.

CCVR
H V H H R H

 

NCCVR = (6.4)

Table 6.1: Results of normalized cross-correlation between the real and the virtual frontal
views. This normalized cross-correlation is applied in various regions of the face concen-
trating more at the eye and mouth regions

 

 

 

 

 

 

 

 

 

 

 

 

video left eye right eye mouth eyes + mouth complete face
Framel 0.988 0.987 0.993 0.989 0.989
Ffame2 0.969 0.972 0.985 0.978 0.985
Frame3 0.969 0.967 0.992 0.978 0.986
Frame4 0.991 0.989 0.993 0.990 0.990
Frame5 0.985 0.986 0.992 0.988 0.989

 

 

The value of the normalized cross-correlation ranges between -1 and 1 with values
of low absolute value indicating low similarity and absolute values near 1 indicating
high similarity. In general, there was a high correlation between the real and the
virtual images. Frames 2 and 3 shown in Figure 6.1 contain facial expressions (eye
and lip movements) that were quite different from the expression used during the
calibration stage and hence the generated view gave a lower correlation value when
compared to the other frames. Also, the facial expressions in the frames 1 and 4 were
similar to that of the expression in the calibration frame. Hence, these frames have a
high correlation value compared to the rest. The eye and lip regions were considered

72

for evaluating the system because during any facial movement, these regions change

signiﬁcantly.

Euclidean distance measure

Using the Euclidean distance measure, the error is estimated to be the difference
in the normalized Euclidean distances between some of the most prominent feature
points. The feature points are chosen in such a way that one of them is relatively
static with respect to the other. This will help us to evaluate the facial movements
more accurately since the difference is calculated using a pseudo—reference(static)
point. For example, if we consider some prominent feature points of the face (such
as corners of the eyes, nose tip, corners of the mouth), the corners of the eyes are

relatively static when compared to the corners of the mouth.

 

 

Figure 6.3: Facial feature points and the distances that are considered for evaluation
using Euclidean distance measure (Left: real view. Right: virtual view.)

Figure 6.3 shows the most prominent facial feature points and the distances
between those points that are considered for evaluation using the Euclidean distance
measure. The basic assumption here is that the corners of the eyes and the nose
tip are static with respect to the cameras. However, there will be a lot of mouth

73

movements and hence the position of the lip corners is not static. When distances
between two feature points are measured, one point is chosen to be a static and the
other one to be dynamic.

The Euclidean distance between two points whose coordinates are (:17,,y,~) and

(at), yj) is given by Equation 6.5

 

ED = Wm.- — a)? + (y.- — .21.)? (6.5)

Let Rij represent the Euclidean distance between two feature points i and j in

the real frontal image and V},- represent the Euclidean distance between two feature

points in the virtual frontal image. The difference in the Euclidean distance is given
by Equation 6.6.

Du‘ = lRij — Val (6-5)

The total error e for comparing the face images is defined by Equation 6.7

1
Total Error (6) = 6[Daf + Db} + Def + Dcg + Ddg + Deg] (6.7)

Table 6.2: Euclidean distance measurement of the prominent facial distances in the real
image and virtual image and the deﬁned average error. All dimensions are in pixels.

 

 

 

 

 

 

Frames Daf Db, Def DCg Ddg Deg Total Error(e)
Framel 2.00 0.80 4.15 3.49 2.95 3.46 2.80
Frame2 0.59 3.00 0.79 4.91 0.63 0.80 1.79
Fiame3 1.88 3.84 4.29 4.34 2.68 1.83 3.14
Hame4 1.09 2.97 2.10 6.33 3.01 4.08 3.36
Frame5 1.62 2.21 5.57 4.99 1.24 1.90 2.92

 

 

 

 

 

 

 

 

 

 

The results in Table 6.2 indicate small error in the Euclidean distance measure—
ments. An error of 3 pixels is not a significant quantity in an image of size 200 X

74

200. The facial feature points in the five frames were selected manually and hence the
errors might have also been caused due to the unstability of manual selection. Also,
one can note that the error values of Def and Dcg are larger than the others. This is
probably because the nose tip is not a robust point when compared to eye corners.
Hence, the errors in the distances involving the nose tip are more.

Some of the errors obtained in both of the above mentioned methods might also
be due to the difference in the resolution of the images. The virtual frontal view is

of resolution 162 X 192. The real frontal view is of higher resolution (as 320 X 240).

6. 1 .2 Subjective Evaluation

Subjective evaluation involves some kind of a human intervention. The response of
human in evaluating the system can vary from a simple “yes/no” to a numerical
rating of the quality (1-6). Perhaps, this is a test that evaluates the quality of the
virtual videos for supporting perception. The main factors that might affect quality

of the virtual frontal face videos include

0 Eye and lip movements

Facial expressions

Synchronization of the two halves of the face

Color and texture of the face

Quality of the audio

Synchronization of audio

75

A preliminary subjective study has been made on some of the virtual videos by
the author and CSE advisor. In general, the quality of the videos was assessed as
adequate to support the applications for which the teleportal system is intended. The
two halves of all the videos are well synchronized and color balanced. The quality of
the audio is good and it has been synchronized well with the lip movements. Some
observed quality factors were distortion in the eyes and teeth and in some cases a
cross-eyed appearance. The face appears a little bulged when compared to the real
videos. The main reason is that the cameras used to capture the side view have smaller
focal length than that of the camcorder and hence the distortion in the images is more

when compared to the real images captured using the camcorder.

Further analysis of these videos is a future task of the teleportal project. Tests
can be conducted where the virtual videos and real videos are displayed in random
order and human judges are asked to identify or evaluate the real and the virtual
videos. This test can ask the subjects to evaluate the expressiveness of the video.
Judgements can be rated on a scale of 1-6 where 6 represents highest conﬁdence in
evaluation of the expressions. Some standard expressions (e.g. joy, angry, surprise,
sadness) could be judged.

76

6.2 Discussion of the Results

6.2.1 Time Taken

The time taken for processing the video is one important aspect of our system. Our
goal is to make the system work in real-time. The total time taken can be mainly
split into three parts. Pre—buffering estimates the time taken to transfer the images
into the corresponding buffers. The next part is the time taken for doing the actual
warping. This is the time taken for interpolating each of the grid blocks in the frontal
image. The ﬁnal part is the time taken for post processing. The post-processing
consumes little time (less than 5% of the total time) because a linear ﬁlter is applied
to smooth the image. The time taken for the overall procedure is directly proportional
to the density of the grid. In our case, the average time taken per frame for processing
the videos is around 60 ms using a computer with 746 MHz. Any processing that
consumes less than 30 ms is considered to be real-time processing. However, the time
taken in our case can be optimized and can be made to work in real-time in a high

speed computer (with 2.6 GHz processing speed).

6.2.2 Positioning of Cameras and Mirrors

Placing the cameras and mirrors in the appropriate positions is crucial to the quality
of the results. Most importantly, the cameras must focus the image of the face inside
the mirror. The mirrors are to be adjusted in such a way that the face is well captured.
Also, the vibrations of the cameras and the mirrors are to be minimized, especially
when the person is speaking.

77

The angle at which the two side views are captured will have signiﬁcant impact
on the generated frontal view. Even though, it is possible to create a frontal view
from widely separated side views using our algorithm, the facial expressions will be
more distorted in the case of widely separated views. Hence, it is important that care
should be taken to capture images that are not widely separated.

On the other hand, the accuracy in the depth measurements will be improved
signiﬁcantly if the cameras are widely separated because the depth information is
extracted from the stereo between the cameras and the projector. As the cameras
move farther apart, the rays from the camera and the projector tend to intersect at
wider angles and thus the intersection points of the rays will be more accurate. If
the views are widely separated then there will not be many common feature points
between the two images captured from the cameras. In such a case, there cannot be

stereo between the images captured from the cameras alone.

6.2.3 Depth of Field Issues

When the face capture is done through the mirror, lot of importance has to be given
to the issue of depth of ﬁeld. Distant facial features like the nose tip and the ear
corners are focussed well. To make the system portable, one has to make sure that
it can be used among various individuals without making any changes. This system
was successful in porting to various individuals (with different head sizes) without

changing the position and the focus of the cameras and the mirrors.

78

Chapter 7

Conclusion

This chapter gives a discussion of the completed work on the facial capture system
and the future directions. Figure 7.1 summarizes both accomplishments and future

work.

7. 1 Summary

The proposed facial capture system will be able to capture the human face in a mobile
environment in real-time. A real time video stream of the frontal view of the face
is obtained by merging the two side views captured by two side cameras. We have
developed customized mirrors based on the calculations made from the optical layout
of the system. The algorithm being used can be made to work in real-time because
of its computational simplicity and has been demonstrated to be near real-time using
a PIII computer. This working prototype has been tested on a diverse set of 10
individuals. For comparisons of the virtual videos with real videos, we expect that

79

important facial expressions will be represented adequately and that feature locations
will not be distorted by more than 2% . To further demonstrate the promise of this

approach, a 3D head model has been created from the two captured side views.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

. Time Domain .
Stat“: > Dynamic
2 D Virtual aggregation of _ Virtual
Frontd hag. mm“ Frontal Video
'5
'3 stereo motion analysts
+
0 generic and
g head model repreuntattou
G
v _______ i r_ ______
V Toxturo Mapped motion changes 4» Ar: 30 Facial 5
3D Face Model geometric data . Animation ' .
3D 1____________f Conclusnon
: Future Work I

Figure 7.1: Conclusion and future work. Solid blocks indicate implemented subsys-
tems. Dashed block indicates future subsystem.

7 .2 Future Work

We need to simplify the calibration procedure by automating some of the steps dur-
ing the calibration stage. This can be done by applying some of the thresholding
algorithms on the two side views captured during the calibration. Online processing
of the video in real-time and real-time remote collaboration should be demonstrated
in the near future. Customizing the camera lenses might improve system quality.
Algorithms for color-balancing should model the lighting conditions more effectively.
Theoretically, it should be possible to reconstruct novel views from a range of view

80

points without reconstructing the 3D model. If this can happen in real-time, a floating
window can show human faces from arbitrary views in space.

One can make the system work in real-time by using a high speed computer such
as P4 with a processor speed of 2.6 GHz. If the calibration process is automated, the
data can be processed online and transmitted via an Internet2 channel. Using the
existing setup, two videos are grabbed simultaneously on the same computer. Instead
of saving these videos into the hard disk, one can process them online in real-time.
This concept of processing is called “quadruple buffering” which allows the user to
process previous frames while current frames are being grabbed from both cameras.

Facial deformations that make signiﬁcant alterations to the face surface will not
be handled by the static warp discussed earlier. There might be need for a dynamic
set of warps that can handle these alterations. The grid that was used might not
handle various facial expressions effectively. A hierarchial grid that can project more
dense grid patterns onto the eye and mouth regions may improve the quality of the
output. The bilinear interpolation technique tends to be a block-wise operation.
Cubic interpolation might give better results because cubic functions are used for
modeling curved surfaces. The equipment has to be stabilized by ﬁxing the capture
system onto a more robust headset. An extra calibration step might be required to
handle the distortion produced by the convex mirrors. This face capture system has
to be integrated with the Head Mounted Projection Display [HGBROI].

In the years to come, one can enable a stereo field-of-view to be transmitted in
time slices alternating between the human face view and the ﬁeld of view of the user.
This can be achieved by using an electro—optical glass in place of the mirrors [K802].

81

Perhaps, one can achieve this by flipping the mirrors mechanically, allowing the same
cameras to transmit the scene viewed by the mobile user. Thus, the FCS will be able
to perform dual duty (1)to capture the face and (2)to capture the user’s ﬁeld of view.

The vibrations of the cameras and the mirrors are to be minimized especially
when the person is speaking. These vibrations might be more while the person is in
motion. If there are signiﬁcant vibrations, then there will be a necessity for a video
stabilization algorithm that has to be implemented while the image sequences are
being captured.

Compression of the video streams will help in effective transmission of the data.
Some parameters for an Internet2 transmission channel can be optimized for effective

data communication for speciﬁc applications.

82

Appendix A

Conversion of Spherical to

Cartesian Coordinates

The points on the calibration sphere are represented in spherical coordinates. These
spherical coordinates are converted into cartesian coordinates with origin as center
of the calibration sphere. Figure A.1 shows how a point P is deﬁned in a spherical
coordinate system. Let R be the radial distance of P from the origin. 6 is the
azimuthal angle in the xy-plane from the x-axis. d) is the polar angle from the z-axis.

This is also called the “colatitude” of point P. The ranges of these angles are as follows
036$2Hand0$¢§ﬂ

Using basic trigonometry, the Cartesian coordinates (P1, Py, Pz) for the point P

are defined from the spherical coordinates. (R, (7’, o)

83

px = RSin(6)COS(¢)
Py : RSi-n(0)5in(¢l

P2 = RCos(o)

bl

 

 

 

 

Figure A.1: Spherical coordinate system

84

Bibliography

[BC94]

[BF82]

[BF98]

[BK87]

[BKPOI]

[Bor01]

[BP93]

[BROO]

[CNSD+92]

[ewes]

[Du/1898]

C. Burdea and P. Coiffet. Virtual Reality Technology. VViley-Interscience,
1994.

S. T. Barnard and M. A. F ischler. Computational stereo. ACM Com-
puting Surveys, 14(4):553—572, 1982.

D. Buxton and G. W. Fitzmaurice. HMDS, caves and chameleon: a
human-centric analysis of interaction in virtual space. Computer Graph-
ics, 32(4):69—74, 1998.

K. L. Boyer and A. C. Kak. Color-encoded structured light for rapid
active ranging. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 9(1):14—28, 1987.

M. Billinghurst, H. Kato, and I. Poupyrev. The magicbook: Moving
seamlessly between reality and virtuality. IEEE Computer Graphics and
Applications, 21(3):6—8, 2001.

M. Bordenaro. ASPs help make ‘virtual meetings’ successful. Chicago
Tribune, Nov. 19, 2001.

R. Brunelli and T. Poggio. Face recognition: Features vs. templates.
IEEE Trans. on Pattern Analysis and Machine Intelligence, 15(10):1042—
1052, April 1993.

F. Biocca and J. P. Rolland. Teleportal face-to-face system. Patent Filed,
August, 2000.

C. Cruz-Neira, D. J. Sandin, T.A. DeFanti, R.V. Kenyon, and J. C.
Cart. The cave: Audio visual experience automatic virtual environments.
Communications of the ACM, 35(6):65—72, 1992.

S. E. Chen and L. Williams. View interpolation for image synthesis. In
Proceedings of SIGG'RPAH93, pages 279—288, 1993.

D. DeCarlo, D. Metaxas, and M. Stone. An anthropometric face model
using variational techniques. In Proceedings of SICGRAPH, pages 67—74,
1998.

85

[DRHLWB] L. Davis, J. Rolland, F. Hamza—Lup, Y. Ha, J. Norfleet, and C. Imielin-

[DT96]

[F ai90]

[FBA+94]

[Fei02]

[Fis96]

[FSW97]

[GGSC96]

[Gew+98]

[Goo95]

[HamOl]

[H886]

[HGBROI]

[HGGROO]

ska. Alices adventures in wonderland: A unique technology enabling a
continuum of virtual environment experiences. Computer Graphics and
Applications, 23(2):10—12, 2003.

F. W. DePiero and M. M. Trivedi. 3-d computer vision using structured
light: Design, calibration, and implementation issues. IEEE Computer,
43:243—278, 1996.

G. Faigin. The Artist’s Complete Guide to Facial Animation. \Natson-
Guptill Publications, 1990.

H. Fuchs, G. Bishop, K. Arthur, L. McMillan, R. Bajcsy, S. Lee, H. F arid,
and T. Kanade. Virtual space teleconferencing using a sea of cameras.
In Proceedings of the First International Symposium on Medical Robotics
and Computer Assisted Surgery” 1994.

S. Feiner. Augmented reality: a new way of seeing. Scientiﬁc American,
286:48—55, 2002.

R. Fisher. Head-mounted projection display system featuring beam split-
ter and method of making same. US Patent 5,572,229, November 5, 1996.

K. E. Finn, A. J. Sellen, and S. B. Wilbur. Video-mediated communica-
tion. Lawrence Erlbaum Associates, 1997.

S. Gortler, R. Grzeszczuk, R. Szeliski, and M. Cohen. The lumigraph.
In Proceedings of SIGGRAPH, pages 43—54, 1996.

B. Guenter, C. Grimm, D. Wood, H. Malvar, , and F. Pighin. Making
faces. In Proceedings of SIGGRAPH, pages 55—66, 1998.

D. Goodman. Handbook of Optics 2nd Ed, chapter General Principles of
Geometric Optics. New York: MeGraw-Hill, 1995.

M. Hamblem. Avoiding travel, users turn to communications technol-
ogy: Videoconferencing, Web collaboration use increasing in aftermath
of attacks. Computer World, Sept. 24, 2001.

B. K. P. Horn and M. J. Brooks. The variational approach to shape from
shading. Computer Vision, Graphics, and Image Processing, 33(2):]74—
208, 1986.

H. Hua, C. Gao, F. Biocca, and J. Rolland. An ultralight and com-
pact design and implementation of head-mounted projective displays. In
Proceedings of IEEE Virtual Reality 2001, pages 175—182, 2001.

H. Hua, A. Girardot, C. Gao, and J. P. Rolland. Engineering of head-
mounted projective displays. Applied Optics, 39(22):3814—3824, 2000.

86

[H889]

[IY96]

[J ar83]

[KGH85]

[K097]

[K802]

[LC01]

[Len98]

[Lin01]

[MD99]

[MK94]

[MIN/197]

[OK93]

G. Hu and G. Stockman. 3-d surface solution using structured light and
constraint propagation. IEEE Trans. on Pattern Analysis and Machine
Intelligence, 11(4):390—402, April 1989.

H.H.S. Ip and L. Yin. Constructing a 3d individualized head model from
two orthogonal views. Visual Computer, 12:254—266, 1996.

R. A. Jarvis. A perspective on range ﬁnding techniques for computer vi-
sion. IEEE Transactions on Pattern Analysis and Machine Intelligence,
5(2):122—139, 1983.

M. Krueger, T. Gionfriddo, and K. Hinrichsen. Videoplace - an artiﬁcial
reality. In Proceedings of ACM CHI ’85 Conference on Human Factors
in Computing Systems, pages 35—40, 1985.

R. Kijima and T. Ojika. Transition between virtual environment and
workstation environment with projective head-mounted display. In Pro-
ceedings of IEEE 1997 Virtual Reality Annual International Symposium,
pages 130—137, 1997.

A. M. Kunz and C. P. Spagno. Simultaneous projection and picture
acquisition for a distributed collaborative environment. In Proceedings
of IEEE Virtual Reality 2002, pages 279—280, 2002.

SH. Lai and CM. Cheng. Three-dimensional face model creation from
video. In Proceedings of SPIE Conference on Three-dimensional Image
Capture and Applications IV, 2001.

J. Lengyel. Telepresence by real-time view-dependent image generation
from omnidirectional video streams. IEEE Computer, 31(7):46—53, 1998.

C. Lindquist. Analysis: 8 hot technologies for 2002. CNN Online, Dec.
31,2001.

R.A. Manning and CR. Dyer. Interpolating view and scene motion by
dynamic view morphing. In Proceedings of International Conference on
Computer Vision and Pattern Recognition, pages 388—394, 1999.

P. Milgram and F. Kishino. A taxonomy of mixed reality visual displays.
IEICE Transactions on Information Systems, 77(12), 1994.

P. Mouroulis and J. Macdonald. Geometrical Optics and Optical Design.
Oxford Univ. Press, 1997.

M. Okutomi and Takeo Kanade. A multiple-baseline stereo. IEEE
Trans. on Pattern Analysis and Machine Intelligence, 15(4):353—363,
April 1993.

87

[OstOI]

[ovrv9s]

[PGD98]

[PHL+98]

[POM99]

[PR98]

[PW96]

[RFOO]

[RK75]

[Sat94]

[scw+91]

[SD96]

[Sei01]

M. Osterman. Messaging subs for travel, snail mail since attacks. Net-
work World Messaging Newsletter, Dec. 03, 2001.

Y. Onoe, K. Yamazawa, H. Takemura, and N. Yokoya. Telepresence by
real-time view—dependent. image generation from omnidirectional video
streams. Computer Vision and Image Understanding, 71(2):154—165,
1998.

M. Proesmans, L. Van Cool, and F. Defoort. Reading between the lines
—— a method for extracting dynamic 3d with texture. In Proceedings of
International Conference on Computer Vision, pages 1081—1086, 1998.

F. Pighin, J. Hecker, D. Lischinski, R. Szeliski, and D. Salesin. Syn-
thesizing realistic facial expressions from photographs. In Proceedings of
SIGGRAPH, 1998.

I. Pandzic, J. Ostermann, and D. Millen. User evaluation: Synthetic
talking faces for interactive servicess. Visual Computer, 15(7/8):330—
340, April 1999.

J. Parson and J. P. Rolland. A non-intrusive display technique for pro-
viding real—time data within a surgeon’s critical area of interest. In Pro-
ceedings of Medicine Meets Virtual Reality 98, pages 246—251, 1998.

F. I. Parke and K. Waters. Appendix 1: Three-dimensional muscle model
facial animation. A. K. Peters, 1996.

J. P. Rolland and H. Fuchs. Optical versus video see-through head-
mounted displays in medical visualization. Presence: Teleoperators and
Virtual Environments, 9(3):287—309, 2000.

F. Rocker and A. Kiessling. Methods for analyzing three dimensional
scenes. In Proceedings of 4th. International Joint Conference on Artiﬁcial
Intelligence, pages 669—673, 1975.

K. Sato. Silicon range ﬁnder :a realtime range ﬁnding VLSI sensor. In
Proceedings of IEEE Custom Integrated Circuits Conference, pages 339—
342, 1994.

M.A. Sayette, J. Cohn, J \/I Wertz, M.A. Perrott, and DJ Parrott. A
psychometric evaluation of the facial action coding system for assessing
spontaneous expression. Journal of Nonverbal Behavior, 252167 — 186,
2001.

S. M. Seitz and C. R. Dyer. View morphing. In Proceedings of SIG-
GRAPH96, pages 21—30, 1996.

SM. Seitz. The space of all stereo images. In Proceedings of International
Conference on Computer Vision, pages 2633, 2001.

88

[sx9n

[8889]

[8801]
[Sut65]

[rxgn

[VBKOZ]

nvst+9a

[web]
[W0080]

[XLH02]

[YLW98]

[zrgn

H. Saji and H. Nakatani. Measuring three-dimensional shapes of a mov-
ing human face using photometric stereo method with two light sources
and slit patterns. In Proceedings of IE] CE Transactions on Information
and Systems, pages 795— 801, 1997.

N. Shrikhande and G. Stockman. Surface orientation from a pro-
jected grid. IEEE Trans. on Pattern Analysis and Machine Intelligence,
11(6):650—655, April 1989.

LC. Shapiro and CC. Stockman. Computer Vision. Prentice-Hall, 2001.

I. Sutherland. The ultimate display. In Proceedings of IFIP65, pages
506—508, 1965.

C. Tomasi and T. Kanade. Shape and motion from image streams under
orthography—a factorization method. International Journal on Com-
puter Vision, 9(2):137—-154, 1992.

S. Vedula, S. Baker, and T. Kanade. Spatio—temporal view interpolation.
In Proceedings of the 13th ACM Eurographics Workshop on Rendering,
June 2002.

R. Welch, T. T. Blackmon, A. Lin, B. A. Mellers, and L. W. Stark.
The effects of pictorial realism, delay of visual feedback, and observer
interactivity on the subjective sense of presence. Presence: Teleoperators
and Virtual Environments, 5(3):263—273, 1996.

The Internet2 website. http://www.internet2.edu.

R. J. Woodham. Photometric method for determining surface orientation
from multiple images. Optical Engineering, 19(1):139—144, 1980.

L. Q. Xu, B. Lei, and E. Hendriks. Computer vision for a 3-d visualiza-
tion and telepresence collaborative working environment. B T Technology
Journal, 20(1):64—74, 2002.

J. Yang, W. Lu, and A. Waibel. Skin-color modeling and adaptation. In
Acian Conference on Computer Vision, pages 687—694, 1998.

J.Y. Zheng and S. Tsuji. Panoramic representation for route recognition
by a mobile robot. International Jounal of Computer Vision,, 9:55-76,
1992.

80

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

[I]I[]]]][]]]]]]]I[[[][