w
I§.'.}1°~‘ 'j.

v. , are:
gmﬁi

. u...

 

3» .

t

..
53. “a...
.s. 1.1.... ea

#1

1:, ﬂ...
, .2
it? 3

Juté.

 

park ELF.

 

{.1 .. L... o 3.? 5... ﬁt;-
i. ,. 2...: n... , . l . , . :rll:. .

 

jhh'

I LIBRARY N

Michigan State
University

 

 

 

This is to certify that the

thesis entitled

VISION—BASED TRACKING OF FIDUCIALS
FOR AUGMENTED REALITY

presented by

PAUL W . MIDDLIN

has been accepted towards fulﬁllment
of the requirements for

 

M.S. degree mm SCIENCE .

/’ I“! f

l/ \ /
, f) (1 ‘ i
/ /( I M I.
, j .
L /
____,,-~ ‘ Major professor ’

Date l2/13/0Z

c-7539 MS U is an Afﬁrmative Action/Equal Opportunity Institution

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6/01 c:/CIRC/DateDua.p65-p. 15

 

VISION-BASED TRACKING OF FIDUCIALS FOR AUGMENTED REALITY
By

Paul W. Middlin

A THESIS

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

MASTERS OF SCIENCE
Computer Science

2002

ABSTRACT
VISION-BASED TRACKING OF FIDUCIALS FOR AUGMENTED REALITY
By

Paul W. Middlin

Visible ﬁducial images are a common method for supporting vision-based tracking in
augmented reality systems. This thesis describes algorithmic improvements in ﬁducial-based
tracking including an improved ﬁducial design, better ﬁducial location, and improved pose
computation. A set of criteria that are desirable in an optically-tracked ﬁducia] are presented
and a new ﬁducial image set is designed that meets these criteria. The images in this set
utilize a square black-border pattern with a 15% border width and an interior image that
supports orientation determination and unique identiﬁcation. The interior image is
constructed from orthogonal Discreet Cosine Transform basis images chosen to minimize the
probability of misidentiﬁcation and to be robust to noise and occlusion. This image could be
integrated into an Augmented Reality software system such as the well-known and widely
used ARToolKit to improve accuracy in identiﬁcation of ﬁducials.

Piducial tracking involves more than simply creating a good ﬁducial image. The
tracking includes methods to accurately locate the ﬁducial in the image, then use this
information to calculate the location and orientation of the ﬁducial in relation to the camera.
This location and orientation is known as pose. The ability of this system to track and
calculate the pose of ﬁducials has been evaluated and compared to the ARToolKit as well.
The system has proved to be generally better than the ARToolKit in terms of locating,

identifying, and calculating the pose of a ﬁducial.

Dedicated to Stuart Grifﬁn, whose ambition inspires us all.

iii

ACKNOWLEDGMENTS

I would like to thank Dr. Charles Owen for his extensive contributions in the
implementation of this project, as well as for his advice and direction.

I would also like to thank the students of CSE891, Augmented Reality, whose
ideas and discussion led to the adoption of the ﬁducial criteria used, and the eventual

DCT method itself.

Thanks go to Tony Lambert for helping set up the testing environment, and for

the use of his laptop and truck.

Michael Malinak was helpful when doing the background research necessary for '

this thesis.

iv

TABLE OF CONTENTS

Table of Figures ................................................................................................................ vii
Table of Tables ................................................................................................................... ix
1 Introduction ................................................................................................................. l
1.1 Background ......................................................................................................... 1
1.2 Contributions ....................................................................................................... 4
1.3 Outline of Chapters ............................................................................................. 4
2 Related Work ............................................................................................................... 6
2.1 ARToolKit ........................................................................................................... 6
2.2 CyberCode ........................................................................................................... 9
2.3 HOM System ..................................................................................................... 11
2.4 IGD System ....................................................................................................... 11
2.5 SCR System ...................................................................................................... 12
2.6 TRIP System ..................................................................................................... 13
2.7 Multi-resolution Colored Rings ........................................................................ 14‘
2.8 Other Systems ................................................................................................... 15
2.9 Pose Calculation Methods ................................................................................. 15
3 Criteria for a Good Fiducial ...................................................................................... 17
3.1 Fiducial Shape ................................................................................................... 17
3.2 , Fiducial Color ............ 20
3.3 Locating the Fiducial ......................................................................................... 21
3.4 Fiducial Identiﬁcation ....................................................................................... 23
3.5 Fiducial Identiﬁcation Range ............................................................................ 27
3.6 A Large Fiducial Identiﬁcation Space .............................................................. 28
3.7 Human Identiﬁcation ......................................................................................... 28
3.8 Summary of Desirable Characteristics .............................................................. 29
4 A “Good” Fiducial Interior Image ............................................................................ 31
4.1 Deriving the Image ............................................................................................ 31
4.2 Detection ........................................................................................................... 35
5 A Functioning System ............................................................................................... 38
5.1 Finding the Border ............................................................................................ 39
5.2 Tracing the Border ............................................................................................ 40
5.3 Accounting for Camera Distortion .................................................................... 42
5.4 Locating the Quadrilateral Comers ................................................................... 43
5.5 Quadrilateral Test .............................................................................................. 45
5.6 Line Fitting ........................................................................................................ 45
5.7 Warping ............................................................................................................. 46
5.8 Identifying with the DCT .................................................................................. 50
5.9 Calculating Pose ................................................................................................ 50

6 Evaluation .................................................................................................................. 55

6.1 Distance and Rotation Test Setup ..................................................................... 55
6.2 Distance Results ................................................................................................ 58
6.3 Rotation results .................................................................................................. 59
6.4 Identiﬁcation Results ......................................................................................... 65
6.5 Speed ................................................................................................................. 73
6.6 Discussion of Results ........................................................................................ 74
7 Future Work .............................................................................................................. 76
7.1 Basis Set ............................................................................................................ 76
7.2 Pose Estimation ................................................................................................. 77
7.3 Finding Potential Fiducials ............................................................................... 78
8 Conclusions ............................................................................................................... 79
9 Appendix A -— Camera Frequency Response ............................................................ 81
9.1 Background ....................................................................................................... 81
9.2 Test Setup .......................................................................................................... 82.
9.3 Results ............................................................................................................... 84
9.4 Discussion ......................................................................................................... 92
10 Appendix B — Testing Data. .................................................................................. 94
11 Appendix C - Camera Calibration ......................................................................... 96
References ....................................................................................................................... 100

vi

TABLE OF FIGURES

Figure 2-1 - Example ARToolKit Fiducial ......................................................................... 7
Figure 2-2 - CyberCode recognition steps ........................................................................ 10
Figure 2-3 - Example HOM Fiducials ............................................................................... 11
Figure 2-4 - Example IGD Fiducials ................................................................................. 12
Figure 2-5 - Example SCR Fiducials ................................................................................ 12
Figure 2-6 - TRIP Target representing 1160407 ............................................................... 13
Figure 2-7 - Multi-size color ﬁducials .............................................................................. 14
Figure 3-1 - Equivalence of interior images for orientation determination ...................... 23
Figure 3-2 - Example images for correlation tests ............................................................ 26
Figure 4—1 - Example DCT ﬁducial Images ...................................................................... 35
Figure 5-1- Region with extraneous pixels ....................................................................... 42
Figure 5-2 - (a) Estimated ﬁrst, actual third (b) Finding 2 and 4 ...................................... 44
Figure 5-3 - Special Case .................................................................................................. 44
Figure 5-4 - Solution for special case ................................................................................ 44‘
Figure 5-5 - Line ﬁtting and intersection .......................................................................... 46
Figure 56 - Pseudo-code for Finding Warped Image ....................................................... 50 '
Figure 5-7 - Finding the X axis ......................................................................................... 51
Figure 5-8 — Pose Finding Method Comparison ................................................................ 54
Figure 6-1 - Test Setup ...................................................................................................... 56
Figure 6-2 - Finding the Center of Projection ................................................................... 57
Figure 6-3 - DCT Test Fiducial ......................................................................................... 58
Figure 6-4 - ARToolKit Test Fiducial ............................................................................... 58
Figure 6-5 - Distance Error Comparison ........................................................................... 59
Figure 6-6 - Angular Error, All Images ............................................................................. 61
Figure 6-7 - Angular Error, 0 Degrees .............................................................................. 61
Figure 6-8 - Angular Error, 15 Degrees ............................................................................ 62
Figure 6-9 - Angular Error, 30 Degrees ............................................................................ 62
Figure 6-10 - Angular Error, 45 Degrees .......................................................................... 63
Figure 6-11 - Angular Error, 60 Degrees .......................................................................... 63
Figure 6-12 - Angular Error, 75 Degrees .......................................................................... 64
Figure 6-13 - ARToolKit Misidentiﬁcation, 3 foot, #1 ..................................................... 68
Figure 6-14 - ARToolKit Misidentiﬁcation, 3 foot, #2 ..................................................... 69
Figure 6-15 - ARToolKit Misidentiﬁcation, 6 foot ........................................................... 70
Figure 6-16 - ARToolKit Correct Identiﬁcation Using DCT, 3 feet ................................ 71
Figure 6-17 - ARToolKit Correct Identiﬁcation Using DCT, 6 feet ................................ 72
Figure 6-18 - Example Test Image .................................................................................... 73
Figure 9-1 - Effect of point spread function ...................................................................... 82
Figure 9-2 - Testing pattern ............................................................................................... 83

Figure 9-3 -Ideal image, bands 165 through 179 (close to half the Nyquist frequency) .. 9O

vii

Figure 9-4 - Logitech Image .............................................................................................. 91

Figure 11-1 - Jig used for calibration ................................................................................ 97
Figure 11-2 - Radial Distortion ......................................................................................... 97
Figure 11-3 - Comer locations after calibration ................................................................ 98

viii

TABLE OF TABLES

Table 5-1 - Starting Directions for Tracing ....................................................................... 41
Table 6-1 - ARToolKit Fiducial Shape Associations ........................................................ 65
Table 6-2 - Identiﬁcation test results ................................................................................. 67
Table 6-3 - Speed Test Results .......................................................................................... 74

ix

1 Introduction

1. 1 Background

Augmented reality (AR) is the blending of computer-generated virtual elements
with reality [1]. A common example AR application is rendering computer graphics onto
existing imagery such that the graphics appear to be seamless additions to or
augmentations of the real image, registered in space, matching in scale. One of the most
difﬁcult challenges in this application is aligning the real and virtual worlds so as to
achieve this seamless registration. The parameters of the rendering environment must
exactly match those of the camera system that captured the image. Vision-based tracking.
uses images of the world to support this computation, either through tracking of natural
image features [2, 3] or through the use of markers or ﬁducials placed in the scene. This
thesis proposes a set of criteria to use when designing a ﬁducial and a vision-based
tracking system for the ﬁducial design, making arguments for a speciﬁc type of ﬁducial
that was created with optimization of these criteria in mind. Further, a system has been
designed that uses these ﬁducials for tracking, which has been optimized for performance
in a way that is consistent with the ﬁducial criteria.

Existing ﬁducial tracking systems use ad-hoc ﬁducial images based on either
comparison to a library of template images or simple bar-code-based mechanisms. The
designs are typically based on human, not machine, identiﬁcation and ease of
identiﬁcation at high resolutions. The images tend to be highly correlated and often are

misidentiﬁed.

This thesis recognizes the need for a set of ﬁducial images that can be
systematically produced, has a small chance of being misidentiﬁed, and can be easily and
accurately tracked. These images are two-dimensional forms of the Discrete Cosine
Transform (DCT) basis set. The shape, border width, color, and method of locating the
ﬁducial have been chosen after analysis of the criteria set forth in this thesis. The choices
were made in an attempt to satisfy the general majority of ﬁducial tracking needs, though
they will not be ideal for all situations. The ﬁducials will utilize a square shape with a
black border that is 15% of the width of the ﬁducial. The interior images are
monochrome and are based on the DCT basis set, with an orientation component built in.

The choices made in the design of the tracking system are shown to be
theoretically superior choices. However, theoretical superiority is not a guarantee of
performance in a real implementation. To verify the performance of this system, it has
been compared to the ARToolKit [4] with tests in pose calculation accuracy and ﬁducial
identiﬁcation. The ARToolKit was used as a benchmark since it is one of the most
popular and widely used ﬁducial tracking systems for Augmented Reality. The
ARToolKit will in fact be mentioned numerous times throughout this thesis as a basis for
performance comparison.

The testing has shown that the system created for this thesis was more capable
both in terms of ﬁducial identiﬁcation and pose estimation. Additionally, this system
executes much more quickly than the ARToolKit system, allowing more time to do the
three-dimensional rendering required in most AR applications.

The need for such an improved system stems from the wide variety of AR

applications that use such technology. For instance, Fjeld and Voegtli [5] have created a

system that uses ﬁducials to allow a user to view chemical models in a more interactive
way. A series of ﬁducials are used to identify different chemical compounds, and a
graphical overlay of a model for these compounds is placed over the ﬁducial. This
addition of graphics is done on a viewscreen. The user can interact with the models by
moving the ﬁducials, or by using another cube that has ﬁducials on it. This cube can be

rotated with a person’s hand, and will cause the chemical model to rotate synchronously.

Using a viewscreen is not the only option for displaying Augmented Reality.
Some systems use a Head-Mounted Display (HMD), which is like having two small
monitors in front of the user (one for each eye). HMDs come in two major forms: video
see-through and optical see-through. A video see through HMD uses a camera to record '
video, then passes this video to the eye with small LCD displays. In this case, the user is
seeing the video as the camera(s) see it. For ﬁducial tracking, this means that the
ﬁducials could be replaced with a virtual element before the video is seen by the user,

thereby augmenting the reality that the user sees.

An optical see-through display is similar, but the user can see the world directly
through the HMD. Here, graphics are overlaid using a half-silvered mirror and LCDs to
combine the computer display’s light with light coming from the actual objects. Again,
ﬁducials could be used to calculate where the user’s head is in relation to the objects he
or she is viewing so that the virtual elements can be registered with the real objects in the
user’s line of sight. This is also an example of a time when the ﬁducials being tracked do
not need to be in the same space as the virtual elements. That is, a separate camera can
be used strictly to track the user’s HMD, so the ﬁducial on the I-IMD would never be seen

by the user; the ﬁducial is never shown on the HMD display.

Fiducial tracking can be extended to many other media, such as video monitors,
handheld devices, or systems that do not use visual representations at all. The purpose
behind using the ﬁducials is for tracking, which implies that any application in which the
location and orientation of an object needs to be known can beneﬁt from vision-based

tracking.

1.2 Contributions
Contributions of this thesis are as follows:

0 A set of criteria that deﬁne the qualities of a good ﬁducial tracking system

0 A set of ﬁducial design choices that optimize those criteria

0 A speciﬁc set of ﬁducial images based on the DCT that perform well with respect to
the given criteria

0 A system implemented using these ﬁducials, optimized for performance

0 Testing of the system and comparison to a well-established ﬁducial tracking system
(ARToolKit)

0 Evaluation results that demonstrate improved accuracy, stability, and reliability.

1.3 Outline of Chapters

Chapter 2 outlines a representative set of existing vision-based ﬁducial tracking
systems. Chapter 3 describes a set of criteria created based on the needs of such systems
as those described in Chapter 2. Chapter 4 utilizes these criteria to derive a new type of
ﬁducial that performs well relative to those criteria. Chapter 5 describes this system in

detail and Chapter 6 presents evaluation of the system performance. The ﬁducial system

created is still not necessarily ideal, and ideas for the improvement of this system are

described in Chapter 7.

2 Related Work

Fiducial-based tracking is a key enabling technology for a wide variety of
applications. The motion-picture industry uses ﬁducials to track camera movement in
support of augmented imagery. Manufacturing applications track ﬁducial images on
circuit boards and other components so as to support accurate assembly alignment.
Because of this general utility, there are many ﬁducial systems that have been proposed
in both commercial and research areas. Some systems exist only to support location of a
single point in an image. Others support only two axis of alignment for parts placement.
Only a limited number of systems support full pose computation as described in this

thesis. This chapter describes a set of the major systems described in the literature.

2. 1 ARToolKit

One of the most well known and widely used ﬁducial tracking systems is the
ARToolKit. It was created by H. Kato and M. Billinghurst at the University of
Washington in the Human Interface Technology Lab [4, 6] and supports full pose
calculation in addition to identiﬁcation of a set of ﬁducials. The ARToolKit is widely
distributed as open source for a variety of target platforms. Between its free distribution,
documentation, and ease of use it has become the center of a wide variety of AR
applications that depend on vision-based tracking. It is used both for research and
commercial use, and for development of other systems in the form of its compiled
tracking libraries.

ARToolKit markers are square ﬁducial images with a ﬁxed, black band exterior

surrounding a unique image interior. Figure 2-1 is an example ARToolKit ﬁducial. The

outer black band contrasts against a light background and is used to locate a candidate
ﬁducial in a captured image. The interior image enables the identiﬁcation of the
candidate from a set of expected images and determination of the four possible
orientations. The four comers of the located ﬁducial are used to unambiguously

determine the position and orientation of the ﬁducial relative to a calibrated camera.

iro

 

Figure 2-1 - Example ARToolKit Fiducial

Design of the distinquising interior image is completely up to the user. This
content is ad hoc, in that there is no systematic process to generate it or to choose good
alternatives. Frequently, single letters or numbers are used.

The ARToolKit requires several steps to ﬁnd and match a ﬁducial image. The
image is thresholded against a constant value and all connected components are labeled.
The edges of the connected regions are located using contour following. These contours

are then ﬁtted to lines to form a quadrilateral. If a quadrilateral is found, then the pixels

in this quadrilateral is resampled into a 16x16 upright square image that is to be
compared with the ﬁducia] patterns registered with the system.

The comparison is done by calculating the correlation coefﬁcient between the
captured candidate image and a stored template pattern. In the following equations, I(x,y)
is the candidate image and P(x,y) is the pattern.

First, the mean and standard deviations for the image and pattern are computed
(clearly the pattern data can be pre-computed). The following equations show how to
compute the standard deviation for the candidate image (or) and for each pattern (or). u;

and up from equation ( 2-1 ) and ( 2-2 ) are just substitutions into the standard deviation

equations in ( 2-3 ).

1
(2-1) #1 =3221(x,y)
x y

1
xy

(2'3) 0'1 = 22(I(x’Y)-#I) 0P = 22(P(xry)—Aap)
xy xy

Then, the correlation coefﬁcient (p) is computed as:

ZZUU, y) -m )(P(x,y) -ﬂp)

xy
(2-4) p=
UIOP

The correlation coefﬁcient is a non-negative value such that larger values indicate
similarity of the image based on an L2 norm. If the coefﬁcient for one image is maximal
for the image set and exceeds a ﬁxed threshold (0.5), then the image is accepted.

Obviously, this process is a complex calculation. More importantly, using this
process means that to ﬁnd a best match the system must calculate a coefﬁcient between
the candidate image and each of the expected patterns, an O(N) operation. The more
patterns in the system, the longer it will take to perform this calculation. The ARToolKit
actually has a hard limit as to the number of ﬁducials that can be registered with the
system. This limit helps to prevent it from taking too long to match the ﬁducial, but
limits the ﬂexibility of the system because of the small number of ﬁducials that can be

used.

2.2 CyberCode

The CyberCode system was created at Sony Computer Science Laboratories [7].
CyberCode is based on a two dimensional bar code ﬁducial. Here, the interest was more
in producing a large number of unique ﬁducials. A CyberCode ﬁducial consists of a
square area for the patterned code, with a black bar alongside the square region to help
determine orientation. There is no surrounding border as with the ARToolKit ﬁducials.

Figure 2-2(a) shows an example of a CyberCode ﬁducial. The guide bar is
pointed out in (b). The four comers of the square area are always black (c), so the code

pattern is the cross-shaped area inside of this (d).

 

Figure 2-2 - CyberCode recognition steps

The tags are found by adaptive thresholding the image, then applying a connected
components algorithm. The connected regions are then searched a speciﬁc second order
moment, indicating the guide bar. From there, the algorithm locates the four comers, and
uses these locations to account for distortion from tilt/angle. The last step is of course to
decode the bitmap inside the four comers.

Sony claims to be able to use 24 bits to encode the identiﬁcation, meaning that
there are over 16 million possible CyberCode markers. This is a very wide space. Sony
published little about the performance of this system in terms of adaptability to different

lighting conditions, low resolution images, or 3D location accuracy.

2.3 HOM System

Similar to CyberCode, the HOM system created by Siemens uses a 2D code with
a side bar [8]. In this case, however, the sidebar also contains 6 bits of additional coding
information and the square part of the ﬁducial has a solid border. See Figure 2-3 as an

example.

 

Figure 2-3 - Example HOM Fiducials

2.4 IGD System

The IGD system is another coded ﬁducial system using a black border and a
bitmap in the middle [9]. The IDG marker system was implemented at the Institute for
Computer Graphics (Institut Graphische Datenverarbeitung) in Darrnstadt, which is an
ARVIKA partner. ARVIKA is the German government supported research project to
develop AR-related applications in industry. Many ARVIKA-related applications are
developed using the IGD marker system. An IGD marker is a square divided into 6x6
square tiles of equal size. The inner 4x4 tiles are used to determine the orientation and
the code of the marker. Figure 2—4 shows an example of this fiducial. The precompiled

libraries of the IGD marker system are available to ARVIKA participants [10].

 

Figure 2-4 - Example IGD Fiducials

2.5 SCR System

The SCR marker system was developed by Siemens Research Corporation for AR
applications [1 1]. It also uses a coded matrix to identify the ﬁducial, as seen in Figure
2-5. Additionally, it locates 8 feature points instead of the usual 4 found in most square
ﬁducial systems. The additional points might help to increase the accuracy of the

location of the ﬁducial, which in turn can help make 3D translations more accurate.

 

Figure 2-5 - Example SCR Fiducials

2.6 TRIP System

The TRIP (Target Recognition using Image Processing) system is a circle-based
system. It was developed at Cambridge University in the Laboratory for
Communications Engineering [12]. It uses a sector-based circular system of bar coding.

The innermost part of the target is a “bull’s-eye”. The bull’s—eye is used to locate
the ﬁducial. The TRIP algorithm thresholds the image, does edge detection, and then
edge following. The connected edges are examined and only those that are circular (or
ovular) are kept. Finally, the bull’s-eye is identiﬁed when two concentric circles are
found.

After ﬁnding the ﬁducial, the two concentric rings around the bull’s-eye are
examined. They are broken into 16 sectors, as shown in Figure 2-6. One of these is used
as a synchronization sector; two others are used for even-parity. The remaining 13
sectors are used as a ternary code. There are therefore 3'3 = 1,594,323 2 220 possible

codes.

 

I,“ I"; 1 —————— ‘1: ‘V~\‘, ‘
ring code I . ‘ '

ring code 2

.. even-parity sectors

  

synchronization sector

0

 

 

 

10 2011221210001

Figure 2-6 - TRIP Target representing 1160407

Despite providing only one real location point, the TRIP system does indeed
calculate the 3D position of the target in relation to the camera. It does this using the
POSE_FROM_CIRCLE algorithm described by Forsyth et a1 [13]. The synchronization

sector is used to ﬁnd the orientation of the circle.

2. 7 Mum-resolution Colored Rings

Cho, Lee, and Neumann at the University of Southern California have created a
system that uses nested colored rings [14]. The purpose is to make ﬁducials that can be
found over a wide viewing range. Each ﬁducial consists of a center circle, then three

rings of increasing width surround the center (Figure 2-7).

a 0 0

First level Second level Third level

Figure 2-7 - Multi-size color ﬁducials

Their algorithm searches for the smaller rings ﬁrst. If the center circle with a
single ring is found (ﬁrst level), then there is no need to look for the surrounding rings. If
the center cannot be found, then the ﬁducial must be too far away to distinguish such a
small feature, so it will locate the second level instead. Likewise the third will be found
for a smaller ﬁducial.

The effective range for each level overlaps, but the smaller should be found ﬁrst
in the case of an overlap because this requires less processing time. The range of sizes

for identiﬁable ﬁducials is about 24 to 56 pixels in diameter.

14

It should also be noted that each ﬁducial returns only a single point, so any
calculation of 3D location would require 3- or more ﬁducials in the scene. In fact, using
strictly the three points, there are often up to four solutions [15]. This implies that this
system employs some extra processing between frames to rule out other solutions, or that

it is sometimes inaccurate because of the lack of a fourth point for correspondence.

2.8 Other Systems

Many simple approaches using ﬁxed color squares, circles, or cross patterns have
been demonstrated. Most projects approach the problem either from the standpoint of
selecting a set of images (as in ARToolKit) or choosing a way to encode data into images
(as in CyberCode). There are a plethora of other systems that do ﬁducial based tracking.

See the following references: [14, 16-19].

2.9 Pose Calculation Methods

Pose is the location and orientation of an object. The location implies three
degrees of freedom -— a point on an ‘X’, ‘Y’, and ‘Z’ axis. This alone does not reveal the
way the object is situated at that location, so the orientation component is needed as well.
Orientation is three degrees of freedom as well — rotations about the X, Y, and Z axes.

Therefore, pose involves six degrees of freedom.

There are many methods for calculating the 3D location of points in relation to a
calibrated camera given the screen coordinates of these points and the model of the object.
It is assumed that a single ﬁducial (or a set of ﬁducials in some systems) represents a
coordinate system. It is typical that a single comer of a ﬁducial image will be declared to

be the origin of the system and all points are considered to be in the (x,y) plane. Any

15

ﬁducial tracking system used for AR must use some form of pose calculation to estimate
the 3D location of the ﬁducial. Three particular methods are examined in this thesis (see
Section 5.9), but there are many methods in existence. The methods described here
relate mainly to those presented by Shapiro and Stockman [20] and that used by the

ARToolKit [4].

It seems valid to mention the work in this area by Ji et al [21] which describes
methods for doing pose calculation from a variety of geometric shapes. Also important is
the work of Quan and Lan [22], who have developed a linear method for pose calculation
(instead of an iterative approach as is described in Section 5.9). This method solves the
systems of equations using the classical Sylvester resultant [23] and quatemions. This
solution is not a perfect least-squared solution, it is an estimate. See also [24-29] for

examples of other methods and applications in the subject of pose calculation.

l6

3 Criteria for a Good Fiducial

Clearly there are tradeoffs among the criteria for a good fiducial image. Existing
designs for AR ﬁducials have been ad hoc and have not started with speciﬁc design
criteria other than support for some level of tracking (planer, pose, etc.). This thesis
approaches the problem by asking question and proposing answers consistent with many
applications in augmented reality and commonly available hardware. The questions
addressed in this section are:

c What is a good fiducial shape?

What colors should be utilized in a ﬁducial image?

How should a speciﬁc ﬁducial be located in an image?

How should a speciﬁc ﬁducial be identiﬁed?

Over what range of sizes should the ﬁducial be identiﬁed?

Should a human be able to decode/identify a ﬁducial?

The answers to these questions can vary depending on the application or domain
that the ﬁducials will be used in. Some applications may require ﬁducials with
anthropomorphic characteristics; others may be optimized for computer tracking only.
Care will be taken, however, to try to make the answers to these questions as generally
applicable as possible. Additionally, points that may inﬂuence one’s decision on the best

choice to meet a given criteria will be presented to help make this decision.

3. 1 Fiducial Shape

The purpose of a ﬁducial image is to provide automatic correspondences between

points in a camera frame and points in a captured image. Clearly, any visual feature can

17

be used as a ﬁducial if its location is known (or can be computed) and it can be
automatically identiﬁed. Indeed, tracking systems designed for use in unprepared
environments have been proposed that use regions, lines, and other natural environmental
features [30, 31]. Most applications for ﬁducial images, however, assume a prepared
space with speciﬁc images placed in the environment, with the assumption that the
relative transformation between a camera frame and frames indicated by the ﬁducials
needs to be determined. In tracking terminology, the position and orientation (six
degrees of freedom) of the frame marked by ﬁducials needs to be identiﬁed relative to the
camera. This problem is also commonly referred to as pose estimation.

Determination of position and orientation of a physical object relative to a camera.
frame requires the correspondence of at least four non-linear points. As an example,
estimating the pose of a camera relative to a physical environment will require the
identiﬁcation of four 2D points in the camera image and knowledge of their 3D
coordinates in the world coordinate system. It is possible to compute pose from only
three points. However, the result is ambiguous, generally having two, and often three or
four, solutions [15]. Hence, any ideal ﬁducial solution supporting 6DOF pose estimation
should always emit a minimum of four located points, no three of which are colinear.
Additional points can be used to compute least-square solutions that can average out
errors and increase the estimate’s accuracy. Many ﬁducial methods utilize a single,
typically very simple, ﬁducial image such as a ring or disk with the requirement that
multiple ﬁducials must be simultaneously tracked [14].

Since the location of ﬁducials in camera images will always be permuted by noise

and quantization error, there is a clear advantage to tracking additional points, so

18

ﬁducials that emit multiple tracking points seem advantageous. Also, many applications
require tracking of styli, independent marked locations, or multiple users, where
placement of a large number of ﬁducial images is prohibitive.

An assertion of this thesis is that an ideal ﬁducial image should emit at least four
points. Beyond that, it is clear that the points should approximate a square. The size of
the ﬁducial equates to resolution in the capture image. Four points not in the form of a
square will result in some elements of the image presenting a lesser resolution to the
camera than others, thereby decreasing tracking accuracy in corresponding orientations.

This requirement does not necessarily imply that the ﬁducial image itself must be
square. Any image that can emit four points would sufﬁce. However, there are clear
computational advantages to simplicity, and a square ﬁducial image is the simplest
possible ﬁducial emitting four points. The straight edges of a square can be used to
compute best-ﬁt lines allowing comers to be computed with greater, potentially sub-pixel
accuracy. Indeed, the ARToolKit standard ﬁducial image is a square image.

It should be noted that a circular marker can be used to determine pose if a point
on the circle can be determined. The POSE_FROM_CIRCLE algorithm provides a
robust solution given circle edge points [13]. However, an interior image for
identiﬁcation is more difﬁcult to implement and cannot be represented in a rectangular
array. Most implementations based on pose estimation from circles are based on

barcodes (or, more precisely, ringcodes) [12].

19

3.2 Fiducial Color

The question of ﬁducial color is much more difﬁcult to address. Clearly,
choosing a color ﬁducial as opposed to monochrome increases the possible set of ﬁducial
images. Indeed, both color and monochrome images have been utilized in existing
systems. However, there are several technical reasons to favor a monochrome ﬁducial:

o Varying chroma resolution in camera systems
0 Decreased image representation
0 Hi gher-perforrnance localization algorithms

The spatial frequency sensitivity of the human visual system for luminance
components is much greater than for chrominance components [32]. Unfortunately,
many imaging systems designed for computers mimic this characteristic, transmitting
chrominance information in lower bandwidth channels or representing chrominance
information with lower resolution. This neCessarily decreases the detection resolution for
color ﬁducials. Use of inexpensive web-cams has become very popular for ﬁducial-
based tracking. These cameras clearly exhibit decreased color resolution. Hence, for the
most accurate results using a wide range of cameras, a monochrome ﬁducial image is the
best choice. When hi gh-quality cameras are available, color ﬁducial images can increase
the information available in the ﬁducial image.

Even if an RGB color presentation is captured at full resolution, the resulting
color image will increase the memory usage and, consequently, the analysis time, by a
factor of three (or four). This is a consequence of the increased memory bandwidth

requirements.

20

An additional element in the choice of color or monochrome is the choice of
localization algorithms. Hi gh-performance algorithms have been developed for color
ﬁducials, but assume very simple shapes that can be identiﬁed by cross-sectional lines
[14]. One advantage of color ﬁducials is the use of color to identify the speciﬁc ﬁducial,
as in the multi-ring approach. However, the number of colors that can be uniquely
identiﬁed varies greatly depending on lighting conditions, and is likely to be small.
Specular reﬂection will not only affect the luminance of an image, but can also modify
the hue of imaged colors. Additionally, the colors must contrast with colors naturally
occurring in the scene.

One option for color is to utilize retro-reﬂective ﬁducials and infrared
illumination [33] or direct imaging of infrared emitters [34]. This option is a very
different technological approach from visible-image tracking, requiring special camera,
illumination, and reﬂective technologies. In addition, IR ﬁducials based on
retroreﬂective materials do not lend themselves well to patterned individual ﬁducials
other than simple binary patterns. As the focus of this thesis is visible image ﬁducials, IR

approaches are beyond the scope of this discussion.

3.3 Locating the Fiducial

The shape and color of a ﬁducial is directly related to the algorithm utilized to
locate it in the camera image. As mentioned previously, the ARToolKit contains a
ﬁducial tracking system using a square image with a black border as illustrated in Figure
2-1. An interior image contained within the border provides identiﬁcation for the
particular ﬁducial image. It is assumed that the marker will contrast with a surrounding

region when converted to a binary image. Typically, this contrast can be achieved by

21

simply ensuring that the ﬁducial is mounted on a white surface or is printed on a larger
white sheet of paper. More details of the ARToolKit approach will be included in later
sections. Kato and Billinghurst [4] allow for the ﬁducial corners to be rapidly and
accurately located in a camera image. The approach assumes a monochrome ﬁducial
image.

Is this the best ﬁducial design for localization, the location of the ﬁducial in an
image? There are several distinct advantages to this design. The shape is a square design
and yields four comer points for tracking purposes. The edges are straight between the
comer points. This allows the comers to be determined by line ﬁtting to the edges,
yielding measurements that are less sensitive to noise in the vicinity of the comer and
quantization errors. The black border also yields a maximum contrast relative to the
background, particularly a white background. Once the comers have been located, the
interior can be warped to a common frame of reference (16 by 16 in the ARToolKit
approach) for comparison to a database of marker images.

This ﬁducial approach does not emit an orientation other than through analysis of
the interior image; hence, the offset of the interior text in the marker image in Figure 2-1.
Would it be better to design the outline to emit orientation independent of the interior text?
This design could be accomplished in a variety of ways, including offsetting the interior
image, adding an orientation image in addition to the interior image, or using varying
colors on the edge. Varying colors is not considered a good choice for the reasons
mentioned in the previous section and because it would eliminate the homogeneity of the
design. Detection performance would be determined by the least common denominator

of detection of the two types of borders. Offsetting the image or adding an image

22

component for orientation is equivalent to using a larger interior image and determining
orientation from the interior image alone. Figure 3-1 illustrates this equivalence. When
either the interior image is offset or a special orientation pattern is added, the ﬁducial can
be considered equivalent to a simple border with a larger interior image, as indicated by
the dotted lines.

Equivalent interior region

 

Offset interior image Orientation pattern

Figure 3-1 - Equivalence of interior images for orientation determination

Given these criteria, the square ARToolKit ﬁducial outline seems to be a "good"
approach. The border width and the interior image will be adjusted in this research,

though.

3.4 Fiducial Identification

Once an individual ﬁducial image is located, it must be identiﬁed. The
identiﬁcation of the interior image is simpliﬁed if a border has been located. The interior

image can then be warped to a square image with a ﬁxed scale.

23

Clearly, marking a space with identical ﬁducials would require the analysis of
relative placement for identiﬁcation, so it is advantageous if ﬁducials are unique.
Uniqueness can be accomplished in a variety of ways, including color combinations, bar
codes, or patterns. The pattern must be unique and accurately identiﬁable at a variety of
resolutions. Several desirable characteristics for ﬁducial identiﬁcation have been
collected:

- Orientation identiﬁcation

Minimal inter-ﬁducial correlation.

Resistance to noise or partial obscuring.

A large identiﬁcation range.

A large ﬁducial identiﬁcation space.

As discussed, using a ﬁxed monochrome square image, as in the ARToolKit
ﬁducials, is a preferred method. The identiﬁcation image is then set inside this box. It is
also preferred that the orientation, and thereby the correspondence of detected image
comers with physical coordinates, is determined by an interior image. Consequently, the
image must support determination of a unique orientation. In ARToolKit, fiducials are
commonly designed with offset text or blocks that make the orientation unique. Then a
candidate image is compared to the known images in each of the four possible
orientations. This method of comparison necessarily limits what can be selected as a
ﬁducial, particularly if users desire ﬁducial images with visually perceptible meaning.

A key characteristic of ﬁducial images is that there is rrrinimal inter-ﬁducial

correlation in all orientations. A variety of methods are possible for comparing images.

24

Mean squared error (MSE) is a common measure of image similarity, particularly when

measuring image degradation:

1/2

(34) c(I.P)= ZZ<I<x.y)—P(x.y»2
x y

In this equation, I(x,y) is the candidate image, P(x,y) is the pattern, and c(I,P) is a
measure of the dissimilarity between the two. For an MSE measure, small values
indicate similarity. This approach is not luminance invariant, however. A better approach
is the correlation coefﬁcient. This is the approach that the ARToolKit uses, and was
explained in detail in Section 2.1.

Clearly, no guarantees can be made about inter-ﬁducial correlations when images
are chosen ad-hoc. As an example, consider the ﬁducial set in Figure 3-2. The Hiro and
Kanji images (ﬁrst two images in the ﬁrst row) are standard ﬁducials included with
ARToolKit. The remaining images illustrate an obvious idea of using alphabetic
characters as interior images.

Clean, computer-generated images were compared using the correlation
coefﬁcient. The Hiro pattern had a worst case correlation to the A pattern of 0.163. The
Kanji pattern has a worst case correlation to the A pattern of 0.498, just below the
standard threshold. The G pattern correlates to B with 0.637 and C with 0.820, both far
above the identiﬁcation threshold. Obviously, the letters have too little difference to be
good choices, but even the Hiro and Kanji ﬁducials have correlations of 0.204. In a test
by Zhang et a] [10], the ARToolKit system identiﬁed a ﬁducial with the pattern “3” with
an 85% conﬁdence as being a “2”, while at the same time giving only a 69% conﬁdence

that the actual “2” pattern was “”.2 Clearly, the images that identify a ﬁducial must be

25

carefully chosen or identiﬁcation errors will result due to correlations among the
candidate set. If two possible candidate ﬁducials high correlation, it becomes difficult to

distinguish between them in a real application.

III
III

Figure 3-2 - Example images for correlation tests

 
 

This problem is accentuated in the presence of noise. Fiducial images need to be
robust in the presence of noise and partial occlusion. This need implies a potential
drawback of the ad hoc choices in Figure 3-2. Were a small part of the G obscured, it
would be indistinguishable from the C. This is even more of an issue when bar-codes are
applied to ﬁducials [7, 12]. The TRIP system, for example, requires 15 unique regions in
the cross-section of the image center. If reduced to a size of 25 pixels across, most
regions are one or two pixels and would be difﬁcult to detect with edge detection
algorithms. Small errors will change the code, violating the ringcode parity and rejecting
the marker or falsely identifying it.

One way to describe this problem is that the features that make a ﬁducial unique

from the set are often highly decorrelated in the image. This places a high percentage of

26

the information content into a minimum set of pixels, making the system much more

sensitive to perturbations of those pixels.

3.5 Fiducial Identification Range

The identiﬁcation range of a ﬁducial depends on the camera resolution and
camera parameters. Some systems have been designed to have redundant identiﬁers at
multiple scales, so that larger images become available as the camera moves beyond the
range of smaller images [14]. Clearly, the same effect can be had by creating multiple
ﬁducials in the space of varying size and, indeed, size ratios of two are shown by Cho,
Lee, and Newmann to be an effective choice.

For the purposes of this research, the concern is with how small the image of a
ﬁducial may become and still be reliably recognized. This size determination is primarily
dependent on the native size of the identiﬁcation image. To be consistent with current
systems like ARToolKit and with the goal of having a small ﬁducial, a 16x16
identiﬁcation image size has been chosen. Therefore, the minimum dimension of the
identiﬁed image in any axis must be 16 pixels. This is not the size of the actual ﬁducial,
but rather the minimum size of the interior image.

Given this criteria, the question arises: how wide should the border be? To
ensure reliable outline location, the border must be wide enough to ensure the point
spread function of some pixel on the border will cover the region at every point along the
border. If the border is too narrow, the border may fall between pixels, leaving the pixels
a shade of gray too indistinct to allow edge following. So, the edge must be wider than
twice the distance between any two pixels. This distance is actually 2.83, because the

worst case distance between pixels is 1.41 for diagonal lines. Hence, the image must be

27

at least 16 + 2.83 + 2.83 = 21.66 pixels wide in the recognized image. For design
purposes, this implies that the border must be at least 13% of the ﬁducial width. To be
conservative, a 15% border width was selected. Note that the border width should be
kept minimal in order to increase the size of the interior image and allow for a larger

recognition range.

3.6 A Large Fiducial Identification Space

The size of a marked space and the number of marked implements in that space is
limited by the number of unique ﬁducials that can be applied to the space. Marking each
two foot square ceiling tile in a twenty foot square room will require one hundred unique
ﬁducials. Clearly, a desirable characteristic of ﬁducials is a large space of identiﬁers.
While some of the bar code solutions claim ranges in the millions, this range is dependent
on recognition of 3 hi gh-resolution code in camera images from varying distances.
Consequently, the images must be relatively large.

A 16x16 image can have up to 256 patterns that are orthogonal to each other, if
the minimal correlation criterion is desired, though the set is easily expanded to 512 if
maximum negative correlation is also allowed. Treating those 16 by 16 images as a 256
binary value would signiﬁcantly increase the number of possible ﬁducial images at the

expense of highly correlated images.

3. 7 Human Identification

It is sometimes advantageous to have ﬁducials that are anthropomorphic; can be

easily identiﬁed by a human. For instance, if a set of ﬁducials corresponded to a set of

28

objects, then it would be easy for a person to pick up the object that they wanted because
the ﬁducial could be something that implies that object. A ﬁducial with an “A” on it,

could mean the “Antelope” object.

Identiﬁcation by a computer, however, is much different. It may be the case that
a set of ﬁducial images is more easily recognized by the machine than by a person. Since
one of the main criteria is to create ﬁducials that are not easily confused by the machine,
this research will not consider human identiﬁcation to be an important capability for the
ﬁducial system to have. Human identiﬁcation can easily be added to any ﬁducial with a
simple label near the ﬁducial. Additionally, ﬁducials are often replaced in an AR
application by graphics, or could have this capability added to account for the human
identiﬁcation factor. The prevalence of bar-code systems clearly indicates that

anthropomorphic characteristics are not important in many applications.

3.8 Summary of Desirable Characteristics

This is a summary of the chosen criteria as proposed in this thesis: An ideal
ﬁducial image should support the unambiguous determination of position and orientation
relative to a calibrated camera. The image should not favor some orientations over others.
The image must be a member of a set of images that are unlikely to be confused such that
a large space or set of objects can be uniquely marked. The image must be easy to locate
and identify using fast and simple algorithms. Images must function over a wide camera
capture range.

Given these criteria, arguments in this chapter have supported the design of a
square ﬁducial (Section 3.1) with a black border (Section 3.3) 15% of the width of the

image (Section 3.5) and some internal image suitable for identiﬁcation of the ﬁducial

29

(Section 3.4). The image will be monochrome (Section 3.2) and will be designed without
respect to human identiﬁcation (Section 3.7). The next section will detail the design of a

suitable interior image.

30

4 A “Good” Fiducial Interior Image

A "good" ﬁducial interior image set will have a large set of images from which to
choose, a means for accurate orientation determination, and a fast algorithm for
identiﬁcation. A major contribution of this thesis is the design of a new ﬁducial interior
image that supports these requirements. The main goal is to select images such that the
correlation coefﬁcient of any two images is minimal. The optimum selection, then,
would be a set of images wherein correlation coefficients among any two non-equivalent
images are null. Other obvious criteria are that the image can be represented using real-
valued images (no negative or complex pixel values), and that the intensity be maximal
(as bright as possible). This section describes the derivation of a new ﬁducial interior
image based on DCT basis functions and an associated algorithm for efﬁcient

identiﬁcation of instances of this ﬁducial design.

4. 1 Deriving the Image

A common method for comparing two images and producing a measure of
similarity is the use of the correlation coefﬁcient. Of the alternatives, the correlation
coefﬁcient is least sensitive to noise and provides a single measure of image similarity.
To get pattern images that are as different from one another as possible, the correlation
coefﬁcient between any two images should be 0. Recall equation ( 2-4 ) shows how to
calculate the correlation coefﬁcient. Setting the equation for the correlation coefﬁcient to

zero for two images I. and 12:

31

ZZ(II(x,y)—m, )(12(x.y)-#12)

(4-1) " y :0
011012

 

implies:

(4-2) 22(11(x,)’)—.u11)(12(x,)’)—#12)= 0
x y

This equation ( 4-2 ) will be satisﬁed if I; and 12 each is the sum of a DC offset
and a member of a set of functions such that the dot product of any two non—equivalent
basis functions is zero. In other words, a good choice for ﬁducial interior images is a set
of orthogonal basis functions scaled to a peak-to-peak range equal to the pixel intensity
range and added to a DC offset sufﬁcient to make the image non-negative.

There are a wide variety of basis function sets available. Most existing 2D linear
transforms, including Fourier, Hadarnard, Haar, and many others, emit sets of real-valued
basis images. Among the 2D sets with real values, we have chosen to use the basis
functions for the Discrete Cosine Transform (DCT), speciﬁcally DCT-II [35]. This N by

N 2D basis function set is deﬁned by:

2x + 01472)) CO4 (2y +1)v7r))

(4-3) B , =cos(
“U” ( 2N 2N

In this application, N=l6. Alternative sizes could be utilized, though larger
ﬁducial images would be required and the recognition range would be decreased.

One approach would be to construct a ﬁducial interior image as:

Bu,v(x, y)+1
2

 

(4-4) Iu,v(x, y) =

32

For simplicity, assume normalized pixel intensities in the range [0,1]. This
interior image set supports 256 combinations of (u,v) as ﬁducial interior images, all of
which have inter—correlation values of zero for non-equal interior images.

However, this solution does not satisfy one of the speciﬁed requirements: it does
not directly include orientation information. In fact, the images with even (u,v) values
are invariant under 180 degree rotation. The odd values could be utilized and do indicate
orientation, but that would reduce the set of ﬁducial images by 75%.

A solution to this problem is to consider the image as the sum of three parts: the
DC offset value (required to make a non-negative image), an orientation image, and an

identiﬁcation image. The orientation image is the (1,0) basis image:

(2x + 1)”)
4-5 B x, = cos ——
( ) 1,0( 1’) [ 2 N J

This basis image is a bit less than a half cycle of a cosine wave (Tl/32 to 311r/32).

With proper scaling, the ﬁducial interior image is deﬁned as:

Bu,v(xr Y) + Bl,O(x9 y) + 2
4

 

(4-6) Iu,v(x, y) =

When the orientation component and the mean of the image (the DC component)
are subtracted, all images are orthogonal to each other, reducing the likelihood of false
ﬁducial identiﬁcation. An advantage of a DCT basis function as a ﬁducial image is that
the pixels within the ﬁducial are highly correlated. This high correlation makes any
correlation-based detection less sensitive to partial occlusions and noise. Whereas some

ﬁducial systems store the information in edge data, particularly barcode-based systems,

33

or within bounded regions as in CyberCode, the DCT basis approach embeds the
identiﬁcation information in the entire interior image gradient.

The interior ﬁducial image equation assumes the creation of a 16 by 16 image.
However, the images used in a room are much larger than 16 by 16. In practice, a
ﬁducial image will be created at a high resolution for printing, sampled by the image
capture system, resampled by the warping, and compared to the basis set. In this
application, 3.5 inch square images printed by a 600dpi laser printer are commonly
utilized. So, the analysis ﬁducial image (16 by 16) must be resampled to the printer

resolution. The equation for creating a resampled ﬁducial image of arbitrary size is:

A xN 1 yN 1
(4-7) I x, =1 ———,———
“"’( y) “”(W 2 H 2]

In this equation, (x,y) are coordinates in a W by H image. This equation is used
to create the ﬁducial image at high resolution. The one half pixel offset ensures that the
high resolution image will properly resample if divided into 16 by 16 square regions and
sampled in the center of the region. This is important to ensure the ﬁducial image is not
offset.

Figure 4-1 illustrates several example ﬁducial images based on this system. The
lighter characteristic on the left side is due to the orientation image component. The

sinusoidal patterns of the basis functions are clearly visible in the images.

34

 

Figure 4-1 - Example DCT ﬁducial Images

4.2 Detection

The choice of DCT basis functions as components of a ﬁducial image allows for
fast identiﬁcation using the Discrete Cosine Transform. Fast algorithms exist for the
DCT, especially for the 16 by 16 size utilized in the MPEG video compression standard
[35]. Computing the DCT performs a simultaneous correlation with all 256 possible
basis images.

The 2D DCT-H N by N unnorrnalized transform is:

N—lN—l
_ (2x +1)” (2y +1)an
( 4-8) F (u, v) — XE=OyE=O f (x, y) cos(—2N )cos(———-—2N

35

The DCT result is linearly dependent on the amplitude of the input signal. This
amplitude (effectively with a ﬁxed scaling from the correlation coefﬁcient equation) can
be directly determined by examining the F(0,0) (DC) term of the DCT result. Dividing
all other values by F(0,0) normalizes for intensity. (A threshold is used to reject images
less than a minimum intensity).

An interesting characteristic of the DCT-II is its behavior under rotation. Let I be

an original image and I' the image obtained by rotating I though 90 degrees counter

clockwise. Then, letting i = DCT (I ) and 17': DCT(I'):

(4-9) i(u,v) = (— l)vI(v,u)

Applied recursively, it can be seen that the DCT of any orientation can be easily
derived from the DCT of any other orientation. The orientation is indicated by the
presence of the following DCT terms:

0 F(l,0) positive: No rotation.

- F(0,1) negative: 90° rotation.
o F(l,0) negative: 180° rotation.
- F(0,1) positive: 270° rotation.

Once the orientation is determined, identiﬁcation of the ﬁducial consists of
determining the cell with the maximum absolute value (other than 0,0, 1,0 and 0,1). The
cell can then be trivially corrected for the image rotation by exchange of terms and/or
negating the correlation result.

Note that the DCT is only performed once. The orientation is determined and
then the index of the cell with the maximum absolute value in the current orientation can

be translated to the index for the cell in the normal orientation.

36

The major contribution of this chapter is the use of DCT basis functions to
produce ﬁducial images with built-in orientation identiﬁcation and excellent cross-
correlation characteristics and an associated algorithm that allows for rapid identiﬁcation
of the ﬁducial. This is the ﬁrst ﬁducial system to make effective use of the image
gradient to convey information. This is a signiﬁcant contrast with existing systems that
utilize binary images, thereby not taking advantage of the range of pixel values other than

for antialiasing purposes.

37

5 A Functioning System

The following is a description of a system created for this thesis by the author

(Paul Middlin) and Dr. Charles Owen [36] to use the method derived by examining the

important criteria for making a good ﬁducial. This DCT method was implemented as the

identiﬁcation method for the ﬁducials, has a square ﬁducial with a black border, and can

calculate pose from the ﬁducial locations. During development, it has been the goal to

produce an accurate and reliable ﬁducial tracking system. Hence, many steps in the

process have been redesigned relative to existing systems or optimized for best processor

and evaluation performance. These are the major steps in the program, which will be

described in detail:

1.

2.

Search for the beginning of a border, using a threshold.

Trace the border to get an outline of the quadrilateral.

Account for camera distortion in the outline.

Locate the 4 comers of the quadrilateral.

Test to make sure it is a quadrilateral.

Fit lines between these corners to get sub-pixel accuracy on the actual comers.
Warp the square to a 16x16 candidate image

Do a DCT transform on the candidate image for identiﬁcation.

Use the comers of the ﬁducial to calculate pose.

38

5. 1 Finding the Border

To ﬁnd where a fiducial might start, the system uses scan line techniques to
search for a point where the pixel values move from a background color to the ﬁducial
edge color. Rather than thresholding the entire image and then scanning each line for a
black pixel, the system utilizes a faster approach. The image is not thresholded ahead of
time, which will save time and memory by not having to create an intermediate image.
The grayscale value is calculated as the pixel is examined. If this value is above a
threshold, it is white, otherwise it is considered black.

The value of the threshold is chosen by taking a quick sample of the image (every
5th column in every 5th row). This provides an estimate of the average intensity in the
image. This is used as the threshold value.

The system also scans only every 5m line because any ﬁducial that we consider
large enough to be found will have to cross at least one of these scans. Remember that
we require the inner part of the ﬁducial to be 16x16 pixels at least, plus the 15% border
width for a total of a 25x25 pixel ﬁducial. It might in fact be possible to skip even more
lines to save time.

The threshold is one of the touchiest parts of the system in varied lighting
conditions. It is often possible to choose a good constant threshold for a particular image,
but this value might be completely ineffective for the next. In an effort to ﬁnd a good
technique, other possibilities were examined and even implemented for comparison.

One technique tried was to look for a sudden change in pixel intensity while
scanning through the image. At the point where the intensity changed, that pixel value

was chosen as the base black value. Any pixel, then, within a smaller threshold range

39

was taken as black as well. For instance, a pixel was found by searching for change of
100 in pixel intensity. That pixel’s intensity (say, 50) will be used, plus an additional
threshold amount equal to 1/4 of the initial threshold. So, for an initial threshold of 100,
and a pixel found with value 50, any pixel with an intensity of 75 or less would be
considered black.

This differencing technique tended to be more adept at ﬁnding the outlines in
varied lighting conditions; however it tended to jitter quite a bit. This is because the ﬁrst
pixel found and used as the base black value could vary greatly in intensity from frame to
frame, causing the outline of the ﬁducial to grow or shrink depending on how black the
initial pixel was. The method was reliable in isolation, but did not produce effective
results in a system.

Other techniques, such as trying to trace using only differences in pixels were
tried, with little success. An image-wide edge detection could be done by using
something like a Canny ﬁlter. This technique would probably be much slower, and
would not necessarily produce connected outline regions the way the thresholding

technique does.

5.2 Tracing the Border

The pixels for the border are found by starting from the ﬁrst pixel identiﬁed as a
border pixel (from step 1) and following the edge in a counter clockwise fashion. This is
a complex process, because the ﬁrst direction that is tried for the current pixel being
tested depends on the direction of the last marked pixel. That is, if the algorithm got to
the current pixel from the right, it would try a different direction ﬁrst than if it had arrived

at the pixel from the left.

40

Table 5-1 shows which direction the algorithm tries ﬁrst, given the direction that
was successful from the last pixel. The ﬁrst column shows the direction that was tried
from the last pixel to ﬁnd the current one. The second column is the ﬁrst direction that

will be tried from the current pixel:

From First

 

Table 5-1 - Starting Directions for Tracing

The ﬁrst starting direction will be the R direction (despite the fact that the very
ﬁrst pixel was found coming from the left).

The algorithm will continue moving around the edge of the potential quadrilateral
until it reaches the starting point, in which case a loop has been created. The pixels are
marked as “visited” as they are added to the list of pixels in the outline. This is not so
that they won’t be visited again while tracing, but rather so that they won’t be retraced
later on during step 1 (scanning for starting edges). In fact, it may be necessary to revisit

a pixel while outlining to make the loop complete, such as in the following example:

41

r i

IIIIIII .
i ‘ ' . .

14.1--Jw . ,. a

Figure 5-1- Region with extraneous pixels

In this example, there is no where to go from the bottom right pixel except for
back to the previous pixel. While this pixel is probably an extraneous error, it must of
course be accounted for and included in the outline.

There are a few ﬁnal optimizations. Small regions are thrown out — if the outline
is less than 8 pixels in length it is discarded. Note also that not every pixel inside the
ﬁducial has to be processed. In fact, only the pixels along the border and the pixels
immediately surrounding those border pixels are examined. Region growing techniques
such as those in ARToolKit would require visiting every pixel in the image twice: once
for thresholding and once again for labeling of the region. Additionally, the edge pixels

in the region would be visited to do the outlining.

5.3 Accounting for Camera Distortion

The camera can add quite a bit of distortion to the pixels in the image, making
lines bend or stretching objects. If the camera being used has been calibrated (see
Section 11), then this can be accounted for in the outline pixels before trying to analyze

them. This is done at this stage, to help find better comers. It straightens the lines, so

42

that the line ﬁtting step is more accurate, and the initial corner ﬁnding is less likely to

ﬁnd a bowed out edge instead of a comer. Roger Tsai’s method was used [37].

5.4 Locating the Quadrilateral Corners

The corners of the square are found by ﬁnding two vertices that are far apart,
drawing a line between them, and ﬁnding the vertices furthest away from this line on
either side.

The ﬁrst step is to estimate where the ﬁrst vertex is by taking the pixel in the
outline with the smallest X value. In the event of a tie, the one with the smallest Y value
will be used. This is not the actual ﬁrst vertex, but is merely a means by which to ﬁnd
the third vertex.

The third vertex is the pixel. from the outline that is the furthest from the estimated
ﬁrst vertex using the Euclidean distance (Figure 5-2(a)). Once found, the algorithm
repeats the process of finding the furthest outline pixel from the third vertex, which will
be the actual ﬁrst vertex (Figure 5-2(b)).

Next, draw a line from the ﬁrst vertex to the third. There will be a one pixel on

each side of the line that is furthest away from the line, as seen in Figure 5-2(b).

l

t ..r. _.A

 

43

Figure 5-2 - (a) Estimated ﬁrst, actual third (1)) Finding 2 and 4

This covers the majority of the cases; however there is one special case that must
be addressed separately. If the ﬁducial is viewed from a straight perspective angle, a
rhombus-like shape is produced as in Figure 5-3. In this case, there is no point found to
the right of the line drawn from vertex 1 to vertex 3. Therefore, the point found as vertex
4 is not necessarily a vertex at all. Also, vertex 3 from the previous steps must actually

be vertex 2.

Figure 5-3 - Special Case
This situation can be detected when a point is found more than a certain distance
away from the line on one side but not the other. In that event, a new line is drawn from
the point found (call it pivot) to the ﬁrst vertex, and a second line is drawn from the third
vertex to the pivot. If a point is found to the right of each of these lines, then these are
the real points 3 and 4. If one of the two lines has no points to either side of the line, then

the pivot can be identiﬁed as being a vertex. See Figure 5-4:

4 M 3

Figure 5-4 - Solution for special case

If it is found that the quadrilateral is less than 25 pixels across in any dimension, it
is discarded. Fiducials of smaller size than this cannot be reliably identiﬁed because the
resolution prevents proper location of the lines. Additionally, the inner part of the

ﬁducial would be less than 16x16, the size of the base images.

5.5 Quadrilateral Test

After ﬁnding the four vertices, each of the four edges is tested for straightness.
This is done by creating an approximate line edge from the ﬁrst vertex to the second, then
comparing each of the pixels along the edge to the line. If a pixel is above a certain ’
tolerance level of distance, then the line is not considered straight, and this is not a

quadrilateral.

5.6 Line Fitting

To ﬁnd the true comers of the quadrilateral, a line is ﬁtted between each of the
vertices using the pixels between each vertex from the outline. This produces four lines
that may actually intersect somewhere other than at the center of the supposed vertex
(Figure 5-5). This allows for sub-pixel accuracy of the comers of the quadrilateral. This
would also ﬁx problems like the extraneous pixels in Figure 5-1. The comer points
themselves are actually excluded from the line ﬁtting, because the comers jitter between

frames more than the rest of the line.

45

 

Figure 5-5 - Line fitting and intersection

5.7 Warping

Now that the four points of the quadrilateral have been found, the image inside
the quadrilateral must be warped to a 16x16 square so that it can be identiﬁed. The
points on a ﬁducial as it appears in a captured image are subject to perspective projection

using a camera calibration matrix:

S“ t11 I12 ’13 '14

(5'1) 3" - ’21 I22 ’23 ’24
S t31 I32 I33 I34

l-‘NVM

Here, x, y, and z are coordinates in the ﬁducial’s coordinate system. u and v are
coordinates on the screen (coordinates in the image that is to be warped). The matrix T
transforms the points from the ﬁducial coordinate system to the screen coordinates. s is a

scaling factor.

46

This assumes that u,v are not subject to radial distortion. In this application it is

assumed that the radial distortion of u,v have been removed at an earlier step, so u,v are

undistorted values (Section 5.3).

The ﬁducial images are planer, so it is assumed that z=0 for all points. Hence, the

problem can be reduced to an equivalent 2D perspective warp:

i-x'l
5“ ‘11 ’12 0 1141),
(5-2) SV = 121 I22 0 {24 O

S t31 ‘32 0 134-1
L

 

 

This can be rewritten as:

S“

s 1

P11 P12 P13 x
P21 P22 P23 y
P31 P32 P33 1

(5-3)

This equation ( 5-3 ) describes the relationship between pixel locations in the
ﬁducial image coordinate system (omitting the z axis) and image coordinates. The
resulting pixel values are arbitrarily scaled and must be divided on a pixel-by-pixel basis

by the scale factor 3 to determine the results. Because the matrix P can be arbitrarily

scaled, a unique P is determined by setting p33=lz

S“ P11 P12 P13 1
(5-4) 8" = P21 P22 P23 y
S P31 P32 1 1

47

Expanding this gives:

su=Pux+PUY+PB
(5-5) SV = P211ch P22)’ + P23
S = P31x+ P32y+1

Substituting s into su and sv:

(P31x+ P32)’ +1)u = P111+ P12Y+ P13

(545)
(P31x + P32y “W = P21x+ P22)’ + P23

Multiplying this out and solving for u and v on one side:

14 = Prrx+ P12)’ + P13 ‘ P313“! - P32yu
V = Pzrx+ P22)’ + P23 - P31xv— P32yv

(5-7)

Therefore, there are 8 unknown variables. There are four pairs of known (u,v) and

(x,y) coordinates, creating 8 equations. Therefore, there is an exact solution for P.

The ﬁrst step in the warping process is to compute P given the comers of the
ﬁducial in the ﬁducial coordinate system and the corresponding points in the image. The
image is subject to an arbitrary rotation of some multiple of 90 degrees which will be
ignored at this point in the process. That rotation is removed in the later orientation
determination set. The points on the ﬁducial image are determined directly by the size of
the image: (0,0), (d,0), (d,d), (0,d), where dis the width (and height) of the ﬁducial
image in world coordinates. However, the actual scaling of dis arbitrary in this step, so
we may use 1 for d, simplifying the calculations. The points in the image are determined

by the comer ﬁnding process described in the previous section.

48

Using the equations from ( 5-7 ) and substituting (x, y, u, v) with the coordinates
of each comer point (x1, y, 111, v. through x4, y4, u4, v4), we form a system of equations

represented as a matrix as follows:

(5-8) Ax = b
where:

Fxr Yr 1 O 0 O ‘xrur “Yr“r‘ PPM? _“1q

(1 0 0 E )h 1 "MW “RH M2 W

x2 Y2 1 0 O O -x2u2 - 3’2uz P13 1‘2

O 0 0 x2 3’2 1 -x2v2 -x2v2 P21 V2

A: ,x= ,b=

x3 1’3 1 0 0 0 -x3u2 " Y3“3 P22 “3

O 0 0 x3 Y3 1 -x3v2 -x3v3 P23 1’3

x4 Y4 1 0 O 0 "x4112 " y4u4 P31 1‘4
_ O O 0 x4 Y4 1 -x4v2 -x4v4d _P32d _V4_

 

 

 

 

 

 

Here, the arbitrarily scaled (x,y) coordinates will be: (0,0), (1,0), (1,1), (0,1). This

makes the A matrix simply:

0 0 1 0 0 0 0 0
0 0 O O 0 l 0 O
l 0 1 0 0 O -u2 O
0 0 0 l 0 1 —v2 0
(5-9) A:
l 1 l O 0 O -u3 -u3
0 0 0 l l 1 —v3 -v3
0 1 1 O 0 O 0 —u4
L 0 O 0 O 1 l 0 -v4‘

 

 

Solving for x will reveal the values in the P matrix of equation ( 5-3 ). Now, the P
matrix can be used to warp the image.
Let the 16x16 warped image be W. W(x,y) is a pixel within this image, where x

and y are scaled from 0 to 1 (instead of 0 to 15). Let I be the original image from the

49

camera, and let I(u,v) be a pixel in this image. To ﬁll in the warped image W, use
equation ( 5—3 ) to get the corresponding (u,v) coordinate in I for each (x,y) in W.
Because the actual interior image is surrounded by a black border, the algorithm

must compensate by scaling and adding an appropriate amount to ﬁnd (x,y).

That is:

 

for each (i,j) where (O<i<15) and (0<j<15)
{

x = (i+.5)/16 * .7 + .15

Y = (j+.5)/16 * .7 + .15

u = 1911x + D12}! + p13 - p31xu - p32Yu
V Pzrx + P22Y + P23 ' P31XV ‘ P32YV
Wﬁqj)==luu,v)

 

 

ﬁbumSﬁ-PHm¢muﬂehrmemg“hnndhmge

The last statement in Figure 5-6 is pseudo-code. In implementation, simple

bilinear interpolation is used to get a sampled value from the pixels surrounding (u,v).

5.8 Identifying with the DCT

The 16x16 image W will now be examined by taking the Discrete Cosine
Transform of these values as described in Section 4.2. Recall that both the orientation

and identiﬁcation of the ﬁducial is done in this one step.

5.9 Calculating Pose

Calculation of pose from the points on the ﬁducial can be done in many ways.

The general problem of calculating pose from four points is called the P4P problem. As

50

 

mentioned previously, pose can be calculated from only three points, but can have
multiple solutions [15]. In fact, it almost always has two solutions and can have up to
four [38]. This is called the three-point perspective problem or P3P. One available
option in this system is to calculate pose from just three of the points using an iterative
method described in [20] and adapted from code originally written by Dr. George
Stockman. This method runs in a reasonable amount of time but ﬁnds only one of the
two possible solutions, and therefore tends to ﬁnd the wrong solution approximately 50%
of the time.

The ARToolKit uses a non-iterative vector based method for calculating pose.
This algorithm uses all four points to determine a rotation, then a translation that ﬁts the
points. This method is sensitive to noise and tends to produce graphics with a large
amount of jitteri An imaginary triangle is created between the camera origin and each of
the comers to make 4 planes. The normals to these planes are then calculated. The cross
product of plane 1's normal with that of plane 3 determines the X direction. Likewise, the

cross product of the second and fourth plane normals determines the Y direction vector.

 

 

Figure 5-7 - Finding the X axis

51

Because the positions of the pixels are not perfectly accurate, the X axis and Y
axis found may not be perfectly perpendicular. To ﬁx this, the half vector between the
two is found and is used to ﬁnd a true X and Y axis equidistant from the half vector. The
Z direction is then simply the cross product of the X and Y vectors.

Once the rotation has been established, a least squared solution for the translation
is found. It is important to note that this solution for a transformation is least squared
with respect to the translation only. It may have been possible to create a better rotation
that would give less error in the least squared sense.

This vector solution tends to be very jittery and noisy. More speciﬁcally, the
rotation tends to be noisy since it is calculated analytically and not in a least-squared
sense before taking translation into account. Small changes in comer locations between
frames create large rotational changes. If the only goal was to lay a ﬂat polygonal model
onto the ﬁducial surface, this technique would sufﬁce. If, however, a model with some
depth in the Z direction is necessary (such as a simple cube) then this method causes far
too much visible “dancing”.

Given the problems exhibited by a traditional P3P solution and the ARToolKit
P4P solution, an alternative solution was developed for this application based on
extension of the inverse Jacobian P3P to four points. This method utilizes all four points
to ﬁnd the most probable 3D locations of the four points in relation to the camera. It
tends to converge quickly, using only about 5 iterations to get to an error less than 1E-6
per point.

This P4P method ﬁnds the 3D location of the ﬁducial in camera coordinates by

assuming that the 3D coordinates lie along four rays that point from the camera’s center

52

of projection to the screen coordinates of the four comers of the ﬁducial on the view
reference plane. 80, what this algorithm really does is ﬁnd distances along these rays
where the ﬁducial comers should be in the camera’s coordinates. These distances are a
least-squared error ﬁt of the known distances between the ﬁducial comers and the

apparent screen coordinates of the comers.

Knowing the 3D location of these comers is not enough to know the pose of the
ﬁducial. The next step is to calculate a transformation between the ﬁducial’s coordinate
system and the camera coordinate system. This computation is done using a Singular
Value Decomposition (SVD) method to ﬁnd a least-squared error transformation (in both
the rotational and translational sense). This was adapted from the algorithm presented by
K.S. Arun et al [39].

This P4P solution is extremely stable in comparison with the other two solutions.
There is some instability when the normal of the ﬁducial is pointing very close to the
camera. The iterative method may be ﬁnding locally ideal solutions instead of the best
solution, and in such a case will be sensitive to small changes in comer location.

Figure 5-8 shows a comparison of these three methods. The cubes in the P4P
vector solution are often times rotated poorly. Note that in the P3P method the ﬁducials
in the bottom and left corners are facing in entirely the wrong direction - these are the
extra solutions that come from using only three points. The P4P iterative method yields

straight polygons.

53

 

 

 

P4P Vector P3P Non-linear P4P Non-linear

 

Figure 5-8 - Pose Finding Method Comparison

54

 

6 Evaluation

The system was evaluated in a variety of ways as this research sought to improve
the performance of ﬁducial-based tracking. If fact, it is not this system so much as the
methods that it uses that need testing. To give the results more meaning, they are
compared to tests of the popular ARToolKit system. Three particular types of tests were

done: distance estimation, rotation estimation, and ﬁducial identiﬁcation.

6. 1 Distance and Rotation Test Setup

A series of pictures were taken using the QuickCam Pro 3000 USB web-cam.
The idea was to take pictures at a variety of known distances and angles. This was done
by placing the ﬁducial such that it was in the center of the picture, at regular distance

intervals. Additionally, at each distance the ﬁducial was tipped away at regular angles.

A long, ﬂat board was used to mark off the locations of the ﬁducial. The ﬁducial
was attached to the front of a rigid box. The box was placed every 1 foot along the board.
The box was then rotated about the left edge of the ﬁducial, such that the left edge
remained the same distance away from the camera. The right edge would be moving
away from the camera as this angle increases. A picture was taken every 15° from 0° to
75°. Figure 6-1 shows the picture taken when the ﬁducial was 4 feet away and at an angle

of 45°.

The distance to the camera is not technically at 1 feet, 2 feet, etc. This is the

distance from the end of the board. The camera is then an additional 1%” from the edge

55

of the board. This still, however, is not the true distance, because the center of projection

(COP) is not at the front edge of the camera. For this camera, the COP is about 7/8” from

the front edge of the camera as determined by camera calibration.

 

Figure 6-1 - Test Setup

This measurement was determined by using the edge of the ﬁeld of view as a
guide. To ﬁnd the COP, objects (soup cans) were placed in a line just inside the ﬁeld of
view of the camera. Drawing a line along these objects on both sides of the camera

reveals the COP where the two lines intersect. See Figure 6-2 for a diagram.

56

 

Figure 6-2 - Finding the Center of Projection

Now that the distances and angles are known, they can be compared to results
from the DCT system and the ARToolKit system. The bottom-left comer of the ﬁducial
was used to calculate distance accuracy. The normal of the ﬁducial (compared with the
correct normal for each rotation) was used for rotation accuracy.

The ﬁducial used for testing the DCT system was #10, which is DCT #2 in both
directions (one cosine cycle). The ARToolKit ﬁducial was one of the 4 ﬁducials
included in the example program they distribute. Both ﬁducials were 3.5” in both
dimensions. Both systems were set up to expect ﬁducials of this size for accurate pose

calculation.

57

Figure 6—3 - DCT Test F iducial Figure 6-4 - ARToolKit Test Fiducial

   

6.2 Distance Results

When running the tests, ﬁducials could not always be found in the picture. For
instance, when the ﬁducial is far away and is at a sharp angle, the ﬁducial will not be 25
pixels across, and will be immediately rejected by the DCT system without further testing.
The ARToolKit will make an attempt to identify the ﬁducial at any size, but will often do
poorly in such situations, as would be expected. The following graph shows the error for
both systems as a percentage in relation to the correct distance. That is, the percentage

error is:

MeasuredDist - ActualDist
ActualDist

 

(6-1 ) error = ABS ( X100%]

Gaps in the DCT line are where there was no ﬁducial found. The numbering on
the bottom indicates the picture number that was being analyzed. Remember that 6
different angles were taken at each distance, so pictures 1 through 6 would be the 6
angles at the ﬁrst distance. Also remember that the lower-left comer of the ﬁducial will

not change locations as the ﬁducial rotates.

58

 

Distance Error

 

 

 

 

 

 

 

—— ocr
------- ARToolKit

 

 

 

 

 

__ ,. I-.__,__/_____ \“2””\“/" _. - - _

5.000 -______-.__-__ .-_- ._ .-_-_.-..-.m_.___.___t_- ._ ___.\.- __c\;.._q

0.” VIII» rrrrrrrrrrrrrrrrrrr ’rr’rrri rrrrr ‘r rrrrrrrrrrrr
14710131619222528313437404346

 

 

 

 

 

 

 

Figure 6-5 - Distance Error Comparison
The percentages shown could technically be shown as negative, because for both
systems the error was always on the short side. That is, both systems always
underestimated the distance to the lower left corner of the ﬁducial. Clearly, both system

exhibit a measurement bias that could be factored out in future work.

6.3 Rotation results

The same pictures were used for testing rotation as for distance. To test rotation,
the vector (0,0,1) was used as the normal to the ﬁducial in “world” coordinates. World
coordinates in this case are the coordinates according to the model, which is the ﬁducial.
The normal vector was multiplied by the transformation matrix generated from the pose
estimation done per ﬁducial. This results in a vector that is normal to the ﬁducial in the

camera’s coordinate system. At a 0° rotation, this normal should be pointing directly

59

back at the camera. The normal would then be rotated 15° about the Y axis to get the
next normal.

The calculated normal of the ﬁducial was compared to the correct normal by
taking the dot product of the two normals, then taking the arcos() of this value to get the
number of degrees of error.

There are multiple graphs that follow to illustrate the angular errors. The ﬁrst
graph, Figure 6-6, shows the results for all 48 pictures. Remember that the pictures were
taken in order by distance, so the angles kept changing from picture to picture (0, 15, 30,
45, 60, 75, 0, 15, 30, ...). To help make the comparison more clear, a graph is shown for
the 8 distances at a constant angle, for the 6 different angles in Figure 67 through Figure

6-12.

60

 

 

Angular Error, all

 

 

 

 

140.000

120.000

100.000
§ 80.000 —T
1' 60000 ------- —RToolKit
5

40.000

20.000

0.000

4 7101316192225m313437404346
Image!
Figure6-6-AngularError,Alllmages
Angular Bror, 0 more”

35.000

30.000

25.000
£20000 . DCT
3 15.000 +Ai=rroor<rr
1% 10.000

5.000

0.000

1 2 3 4 5 6 7
ﬂauncufeet)

 

 

Figure 6-7 - Angular Error, 0 Degrees

61

 

 

 

Angular Error, 15 mgreea

—o— DCT
—a—— ARToolKit

 

t 2 3 4 5 6 7 a
Distance (teat)

 

Figure 6-8 - Angular Error, 15 Degrees

 

 

Angular Bror. 30 mgreea

70.000
60.1130
50.000
40.000

30.000

Error (degrees)

20.000

10.000

 

0.000

1 2 3 4 5 6 7 8
Instance (feet)

 

 

Figure 6-9 - Angular Error, 30Degrees

 

 

 

Angular Bror, 45 Degreea

Error (degrees)

 

1 2 3 4 5 6 7 8
Batonoa (feet)

 

Figure 6-10 - Angular Error, 45 Degrees

 

 

Angular Eror, 00 degreea

120.000
100.000
80.000
60.000

+ ARTooIKR
40.000

END? (“0"”)

20.000

 

0.000

1 2 3 4 5 6 7 8
llatanee (teat)

 

Figure 6-11 - Angular Error, 60 Degrees

63

 

 

 

 

Error (degrees)

140.000

120.000

100.000

80.000

60.000

40.000

20.000

0.000

Angular Error, 75 Degrees

 

3 4 5 6 7

Instance (feet)

+AFlTooKll

 

Figure 6-12 - Angular Fxror, 75 Degrees

 

6.4 Identification Results

The ARToolKit has a tendency to misidentify ﬁducials because they are not
signiﬁcantly different from one another. Zhang et al pointed this out [10] and showed an
example, where one ﬁducial was mistaken for another. To see if the ARToolKit could
beneﬁt from using orthonormal images instead of ad-hoc images, the following test was
set up. This is a test of the general idea of orthonormal images as proposed in this thesis,
rather than of a speciﬁc system and is intended to illustrate the superiority of this design

over ad hoc ﬁducial images even in an existing system.

Six ﬁducials were chosen and loaded into the ARToolKit. These were ad-hoc
images. The images chosen had the numbers 2,3, and 8 inscribed in the squares. The
letters C, G, and B were also chosen. A drawing shape was associated with each ﬁducial

(cone, cube, sphere, and torus). The associations were as follows:

 

 

 

 

 

 

 

Fiducial Associated Shape
2 Cube

3 Cone

8 Torus

B Sphere

G Cone

C Cube

 

 

 

 

Table 6-1 - ARToolKit Fiducial Shape Associations

A few pictures were taken from about 3 feet. Figure 6-13, Figure 6-14, and Figure
6-15 show the rrrisidentiﬁcation errors common with the ARToolKit. In the ﬁrst picture,

it is apparent that the system has confused ﬁducials “2” and “3”. In the second picture the

65

“8” has a specularity from the light, causing it not to be found as a square. With “8” out

of the way, “3” is now identiﬁed as “8”. “2” is still identiﬁed as “3”.

Figure 6-13, Figure 6-14 were taken at a distance of about 3 feet. Figure 6-15 was

taken at about 6 feet from the ﬁducials.

The ARToolKit was then tested using a different set of ﬁducials. Fiducials
generated with DCT patterns like those in the proposed system were used. The border
width was kept the same, as this is what the ARToolKit expects. The idea was to see if
the ARToolKit would fare better with orthonormal images instead of the ad-hoc images.
Technically, the DCT images are not completely orthonormal because they still have the

orientation component which will be common amongst all of the images.

Figure 6—16 shows a picture taken from 3 feet away. All 8 of the ﬁducials are
correctly identiﬁed, despite the fact that many of the DCT images are “close together”.
DCT images 10, 11, 12, 13, 14, 15, 100, and 105 were used. The lower-left portion of
each picture shows the original image, without overlays. The lower-right portion shows

the binarized image with identiﬁed ﬁducials shown in red outlines.

In the second picture, which was taken at 6 feet (Figure 6-17), there are 2
ﬁducials that were not identiﬁed (#11 and #14). The rest of the ﬁducials, however, are
identiﬁed correctly. While it is not good that 2 ﬁducials were not identiﬁed, it is good

that they were not mistaken for the wrong ﬁducials.

Table 6-2 summarizes the identiﬁcation problems of these systems for each of the

pictures taken. The misidentiﬁed ﬁducials are highlighted.

66

 

 

 

 

 

 

 

 

 

 

 

ARToolKit DCT System
Fiducial Figure Figure Figure F iducial Figure Figure
6-13 6-14 6-15 6-16 6-17

B B B B 10 10 10

C C C - 11 1 1 -

G G G C 12 12 12

2 3 3 - 13 13 13

3 2 8 3 14 14 -

8 8 - 8 15 15 15
100 100 100
105 105 105

 

 

 

 

 

 

 

Table 6-2 - Identification test results

67

 

.

.31...
.3...

p...
:
2......
........“..

;._-.-..: <~

'f.

1 ‘5“.11 1.44
[4.2/1......

3...,
at...

1....

1..

..‘nw......

 

 

Figure 6-13 - ARToolKit Misidentiﬁcation, 3 foot, #

68

It
.. u . an“:
n

a ,.

 

3foot,#2

entification,

- ARToolKit M'sid

Figure 6-14

69

 

 

Figure 6-15 - ARToolKit Misidentiﬂeation, 6 foot

70

WW--“ .. .1.» ﬂ.
_-- ...p ..._—-—.. .
. ”—4 ...Ar _.,_.g- ‘. ”..-

~
\u.

 

Figure 6-16 - ARToolKit Correct Identification Using DCT, 3 feet

71

 

 

Figure 6-17 - ARToolKit Correct Identiﬁcation Using DCT, 6 feet

72

6.5 Speed

The implementation from image input to ﬁducial identiﬁcation with point
correspondence was timed on a 1.0GHz P3 with 512 MB of memory. The test image
used for timing was 320 by 240 pixels and had four ﬁducial images, as shown in Figure
6-18. The algorithm execution time was close to 2ms, well within the requirements of
real-time tracking. Using the same image and machine, ARToolKit averaged around 8ms.
The results are shown in Table 6-3. Note that the adaptive thresholding technique caused
the execution time to increase. This increase is not due to the adaptive thresholding itself
taking signiﬁcantly longer, but rather is caused by an increase in the number of outlines
found in the image due to the different threshold value. The constant thresholding for the

DCT-system and ARToolKit were set to the same value (100 on a scale of 0-255).

 

Figure 6-18 - Example Test Image

73

 

Method Average Time (ms)
ATK 8.142
DCT (constant threshold) 1.81 1
DCT (adaptive threshold) 2.368

Table 6-3 - Speed Test Results

 

 

 

 

 

 

 

6.6 Discussion of Results

It is important to note that the ARToolKit uses inter-frame processing to smooth
out errors or lost ﬁducials. When running these tests, the ARToolKit would use old
results for a frame if it could not ﬁnd a ﬁducial, which would of course cause large errors.
When examining the graphs above, it would probably be fair to ignore ARToolKit results
where the DCT system did not ﬁnd a ﬁducial. In these pictures, the size of the ﬁducial
should be considered too small for accurate processing. The ARToolKit could easily

ignore these ﬁducials as the DCT system does to avoid some of the larger errors.

The translations were surprisingly inaccurate in both systems. Distances were
always underestimated. The DCT system ranged from about 5% to 10%, which is still a
sizable error. The ARToolKit seemed to do much worse, with an average error around

30%.

One of the instances that the DCT system has trouble with is when the ﬁducial is
pointed directly at the camera. While the translation is still relatively accurate, the
rotation can be very erroneous. The ARToolKit had a similar problem (see Figure 6-7).
The ARToolKit was more accurate in this case at close distances, but worse at large

distances. Both systems, in fact, got worse as distances increased.

74

The identiﬁcation results seemed to be better when using more appropriate
images. The numbers “2” and “3” are not very similar, yet the ARToolKit seems to
confuse these two ﬁducials. Even the high-frequency ﬁducials seemed to do well, at least
at close ranges. It should be noted that the DCT system discussed in this thesis does not
suffer from this rnisidentiﬁcation problem. While it is of course possible for this to
happen, it has been the experience of those who worked with the system that it has never
once incorrectly identiﬁed a ﬁducial. Part of this stems from the fact that the DCT
system never tries to identify ﬁducials that are too small, but the orthonormal property of

the images is clearly the main reason for the prevention of misidentiﬁcation.

75

7 Future Work

The system created was done so with the ideal ﬁducial criteria in mind. This does
not mean, however, that all of the courses of action taken are the best way to meet those
criteria. Further, these criteria might be met in many ways, or the criteria might change

depending on the needs of a speciﬁc application.

7. 1 Basis Set

The DCT was an obvious ﬁrst choice for an orthonormal basis set, but it is not the
only possibility. The higher frequency components are more sensitive to errors in the
outline detection process and image warping. Cameras have decreased high frequency,
content as shown in Appendix A. In addition, image blurring impacts high frequency
content more than low frequency content.

In preliminary experiments, several custom basis image sets have been
constructed that may exhibit better high frequency and frequency spreading
characteristics. The trade-off is the lack of a fast transform for identiﬁcation. However,
as the transform is separable, it may be possible to construct a custom transform that runs
fast enough for this application. Other known transforms, such as Fourier, Hadamard, or
Haar might have better suited basis vectors.

There is also ample opportunity to work in more advanced combinations of basis
vectors. Color could be used, for instance, to expand this set dramatically. A blue image
could be combined with a red image, immediately squaring the number of combinations

possible.

76

It has been assumed that an orthogonal basis set is the best choice for this
application. However, the set can be supplemented using the negatives of the basis
functions. Negative correlation is, in fact, as good as zero correlation in an identiﬁcation
system. The only correlation in the augmented set is between basis functions and their
negatives. This negative correlation is easily identiﬁed in the DCT result. Adding the
negative basis functions would not in any way decrease performance and effectively
doubles the set size for free.

The proportion of the orientation part of the image to the coded part of the image
rrright also be changed. Using 1/4 orientation, 1/4 DC offset, and 1/2 DCT code may not
be the ideal combination. It may be possible to decrease the contribution of the
orientation in order to get a better dynamic range with the coded part, and therefore a
better resolution. This might allow us to use more of the hi gh-frequency ﬁducials, or do

a better job of ﬁnding ﬁducials in noisy images.

7.2 Pose Estimation

The P4P non-iterative method is clearly better than the others implemented for
most situations, but is not nearly perfect. The noisy results when the ﬁducial normal is
pointing close to the camera could be extremely detrimental in applications where this
happens often. It may be possible to improve this method by ﬁnding better starting
values for the iterations. This might be done by using the linear P3P technique, which
would be estimation but could start the P4P technique closer to the right results. This
might prevent the P4P process from ﬁnding locally ideal solutions and move more

quickly to the overall ideal solution [40].

77

The P3P method by Stockman [20] actually only provides one of the solutions,
instead of ﬁnding all of the possible solutions. Huttenlocher and Ullman [41] devised a
method that will produce two solutions for computing the pose of a rigid conﬁguration of
three points from a single weak perspective projection. It may be possible to choose
between these two solutions to get the right one, particularly if inter-frame information or
other such factors are used.

The pose could also beneﬁt greatly from using multiple ﬁducials together [42].
Using all of the points for all of the ﬁducials at once would allow for a much less noisy
least squared ﬁtting. Each individual ﬁducial may not line up as well with the

corresponding virtual parts, but there would be little noise and more overall accuracy.

7.3 Finding Potential Fiducials

The thresholding technique (and consequently, the edge ﬁnding) could be
improved to work under a wider range of lighting conditions. There are a variety of
techniques in existence, such as the modiﬁed homomorphic image processing method
used by InterSense Inc. in their circular ﬁducial system [16]. Most of these more
effective methods are, however, much more expensive operations. The method used in
this system is fast, since that was one of the goals set forth initially. Nevertheless, as PCs

get faster and custom hardware is created, more advanced algorithms become possible.

78

8 Conclusions

This thesis has set forth a set of criteria by asking questions about what makes up
a good ﬁducial. Though the answers to these questions and the degree to which a ﬁducial
tracking system meets these criteria can differ, the answers to these questions are as
general as possible. This set of criteria sets forth a standard by which to judge the quality
of a ﬁducial tracking system, but more importantly creates guidelines which can be used
create a good system.

Given these criteria, a new type of ﬁducial image was created using the Discrete
Cosine Transform basis images which has proven to be an effective choice. Additionally,
a system using this type of interior image was created with fast algorithms taking I
advantage of the choices made when examining the criteria for a good ﬁducial. The
speed increase is apparent in all areas: locating the ﬁducial, identifying the ﬁducial, and
calculating the pose of the ﬁducial.

Further, to prove the validity of such criteria and the decisions made using those
criteria, the system was tested and compared to a popular and effective system. The
results have shown signiﬁcant performance increases in speed, identiﬁcation accuracy,

and pose estimation.

Examining the real needs and purposes behind ﬁducial tracking has shed a great
deal of light on what systems work and why. More speciﬁcally, the individual parts of
the many systems can be evaluated for the effectiveness of their purpose. The focus of

this research is obviously on the science behind tracking technologies, not a speciﬁc

79

implementation. The system discussed here was created to demonstrate the effectiveness
of the choices made when examining the identiﬁed criteria.

This system helps to validate how fast and reliable the ﬁducial tracking can be,
yet leaves the door open for improvements in many areas. If other technologies and
systems are reevaluated with the goals in mind they might be improved. It is the author’s
hope that this more analytical approach will lead to advancements in the theory behind

ﬁducial tracking rather than mild improvements in speciﬁc implementations.

80

9 Appendix A — Camera Frequency Response

Camera imaging systems are not perfect. High frequency scenes cannot be
captured perfectly in any camera, and some cameras are worse than others. To better
understand what types of ﬁducials can be identiﬁed, the camera that is capturing the
images must be examined. This section focuses on testing the response of cameras to

images of varying pixel intensity.

9. 1 Background

Cameras use a CCD to convert the visual signal into an electronic pattern. The
CCD is an array of receptors (640x480 in many cases) that each capture some part of the
scene that the camera is taking a picture of. Unfortunately, these receptors are somewhat
interdependent. If a black piece of paper with a single white dot was held in front of a
camera, the dot would not register on only one receptor. Even if perfectly aligned, there
would be one sensor that receives most of the stimulation while the sensors surrounding it
have some smaller stimulation as well. The amount of spillover from pixel to pixel is
referred to as the point spread function.

Cameras with a wide point spread function will have trouble viewing high
frequency images. That is, images with pixels that change drastically in intensity are
difﬁcult to identify. An image that goes back and forth between black and white tends to
be blurred and grayed out, as in Figure 9-1. An image that should be alternating pixels of

pure black then white becomes shades of gray.

81

 

 
   

Blurry Image

Figure 9-1 - Effect of point spread function

9.2 Test Setup

The purpose of this test is to see how well cameras will respond to changing
frequency images. To quantify the frequency, images were printed in bands 1/2” wide as a
cosine function. The cosine function of course was scaled to range from black to white
instead of -1 to 1. Figure 9-2 is an example image, which was printed with ﬁfteen 1/2”
bands on an 81/2” by 11” paper with a laser printer. The images printed were very high
resolution (6000x4500 pixels). The ﬁrst band has 0 cycles, the second has 1, the third

has 2, and so on.

82

 

 

 

 

Figure 9-2 - Testing pattern

A total of 22 images were printed in order to print out 320 different bands. The
320’h band has 320 complete cosine cycles, which means that on a 640x480 image the
values would be exactly 255,0,255,0,... through the whole band. This is the Nyquist
frequency for the 640 width image; doing bands with more than 320 cycles would be
pointless.

Each printed image was photographed with a camera, and the camera was moved
(or zoomed) so that the edges of the image were right at the edges of the camera’s picture.
Care was taken to get a straight alignment. These images were then processed for
frequency response.

The frequency response was measured using the Root-Mean-Square (RMS)
method. RMS works by taking the sum of the squared difference of each pixel from the
average, divided by the number of pixels squared:

N (P"P)2
(9-1) RMS=Z:'—2
r=1 N

83

where p, is the pixel intensity value (0-255) for the ith pixel in a row, and ‘p’ is the
average pixel intensity for the row.

The pixels to use for each band are chosen by taking three rows from the middle
of each band, then averaging the three pixels vertically for each column. That is, p, is
really an average of three pixels from the ith column in the middle three rows of that band.

The RMS was chosen because it should remain constant for all frequency bands.
The RMS value of a perfect 640x480 image is about 12.6. This is the ideal value that

each of the camera captured images should get.

9.3 Results

Four cameras were tested. The ﬁrst is a Logitech QuickCam Pro 3000, which is a
medium priced web-cam and is also the camera used during development of this system.
This camera has very little distortion, and is capable of providing 30 non-interlaced
frames per second. To do this, however, the frames are compressed before being sent to
the PC from the camera.

The second camera was a Sony DCR-VXlOOO Digital Video camera. This
camera is normally used for video capture, but is non-interlaced as is the standard for DV
cameras. Also standard is the 720x480 resolution, so the top of these pictures was
slightly cropped off. This is a somewhat expensive video camera.

The third camera was a Sony DXC-LSl lipstick camera. This is also a more
expensive camera. While the picture quality is high, it is interlaced and has a great deal

of radial distortion in comparison with the other cameras.

84

The fourth camera was an older IBM webocam, purchased for about $20 over 2
years before these tests. This camera is very low quality, and can only provide up to
320x240 images.

The following are graphs of the average pixel intensities and RMS values of the
images taken by each of the four cameras. Remember that the ideal RMS value is 12.6,

and the idea average pixel intensity should be about 128 out of 255.

85

 

Logitech Camera Averages

§

{9‘

'8‘

Average (plxol Intensity)

8

 

0
1 19 37 55 73 91109127145163181199217235253271289307

Band (number of cycles)

 

 

 

Logitech Camera RMS

RMS (of pixel values)
4h 01 G 8

N

 

O

1 19 37 55 73 91109127145163181199217235 253271289307
Band (number of cycles)

 

 

 

Average (plxel Intensity)

Sony DV Cam Averages

'8‘55

 

 

 

 

80
60
40
20
0
1 19 37 55 73 91109127145163181199217235 253271289 307325
band (number of cycles)
Sony DV Cam RMS
5
4.5
4
3.5
g a
3 2.5 -
1»
E 2
1.5
1
0.5

  

1 19 37 55 73 91 109127145163181199 217235253 271289 307325
Band (number of cycles)

 

87

 

 

 

Average (pixel Intensity)

250

1

24 47 70 93116139162185208231254277300

Sony Lipstick Cam Averages

  

Band (number of cycles)

 

 

 

 

RMS value (255 scale)

1

Sony Lipstick Cam RMS

  

23 45 67 89111133155177199221243265287309

Band (number of cycles)

 

 

88

 

 

Average (pixel Intensity)

IBM Webcam Averages

53‘

ssssé

0

Band (number of cycles)

  

1 1019 28 37 46 55 64 73 82 91100109118127136145154163

 

 

 

 

RMS (pixels)

IBM Webcam RMS

—|
0|

.5
C

11019 28 37 46 55 64 73 82 91100109118
Band (number of cyles)

    

 

89

 

l|||ll|lll|| llllllllllll|lllllllll|lll| ll|l
x‘llllllllllllllllllllllll|||llllllllllllllll llll

||| ||
II |||| ||| |||
| ||| |||l||||||||| Wllllll||||l |||| ||||||| ||| |||

|||

l

W
||

|||.
||| | , llI
lllll llllllll lllllllll lll lllllllllllllllll llll l||ll| llll |l|||
llllll || lElll lllll ll lllll llllllllllllll ll||l| ||| lll l lll ll llll ||| l lll :lllll ll ||| ||| l H
llllllll |||lllllllllllllllllllllllllll lll||||lll ll l l ||| ill l|l|| ll l I|
l|l|l|||ll|ll||llllllllllllllllllllllllllllllllllll|||l|||| ll ||| I;
ll llllllll'llllllllllllllllll l lllllllllllll lllllll l’ ||| l lll ll | l
|l||| lllll ll llllllllllllllllllllllllllllllllllllllll ll ll llll ||lll|| |||
l ||| ll lllll lllllllllllllllllllllllllllll||| l'llll lllll Ill | l‘ ll|l| |
lllllll llllll llllllllll llll lllllllllllllllllllllllllllllllll ||| l|lll || ||| l‘
ll||ll|lll|lllllllllllll|||llllllllllllllllllllllllll| llll ||| ||-
ll lllllllllllllllllllllll llllllllllllllllllllllll l l ||| [1|
ll“
lllif |l

'|‘||~'
l

l

‘ l
l
l

l

l||l|
|
|
l
l
l

|'li
ll
|||

 

I, | |||
I
l
l l llllll llllllll‘ lll lllllll ll'lllllll llllll‘ llIll IlI|l||l||| M

| l
. |
l l

l|||||
|||] l'llll
l
l
|| |
l
l
| ll 1| l |||llllllllllllllllllll|l||ll'|l||'I l lll |||

l.
l

l
l

Logitech image:

 

Figure 9.4 - Logitech Image

91

9.4 Discussion

The most obvious trend is that for each camera, the RMS value decreases as the
number of bands increases. This conﬁrms that cameras do not react equally to high
frequency noise. As the frequency increases, the spillover from pixel to pixel (result of
the point spread function) has a more drastic affect on the pixel intensities. Figure 9-3
and Figure 9-4 show a comparison between an ideal image and that of the Logitech
camera.

The Logitech images have the additional disadvantage of being compressed into
jpeg images. All of the other cameras offered the ability to take an uncompressed frame.
Jpeg was actually designed to remove hi gh-frequency information in favor of
compression. In most photographic applications this is not noticeable, but in this case it
made a signiﬁcant difference.

Another interesting trend is that both the averages and RMS values tend to have
“bumps”. These bumps occur every 15 bands — which is how many bands are on each
page. This implies that both the average image intensity and the frequency response are
higher across the center of the image than they are towards the edges. 80, the camera is
more capable towards the center of the image.

The cameras tend to have a high average pixel intensity — the original image
would have an average of 128, whereas the pictures taken by the cameras would often be
above 140. This varied throughout the picture, despite the ample and constant lighting
when the images were taken. Most of these cameras provide some type of automatic gain

control, exposure setting, or white balance. The automatic adjustment of these features

92

was turned off when possible, though it seemed that often times there was still some
processing out of the user’s control.

These results make it clear that high-frequency ﬁducials, while orthogonal during
creation, will not remain orthogonal when processed by the camera. The cameras’ built
in smoothing from their point-spread functions will smooth out the image as a sort of
low-pass ﬁlter. This makes high-frequency ﬁducials a bad choice as they will become

correlated from the camera noise. This reduces the set of usable DCT basis vectors.

93

10 Appendix B — Testing Data.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Real Data DCT-based System Estimates ARToolKit Estimates
Angular Angular

IMG Angle RealDlst Dlst Err %DlstErr Dlst Err %DlstErr
1 0 14.125 12.990 2.563 8.035 10.040 0.000 28.920
2 15 14.125 13.140 4.283 6.973 10.130 4.237 28.283
3 30 14.125 13.100 2.929 7.257 10.110 3.144 28.425
4 45 14.125 13.130 3.206 7.044 9.980 2.374 29.345
5 60 14.125 13.140 3.413 6.973 9.850 1.616 30.265
6 75 14.125 13.040 3.055 7.681 9.660 4.291 31.611
7 0 26.125 23.620 6.280 9.589 17.910 5.732 31 .445
8 15 26.125 23.960 2.403 8.287 18.130 6.141 30.603
9 30 26.125 23.920 3.076 8.440 18.150 3.661 30.526
10 45 26.125 23.900 3.206 8.517 18.080 2.374 30.794
11 60 26.125 23.840 2.735 8.746 17.890 1.596 31.522
12 75 26.125 23.830 3.000 8.785 17.470 3.755 33.129
13 0 38.125 34.290 9.249 10.059 25.970 5.126 31 .882
14 15 38.125 35.000 0.000 8.197 26.440 6.483 30.649
15 30 38.125 34.870 2.775 8.538 26.300 4.107 31.016
16 45 38.125 34.830 2.374 8.643 26.470 2.374 30.570
17 60 38.125 34.680 3.281 9.036 26.150 0.000 31.410
18 75 38.125 25.720 2.435 32.538
19 0 50.125 45.870 7.252 8.489 33.990 14.760 32.190
20 15 50.125 45.800 2.100 8.628 34.870 9.107 30.434
21 30 50.125 45.700 2.817 8.828 35.370 7.487 29.436
22 45 50.125 45.430 2.374 9.367 35.470 6.178 29.237
23 60 50.125 45.490 3.41 3 9.247 34.950 2.060 30.274
24 75 50.125 33.850 1 .693 32.469
25 0 62.125 56.040 9.249 9.795 41.910 18.195 32.539
26 15 62.125 57.090 2.859 8.105 43.170 8.559 30.511
27 30 62.125 56.850 1 .576 8.491 42.220 2.668 32.040
28 45 62.125 56.590 3.206 8.909 42.860 2.374 31 .010
29 60 62.125 42.140 2.892 32.169
30 75 62.125 42.280 2.435 31 .944
31 0 74.125 66.750 1 1.187 9.949 48.560 24.632 34.489
32 15 74.125 68.530 2.108 7.548 48.980 29.724 33.922
33 30 74.125 68.130 3.216 8.088 51 .990 1.959 29.862
34 45 74.125 67.510 3.863 8.924 53.120 2.374 28.337
35 60 74.125 50.710 5.282 31 .589
36 75 74.125 50.710 10.295 31.589
37 0 86.125 78.390 10.263 8.981 56.520 30.231 34.374
38 15 86.125 80.690 28.498 6.31 1 57.760 29.1 18 32.935
39 30 86.125 81 .930 57.855 4.871 59.060 6.997 31 .425
40 45 86.125 59.040 7.536 31 .448
41 60 86.125 63.570 2.485 26.189
42 75 86.125 63.570 17.537 26.189

 

 

 

 

 

 

 

 

 

94

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

43 0 98.125 90.390 1 1 .763 7.883 64.960 30.345 33.799
44 15 98.125 94.130 26.101 4.071 72.760 14.038 25.850
45 30 98.125 93.700 0.000 4.510 71 .770 5.299 26.859
46 45 98.125 68.550 6.178 30.140
47 60 98.1 25 76.050 1 1 0.553 22.497
48 75 98.125 76.050 125.545 22.497
AVERAGE: 6.876 8.1 05 12.458 30.441
STANDARD DEV: 10.666 1.401 23.801 2.61 1

 

 

95

 

11 Appendix C - Camera Calibration

In order to get accurate pose calculations, it is helpful to have a calibrated camera
model. Using the assumed pin-hole camera model is not accurate with many cameras,
and will result in creating inaccurate pose estimations. Since camera calibration need
only be done once for a camera, a complicated computation is not daunting.

This system uses Roger Tsai’s method [37] for camera calibration, adapted from a
book by Shapiro and Stockman [20] (with some corrections). This method calculates
both the extrinsic camera parameters (translation, rotation) and the intrinsic parameters.
Intrinsic parameters include ﬁnding the principal point, scale factors, aspect distortion
factor, focal length, and lens distortion factor for the camera.

The calibration is done using a jig (see Figure 11-1) onto which was printed 72
ﬁducials of known location. With four key points (comers) per ﬁducial, this allowed up
to 288 points from which to do the calibration. The photograph seen in Figure 11-1 was
taken with the Sony DXC-LSl lipstick camera. This camera has a signiﬁcant radial
distortion factor. Figure ll-2 shows a blown up version of the upper-right portion of the
picture, with two straight red lines drawn along the edges of two rows of ﬁducials. It is

clear that the image bows out, away from the straight lines.

96

 

Figure 11-2 - Radial Distortion

After using Tsai’s calibration technique, 3D translation is much more accurate.
Figure 11-3 is the same jig as above, with the corners marked with a red +. These

locations came from translating the known 3D location of the comers into 2D screen

97

coordinates, using the calibrated camera matrix. The yellow outlines are the outlines of

the ﬁducials as found by the system. The comers match quite accurately.

 

Figure 11-3 - Corner locations after calibration
One signiﬁcant advantage of using ﬁducial tracking is that a camera calibration
such as the one described here can be done automatically. In most systems, camera
calibration has to be done by forcing the user to associate points in the image with points
in the real world. This can be a painstaking process, involving much wasted time on the
part of the user. Additionally, users rarely will take the time to input as many

associations as are used in the example above.

98

With ﬁducial identiﬁcation and tracking, the comer points are found
automatically, and the identiﬁcation of the ﬁducials allows the system to know what the
camera coordinates of those points are. A robust camera calibration can now be done
simply by pointing the camera at a jig, instead of using an error-prone, heavily user-

interactive process.

99

 

PM“

10.

ll.

12.

13.

References

Milgram, P. and F. Kishino, A Taxonomy of Mixed Reality Visual Displays. IEICE
Transactions on Information Systems, 1994. E77-D(l2).

Ferrari, V., T. Tuytelaars, and L.V. Goo]. Markerless augmented reality with a
real-time aﬁ‘ine region tracker. in IEEE and ACM International Symposium on
Augmented Reality (ISAR '01). 2001. New York, NY.

Simon, G., A.W. Fitzgibbon, and A. Zisserman. Markerless Tracking using
Planar Structures in the Scene. in Intemational Symposium on Augmented Reality.
2001. New York, NY.

ARToolKit, http://wwwhitl.washington.edu/research/shared spacel.

Fjeld, M. and B.M. Voegtli. Augmented Chemistry: An Interactive Educational
Workbench. in The First IEEE International Augmented Reality Toolkit Workshop.
2002. Darmstadt, Germany.

Kato, H. and M. Billinghurst. Marker Tracking and HMD Calibration for a .
Video-based Augmented Reality Conferencing System. in 2nd IEEE and ACM
international Workshop on Augmented REALITY ( I WAR 99). 1999. San Fransisco,
CA.

Rekimoto, J. and Y. Avatsuka. CyberCode: Designing Augmented Reality
Environments with Visual Tags. in DARE 2000. 2000.

Appel, M. and N. Navab. Registration of technical drawings and calibrated
images for industrial augmented reality. in IEEE Workshop on Applications of
Computer Vision. 2000.

ARVIKA, http://www.2_1rvika.de/www/index.htm.

Zhang, X., S. Fronz, and N. Navab. Visual Marker Detection and Decoding in AR
Systems: A Comparative Study. in IEEE and ACM International Symposium on
Mixed and Augmented Reality. 2002. Darmstadt, Germany.

Zhang, X. and N. Navab. Taking AR into large scale industrial environments:
Navigation and information access with mobile computers. in IEEE International
Symposium on Augmeted Reality. 2001.

Ipiiia, L.d. TRIP: a low-Cost Visuion-Based Location System for Ubiquitous
Computing. in 2001 Workshop on Perceptive Computer Interfaces. 2001.

Forsyth, D. and J .L. Mundy, Invariant Descriptors for 3-D Object Recognition

and Pose. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1991.
13(10): p. 971—991.

100

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

27.

Cho, Y., J. Lee, and U. Neumann. Multi—ring Color F iducial Systems and An
Intensity-Invariant Detection Method for Scalable F iducial Tracking Augmented
Reality. in IEEE International Workshop on Augmented Reality. 1998.

Wolfe, WJ. and D. Mathis, The Perspective View of Three Points. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 1991. 13(1).

Naimark, L. and E. Foxlin. Circular Data Matrix F iducial System and Robust
Image Processing for a Wearable Vision-Inertial Self-Tracker. in IEEE and ACM
International Symposium on Mixed and Augmented Reality (ISMAR2002). 2002.
Darmstadt, Germany.

Auer, T., A. Pinz, and M. Gervautz. Tracking in a Multi-user Augmented Reality
System. in Proceedings of the IASTED Inter91. 1998.

Efrat, A. and C. Gotsman, Subpixel Image Registration Using Circular F iducials.
International Journal of Computational Geometry and Applications, 1994. 4(4): p.
403-422.

Yoshimi, 8H. and P. Allen, Closed-Loop Visual Grasping and Manipulation:-
Department of Computer Science, Columbia University.

Shapiro, G. and G.C. Stockman, Computer Vision. 2001: Prentice Hall.

J i, Q., et al. An integrated technique for pose estimation from diﬁerent geometric
features. in Vision Interface '98. 1998. Vancouver.

Quan, L. and A. Lan, Linear N-Point Camera Pose Determination. IEEEIEEE
Transactions on Pattern Analysis and Machine Intelligence, 1999. 21(7).

Wearden, B.L.V.D., Modern Algebra. 1950, New York: F. Ungar.

Liu, ML. and K.H. Wong, Pose Estimation using Four Corresponding Points.
1998: Chinese University of Hong Kong.

Park, J ., B. Jiang, and U. Neumann. Vision-Based Pose Computation: Robust and
Accurate Augmented Reality Tracking. in 2nd IEEE and ACM International
Workshop on Augmented Reality. 1999.

Sharma, R. and J. Molineros, Computer vision-based augmented reality. Presence:
Teleoperators and Virtual Environments, 1997. 6(3): p. 292-317.

Lowe, D.G., Fitting Parameterized Three-Dimensional Models to Images. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 1991. 13: p. 441-450.

101

 

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

38.

39.

40.

Yuan, J .S.C., A General Photogrammetric Methods for Determining Object
Position and Orientation. IEEE Transactions on Robotics & Automation, 89. 15:
p. 129-142.

Dementhon, D. and L. Davis, Model Based Object Pose in 25 Lines of Code.
International Journal of Computer Vision, 1995. 15: p. 123-141.

Strickler, D. and G. Klinker. A Fast and Robust Line-based Optical Tracker for
Augmented Reality Applications. in International Workshop on Augmented
Reality. 1998.

Jiang, B. and U. Neumann. Extendible tracking by line auto calibration. in IEEE
and ACM International Symposium on Augmented Reality. 2001. New York, NY.

Wandell, B.A., A.E. Gama], and B. Girod, Common Principles of Image
Acquisition Systems and Biological Vision. Proceedings of the IEEE, 2002. 90(1).

Ribo, M., A. Pinz, and A. Fuhrrnann. A new Optical Tracking System for
Augmented Reality Applications. in IEEE Instrumentation and Measurement
Technology Conference IM TC . 2001.

Welch, G. and e. al. The HiBall Tracker: High-Performance Wide-Area Tracking
for Virtual and Augmented Environments. in Symposium on Virtual Reality
Soﬁware and Technology. 1999.

Yip, P. and KR. Rao, Discrete Cosine Transform: Algorithms, Advantages, and
Applications. 1997: Academic Press.

Owen, C.B., F. Xiao, and P. Middlin. What is the best ﬁducial. in The First IEEE
International Augmented Reality Toolkit Workshop. 2002. Darmstadt, Germany.

Tsai, R.Y., A versatile camera calibration technique for high accuracy 3D
machine vision metrology using ojf-the-shelf tv cameras and lenses. IEEE Journal
of Robotics and Automation, 1987. RA-3(4): p. 323-344.

Fischler, M. and R. Bolles, Random concensus: a paradigm for model ﬁtting with
applications in image analysis and automated cartography. Communications of
the ACM, 1981. 24: p. 381-395.

Arun, K.S., T.S. Huang, and SD. Blostein, Least-Squares Fitting of Two 3-D
Point Sets. IEEE Transactions on Pattern Analysis and Machine Intelligence,

1987. 9(5).

Oliensis, J ., A Critique of Structure from Motion Algorithms, in NECI Technical
Report. 1997.

102

41.

42.

Hettenlocher, D. and S. Ullman. Recognizing solid objects by alignment. in
Proceedings on DARPA Spring Meeting. 1988.

Baratoff, G., A. Neubeck, and H. Regenbrecht. Interactive Matti-Marker
Calibration for Augmented Reality Applications. in IEEE and ACM International
Symposium on Mixed and Augmented Reality (ISMAR2002). 2002. Darmstadt,
Germany.

103

 

l l‘l‘U‘l‘I"!"l"l“l I

02328 8248

      

_ «it