PARALLEL IMPLEMENTATION OF 3-D IMAGE PROCESSING
By
Huan Lin

A THESIS
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
MASTER OF SCIENCE
Electrical Engineering
2012

ABSTRACT
PARALLEL IMPLEMENTATION OF 3-D IMAGE PROCESSING
By
Huan Lin
A structured light based single-direction 3-D navigation system consists of a projector, a
camera, and an infrared light filter. The system serves as a navigation sensor to help the robot to
perform the task of grabbing an object in front of the robot. It uses a unique algorithm to
calculate 3-D wold coordinates of each point viewed by the camera from the pixel information
obtained from the camera. Through experiments, the computation time of the existing algorithm
of this navigation system is relatively long. To improve the system, optimization needs to be
applied. There are nine functions executed in the existing algorithm. Comparing the processing
time of each function, decoding is the one which takes more than half of the entire program's
computation time. The existing decoding algorithm in the single-directional 3-D navigation
system is designed and implemented using CPU sequential serial computing algorithm. To
improve, GPU's (Graphics processing unit) which has unique parallel architecture can be added
to the system. However, how to modify and re-design the serial computing algorithm to
efficiently utilize the GPU's parallel architecture is a challenge. In this paper, I am going to
present a parallel decoding algorithm using CUDA-based GPU to accelerate the existing serial
decoding algorithm more than 2000 times.

To Dr. Xi for always guiding me and supporting me during my study time in MSU.
To the committees who spend time to review my thesis and give me suggestions.
To all my professors in the ECE Department for always coaching me and supporting me.
To all the faculties and stuff for always helping me and assisting me.
To my awesome parents for always believing in me and loving me.
To my grandparents for inspiring me in my childhood, for always loving me and supporting me.

iii

TABLE OF CONTENTS
List of Tables.
List of Figures.

.

. . .
.

.

.

.

.

.

.

.

.. . . . . . . . . . . . . . . . . . vi

. .

.

.

.

.

.

.

.

.. . . . . . . . . . . . . . . . . . vii

1. An Introduction of Parallel Computing .

.

.1

1.1 What is Parallel Computing.

. . . . .. . . . . . . . . . . . . . . . . . .

1

1.2 Motivation .

. . . . .. . . . . . . . . . . . . . . . . . .

3

1.3 Requirements of the Application. . . . . . . . . . . . . . . . . . . . . . .

5

1.4 Challenge and Difficulties .

6

.

.

.

.

.

.

.

.

.

.

.

3.3.1 K-NN algorithm .

.

.

.

13
14
22

. . . . . . . . . . . . . . . . . . . . . . .

25

.

.
.

.

. . . . . . . . . . . . . . . . . . . . . . .

3.1 Software Working Principle.

3.3 Decoding Algorithm.

.

.. . . . . . . . . . . . . . . . . . . . . . .

.

.

.

.

.

.

3. Parallel Computing Algorithm of 3-D Image Processing. .

.

.

.13

.

2.4 CUDA Program Structure.

3.2 Sorting Algorithm .

.

. . . . . . . . . . . . . . . . . . . . . . . .

.

.

.

.

2.2 NVDIA's GPU Architecture.
2.3 Data Parallelism .

.

. . . . .. . . . . . . . . . . . . . . . . . .

2. CUDA Parallel Computing Architecture. .

2.1 History of GPU .

.

.
.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.31

.. . . . . . . . . . . . . . . . . . . . . . 31
. . . . . . . . . . . . . . . . . . . . . . . 32

.
.

.

. . . . . . . . . . . . . . . . . . . . . . 33
. . . . . . . . . . . . . . . . . . . . . . . 35

3.3.2 Decoding Implementation. . . . . . . . . . . . . . . . . . . . . . . 36

iv

4. Experiments and Improvements. . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

44

5. Conclusion and Future Work. .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

76

..

78

6. Bibliography. .

.

.

.

.

.

.
.

.

.

.

.

v

.

.

.

.

.

.

.

.

.

.

LIST OF TABLES
Table 2.1 Four pre-request for data parallelism implementation. .
Table 3.1 Sorting Algorithm Conditions.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.24

.

.

32

Table 3.2 Time consuming comparison (Decoding vs. Complete Program Implementation). . 34
Table 3.3 Decoding Implementation steps.

.

.

.

.

.

. .

.

.

.

.

.

.

.

.

. 45

Table 4.1 Time comparison with 128 threads/block.

.

.

.

.

.

.

.

.

.

.

.

.

. 46

Table 4.2 Time comparison with 512 threads/block.

.

.

.

.

.

.

.

.

.

.

.

.

. 47

Table 4.3 Total Application Time comparison w/ 128 threads/block.

.

.

.

.

.

.

. .51

Table 4.4 Total Application Time comparison w/ 512 threads/block.

.

.

.

.

.

.

. .52

vi

LIST OF FIGURES
Figure 1.1 Single-directional Camera System.

.

.

.

.

.

.

.

.

.

.

.

.

.

.8

Figure 1.2 Single-directional Camera System used as a navigation sensor on the Robot .

9

Figure 1.3 Software Working Principle.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.10

Figure 1.4 Projected Marker Pattern.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

11

Figure 1.5 Markers orientation.

.

.
.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Figure 1.6 3×3 Matrices and Codeword formation.

.

.

.

.

.

.

.

.

.

.

. . .12

Figure 1.7 Serial Decoding Algorithm Flow Chart.

.

.

.

.

.

.

.

.

.

.

. . .12

Figure 2.1 Comparison of CPU and GPU Architecture.

.

.

.

.

Figure 2.2 Nvidia CUDA-capable GPUs - Fermi Architecture.

.

.

.

.

.

.17
. 18

Figure 2.3 Comparison of traditional GPU Architecture and Fermi Architecture .

.

. . .19

Figure 2.4 SM Structure.

.

.

.

.

.

.

.

.

.

.

.

. . 11

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. 19

Figure 2.5 Detail Structure of each SM.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Figure 2.6 Fermi Memory Hierarchy .

.

.

.

.

.

.

Figure 2.7 Data parallelism in matrix multiplication.
Figure 2.8 CUDA program Execution Structure.
Figure 2.9 Memory transfer process.

.

.

.

.
.

.

.
.

.

.

.
.

.

.

.
.

.

.

.
.

.

.

.
.

.

.

.
.

.

.

.
.

.

.

.
.

.

.

.
.

.

20
. .21

.
.

.

.

24

. . 28
.

.

.29

Figure 2.10 The Grid of thread blocks.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. .30

Figure 3.1 Decoding Algorithm.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. .40

.

.

Figure 3.2 k-NN Sorting of Decoding Algorithm .

vii

.

.

.

.

.

.

.

.

.

.

.

. . 41

Figure 4.1 computation time comparison (128 threads/block) .

.

.

.

.

.

.

.

. . 48

Figure 4.2 computation time comparison (512threads/block) .

.

.

.

.

.

.

.

.

Figure 4.3 Average Computation Time Comparison .

.

.

.

.

.

.

.

.

. . 50

Figure 4.4 Total Application Performance Time Comparison .

.

.

.

.

.

.

.

.

.

.

Figure 4.4.1 Input Image #1, Output image comparison .

.

.

.

.

.

.

.

.

.

49

50
. .54

Figure 4.4.2 Input Image #2, Output image comparison

.

.

.

.

.

.

.

.

.

.

55

Figure 4.4.3 Input Image #3, Output image comparison

.

.

.

.

.

.

.

.

.

.

56

Figure 4.4.4 Input Image #4, Output image comparison

.

.

.

.

.

.

.

.

.

.

57

Figure 4.4.5 Input Image #5, Output image comparison

.

.

.

.

.

.

.

.

.

.

58

Figure 4.4.6 Input Image #6, Output image comparison

.

.

.

.

.

.

.

.

.

.

59

Figure 4.4.7 Input Image #7, Output image comparison

.

.

.

.

.

.

.

.

.

.

60

Figure 4.4.8 Input Image #8, Output image comparison

.

.

.

.

.

.

.

.

.

.

61

Figure 4.4.9 Input Image #9, Output image comparison

.

.

.

.

.

.

.

.

.

.

62

Figure 4.4.10 Input Image #10, Output image comparison .

.

.

.

.

.

.

..

.

63

Figure 4.4.11 Input Image #11, Output image comparison .

.

.

.

.

.

.

.

.

64

Figure 4.4.12 Input Image #12, Output image comparison

.

.

.

.

.

.

.

.

65

Figure 4.4.13 Input Image #13, Output image comparison

.

.

.

.

.

.

..

.

66

Figure 4.4.14 Input Image #14, Output image comparison

.

.

.

.

.

.

.

.

67

Figure 4.4.15 Input Image #15, Output image comparison

.

.

.

.

.

.

.

.

. 68

Figure 4.4.16 Input Image #16, Output image comparison

.

.

.

.

.

.

.

.

. 69

Figure 4.4.17 Input Image #17, Output image comparison

.

.

.

.

.

.

.

.

. 70

viii

Figure 4.4.18 Input Image #18, Output image comparison

.

.

.

.

.

.

..

.

. 71

Figure 4.4.19 Input Image #19, Output image comparison

.

.

.

.

.

.

.

.

. 72

Figure 4.4.20 Input Image #20, Output image comparison

.

.

.

.

.

.

.

.

. 73

Figure 4.4.21 Input Image #21, Output image comparison

.

.

.

.

.

.

.

.

. 74

ix

Chapter 1
Introduction to the Parallel Computing
1.1 What is Parallel Computing

Microprocessors based on a single central processing unit (CPU) have played a major role in
science computation speed increases and cost reductions in computer applications. In the past,
scientists have been focused on CPU application software developing to improve the computer’s
performance. These performance improvements have allowed applications software to provide
more functionality user interfaces, and generate more useful results. During this period, CPU
serial computing was used to develop computing algorithms to increase the speed of their
applications under the hood. Software written for serial computation to be run on a single Central
Processing Unit (CPU) first breaks a problem into a discrete series of instructions and then
executes each instruction sequentially. Each instruction may execute only one time. For large
scale data stream, the hardware architecture of the CPU and the software program structure of
the CPU cause applications have disadvantages on computation time and application cost. To
improve the functionality of CPU serial computing, scientists innovate and developed a new
computing method called parallel computing.
Parallel computing is an evolution of serial computing. It significantly increases the
1

computation time and reduces the cost of applications. Parallel computing was first introduced to
use a single computer with multiple processors to solve a computational problem simultaneously.
The program structure is to first break a problem into discrete pieces of work which can be
solved concurrently and then execute multiple program instructions from each part
simultaneously on different CPUs. Using multiple compute resources (CPUs) in a single
compute resource to solve computational problem simultaneously significantly improved the
CPU serial computing computation ability. In this approach, CPUs need to perform the
computation tasks as well as the data transformation tasks between CPUs. However, for many
applications used in the areas of atmosphere, earth and environment analysis, applied and nuclear
physics and biomedical image processing which involve very large scale data stream
computation and transformation, the performance level of parallel computing using multiple
CPUs is still not enough. This problem can be solved by adding an additional hardware to
independently handle the parallel computing task from the CPU. In fact, CPU can be used to
perform other necessary serial computing tasks and handle data transformation. In this approach,
the large data stream transformation time will be saved.
The add-on hardware of this approach has to have high computation ability and good
communication ability with the CPU. GPU, Graphics Processing Unit, which is majorly used in
image processing, now has been found as a better developed hardware to use to speed up large
data stream program computation speed. It serves as an accelerator to the CPU in the
performance of parallel computing applications. This type of parallel computing was named
GPU parallel computing. Due to the significant improvement of the GPU parallel computing, the
2

characteristic of Parallel computing using GPU started to be more and more evaluated and
researched by hardware and software developers. Developing softwares using GPU parallel
computing, scientists do not only need to consider the program structure of each programming
languages but also need to consider the hardware architecture characteristics. Not every serial
computing algorithm can directly translate to GPU parallel computing algorithm. How to
implement and redesign the existing serial computing algorithm using GPU parallel computing
and how to efficiently utilize the GPU for the best performance of parallel computing are the
main focus of researchers.

1.2 Motivation

Structured light vision system has been successfully used to obtain the 3-D model of an
object. To achieve 3-D measurement accurately in realtime, having fast computation speed is
important. A structured light based single-directional 3-D camera system, consisting of a
projector, a camera, and a infrared light filter was used in this research as a navigation sensor on
the robot to generate the 3-D image information in real-time for robot remote control, which can
be seen on Figure 1.2. [13] Figure 1.1 shows the hardware components of this 3-D camera
system. The flow chart of the software working principle of this single-directional 3-D camera
system is shown in Figure 1.3. The camera takes a picture of the object in front of the robot and
returns the pixel information of the image to the computer. Switched the camera mode to an
infrared mode, we then re-takes the image and assigns the pre-defined projector marker pattern
3

to the image. The pre-defined projector marker pattern is shown in figure 1.4. The computer
extracts each marker's camera coordinates information to calculate the orientation of each marker
'o', then returns the camera coordinates and orientation of each marker to the next step decoding.
The pre-defined projector marker pattern has four types of markers shown as in figure 1.5. The
orientation 'o' is defined by the angle of the central moment of each marker and the x-axis shown
in figure 1.5. On a 360° bases, if the angle is 0°or 180°, we assign the orientation as 0; if the
angle is 45°or 225°, we assign the orientation as 1; if the angle is 90°or 270°, we assign the
orientation as 2; if the angle is 135°or 315°, we assign the orientation as 3. Decoding then reads
the information of each marker and decodes them to several 3×3 matrices shown in figure 1.6.
Each marker is looked as a target marker. Assign each target marker's camera coordinates and
orientation information in the center of each individual 3×3 matrix. The rest eight positions are
filled by the eight shortest distance markers from the targe marker. How to allocate these eight
markers is based on their angle 'θ', which is shown in figure 1.6. After the matrices are formed,
we use these nine markers' orientation to form a nine digit codeword which shown in figure 1.6.
Then we send these codewords to the computer and perform the next step, corresponding
matching with the codeword extracted from the projector. After matching is confirmed, we
perform 3-D reconstruction to get the 3-D world coordinates of the object and transfer this
information to form a color fusion image for the robot to view.
Of all the 3-D sensing methods, speed and accuracy are the challenges for all researchers. To
overcome these challenges, we have to consider how to perform real-time computation of a large
array data in fast time. The goal of this research is to speed up the computation time of the entire
4

program and finally achieve video rate to better assist the robot performance, so the robot can
quickly and accurately grab the object in front of it. Due to the study of experiments, among all
nine functions, decoding is the one which takes the most processing time. The average
computation time of Decoding is about an average of 57% of the entire software execution time.
Therefore, optimizing the Decoding algorithm is the first and the most important step. The
details of the serial decoding algorithm is shown in the flow chart in Figure 1.7. This paper is
going to introduce a new parallel decoding algorithm for the single-directional 3-D camera
system to achieve significantly faster computation speed for high dimensional large data stream.

1.3 Requirement of the Application

The requirement of this application are hardware selecting, software sorting algorithm
developing and implementation. Data mining has become a hot research domain in nowadays. It
has been widely applied to a variety of fields, such as business intelligence, customer
relationship

management, navigation

systems,

scientific

simulation,

statistical

model

configuration, e-commerce, biomedical data computation etc. Instead of implementing large data
input on serial computed CPU, parallel computed GPU emerged as the co-processor of the CPU
to achieve a high overall throughput to speed up the high computing performance of
general-purpose applications. The GPUs provide tremendous memory bandwidth and
computational horsepower not only for vertex and pixel processing pipelines but also for
non-graphical general purpose applications.
5

Sorting is an algorithm used for a wide range of applications such as geographic information
systems, computational biology, search engines and so on. Any application utilized by database
operations may benefit from an efficient sorting algorithm. In this research, I will present an
efficient parallel sorting algorithm to fast decode the 3-D image information generated from the
single-directional camera system.

1.4 Challenge and difficulties

Due to the trend of parallel architectures being commonly applied into consumer hardware,
parallel algorithms such as parallel sorting are becoming more and more important for
researchers. Various parallel data mining algorithms have been introduced and studied, such as
parallel clustering, parallel decision tree, KNN (k nearest neighbor), parallel reduction, etc.
CUDA-based general-purpose parallel computing has become popular in the past few years.
The Challenge of implementing a one-shot algorithm for this single-directional 3-D camera
system in realtime is to minimize the computation time of decoding. How to accurately and
quickly decode large data input from the camera's information is the focus of this research. To
overcome these difficulties, selecting and implementing an efficient sorting method for the large
data stream to optimize the peer's decoding method is the key. The single-directional 3-D camera
system Chi [13] presented was implemented with OpenCV based serial computed CPU. This
decoding algorithm is a serial computing algorithm which breaks the problem into a discrete
series of instructions and then executes each instruction sequentially. The program structure of
6

this algorithm is to execute one instruction at a time which causes much wait time between
instructions. Using this serial decoding algorithm, it is impossible to achieve the goal of video
rate for the 3-D navigation sensor. Therefore, to slove the problem I do not only need to
implement a new efficient parallel sorting method but also need to add a computation accelerator
in the 3-D camera system. From the study, the computing power of GPU is equivalent to a
medium-sized supercomputer. As introduced in the previous section of parallel computing, we
can say GPU parallel computing is the best approach to help us to achieve the goal of fast
computing speed of the 3-D camera system for high dimensional large data stream. NVIDIA's
GPU with CUDA environment which uses extended C as programming language for the GPU is
the newest and most powerful parallel computing platform. The unique hardware architecture is
well known in the market to perform parallel computing tasks. The program structure is a
CUDA-based parallel structure. CUDA is the program language for NVIDIA's GPU written in
extended C language. In this research, I am going to use NVIDIA's GPU GeForce GTX 470 to
implement a parallel decoding algorithm for the single-directional 3-D camera system to achieve
significant speed increase of the existing serial decoding algorithm. In addition, the accuracy of
the new designed parallel decoding algorithm has to have the same accuracy as the existing serial
decoding approach for consistency.

7

Detail View

Infrared Light Filter

Projector
Camera

Figure 1.1 single-directional Camera System
For interpretation of the references to color in this and all other figures, the reader is referred to
the electronic version of this thesis

8

Figure 1.2 single-directional Camera System used as a navigation sensor on the Robot

9

Figure 1.3 Software Working Principle

10

Figure 1.4 Projected Marker Pattern

Figure 1.5 Markers orientation

11

Figure 1.6 3×3 Matrices and Codeword formation

Figure 1.7 Serial Decoding Algorithm Flow Chart

12

Chapter 2
CUDA Parallel Computing Architecture
2.1 History of GPU

GPU stands for Graphics processing unit. For many years, GPU has played an important role
on displaying images and motion on computer displays. Common usages of GPUs include
embedded systems, mobile phones, personal computer, workstations and game consoles.
However, modern GPUs are efficiently used in manipulating computer graphics as well as large
data stream parallel computing algorithms. GPU parallel computing is considered to be "the high
end of computing" approach. It has been widely used to model difficult problems in many areas
of science and engineering such as atmosphere and earth stricture analysis, applied physics
applications, particle movement examinations, biotechnology invention, etc. Not only in the
science and engineering filed, GPU is also commonly used to satisfy the commercial applications
requirement of processing large amounts of data in sophisticated ways, such as data mining, oil
exploration, web search engine, web based business services, medical imaging and diagnosis,
financial and economic modeling, networked video and multi-media technologies and
collaborative work environments.
13

The term GPU was defined and popularized by Nvidia in 1999, who marketed the
GeForce256 （“the world’s first GPU”. It was a single-chip processor which integrated transform,
lighting, triangle setup, clipping and rendering engines that were capable of processing a
minimum of 10 million polygons per seconds. Several parallel computing architectures have
been developed for GPU. The most powerful one was developed by Nvidia called Compute
Unified Device Architecture (CUDA). CUDA gives developers access to the virtual instruction
set and memory of the parallel computation elements in CUDA GPUs. Rather than executing
instructions in a single thread sequentially computed CPUs structure, GPUs has a parallel
throughput architecture which can execute many threads concurrently to implement each
instruction individually at the same time.

2.2 NVDIA's GPU Architecture

The large performance gap between many-core GPUs and general-purpose multicore CPUs
is from the differences in the fundamental design philosophies between these two types of
processors illustrated in Figure 2.1. The design of a CPU was optimized for sequential
instructions performances versus GPU makes use of sophisticated control logic to allow
instructions from a single thread of execution to execute in parallel while maintaining the
appearance of sequential execution. In addition, large cache memories are provided to reduce the
instruction and data access latencies of large complex applications. Neither control logic nor
cache memories contribute to the peak calculation speed. [1]
14

CUDA-capable GPU is organized into an array of highly threaded streaming multiprocessors
(SMs) illustrated in Figure 2.2. In this example, there are four SMs to form a building block;
however, the number of SMs in a building block can vary from one generation of CUDA GPUs
to another generation. Also, each SM has a number of streaming processors (SPs) that share
control logic and instruction cache. Each GPU currently comes with up to 4 gigabytes of
graphics double data rate (GDDR) DRAM, referred to as global memory [1]. These GDDR
DRAMs are essentially the frame buffer memory which is used for graphics. They can store
video images and texture information for three-dimensional rendering.
The NVIDIA's GeForce GTX 400 family of GPUs is based on NVIDIA's Fermi architecture.
The initial vision of what a unified graphics card compute processor should look like for
NVIDIA is G80. The next GT200 extended the performance and functionality of G80. The Fermi
architecture is the most significant leap forward in GPU architecture and is the key of what
makes parallel computing on NVIDIA's GPU more reliable efficient than the others [2].
Comparing the Fermi Architecture to the Non-scalable Architecture, the traditional GPU
designs use a single geometry engine to perform tessellation. The performance of pixel shading
in the early GPU designs used a single pixel pipeline. By noticing how much impact that pixel
pipeline grew from a single unit to many parallel units have and the impact on 3-D realism,
Nvidia designed their tessellation architecture to be parallel [3]. The most successful parallel
tessellation units Nvidia made is this Fermi Architecture GPUs, which implement up to fifteen
parallel tessellation units shown in Figure 2.3. In this Architecture, each tessellation unit has its
own dedicated shading resource. Up to four parallel Raster Engines transform newly tessellated
15

triangles into a fine stream of pixel of shading [3]. The close coupling for tessellation, shading,
and raster units provides enormous on-chip bandwidth and high execution efficiency [3]. The
result is a breakthrough in tessellation performance at over 1.6 billion triangles per seconds [3].
Comparing to other products, Fermi GPUs are 2-8X faster as measured by independent reviews
using Microsoft's DirectX 11 software development kit. And this is why I decide to use this
Fermi GPU as our hardware to optimize the decoding approach [3].
The heart of the GPU is the Streaming Multiprocessor (SM) which performs the vital
functions such as tessellation, pixel shading, physics and compute calculations, etc. The structure
of the SM can be seen in Figure 2.4 and Figure 2.5. The graphic card used for this research is
GeForce GTX470. The SM of GeForce GTX470 is a highly parallel processor unitizing
superscalar excution for optimal performance. Unlike thread level parallelism, superscalar
execution is a technique that allows the program to execute sequential instructions in parallel.
Since the same program executes in less time, superscalar excution not only improves throughput
but also improves latency [3]. The GTX 470 SM is designed to perform two instructions per
thread, per clock. Two warps (groups of 32 threads) execute concurrently in the SM. Therefore, a
peak of four instructions per clock is realized [3]. The GeForce GTX 470 has fourteen tesselation
engines, 448 CUDA cores, and 1GB of memory [3]. It has fifteen cores or streaming
multiprocessors (SMs). Comparing with the GeForce GTX 480 introduced by NVIDIA, the
GTX470 can facilitate greater instruction throughput by expands the number of execution units
of the SMs [3]. Each SM features from 48 CUDA cores. The texture units and the number of
special function units (SFUs) is eight.
16

Compared to the CPU's memory access performance, Fermi is the first GPU architecture
with fully cached memory access which is able to cache to all graphic and compute programs. As
shown in Figure 2.6 Fermi Memory Hierarchy, programs can access to a Texture Cache, an L1
Cache and an L2 Cache. The texture cache enables fast and efficient texture filtering [3]. And the
L1 and L2 caches improve the performance for programs with random memory access patterns
such as ray-tracing and physics [3]. General purpose GPU applications such as video
trans-coding and photo processing can also fast access the shared memory for storage and data
transformation.

Figure 2.1 Comparison of CPU and GPU Architecture

17

Figure 2.2 Nvidia CUDA-capable GPUs - Fermi Architecture

18

Fermi Architecture

Non-scalable Architecture

Figure 2.3 Comparison of traditional GPU Architecture and Fermi Architecture

Figure 2.4 SM structure of a GPU

19

Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
Shared Mem
Reg

Reg

P1

Reg
P3

P2

Instruction
unit

Cache
Cache

Device Mem
Figure 2.5 Detail Structure of each SM

20

Thread

Shared
Memory

Register File

Thread

Register File

L1 Cache

L2 Cache

DRAM

DRAM

Figure 2.6 Fermi Memory Hierarchy

21

2.3 Data Parallelism

Nowadays, data parallelism has been widely successfully implemented to improve human
life. Consider the satisfaction with digital high-definition television, all the processing that is
necessary for the HDTV is a very parallel process, as are 3-D imaging and audio coding and
manipulation. Another benefit of parallelism is that it allows developer to develop much better
user interfaces offered by greater computing speed comparing to serial computing. Apple
iPhone_interface is one of the example. Because of the increase of the speed, customers can
enjoy a much more natural interface with touch screen compared to other cell phone devices. In
the future, these devices will be able to have more functions which incorporate higher definition,
3-D imaging and computer vision based. GPU-accelerating computer vision is leveraging the
computational resources of the GPU for vision to speed up the frame-rate of the camera. How to
exploit the computational resources and parallelism offered by modern programmable GPUs in
the context of computer vision is a challenge. A good implementation on a GPU can speed up
more than 100 times over sequential execution on CPU. However, in order to achieve this, the
application must be suitable for parallel execution. Therefore how to design the algorithm and
how to implement it to fit and optimize GPUs parallel computing characteristic is a challenge.

Data parallelism refers to the program to perform many arithmetic instructions on the data
structures simultaneously. A simple example used in many books to illustrate the concept of data
parallelism is a matrix-matrix multiplication. Here, we can use the same example to explain how

22

data parallelism is processed. In the example of Figure 2.7, performing a dot product between
each element of rows in matrix M and each element of columns in matrix N, we generate the
product matrix P. The dot product of each row of matrix M and each column of matrix N
performs simultaneously. The width of matrix M, matrix N and matrix P has to be the same in
order to perform simultaneously. In this multiplication, none of these dot products in matrix P
will affect the result in each other. This is called parallel multiplication. For large amounts of
data parallelism, GPU with CUDA can significantly accelerate the execution of the matrix
multiplication over a host CPU.

During the study, if the percentage of time spent in the parallelized portion of the application
is 30%, a 100 times speedup of the parallel portion will reduce the execution time by 29.7%, the
speedup for the entire application will be only 1.4 times faster [1]. However, if 99% of the
program in parallel computation mode, a 100 times speedup will reduce the application
execution to 1.99% of the original time [1]. Which means the entire application is 50 times faster.
Therefore, the more execution processes in parallel computation mode, the faster speed the entire
application can achieve [1].

Nowadays, various approaches of data parallelism do exist, however four pre-request must
be proved before implementing the program, shown as Table 2.1.

23

1) Select supportable hardware which suits for the programming model and facilitates parallel
implementation.

2) Determine the portion of the application that can be parallelized

3) Programming model must not hinder parallel implementation.

4) Data delivery must be properly managed in order to optimize the use of GPUs

Table 2.1 Four pre-request for data parallelism implementation

N

WIDTH

M

P
WIDTH
WIDTH

WIDTH

Figure 2.7 Data parallelism in matrix multiplication

24

2.4 CUDA Program Structure

There are two type of phases of a CUDA program. These phases are executed on either the
host (CPU) or a device such as a GPU. The phases that compute little or no data parallelism are
implemented in host code [20]. The phases that compute large amount of data parallelism are
implemented in the device code [20]. A CUDA program is a structured source code with both
host and device code. These two codes are separated by the NVIDIA C compiler during the
compilation process. The host code is straight ANSI C code which compiled with the standard C
compilers on the host CPU. A function compiled for the device is called a kernel which is written
using ANSI C extended with keywords for labeling data-parallel functions and their associated
data structures. A kernel is executed on the device as many different threads. These threads work
parallelized. Furthermore, the kernel codes will be compiled by the nvcc and executed on the
GPU device.

Any source file containing CUDA language extension must be compiled with nvcc, the
compiler driver of CUDA. The outputs nvcc generates are host CPU code, which is compiled
together with other parts of the application, written in pure C. The Executable files with CUDA
code require CUDA runtime library (cudart) and CUDA core library (cuda).

25

To exploit data parallelism, the kernel functions generate a large number of threads. In the
matrix multiplication example Figure 2.7, the parallel computation of each multiplication of the
elements in matrix N and matrix M can be implemented as a kernel where each thread is utilized
to compute to obtain each element of matrix P. And the number of threads used by the kernel is
determined by the dimension of the matrix. For a 1000×1000 matrix multiplication, when the
kernel is invoked, 100,000,000 threads will be generated to perform the parallel computation to
obtain each element on matrix P simultaneously. The CPU threads typically require thousands of
clock cycles to generate and schedule. In contrast to the CPU threads, using CUDA-based GPU,
these threads can be assumed to take very few cycles to generate and schedule due to the
efficient hardware architecture design.

A device (GPU) is viewed as a compute device operating as a coprocessor to the main host
(CPU). Compute intensive functions should be off-loaded to the device [2]. Functions that are
executed many times, but independently on different data are prime candidates [2]. Once a kernel
is invoked, the execution moves to a device (GPU) to perform the calculation of abundant data
parallelism using a large number of threads. All the threads that are generated by a kernel during
an invocation are collectively called a grid [20]. KernelA<<<nBIK,nTID>>> tells the device
how many blocks needs to be generate and how many threads are generated in each block. In
Figure 2.8, there are two grids of threads. Once all threads of the first kernel complete their
execution, the corresponding grid terminates. The execution continues with the CPU serial code
on the host (CPU) until the seconds kernel is invoked.

26

Figure 2.10 illustrates the structure of the Grid of thread blocks in more details. Each
computational grid consists of a grid of thread blocks. Each thread executes the kernel. The
application specifies the grid and block dimensions. The shape of the grid can be 1, 2, or
3-dimensional. The maximal sizes of each grid are determined by the hardware GPU’s memory
and kernel complexity. Each block has a unique block ID. Within one block, each thread has a
unique thread ID. The challenge of the design is how to specify the shape of the grid, the number
of the blocks, the number of the threads and the data type being executed on each thread.

Host and device both manage their own memory, which are called host memory and device
memory. Data can be copied between them [2]. The CUDA device memory model is illustrated
in Figure 2.9. How to allocate and use of the various memory types of a device is very important
on accelerating the application. As shown in Figure 2.9, global memory and constant memory
are the memories which the host code can transfer data to and from the device. Constant memory
allows read-only access by the device code. Mostly only constants locate under constant memory.
Most of the variables are likely being put under global memory because these variables can be
transferred to and from the device, such as Figure 2.9. The transfer is asynchronous.

27

Grid 0

Grid 1

Figure 2.8 CUDA Program Execution Structure

28

Figure 2.9 Memory transfer process

29

Figure 2.10 The Grid and thread blocks

30

Chapter 3
Parallel Computing Algorithm of 3-D Image
Processing
3.1 Software working principle

The flow chart of the software working principle of this single-directional 3-D camera
system is shown in Figure 1.3. The camera takes a picture of the object in front of the robot and
returns the pixel information of the image to the computer, then switches to the infrared mode
and re-take the image and assign pre-defined projector marker pattern to the image. The
pre-define projector marker pattern is shown in figure 1.4.
The computer extracts each marker's camera coordinates information to calculate the
orientation of each marker 'o', then returns the camera coordinates and orientation of each marker
to the next step decoding. The pre-defined projector marker pattern has four types of markers
shown as in figure 1.5. The orientation 'o' is defined by the angle of the central moment of each
marker and the x-axis shown in figure 1.5. On a 360° bases, if the angle is 0°or 180°, we assign
the orientation as 0; if the angle is 45°or 225°, we assign the orientation as 1; if the angle is 90°or
270°, we assign the orientation as 2; if the angle is 135°or 315°, we assign the orientation as 3.
Decoding then reads the information of each marker and decodes them to several 3×3 matrices
shown in figure 1.6. Each marker is looked as a target marker. Assign each target marker's
31

camera coordinates and orientation information in the center of each individual 3×3 matrix. The
rest eight positions are filled by the eight shortest distances marker away from the targe marker.
How to allocate these eight markers are based on their angle 'θ', which shown in figure 1.6.
After the matrices are formed, we use these nine markers' orientation to form a nine digit
codeword which shown in figure 1.6. Then we send these codeword to the computer and perform
corresponding match with the codeword extracted from the projector. After matching confirmed,
we perform 3-D reconstruction to get the 3-D world coordinates of the object and transfer this
information to form a color fusion image for the robot to view.

3.2 Sorting Algorithm

A sorting algorithm is an algorithm that puts elements of a list in a certain order. As one of
the most widely studied problems in computer science, sorting is a fundamental problem in many
applications which uses a database. Applications, such as geographic information systems, search
engine and computational biology all need efficient sorting approaches to accelerate searching as
a preprocessing step.
The output of any sorting algorithm must satisfy two conditions:
1) The output is in nondecreasing order (each element is no smaller than the previous element
according to the desired total order);
2) The output is a permutation (reordering) of the input.
Table 3.1 Sorting Algorithm Conditions
32

Comparing with the main trend in current CPU architecture, highly parallel GPU
architecture support and effect thousands of concurrent threads to optimize the utilization of
computational resources. Parallel architectures start to find their way into the development of
non-parallel applications as it provides an efficient implementation which allows data-parallel
processing of individual input tiles by blocks of fine-grained concurrent treads. Parallel
algorithms such as parallel sorting are becoming more and more important in the study of
programming. There are many different sorting approaches for different usage, such as bubble
sort , quicksort , sharker sort, and heapsort , etc. To the benefit of our application, how to choose,
modify and create a parallel sorting approach is the key to speed up the computation time and
still maintain the same accuracy.

3.3 Decoding Algorithms

The goal of the single-directional 3-D camera system is to provide real-time 3-D coordinates
information of the tested object image captured by the vision sensor to the robot. This type of
vision sensor can be utilized in many applications, such as, robot navigation, remote
manipulation, auto-driving system, map reconstruction, and pedestrian detection. These
applications all require fast, accurate and real-time response. Therefore, for large scale data
information obtained from the camera, an efficient algorithm which meets these requirements is
the key to implement this application.
As described in Section 1.2 Motivation, there are nine steps to implementing the algorithm
33

for this single-directional 3-D camera in realtime shown in figure 1.3. Comparing the time each
step spent, decoding is the most time consuming one. This can been seen on table 3.2. For
example, a image which takes about 13 seconds to implement complete the entire program,
decoding takes about 8.7 seconds The average computation of decoding is about 57% of the
entire program. Therefore, to optimize the single-directional 3-D camera system with faster
processing speed and maintain the same accuracy, re-design the previous decoding algorithm is
necessary.
The existing serial decoding algorithm contains six steps shown as Figure 1.7. Summarizing
these necessary steps, we can re-organize the decoding algorithm into two major computation
steps shown in table 3.2.

1) Sort the point clouds to form a list of nine shortest distance neighbor points. This list has the
information of each point's 2-D coordinates (x,y) and Orientation information 'o'.
2) Perform angle calculation within each group in the sorted list from step 2. Use these angels
and the previous generated soring list to search the necessary points' coordinates and form the
codeword list for the next step Correspondence match.
Table 3.2 Decoding Implementation steps

A efficient sorting approach can lead to a major improvements on computation speed of the
entire decoding algorithm. It is important for optimizing the use of other algorithms in the
following steps. Analyzing these two steps of Decoding, the first step, sorting, takes up the most
34

time of decoding due to serial computation of large scaled data information. Therefore,
implementing an efficient sorting algorithm which can be implemented to meet the speed and
accuracy requirement is important.
There are many data sorting algorithms to find paths with the minimum costs for point
clouds. As described the sorting purpose of step 1 in Table 3.2, the k-nearest neighbor algorithm
(k-NN), a method for classifying objects based on closest training examples in the feature space,
is the technique we decide to implement for our application.

3.3.1 k-NN Algorithm

K-nearest neighbor (k-NN) has been widely applied in the performance in classification
applications in terms of computational complexity or accuracy, such as decision tree, neural
network, etc. It is a machine learning algorithms which an object is classified by a majority vote
of k neighbors, where k is a positive number. For example, if k=1, then the output is simply the
result of each object's nearest neighbor among all the class.
This algorithm is commonly applied to a small size of database. To find k nearest neighbor
for each target object, k-NN algorithm has to scan all the objects in the reference data points. If
there is n target objects, and looking for m nearest neighbors for each object, there are m × n
computation steps needed for the performance. Because of the complexity of the serial k-NN
computation with CPU, the larger the database is the slower the application response. The
challenge of this research is to fast implement the decoding process with large-scale datasets.
35

Therefore, slow response of k-NN computation with large-scale database needs to be improved.
Parallel computing with GPU will help us to resolve this issue. In section 3.22, we will present a
efficient parallel implemented sorting algorithm with k-NN to accelerate the computation process
of the 3-D single-directional camera system.
The k-NN algorithm computes in the following procedures:
1)

For each target object m, compute Euclidean distance from target object m to the others
in the reference database.

2)

Order samples by calculated distance

3)

Choose optimal k closest points to target object m

3.3.2 Decoding Implementation

The input data structure for the decoding function is a three dimensional array which
contains each marker's camera coordinates (x,y) and an orientation 'o' generated from the
previous image processing function. In C++ serial programming, these data are defined as
vectors (x,y,o). Even though 'o' does not represent dimension z for depth information, in the
parallel decoding algorithm, we still look at the data information as three dimensional data
structure because we need to use all three pieces of information to perform the decoding
algorithm.

36

CPU serial decoding algorithm and implementation

The previous sequential decoding algorithm in CPU is first sequentially find the nearest nine
points of each point, then put these nine points into a 3*3 matrix, name them mask [0] to mask [8]
(mask [0] represents the original point), then calculates the angle from the rest 8 points to the
original point mask [0]. Compare these angle to nine pre-defined angle positions "45°, 225°,
90°, 270°, 135°, 315°," to classify the codeword for the next function in the single-directional
camera system algorithm. For any group of nine contains all nine angle positions, return a nine
digit codeword built by their corespondent orientation value 'o'. Then pushback their respected x
and y candidates with this codeword to form a decoding list for future use.
As described in previous chapter, if I can implement this decoding algorithm using GPU
parallel architecture, the computation time would be faster. I can improve the whole application
response speed. However, the question is if I just directly use this sequential algorithm in the
GPU, will it utilize the usage of the GPU and will it fit for the GPU parallel architecture? The
answer is no. The reasons of why it cannot be directly implemented as it is into GPU are
demonstrated as follows:
1) GPU parallel computing is easy on implementing simple computation but not on many
looping sequential calculations. When invoking the kernel, many threads are generated at the
same time and calculate at the same time, each thread has to read all data points in the
reference database to do the comparison with all the other points to find the 9 nearest points.

37

Therefore, it creates a heavy sequential calculation work flow for each thread. If the data is
very large, the GPU might corrupt.
2) The data type cannot be defined as three dimensional in GPU, because in (x,y,o), o is not
direction z. Therefore it is hard to use o to design the grid and block shape. However, o is
key characteristic to create the decoding list, which means o cannot be ignored. Therefore
each combination of (x,y,o) cannot be represented as an vector array, which increase the
difficulties of this task.

Since the original sequential execution algorithm of C++ does not fit for GPU's parallel
computing architecture. Therefore modification and re-design of the current arithmetic is needed.
Keep the same logic but use different arithmetic and implement the application based on GPU
parallel computing characteristic is the challenge.

GPU parallel decoding algorithm and implementation

The parallel decoding algorithm based on CUDA GPU architecture needs to be designed
in the consideration of the following conditions:

1) Use each thread to calculate all the necessary calculation steps for one point. And allocate
the number of threads based on the number of points. The number of threads has to be
greater and equal than the number of points.
2) Change data type: build a structure. Instead of using 3-dimensional vector array, putting
coordinates (x, y), orientation 'o' and angle 'θ' into a structure, so that each of them can be

38

singly pulled from the structure to do calculation and do not conflicts each other. In addition,
the shape of the grid and the block can be designed using only the dimension size of the x
and y. “o” can be ignored at this step, so it won’t affect any future calculation.
3) Define a structure which contains 9 nearest points which defined as a neighbor. Give up the
3*3 matrix, using k-nearest neighbor algorithm (k = 9) to compute the distance and sort the
points instead of decreasing the sequential calculation loops for each thread.
4) Use less data to try out in simple block, and then implement large scaled data in
multi-blocks. Modification of the code might needed due to the heavy sequential calculation
in each thread.
5) Since data in _global_ memory can be read both host and device freely, and in the
computation process of decoding algorithm, the k-nearest neighbor sorting is the most time
consuming computation process, and the final decoding list is only used to store the final
output for the next programming function, to allocate the memory of the GPU, I put the final
decoding list under _device_ memory and put the k-nearest neighbor sorting process under
_global_memory. In this way, I can reduce the work flow for memory reading, in addition,
reduce the kernel invoking time, in fact, accelerate the decoding implementation.
6) A timer needs to add to the program to calculate the time of invoking kernel and the time of
transferring data back to the Host (CPU) from Device(GPU) to compare the computation
time with the CPU serial decoding algorithm.

39

The designed parallel decoding algorithm is only implemented in two steps shown in figure
3.1. Comparing to the serial decoding algorithm, the complexity of the computing processing is
decreased from 6 steps to 2 steps.

Figure 3.1 Decoding Algorithm
The computation of the distance can be calculated using the Euclidean Distance. The
information of the reference data base is loaded from the global memory to the shard memory of
each block. Each thread takes care of only one target point to compute the distance with the other
points in the reference data sheet shared with the other threads. Therefore, the distance
calculation can be parallelized computed using GPU. The number of threads need to be
generated in this kernel depends on how many points in the database. It has to be greater or equal
than the number of the reference points in the data sheet. This distances are stored in the shared
memory of each block for future sorting kernel.

After computing the distances between each target point to the other points in the reference
data sheet, I use k-NN algorithm to find the 8 nearest neighbors of the target point. Since I stored
all the distances for each target point in the shared memory in each block, the sorting of 8 nearest
neighbors for each target point can pull the distance data inside of its block. Therefore, time of

40

transferring data from memory is being saved. All the blocks invoke this sorting kernel
concurrently. Also all the threads generated by the kernel in a common block performs the
sorting algorithm simultaneously. This parallel k-NN sorting process can be shown in Figure 3.2.

Figure 3.2 k-NN Sorting of Decoding Algorithm

41

In the k-NN sorting step, I have generated a data sheet contains groups of 9 nearest
neighbors shown as in Figure 3.2. Then I perform the angle comparison sorting using arctan
function. In terms of the standard arctan function, whose range is (−π/2, π/2), I use atan2
function to prevent over eliminating the points in the range of (−π, π].Compare these angle to the
9 positions defined by angle "45°, 225°,

90°, 270°, 135°, 315°." Each thread only handles one

groups of 9-nearest neighbors. All threads in the same block and all blocks invoke this angle
comparison kernel simultaneously in the GPU. After the result is obtained, I return a nine digit
codeword built by their corespondent orientation value 'o'. The final decoding list will contain
groups of 9-nearest neighbor's x and y coordinates and the nine digit codeword in each row. And
this list is stored in the _device_memory. Whenever this list is needed, it can be pulled and
transferred into the Host (CPU) memory for future use. Since the data is large, it saves the
memory space of the CPU to do other six computations for this 3-D single-directional camera
system.

42

Table 3.3 Time consuming comparison (Decoding vs. Complete Program Implementation)

43

Chapter 4
Experiments and Improvements
I captured 21 different images with different patterns from different angle in different light
source environments with the 3-D single-directional Camera as experiments to prove the parallel
decoding algorithm improvements. Comparing to the serial CPU decoding algorithm, our
parallel decoding algorithm efficiently utilizes CUDA-based GPU program structure to speed up
the decoding process. In fact, as demonstrated in Section 3.3, decoding is the most time
consuming function among all seven functions, it majorly speeds up the whole 3-D
single-directional Camera system total implementation time.

Table 4.1 and Table 4.2 shows the comparison of the computation time with parallel
decoding algorithm using GPU and serial decoding algorithm using CPU for 21 different images.
Our parallel decoding algorithm with CUDA based GPU speeds up in the range from 770 to
2149 times of the original serial decoding algorithm with CPU.

From Figure 4.1 and Figure 4.2, I can see that no matter the kernel invoke 128 threads/block
or 512 threads/block, the computation in our parallel decoding algorithm with CUDA-based
GPU is way faster than the computation in serial decoding algorithm with CPU. Calculating the
average time usage in each algorithm, Figure 4.3 shows the case if each kernel generates 128
44

threads/block , the decoding algorithm will take about 3.85 ms which speeds up about 1942
times, however, if each kernel generates 512 threads/block , it will take about 8.27 ms which
speeds up about 905 times. Implement the parallel decoding algorithm using 128 threads/block is
about twice faster than using 512 threads/block. This is because when I perform the k-NN sorting
algorithm, each thread in the common block has to sequentially look for every other elements
inside the block to sort the distances. And each thread handles only one point and perform this
sequential k-NN sorting inside of the common block. Therefore, the less points inside a block,
the less sequential computation need to be implemented inside the common block, the more
blocks can be used to perform parallel computing using GPU, the faster the program can be
implemented.

Since our parallel decoding algorithm has such significant improvement in computation
speed and decoding is the most time consuming function in the whole application program, the
total application implementation time should be optimized significantly as well. Table 4.3 and
Table 4.4 shows how much times it improved for each experience.

From Figure 4.4, I can tell that no matter 128 threads/block or 512 threads/block each
kernel can generate, the total application time of the single-directional camera system can be
saved about 7.5 sec which speeds up the whole system about 1.7 times, in fact, almost speed up
twice.

45

kernel invoke : 128 threads/block

Ima
ge #

Serial
Decoding
w/ CPU
(s)

Parallel Decoding
w/ GPU (ms)

Parallel Decoding
vs.
Serial Decoding
(# of Times Faster)

1
2

8.70
7.10

4.047776
3.644288

2149
1948

3
4
5

5.30
9.00
8.60

2.975008
4.092448
4.291360

1782
2199
2004

6
7

6.00
7.10

3.576512
3.597312

1678
1974

8
9
10

7.30
5.10
7.20

3.357568
2.933344
4.020896

2174
1739
1791

11
12

8.40
7.30

3.917984
4.071840

2144
1793

13
14

8.50
8.30

4.009440
4.195232

2120
1978

15
16
17

8.20
7.60
7.30

3.982304
3.959104
3.890592

2059
1920
1876

18
19

7.90
7.80

4.229504
4.234304

1868
1842

20

8.40

4.347168

1932

21

6.00

3.510752

1709

Table 4.1 Time comparison with 128 threads/block

46

kernel invoke : 512 threads/block

Ima
ge #

Serial
Decoding w/
CPU (s)

Parallel Decoding
w/ GPU (ms)

Parallel Decoding
vs.
Serial Decoding
(# of Times Faster)

1

8.70

9.484000

917

2

7.10

8.088288

878

3

5.30

3.942016

1344

4

9.00

9.363200

961

5

8.60

9.205408

934

6

6.00

7.046048

852

7

7.10

7.367872

964

8

7.30

5.888064

1240

9

5.10

3.884032

1313

10

7.20

9.428192

764

11

8.40

9.248288

908

12

7.30

9.475680

770

13

8.50

9.268384

917

14

8.30

9.394784

883

15

8.20

9.363968

876

16

7.60

9.222912

824

17

7.30

8.713696

838

18

7.90

9.430720

838

19

7.80

9.424864

828

20

8.40

9.480160

886

21

6.00

6.846112

876

Table 4.2 Time comparison with 512 threads/block

47

Figure 4.1 computation time comparison (128 threads/block)

48

Figure 4.2

computation time comparison (512 threads/block)

49

Avg Decoding time comparison

128 threads/block
512 threads/block

Avg Serial Decoding time (ms)
Avg Parallel Decoding time w/ 128 threads/block (ms)
Avg Parallel Decoding time w/ 512 threads/block (ms)
Figure 4.3 Average Computation Time Comparison

Figure 4.4 Total Application Performance Time Comparison

50

kernel invoke : 128 threads/block
Total Application

Imag
e #

Total
Applicat
ion
Time w/
CPU (s)

Total
Applicati
on
Time w/
GPU (s)

w/ Parallel
Decoding
vs.
w/ Serial
Decoding
(Time saving in
sec)

Total Application
w/ Parallel Decoding
vs.
w/ Serial Decoding
(# of Times Faster)

1

13.30

4.604

8.696

2.9

2

12.60

5.504

7.096

2.3

3

10.80

5.503

5.297

2.0

4

14.80

5.804

8.996

2.5

5

15.80

7.204

8.596

2.2

6

10.20

4.204

5.996

2.4

7

11.90

4.804

7.096

2.5

8

12.10

4.803

7.297

2.5

9

10.40

5.303

5.097

2.0

10

12.40

5.204

7.196

2.4

11

12.80

4.404

8.396

2.9

12

12.10

4.804

7.296

2.5

13

15.30

6.804

8.496

2.2

14

14.90

6.604

8.296

2.3

15

15.80

7.604

8.196

2.1

16

12.80

5.204

7.596

2.5

17

12.20

4.904

7.296

2.5

18

13.30

5.404

7.896

2.5

19

13.20

5.404

7.796

2.4

20

14.20

5.804

8.396

2.4

21

10.50

4.504

5.996

2.3

Table 4.3 Total Application Time comparison w/ 128 threads/block

51

kernel invoke : 512 threads/block

Imag
e #

Total
Total
Applicati Applicati
on
on
Time w/
Time w/
CPU (s)
GPU (s)

Total
Application
w/ Parallel
Decoding
vs.
w/ Serial
Decoding
(Time saving in
sec)

Total Application
w/ Parallel Decoding
vs.
w/ Serial Decoding
(# of Times Faster)

1

13.30

4.609

8.691

2.9

2

12.60

5.508

7.092

2.3

3

10.80

5.504

5.296

2.0

4

14.80

5.809

8.991

2.5

5

15.80

7.209

8.591

2.2

6

10.20

4.207

5.993

2.4

7

11.90

4.807

7.093

2.5

8

12.10

4.806

7.294

2.5

9

10.40

5.304

5.096

2.0

10

12.40

5.209

7.191

2.4

11

12.80

4.409

8.391

2.9

12

12.10

4.809

7.291

2.5

13

15.30

6.809

8.491

2.2

14

14.90

6.609

8.291

2.3

15

15.80

7.609

8.191

2.1

16

12.80

5.209

7.591

2.5

17

12.20

4.909

7.291

2.5

18

13.30

5.409

7.891

2.5

19

13.20

5.409

7.791

2.4

20

14.20

5.809

8.391

2.4

21
10.50
4.507
5.993
2.3
Figure 4.4 Table 4.3 Total Application Time comparison w/ 512 threads/block

52

However, not only comparing the speed of the performance, I also compare the final output
fusion images to prove the stability of the accuracy. I compared the output fusion images
generated from the 3-D single-directional camera system using the serial CPU decoding
algorithm to our parallel decoding algorithm using CUDA-based GPU. The results are all the
output fusion images generates the same 3-D image coordinates information, which means the
parallel decoding algorithm I designed not only optimizes the original serial decoding algorithm
on the computation speed, but also maintains the same accuracy. Therefore, this is a efficient
decoding algorithm. Each input images and the comparison of the output fusing images using
both algorithms are shown below: from Figure 4.4.1 to Figure 4.4.21.

53

Figure 4.4.1 Input Image #1, Output image comparison

54

Figure 4.4.2 Input Image #2, Output image comparison

55

Figure 4.4.3 Input Image #3, Output image comparison

56

Figure 4.4.4 Input Image #4, Output image comparison

57

Figure 4.4.5 Input Image #5, Output image comparison

58

Figure 4.4.6 Input Image #6, Output image comparison

59

Figure 4.4.7 Input Image #7, Output image comparison

60

Figure 4.4.8 Input Image #8, Output image comparison

61

Figure 4.4.9 Input Image #9, Output image comparison

62

Figure 4.4.10 Input Image #10, Output image comparison

63

Figure 4.4.11 Input Image #11, Output image comparison

64

Figure 4.4.12 Input Image #12, Output image comparison

65

Figure 4.4.13 Input Image #13, Output image comparison

66

Figure 4.4.14 Input Image #14, Output image comparison

67

Figure 4.4.15 Input Image #15, Output image comparison

68

Figure 4.4.16 Input Image #16, Output image comparison

69

Figure 4.4.17 Input Image #17, Output image comparison

70

Figure 4.4.17 Input Image #17, Output image comparison

71

Figure 4.4.18 Input Image #18, Output image comparison

72

Figure 4.4.19 Input Image #19, Output image comparison

73

Figure 4.4.20 Input Image #20, Output image comparison

74

Figure 4.4.21 Input Image #21, Output image comparison

75

Chapter 5
Conclusion and Future Work
In this paper, I presented the parallel decoding algorithm using CUDA based GPU of three
dimensional data streams in real-time. The calculation steps are involved in this algorithm are
distance computation, k-NN sorting and angle position sorting. Point clouds in these kernels are
implemented under CUDA-based GPU parallel computation architecture.
Many experiments are being used to test the stability of this program. The parallel decoding
algorithm reduces the computation complexity from six to two and accelerates the serial
decoding algorithm with the speedup times up to 2149X. Summarizing the experiments results
and analyze them, I have successfully implemented a efficient and stable parallel decoding
algorithm with CUDA-based GPU for the single-directional Camera system. Furthermore, the
k-NN parallel sorting algorithm can also be widely applied to any other application which needs
to use the distance as a reference to order the large scale of data. Also, as shown in Figure 4.1
and 4.2, it is easy to see that the implementation speed also determines by how to allocate the
number of the blocks and threads. As illustrated in table 4.1 and 4.2, implementing decoding
with 128 threads/block is faster than implementing decoding using 512 threads/block. This is
because, in GPU, each thread performs parallel computing from the others, so does blocks, and
between blocks there is no communication involved. Therefore increasing the number of blocks
76

and decrease the number of threads will decrease the number of data points needed to compute in
each block and increase the number of data parallelism for each frame which will increase
parallel computing power. And for large scaled data points, using more threads for each block
can result less processing time than putting all data in one block.
In order to accelerating the single-direction camera system by GPU, design an algorithm
which efficiently utilizes the CUDA-based GPU parallel program structure is very important.
These includes pre-define the data structure, design the shape of the grid and blocks,
modification of existing sequential arithmetic, and optimized allocate memory on host and
device. As seen in Figure 4.3, I can conclude that using different number of blocks and different
number of threads per block makes a big difference on processing time.
The final goal is to achieve video rate for the single-directional 3-D camera system to fast
and accurate navigate the robot arm to correctly and quickly grab the object in front of the robot.
The parallel decoding algorithm optimizes the entire program to a significant speed
improvements, 2000×. However, this is not enough to improve the whole system to have video
rate response. Future implementation of other eight functions in the software architecture is
necessary. Therefore, future work will be focused on parallel implementing the other eight
functions in the system using CUDA based GPU.

77

BIBLIOGRAPHY

78

BIBLIOGRAPHY

[1] David B.Kirk, Wen-mei W.Hwu (2010). Programming Massively Parallel Processors.
London: Morgan Kaufmann. 1-279.
[2] Nvidia, “Compute Unified Device Architecture Programming Guide Version 4.0,"
http://developer.download.nvidia.com/compute/cuda/4.0/docs/NVIDIACUDAProgrammingGuid
e 4.0.pdf
[3]Nvidia,
“NVIDIA
GeForce
GTX
470
GPU
Architectural
Overview,”
http://www.nvidia.com/docs/IO/55506/GeForce GTX 470 GPU Technical Brief.pdf, May 2009
[4] ——, “The CUDA
1213955090354.html.

Compiler

Driver

NVCC,”

http://www.nvidia.com/object/io

[5] W. J. van der Laan, “Decuda,” http://wiki.github.com/laanwj/decuda/.
[6] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “NVIDIA Tesla: A Unified Graphics
and Computing Architecture,” IEEE Micro, vol. 28, no. 2, pp. 39–55, 2008.
[7] V. Volkov and J. W. Demmel, “Benchmarking GPUs to Tune Dense Linear Algebra,” in SC
’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. Piscataway, NJ, USA:
IEEE Press, 2008, pp. 1–11.
[8] Z. S. Hakura and A. Gupta, “The Design and Analysis of a Cache Architecture for Texture
Mapping,” SIGARCH Comput. Archit. News, vol. 25, no. 2, pp. 108–120, 1997. [12] H. Goto,
“Gt200 over view,” http://pc.watch.impress.co.jp/docs/2008/0617/kaigai 10.pdf, 2008.
[9] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, “Analyzing CUDA
Workloads Using a Detailed GPU Simulator,” in Performance Analysis of Systems and Software,
2009. ISPASS 2009. IEEE International Symposium on, April 2009, pp. 163–174.
[10] D. Kirk and W. W. Hwu, “ECE 489AL Lectures 8-9: The CUDA Hardware Model,”
http://courses.ece.illinois.edu/ece498/al/Archive/Spring2007/lectures/lecture8-9-hardware.ppt,
2007.
[11]H. J′egou, M. Douze, and C. Schmid, “Hamming embedding and weak geometric

79

consistency for large scale image search,” in European Conference on Computer Vision,
Marseille, France, 2008.
[12] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk,and W. W. Hwu,
“Optimization Principles and Application Performance Evaluation of a Multithreaded GPU using
CUDA,” in PPoPP ’08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming. New York, NY, USA: ACM, 2008, pp. 73–82..
[13] J, Xu, N. Xi, C. Zhang and Q. Shi,“Real-time 3-D shape measurement system based on
single structure light pattern,” IEEE International Conference in Robotics and Automation
(ICRA), 2010
[14] MaCallum A, Nigam K. A comparison of event models for Navie Bayes text classification.
In: Processings of AAAI-98 workshop on learning for text categorization, 41-48, 1998.
[15] Joachims T. Learning to classify text using support vector machines: methods, theory and
algorithms. Springer, Berlin, 2002.
[16] D. Aha, D. Kibler, and M. K. Albert. Instance-based learning algorithms. Machine Learning,
6:37–66, 1991.
[17] Han, E.H., George, K., Vipin, K.: Text categorization using weight adjusted k-nearest
neighbor classification: Technical Report. University of Minnesota (2000).
[18] H. V. Jagadish, B. C. Ooi, K.-L. Tan, C. Yu, and R. Zhang. iDistance: An adaptive B+-tree
based indexing method for nearest neighbor search. ACM Transactions on Database Systems,
30(2):364-397, June 2005.
[19] Xiang Li, Daming Shi, Varoon Charastrakul and Junhong Zhou. Advanced P-Tree based
K-Nearest neighbors for customer preference reasoning analysis. Journal of Intelligent
Manufacturing, Volume 20, Number 5, Pages 569-579,2009.
[20] An Gong, Yanan Liu. Improved KNN Classification Algorithm by Dynamic Obtaining K.
Communications in Computer and Information Science, Volume 143, 320-324, 2011.
[21] M. Kamber, J. Han, Data Mining: Concepts and Techniques, 2nd Edition, Morgan
Kaufmann, 2005.
[22] B. Catanzaro, N. Sundaram, K. Keutzer, "Fast Support Vector Machine Training and
Classification on Graphics Processors", Proc. ofInternational Conference on Machine Learning ,
2008, pp. 104-111.

80

[23] G. Vasiliadis, S. Antonatos , M. Polychronakis, et ai, "Gnort : High Performance Network
Intrusion Detection Using Graphics Processors", Recent Advances in Intrusion Detection (RAID),
2008, Vol. 5230, pp. 116-134.
[24] V. Garcia, ′ E. Debreuve, and M. Barlaud, “Fast k nearest neighbor search using gpu,” in
CVPR Workshop on Computer Vision on GPU, Anchorage (AK), USA, 2008.
[25] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput.
Vision, vol. 60, pp. 91–110, 2004.
[26] V. Garcia, E. Debreuve, M. Barlaud, "Fast K Nearest Neighbor Search using GPU", IEEE
Conference on Computer Vision and Patter Recognition Workshops, 2008, Vol. 1-3, pp.
1107-1112.
[27] H. Zhang, A. C. Berg, M. Maire, and J. Malik, “SVM-KNN: Discriminative nearest
neighbor classification for visual category recognition,” in International Conference on
Computer Vision and Pattern Recognition, New York (NY), USA, 2006.
[28] M. N. Goria, N. N. Leonenko, V. V. Mergel, and P. L. Novi Inverardi, “A new class of
random vector entropy estimators and its applications in testing statistical hypotheses,” J.
Nonparametr. Stat., vol. 17, pp. 277–297, 2005.
[29] F. Pan, B. Wang, X. Hu, and W. Perrizo, “Comprehensive vertical sample-based knn/lsvm
classification for gene expression analysis,” J. Biomed. Inform., vol. 37, pp. 240–248, 2004.
[30] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu, “An optimal algorithm
for approximate nearest neighbor searching fixed dimensions,” Journal of the ACM, vol. 45, pp.
891–923, 1998.
[31] H. J′egou, M. Douze, and C. Schmid, “Searching with quantization: approximate nearest
neighbor search using short codes and distance estimators,” Tech. Rep. RR-7020, INRIA, 2009.
[32] D. Qiu, S. May, and A. N¨uchter, “GPU-accelerated nearest neighbor search for 3-D
registration,” in International Conference on Computer Vision Systems, Li`ege, Belgium, 2009.
[33] Y. Zhuge, Y. Cao, and R.W. Miller, “GPU accelerated fuzzy connected image segmentation
by using CUDA,” in Engineering in Medicine and Biology Conference, Minneapolis (MN),
USA, 2009.

81