é . in...“

«W.. a”,
r. . L
‘yu

 

. . $.79 in. «arm:
«at .uﬁr. «"314. «RH. K

”1.0.3:; haw.”
. . Jig‘auiﬁg.
i... s, P: Pr.) .

...:x
Lun:.ra....a“~n :N.
315.1 .ﬁPﬁu

"5. AL}

{Saw .. ; ~
.. 3.: n. "ma...“

L1

 

. 3%?» «$13» . , ﬁgfﬂk

 

THESIS
:1
. .'
_ ; of "I

@0393330

This is to certify that the
thesis entitled

COMPLEXITY DISTORTION OPTIMIZATION FOR
VIDEO OVER WIRELESS DEVICES

presented by
Alan Paul Ray

has been accepted towards fulﬁllment
of the requirements for the

MS. degree in Electrical and Computer Eng.

{gt/M4 M

cj/toi'ajor Profeséefs Signature
4714/ 2 7, 20451/

Date

 

 

 

MSU is an Afﬁnnative Action/Equal Opportunity Institution

 

 

 

LIBRARY
Michigan State
University

 

 

 

PLACE IN RETURN Box to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6/01 cJCIRCIDatoDuopes-p. 1 5

COMPLEXITY DISTORTION OPTIMIZATION FOR
VIDEO OVER WIRELESS DEVICES

By

Alan Paul Ray

A THESIS

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

MASTER OF SCIENCE

Department of Electrical and Computer Engineering

2004

ABSTRACT

COMPLEXITY DISTORTION OPTIMIZATION FOR
VIDEO OVER WIRELESS DEVICES

By

Alan Paul Ray
The introduction of powerful compact wireless devices (such as handheld computers)
raises new and unexplored aspects to optimizing complexity-distortion for video.
Historically, the computational power of powerful platforms (e.g., computers) has
exceeded the available communication bandwidth over emerging networks such as the
Internet and wireless networks. Hence, improved performance was achieved by reducing
the required transmission bitrate of video (through higher compression) and by increasing
the computational complexity. Handheld devices do not follow the traditional balance of
bandwidth and power; computational power is at a premium while bandwidth is
comparatively more available. This thesis investigates the complexity-distortion options
for a state-of-the-art video compression standard, H.264/JV T, which is an estimated three
times more complex than the previous standards. The research focuses on analyzing the
video standard’s complexity for one of the most popular handheld devices, the iPAQi
Operational complexity-distortion and rate-complexity curves are presented for a number
of Group of Picture (GoP) structures, entropy coding modes and compared against
Pentium IV performance. The experiments include some of the earliest comparisons
between JV T’s entropy coding modes, context-adaptive binary arithmetic coding and

content-adaptive variable-length coding.

TABLE OF CONTENTS

LIST OF TABLES ............................................................................................................. V
LIST OF FIGURES .......................................................................................................... VI
CHAPTER 1
INTRODUCTION .............................................................................................................. 1
Problem Background ...................................................................................................... 1
Prior Work and Motivation ............................................................................................. 2
Objectives ....................................................................................................................... 4
Approach ......................................................................................................................... 5
Contributions ................................................................................................................... 5
Thesis Overview ............................................................................................................. 7
CHAPTER 2
BACKGROUND ................................................................................................................ 8
Standard Encoder Model ................................................................................................. 8
Input Frame ................................................................................................................. 9
Motion Estimation & Prediction Error ..................................................................... 10
Frequency Transform & Quantization ...................................................................... 12
Decode Frame Prediction .......................................................................................... 13
Entropy Coding ......................................................................................................... 13
IV T Decoder Implementation ....................................................................................... 14
Input Decoding .......................................................................................................... 14
Dequantization and Hadamard Transform ................................................................ 17
Motion Prediction ..................................................................................................... 18
Deblocking (Loop Filter) .......................................................................................... 21
CHAPTER 3
IMPACT OF GROUP OF PICTURE TYPES ON REAL TIME PERFORMANCE ...... 23
Hardware Selection ....................................................................................................... 24
Handheld device - iPAQ: .......................................................................................... 25
PC Platform: .............................................................................................................. 25
Software Selection ........................................................................................................ 25
Operating System ...................................................................................................... 26
Compiler and Software Development ....................................................................... 27
IV T Decoder ............................................................................................................. 28
JV T Encoder ............................................................................................................. 31
IV T Player ................................................................................................................ 32
Experimental Platform .............................................................................................. 33
Experimental JV T Video Sequence Settings ................................................................ 34
Experiment Variables: .............................................................................................. 34
Experiment Constants: .............................................................................................. 38
Primary (I,P,B,T) GoP Results by Complexity-Distortion: .......................................... 39
Plot Speciﬁcations .................................................................................................... 41

iii

Akiyo ............................................................................................... i .......................... 42

Foreman .................................................................................................................... 47
Mobile ....................................................................................................................... 52
Trends and Observations ............................................................................................... 56
CHAPTER 4
BITRATE COMPLEXITY ANALYSIS .......................................................................... 59
Timing Probes ............................................................................................................... 59
Actual Timing Probes ............................................................................................... 6O
Calculated Timing Probes ......................................................................................... 63
Akiyo ............................................................................................................................. 64
Summary ................................................................................................................... 66
Akiyo Tables and Figures ......................................................................................... 67
Foreman ........................................................................................................................ 74
Summary ................................................................................................................... 76
Foreman Tables and Figures ..................................................................................... 77
Mobile ........................................................................................................................... 84
Summary ................................................................................................................... 85
Mobile Tables and Figures ........................................................................................ 86
Conclusions ................................................................................................................... 93
CHAPTER 5
CONCLUSION ................................................................................................................. 96
Summary ....................................................................................................................... 96
Future Directions .......................................................................................................... 97
BIBLIOGRAPHY ........................................................................................................... 100
General References ..................................................................................................... 101

iv

LIST OF TABLES

Table 1: Complexity Summary over 25-45 dB range ......................................................... 7
Table 2: Summary of Coding Variables ........................................................................... 34
Table 3: Akiyo Complexity Deltas over 25-45 dB ........................................................... 44
Table 4: Foreman Complexity Deltas over 25-45 dB ....................................................... 49
Table 5: Mobile Complexity Deltas over 25-45 dB .......................................................... 54
Table 6: Akiyo Rate Invariant Probes (15fps, CE) ........................................................... 67
Table 7: Foreman Rate Invariant Probes (15fps, CE) ....................................................... 77
Table 8: Mobile Rate Invariant Probes (leps, CE) ......................................................... 86
Table 9: Complexity Summary over 25-45 dB range ....................................................... 94

LIST OF FIGURES

Figure 1: Basic Video Encoder Model ................................................................................ 9
Figure 2: Group of Picture Illustration .............................................................................. 20
Figure 3: First Frame of Video Sequences ....................................................................... 35
Figure 4: Akiyo Complexity-Distortion (15 FPS, CE) ..................................................... 42
Figure 5: Akiyo Complexity-Distortion (15 FPS, PC) ..................................................... 43
Figure 6: Foreman Complexity-Distortion (15 FPS, CE) ................................................. 47
Figure 7: Foreman Complexity-Distortion (15 FPS, PC) ................................................. 48
Figure 8: Mobile Complexity-Distortion (15 FPS, CE) .................................................... 52
Figure 9: Mobile Complexity-Distortion (15 FPS, PC) .................................................... 53
Figure 10: Akiyo Rate-Complexity (15 FPS, CE) ............................................................ 68
Figure 11: Akiyo Deblock Rate-Complexity (15 FPS, CE) ............................................. 69
Figure 12: Akiyo Deblock Rate-Complexity by Picture Type (15 FPS, CE) ................... 70
Figure 13: Akiyo ReadMB Rate-Complexity (15 FPS, CE, CABAC) ............................. 71
Figure 14: Akiyo ReadMB Rate-Complexity (15 FPS, CE, CAVLC) ............................. 72
Figure 15: Akiyo ProcessMB exl. ReadMB Rate-Complexity (15 FPS, CE, CABAC) .. 73
Figure 16: Foreman Rate-Complexity (15 FPS, CE) ........................................................ 78
Figure 17: Foreman Deblock Rate-Complexity (15 FPS, CE) ....................................... i 79
Figure 18: Foreman Deblock Rate-Complexity by Picture Type (15 FPS, CE) .............. 80
Figure 19: Foreman ReadMB Rate-Complexity (15 FPS, CE, CABAC) ....................... 81
Figure 20: Foreman ReadMB Rate-Complexity (15 FPS, CE, CAVLC) ......................... 82
Figure 21: Foreman ProcessMB exl. ReadMB Rate-Complexity (15 FPS, CE, CABAC)

.......................................................................................................................... 83

vi

Figure 22: Mobile Rate-Complexity (15 FPS, CE) .......................................................... 87

Figure 23: Mobile Deblock Rate-Complexity (15 FPS, CE) ............................................ 88
Figure 24: Mobile Deblock Rate-Complexity by Picture Type (15 FPS, CE) ................. 89
Figure 25: Mobile ReadMB Rate-Complexity (15 FPS, CE, CABAC) ........................... 90
Figure 26: Mobile ReadMB Rate-Complexity (15 FPS, CE, CAVLC) ........................... 91

Figure 27: Mobile ProcessMB exl. ReadMB Rate-Complexity (15 FPS, CE, CABAC). 92

vii

CHAPTER 1
INTRODUCTION

Problem Background

Compression is a crucial component of video coding. Uncompressed video is extremely
space intensive, with an average resolution 320x240 pixel image taking up over 300
kilobytes, or over 27 megabits per second at the 30 frames per second standard. As video
streaming grows and spreads, today’s bandwidth and storage simply cannot support any
signiﬁcant level of uncompressed video. Previous standards such as MPEG-2, MPEG-4,
and H.263 have successively introduced newer and improved approaches to video

compression.

However, computer processing power has continued to rapidly grow, allowing the
introduction of better compression standards to meet the rising use of video usage. High
compression standards attempt to deal with difﬁculties such as increased network trafﬁc,
limited bandwidth, and storage space. The current state-of-the-art standard is the

JV Tl/H.264 standard2 developed by the Moving Picture Experts Group and Video
Coding Experts Group (under ISO - International Standard Organization) and the Video
Coding Experts Group (under ITU — the International Telecommunications Union) [1].
This standard has been developed as a response to “the need for an industry standard
for compressed video representation with substantially increased coding efﬁciency and

enhanced robustness to network environments.” [1]

 

' JVT is a Joint Video Team from the two aforementioned experts groups.
2 Throughout this thesis, JV T, H.264, JVT/H.264, and H.264/JVT are used interchangeably.

Prior Work and Motivation

As discussed later, JV T provides signiﬁcant advantages and improvements over previous
standards in terms of higher compression ratios. One estimate put H.264 bit savings at
39% compared to MPEG-4, 59% over H.263 High Latency Proﬁle, and 64% better than
MPEG-2 [2]. Another study suggested around 50% coding gain over MPEG-2 and the
baseline H.263, and 24% over the high proﬁle H.263 model [3]. However, to obtain the
improved compression rates, the JVT standard signiﬁcantly increases the computational
complexity. One study suggests that the baseline H.264 decoder increases complexity an
average of 2.4 times compared to the previous H.263 baseline implementation [4].
Although much research has gone into reduced-complexity improvements to H.264, it

remains a computationally complex standard.

While the increased complexity of the H.264 standard poses some problems for today’s
traditional desktop and laptop computers, these computers have grown rapidly in their
processing power, and typically have specialized video cards and instruction sets. In
addition, signiﬁcant amounts of research are focused on reducing complexity for
traditional computers. However, the spread of handheld computers and wireless networks

introduces a new set of requirements to video decoding.

Handheld computers, with high-speed Internet and quality graphics, are clearly viable
candidates for video decoding. Their wireless network access has the same bandwidth as
any other wireless implementation, currently around one to ﬁve megabits per second,

depending upon the technology being used, the environment interference, error handling

techniques, and the network implementation and usage. The typical handheld graphics,
while adequate, are far less capable than a typical desktop. The handheld uses a 240x360
pixel screen, at least four times smaller than a desktop screen. A desktop’s color range
ranges in the millions of colors; a handheld device at sixty-ﬁve thousand or fewer (a
maximum of 16 bits per pixel). More importantly, the handhelds typically lack

specialized video cards, powerful processors, or extra instruction sets.

The handheld processors are slower, both in terms of clock rate as well as optimizations.
Because of power, space, and heat issues, architectural speedups are typically not
implemented to the same degree as for standard systems. Thus a 4OOMhz processor for a
handheld is not as powerful as a 4OOMhz processor for a desktop. Additionally, handheld
devices do not run the same operating systems as desktops. Thus, the needs and
limitations for a video decoder over a handheld are likely to be signiﬁcantly different

from video decoder optimized for a desktop.

Speciﬁcally, bandwidth may be much less of an issue compared to the lack of computing
power. Furthermore, due to the different operating system and architectural designs,
handheld optimization may not particularly correspond to desktop optimizations.
However, designing a new video standard for handheld devices is undesirable for several
reasons. First, video will often be generated for multiple destinations, possibly without
knowing the platform of the end-user. In addition, video is often encoded for multiple
uses (handheld, storage, desktop) and encoding video multiple times for different

applications is less than desirable. Furthermore, the development of additional standards

defeats one of the major purposes of having a standard: To have a universally accepted
format for video encoding. IV T, designed and backed by both the MPEG and ITU
standards organizations, is likely to become the next generation standard in many
applications. However, little research has been done for addressing the questions

involved in how I VT can be best implemented on handheld devices.

Since handheld devices, due to their low storage space, are better suited for real-time
video than to storing video, this thesis focuses on the real-time performance of a
handheld decoder with a variety of approaches to decoding JV T video sequences. Special
attention is paid to the time involved in the subparts of video decoding. One previous
study [5] breaks JV T decoding process into six main sections3 : The deblocking ﬁlter,
interpolation, inverse transform and quantization, variable length decoding (VLD), intra
prediction, and miscellaneous. Based on using a Pentium HI processor, this previous
study estimated that the deblocking ﬁlter takes 20-3 5% of the time, interpolation
(including motion prediction) another 35-50%, intra prediction less than 1%, inverse
transform and quantization 7-11%, variable length decoding 4-9%, and miscellaneous

around 6-21%. The wide range of results reﬂects the use of different video sequences.

Objectives
The purpose of this research was three-fold: To understand the impact of high level
encoder options such as entropy coding and Group of Picture (GoP) structure on

complexity-distortion; to develop a handheld speciﬁc complexity model to guide further

 

3 The following chapter elaborates on and provides some background on these parts of the video decoder.

handheld decoder optimizations; and to generate a real-time JVT decoder for a handheld

device.

Approach

Since prior research, such as the study by Lappalainen et a1. suggests interpolation
accounts for a large portion of the decoding time, an initial research focus is the impact of
the frame type composition upon the computational complexity. The frame type
composition, or Group of Picture (GoP) types, is a signiﬁcant factor in controlling the

amount of motion prediction required by the video decoder.

Contributions

The major contributions of this research are broken down into four sections. The ﬁrst
contribution, developing the available reference software into a viable real time H.264
decoder, provided the underlying software used for performance analysis throughout the
rest of this research. In addition, the real time H.264 decoder ensures the practical

application of the research to actual decoding.

The second contribution, discussed in chapter three, provides a baseline complexity for
the H.264 decoder under different parameters, especially the impact of motion prediction
through controlling the picture frame types. The exploration of these parameters gives
insight into the complexity control available at a high level through varying the encoder
options. This contribution focused on the impact of I, P, and B-frames on the complexity-

distortion curves generated by different levels of quantization in the encoder. This

contribution highlights the generally optimum performance of P-frames, even when no

rate-limit is imposed on the sequence.

The third contribution, examined in chapter 4, breaks down the decoder into sub-
components and presents the rate-distortion for each aspect. It identiﬁes the portions of
the decoder that are complexity invariant, regardless of rate, and the components that are
rate-dependent. Additionally, this section breaks out the section that goes into parsing,
which is highly linear based on rate, and also breaks out the decoding (non-parsing)
complexity for I, P, and B frames. This section highlights the signiﬁcance of parsing,
especially with a sub-optimal implementation, as well as the large decoding time,

especially with B frames.

The fourth contribution compares the iPAQ results to the same experiments run on a
Pentium IV processor. The contribution highlights that the performance differences
between the two processors are not simply scaled versions of each other. The Pentium
speedup varies signiﬁcantly based upon the component of the decoder. This variance
strongly suggests that algorithms and optimizations from one processor will not
necessarily be equally beneﬁcial on the other processor and that separate handheld
research is needed. This contribution is discussed in both chapter 3 and 4 after the
relevant iPAQ discussion. Table 1 provides an overview of the complexity and parsing

rates for a range of sequences.4

 

‘ This table is discussed in more depth at the conclusion of chapter 4.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Frame iPAQ Pentium IV, PC Speedup
Type 4OOMhz 1.5GhZ (Times Faster)
Low End J High End
Decoding, exl. Parsin r ans/frame)
I 11.5-13.5 2.75-3.10 4.18 4.35
P 21.5-22.5 3.95-4.45 5.44 5.06
B 33.0-48.5 5.40-7.10 6.11 6.83
Parsing (bytes/ms)
CABAC Codin r
I 250-280 1 100-1300 4.40 4.60
P 160-250 720-1 100 4.50 4.40
B 80-210 450- 1 000 5.63 4.76
CA VLC Coding
I 135-245 600-1130 4.44 4.61
P 50-100 380-520 7.60 5.20
B 30-80 360-500 12.00 6.25

 

 

 

 

 

 

 

 

 

Table 1: Complexity Summary over 25-45 dB range

Thesis Overview

The remaining chapters are organized as follows: Chapter 2 discusses video coding
standards in general and the important features of the J VT standard. Chapter 3 looks at
the complexity-distortion of difference GoP sequences as well as contrasting the PC and
iPAQ performance. Chapter 4 breaks down iPAQ decoder performance into its
components and looks at the component performance based on compression. Chapter 5

concludes with summarizing the contributions in more detail and looking at future work.

CHAPTER 2

BACKGROUND
This chapter discusses some key basics and well-established concepts for video decoders,
the critical portions of the H.264/JV T standard, and the previous studies into H.264 and
wireless complexity. The ﬁrst section of this chapter provides a general summary of how
a video sequence is broken down for coding and provides a very brief synopsis of the
generic encoding process. The next section steps through the JVT standard decoding
process of a correctly coded video stream, starting with the incoming stream, then
looking at coefﬁcient dequantization, motion prediction, and ﬁnally image deblocking.
These sections include exploration of previous research on J VT complexity issues as well

as the few previous studies on implementing video decoding on handheld platforms.

Standard Encoder Model

The typical video compression model is based on an encoder, either real-time or ofﬂine,
that is encoding a series of pictures, transmitting the encoded video sequence over some
channel, possibly with additional channel coding, where the sequence is received by a
decoder, which reverses the encoding process and generates a sequence similar to the
original. The similarity to the uncorrupted original sequence, often referred to as the
signal-to-noise ratio, depends upon the amount of information discarded by the encoder
while compressing the sequence and any optimization in the decoder that neglected
transmitted information. Many standards, including the H.264 standard, specify only the
correct (“standard compliant”) input and output of the decoder; for the encoder, any

algorithm that generates a correctly encoded (compliant) stream is acceptable. Likewise,

any decoding process that takes a valid input stream and produces the correct decoded
video sequence is acceptable. In practice, there is a standard encoder model as shown in
Figure 1. The rest of this section examines each of these seven components (input frame,
motion estimation, prediction error, frequency transform, quantization, decode frame

prediction, and entropy coding).

1 Frame

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Motion Frequency
Estimation Transform —p Quantization
i (DCT)
Error
(Actual vs. Decoder
Prediction) 33::
Decode
I Frame {— Entropy
(Prediction) I OWN

 

 

 

 

 

Figure 1: Basic Video Encoder Model

Input Frame

Video frames, typically in YUV format, are sequentially fed into the encoder. YUV is a
color representation scheme where Y represents the intensity, or luminance (luma), of the
color. The U (or Cb) and V (or Cr) components contain the chrominance, or color
component. (Cb and Cr primarily contain the blue and red chrominance components of
the original Red-Green-Blue (RGB) pixel, respectively; where as the luminance Y
primarily contains the green component.) Because the human eye is much more
sensitivity to intensity than to color, the chrominance is oﬁen speciﬁed with fewer bits.

One of the signiﬁcant advantages of the YUV scheme is the ease of sending more detail

9

regarding the luminance by including more bits for the Y component. The second way of
exploiting the YUV scheme, used by popular video compression standards such as J VT,
is to use a single sample of the UV component for multiple samples of the Y component.
One format that is used by the MPEG and ITU series of video standards (including J VT)
is referred to as the YUV 4:220, which implies that for each pixel, the Y component is
encoded. However, a Cb and Cr sample are only taken every other column and every
other row, effectively one UV sample per four Y samples. Various schemes are used to
properly average the UV components over all four Y samples. Currently the JVT
standard supports 8-bits per Y, U, or V sample, although support for additional bits is

expected [6].

Motion Estimation & Prediction Error

Motion estimating plays a major role in enabling modern video compression standards to
achieve high coding efﬁciency. Motion estimation is traditionally achieved by dividing
the image frame (under consideration by the encoder) of the video sequence (being
compressed) into square (or rectangle) regions. The classic subdivision of a frame is into
sixteen by sixteen (16x16) pixel blocks known as macroblocks. Each macroblock,
therefore, contains a 16x16 array of the Y luminance information, and two 8x8 arrays,
one for the Cb component and the other for the Cr component. (The 4:2:0 YUV format
has four times the Y information compared to the U and V.) The macroblock is the
fundamental block which is encoded, transmitted, and then decoded. Note that the
standard video formats, such as QCIF (176x144 pixels) are multiples of 16x16. QCIF, for

example, contains 99 macroblocks (11x9).

10

When the encoder receives a frame, it breaks the frame (current frame) down into
macroblocks and for each macroblock in the current frame, searches for the most similar
16x16 area within the previously coded frame (reference frame). Once the most similar
16x16 area (“best match”) in the reference frame is identiﬁed by the motion search
process, the encoder then takes the difference between the two: the macroblock in the
current frame and the 16x16 best-match area in the reference frame. The 16x16 area
(“best match”) in the previous frame is the same size as the macroblock of the current
frame that is being coded; however this 16x16 area might consist of multiple macroblock
subsections of the previous frame. The range of the search depends upon the type of
picture being encoded and the encoder options. Some picture types only allow searching
within the same ﬂame for similar macroblocks, others allow looking at one or more
previous reference ﬂames. The number of reference ﬂames depends upon the encoder
options; more ﬂames increases the encoder complexity significantly. The deﬁnition of
“most similar” macroblock depends upon the encoder implementation. Numerous
schemes for increasing the search efﬁciency and performance have been published. For
example, Andy Chang et al. propose a fast estimation procedure that reduces
computational complexity by forty percent compared to a standard full search [7]. The
difference between the current block and the reference block is taken because usually the
difference contains smaller values than the original macroblock and thus requires fewer
bits to encode. In addition to the difference, a motion vector is sent specifying the

reference 16x16 area (in the reference ﬂame) for the current macroblock.

11

As discussed later, the encoder, after encoding each frame, immediately (internally)
decodes the ﬂame and stores it locally to serve as a reference decoded frame for the next
(current) ﬂame. This allows the encoder to be “in synch” with the decoder at the receiver
(as explained further below). In other words, this enables the encoder to use the correct
“reference” ﬂame that will be available at the true decoder when computing the
difference signal. (Recall that the receiver will not have access to the original lossless

ﬂames, and it will only have access to the decoded ﬂames, distorted due to compression.)

Frequency Transform & Quantization

Once the Y, U, and V coefﬁcients have been determined along with any motion vector,
the coefﬁcients of the macroblock undergo a frequency transform. This transform is
usually a discrete cosine transform (DCT). Here, since the data is in arrays of sizes such
as 4x4, 8x8, or 16x16, the transforms can easily be applied. The result is a set of
coefﬁcients that describe the samples and motion vectors ﬂom in the ﬂequency domain
rather than pixel by pixel. Many of these components tend to be zero or very small,
further reducing the amount of information needing to be transmitted. The J VT standard
uses a special Hadamard transform, discussed later, to avoid multiplications or ﬂoating

point discrepancies by the decoder.

Once the ﬂequency coefﬁcients are determined, they are quantized to further reduce the
amount of information. The simplest form of quantization is to simply divide each
coefﬁcient by some number and send the integer result, discarding any remainder. The

decoder simply multiplies the remainder by the coefﬁcient, which is also sent, and its

12

ﬁnal coefﬁcients are close to the original. More complicated schemes with scaled
quantization parameters are actually used to improve quantization performance. Careful
selection to the quantization equations in the standard enable multiplication and division
to be avoided and replaced with shifting and additions. This quantization step is the
primary lossy step where information is discarded by the encoder that cannot be

recovered by the decoder. The level of quantization is determined by the encoder options.

Decode Frame Prediction

Because of the loss of information ﬂom quantization, and possibly from the frequency
transform (mainly due to minor loss in the precision of computation), a decoded ﬂame is
not exactly the same ﬂame that was earlier encoded. Since the decoder is using motion
prediction based on past decoded ﬂames, not past original ﬂames, the encoder needs to
select the motion prediction based on the actual decoder ﬂames. Otherwise, the motion
vectors for the macroblocks will point to different coefﬁcients than expected, and the
newly decoded ﬂame will not be correct. Since future ﬂames will be based on past
ﬂames, the error will propagate and grow, adding signiﬁcant distortion. To avoid this
problem, the encoder includes a decoder which decodes each encoded ﬂame. These
decoded ﬂames are then used for motion estimation, and so the motion estimation is

correct regardless of what loss the encoder introduces.

Entropy Coding
The quantized macroblock coefﬁcients, motion vectors, and other overhead information

are now ready for lossless entropy encoding to minimize the amount of transmitted data.

13

Channel coding and transmission protocols, which compensate for lossy channels, are
irrelevant to the video standard although they clearly impact the overall video
performance. Many standards, including JV T, include advanced features for handling lost
data at the cost of additional complexity and bandwidth. The JVT standard supports two
entropy coding standards, discussed shortly in more depth as part of the J VT decoder.
Entropy coding represents the ﬁnal encoding step and the output is the encoded video,

which is then channel coded, and sent to the decoder.

JVT Decoder Implementation

This discussion of the key aspects of the JVT standard are subdivided into several parts.
The ﬁrst portion deals with the decoder input and entropy decoding. The next focuses on
the coefﬁcient dequantization and inverse transform. The ﬁnal parts examine motion
prediction and the deblocking ﬁlter. Schwarz and Wiegand provide a more in—depth

discussion of many of these aspects in [2].

Input Decoding

The ﬁrst segment of the decoder is processing the incoming video stream and then
reversing the entropy coding. The video stream processing is based upon breaking the
stream down into pro-deﬁned packets. The packets then specify the protocol and data

within each packet, and the data is extracted in the appropriate manner.

14

Network Abstraction Layer Unit (NAL U)

Since the JV T standard is designed to work independent of any network or channel
technology, JV T uses its own packet model, called a network abstraction layer unit
(NALU), which is very similar to an Internet Protocol (IP) packet. The NALU
encapsulates all the ﬂame data, whether motion vectors, frequency coefﬁcients, or ﬂow
control data. The incoming stream is searched for the unique NALU startcodes, then the
decoder processes one NALU at a time. NALUs may contain either ﬂow control data or
picture data. (Flow control data involves information such as the reference ﬂame buffer
handling, slices, and macroblock ordering schemes.) Different NALU types contain

different data packed in according to that NALU type.

Entropy Coding: CAL VC & CABAC

Depending upon the encoder options and the NALU type, the data is unpacked
appropriately. Two major types of entropy coding are used to pack the data. The ﬁrst, a
combination of universal variable length codes (UV LC) and context adaptive variable
length codes (CAVLC). Throughout this thesis, this combination is generally referred to
by the CAVLC abbreviation, although this method of entropy encoding uses both.
However, previous coding standards generally relied on a pure UVLC entropy approach.
The context adaptive VLC codes use a combination of different lookup tables, where the
context is the nearby macroblocks. In addition to the lookup tables, additional syntax is
used to specify the location of zeroes in between the non-zero coefﬁcients. A more in-
depth explanation can be found in [8], as well as the original standard proposal [9].

Horowitz et al. [4] brieﬂy note in their decoder complexity analysis that their empirical

15

tests indicate CAVLC is about two times more complicated than the previous pure UVLC
deﬁned by J VT. However, they do not examine CAVLC complexity beyond that brief

footnote.

The other entropy method that is supported by the J VT standard is context based adaptive
binary arithmetic coding (CABAC). CABAC is an extension of arithmetic coding.
Arithmetic coding works by encoding a series of coefﬁcients as a ﬂoating point number.
The more bits, the more speciﬁc the ﬂoating point number and the more coefﬁcients can
be encoded in a single number. Arithmetic coding can nearly achieve the source entropy
because it uses the probability distribution to determine which ﬂoating-point numbers
correspond to which series of coefﬁcients. Since the probability distributions are variable
for different video sequences, and even within the same sequence, the JV T model
recalculates the probabilities so that the distributions are representative of the recent past.
Studies such as [10, l 1] suggest that the context based arithmetic coding represents a 10-
25% bit improvement over the UVLC implementation, depending upon quantization
parameter and video sequence. A study by Saponara et al. suggested that 16% was the

upper limit [12].

In terms of prior research into coding complexity, both CABAC and CAVLC are
relatively new coding schemes, and more focus has been given to the bitrate
improvement of CABAC than to its complexity [10, 11, 13]. This research does, though,
clearly note the bitrate advantages of context based coding. A 1987 study by Wittten et

al. [14] between arithmetic coding and Hufﬁnan coding (which is traditionally the core of

16

VLC type of coding) suggests that arithmetic coding was generally superior to Huffman
coding using the computers of that time. However, the study did not examine context
based coding, nor did it attempt to fully optimize the Hufﬂnan implementation. (Huffman
codes, like VLC codes, generally rely on table based decoding.) The H.263 annex E ﬁrst
introduced CABAC to video standards. The JVT standard used an improved low-
complexity table-based implemented, and a multiplication ﬂee coder was recently
proposed in [15]. However the study by Saponara et al., using version 2 of the JVT
reference soﬂware (pre-CAVLC), reports that CABAC had only a minor complexity
increase but provided no experimental numbers for entropy coding [12]. It’s important to
remember that the context based usage of VLC tables differs ﬂom previous video

standards which relied upon a single universal table.

Dequantization and Hadamard Transform

('1 1 1 11’ 1'1 1 1 1”)

1 1 —1 -1 z 1 1 —1 -1
W: /2

1 —1 -1 1 1 -1 —1 1

([1 —1 1 -1_L ”1 -1 1 —1_‘/

 

 

 

 

 

 

 

 

Z = Frequency Coefﬁcients
W = Unscaled Spatial Coefﬁcients

Equation 1: Luma DC Hadamard Inverse Transform

As mentioned earlier, many previous standards have used discrete cosine transforms
(DCT) for the space-ﬂequency domain transform. Because the DCT relies on ﬂoating
point approximations irrational numbers, precision is platform and implementation

dependent. This imprecision leads to “driﬂ,” the same problem that occurred when

17

motion vectors were based off of the original ﬂames rather than the decoded ﬂames and

led to a decoder instance in the encoder.

To prevent this driﬂ problem, as well as to improve performance, the J VT standard
implements a 4x4 integer variant of the DCT. All of the transforms, one of which is
shown in Equation 1, can be implemented as a series of one dimension transforms. which
can be implemented using only addition, subtraction, and bit shifts for the multiplications
and divisions. The JV T standard uses a 4x4 transform, where previous standards use a
larger 8x8 transform. The smaller transform allows the J VT to use smaller macroblock
divisions, as discussed below. A fuller treatment of the J VT Hadamard transform
progression and implementation is given by Richardson [16]. The transform speciﬁcs
vary by the type of sub-macroblock they are decoding; luma DC blocks and chroma DC
blocks have their own variants of this process. Once the unsealed spatial coefﬁcients are
reconstructed by the inverse transform, they are rescaled by multiplying by a factor

determined using lookup tables referenced by the quantization parameter.

Motion Prediction

Each image is encoded and decoded one macroblock at a time. The encoded information
for each macroblock includes the motion vectors and the differential coefﬁcients. For
images with any signiﬁcant amount of motion, simply using the difference with the
previous image provides poor correlation compared to elsewhere, both within the same
picture and previous pictures. In order to best exploit the sequence correlation, two types
of motion are exploited. The ﬁrst, spatial correlation, tries to use portions of the same

ﬂame as a base for encoding the remainder of the ﬂame. The other, temporal correlation,

18

uses previously encoded ﬂames as the basis for encoding. The difference between the
current block and the selected match is the motion vector, which is generally encoded as

the difference between that motion vector and previous motion vectors.

The encoder decides how best to encode the macroblock. It can encode the macroblock as
one large 16x16 chunk, or subdivide it into 8x8, 16x8, or 8x16 blocks. Those blocks can
further be divided into 4x4, 8x4, or 4x8 sub-blocks. The more blocks that are used, the
better match the motion compensation can be to that particular block. Chang et al.
simulations suggest as much as a 15% bitrate improvement by using the subblock sizes in
addition to the whole macroblock [7]. However, more blocks increase the information
needed by adding more motion vectors. Blocks may be intro-coded, taking advantage of
spatial correlation, or inter-coded, pointing to previously coded ﬂames. Inter-coded
blocks may be predictive, Using one motion vector, or bi-predictive, using two motion
vectors. The bi-predictive mode allow a block to combine information from two previous
blocks, saving on coefﬁcient information but increasing the computation required on the

decoding end.

Frame Types & Picture Buffering

The ﬂame type, determined by the encoder options, determines which coding modes are
used in coding that ﬂame. An intra-ﬂame, or l-frame, only uses intra-coded blocks. Thus
it is completely self-decoding as it has no motion vectors pointing to previous
information. However, I-frames tend to be the largest type of frame because they do not
take advantage of any temporal correlation. P-frames are predictive frames, which can

use either intra coding (intra-coded macroblocks)or a single motion vector per

19

macroblock. This motion vector points to the “best match” area (normally of size 16x16
pixels) with the previously coded I-frame or P-frame. And the bi-predictive ﬂame (B-
frame) use two motion vectors, which can point to different reference pictures. The
pattern in which an encoder uses these frame types is known as the Group of Picture
(GoP) type. The encoder uses this pattern to decide which picture type to encode next,
resulting in the pattern repeating over and over. Figure 2 illustrates two classic GoPs. The
top one, IPPP, consists of a single I-frame followed by P-frames. Each P-frame uses the
previous P-ﬂame as a reference ﬂame. The second, IBBPBBP, consists of an I-frame,
then a repeating pattern of two B-ﬂames followed by a P-frame. Each P-frame uses the
previous I or P-ﬂame as a reference; the each B-ﬂame uses the previous and next P-
ﬂarnes as reference. (The ﬁrst two B-ﬂames use the I-ﬂame as a reference since there is

no previous P-frame.)

 

Figure 2: Group of Picture Illustration

20

One-quarter pixel prediction

The goal of the motion compensation is to ﬁnd a vector that reduces the macroblock
coefﬁcients to as close to zero as possible. In theory, motion may not happen in pixel
steps, and a higher degree of precision may ﬁnd better reference matches. Previous video
standards, such as MPEG-2 [17], have frequently used half-pixel, or half-pel, motion
vector precision. Increased precision on the motion vector increases the search
complexity for the encoder. Research into motion search techniques demonstrate that
improved methods are possible. For example, Panusopone and Baylon present a half-pel
8x8 search method in [18] than can be generalized to different block sizes and
resolutions. Previously, the added precision did not reduce bit usage for previous
standards. Werner [19] proposed a ﬁltering method to overcome signal noise in motion
estimation. For JV T, Wedi and Musmann extended Wemer’s proposal into a functional
method for using one-quarter and one-eighth pel precision [20]. The speciﬁc
implementation in JVT, discussed brieﬂy by Wieland et al. in [21], uses a ﬁnite input
response ﬁlter for the half-pel estimation, and then averages the integer and half-pel
values for the quarter-pel prediction. This process is clearly more complex than full-pel

decoding.

Deblocking (Loop Filter)

Because the encoder and decoding is done in blocks, the edges of the blocks tend to have
seams, which create visual artifacts and blockiness along the block boundaries. To help
clear up this distortion, decoded macroblocks are run through a deblocking ﬁlter. The

deblocking ﬁlter runs ﬁrst on the vertical edges of the block, then on the horizontal

21

edges. The ﬁlter serves to average out values along the edges, thus giving a smoother
look to the ﬁnished ﬂame. This ﬁlter is an integer based ﬁlter, so it can be quickly

implemented.

22

CHAPTER 3

IMPACT OF GROUP OF PICTURE TYPES ON REAL TIME PERFORMANCE
Research shows that the JVT standards provide a computationally-complex algorithm for
video decoding. While signiﬁcant research has been done into complexity reduction
methods for the JV T standard[6, 7, 22], little research has been done into relating the
problem to mobile devices. An exception is Yu et al. proposing a real-time mobile
encoder in [23]. Lappalainen et al. did conclude that real time decoding for the H.263
standard should be possible for low ﬂequency mobile processors and that 12-38Mhz of a
Pentium HI was required for lOfps QCIF decoding [5]. Horowitz et al. analyzed the JVT
complexity from by looking at the number of average operations to decode a macroblock
and then extrapolating that into complexity based on processor speciﬁcations [4] The

previous study, however, largely ignored the entropy decoding complexity.

As highlighted in the previous chapter, the JVT standard provides a wide range of video
coding tools and related options. By varying these options, the JV T codec can provide a
wide range of video quality; at the same time, these coding options imply a wide range of
complexity that is encountered by the JVT decoder. Hence, a key objective of this thesis
is to identify high-level IV T coding options that provide the best quality (minimum
distortion) while adhering to a maximum complexity constraint. Hence, our work
establishes an initial baseline for the complexity-distortion performance of J VT decoding
over handheld devices. Consequently, a variety of video sequences were encoded to
explore the impact of ﬂame rate, entropy coding, group of picture type, and video

sequence upon the decoding performance. Equivalent tests were performed using a

23

Pentium IV to provide a comparative-performance baseline. The baseline tests (for both
handhelds and standard computers) used a lossless decoder and rejected any

optimizations which would represent a lossy decoder’.

The purpose of these experiments was three-fold: First, to answer the question “Given
that a video sequence is encoded for a handheld device using the J VT standard, how
should one proceed to encode that sequence to achieve the best possible quality under a
given complexity constraint?” Second, to identify the broad over-arching complexity of
the JV T decoder implementation chosen. Finally, to establish the performance of this
decoding software compared to that used in previous research. Our experiments
investigated the complete range of quantization choices, four different frame rates, both
entropy encoding methods, four common Group of Picture (GoP) types, and Intra-coded

(I-frame) ﬂequency variance for two common GoPs.

This chapter is organized as follows. First, the hardware and software components of the
experiments are described. Then the video encoding conﬁgurations used in the
experiments are described. Subsequently, the complexity-distortion results are presented

in graph form with corresponding analysis.

Hardware Selection
The signiﬁcant hardware choices were ﬁrst the type of handheld, and then the standard

computer used as a reference comparison for the handheld performance.

 

5 Here, a lossy decoder implies a decoder that discards some information/data that is already sent by the
encoder for the purpose of reducing complexity.

24

Handheld device - iPAQ:

The iPAQ series, built by Compaq and later by HP after the Compaq / HP merger, is a
series of handheld devices that are representative of the capabilities of the generic
handheld devices. They are designed as multipurpose consumer devices. They possess
simple graphics, using 16 bits per pixel — 6 bits for green and 5 for red and blue. (Many
of the early mobile devices used this 5-6-5 RGB format but did not support 65K colors.)
The processors are simple but fast 32-bit processors, capable of speeds between 200 and
600 Mhz. The selected iPAQ used in this research is the h5550 model. Designed by Intel,
the 5550 model employs a 32-bit processor targeted for mobile environments. The iPAQ

represents a commercial integrated product typical of a generic mobile computing device.

PC Platform:

To provide the PC baseline, a system running Windows 2000 with a Pentium 4 1500 Mhz
chip and 256 megabytes of RAM was used. Aside ﬂom the normal system services and
background tasks (e. g. virus detection, ﬁrewall) no other loads were running during

simulation testing.

Software Selection

In order to prepare the experiments, the following software was used. The general
principles behind the selection were to use the commercially available state-of-the-art, to
use pro-developed solutions as a baseline and then improve the bottlenecks, and to

initially avoid any degrading in video-quality or decoder-speed due to software changes.

25

Improvements which sacriﬁced video quality for improved complexity were not

considered at this point (i.e., we focused on lossless decoding).

Operating System

The handheld devices are developed and designed with Windows Compact Edition
(WinCE) in mind. Thus hardware interface and control problems are minimized. WinCE
is a light version of the traditional Windows system. It supports most of the basic
Windows features, but is uses relatively few of the advance features. Graphically, for
example, WinCE uses its own high speed graphics library (GAPI) for direct access to the
screen rather than the standard DirectX used by other systems. The WinCE software
development kit (SDK) supports few of the libraries that are normally available. The
operating system, however, supports minimal conﬁguration of its memory management,
thread priorities, or ﬁle handling. However, because of WinCE’s simplicity, availability
on a wide variety of handhelds, and support by Microsoft, it makes an excellent platform
for examining the actual issues involved in real-time video decoding over low-
complexity devices. The software development kit supports all standard C/C++ syntax as
well as wireless intemet, ﬁle I/O, display writing, basic timing, and 32-bit arithmetic,
giving the basic building blocks for video decoding. Additionally, WinCE is designed to
interface with the desktop, facilitating desktop controlled program execution, video
transfer, and data collection for experiments. Given that gigabytes of video sequences
would eventually be generated, this automation is a valuable component. For those
interested, others have written Linux versions to replace the default WinCE operating

system. The open source aspects of Linux would facilitate ﬁrrther research into operating

26

system improvements to speed up decoder performance. However, signiﬁcant work in
terms of a reasonable coding environment (see below) would be needed for such a

change.

Compiler and Software Development

Since the current iPAQs are based upon the PXA255 or related processors, they use the
ARM instruction set. Microsoft’s freely released Embedded Visual C++ 3.0 (EVC3) and
4.0 (EVC4) development kits were used to facilitate development of handheld
executables. Version 3.0 represents an early version of WinCE; version 4.0 represents the
current version. Early design and development was done with version 3.0 although the
results presented throughout this thesis are based on a more updated iPAQ which used the
Visual C++ 4.0 environment. While EVC allows a handful of Microsoft/ARM speciﬁc
optimizations, and some of these were used at speciﬁc bottlenecks in the program, the

majority of the code remains standard C/C++ and can be ported to any platform.

All linking was done with the included Microsoft linker, while compilation was done
with Intel’s Windows Compiler for Embedded Visual C++ (later combined into Intel’s
Window’s Compiler program). Initial trials used the Microsoft compiler but the
Microsoft compiler incorrectly compiled code for a single array resulting in incorrect
array values in the decoder. The code compiles correctly with both Visual Studio .NET
(discussed later) and the Intel compiler. Speed comparisons between the two compilers

showed no statistically signiﬁcant variance in performance based upon the compiler.

27

A WinCE software development kit called Port SDK [24] was used to enable the use of
standard console HQ in the WinCE environment. Port SDK also provided a simpler
method to execute the compiled decoder since the decoder did not need to create its own
windows environment. A generic console application [25] actually served as a DOS
console under WinCE. (WinCE has no native console). Preliminary tests showed no
signiﬁcant overhead to using the PortSDK/console when compared to the standard
windows interface. All timings ignore program initialization and cleanup, including
thread and window initialization. The console interface allowed for standard I/O methods
to be used instead of specializing for the WinCE, increasing portability and reducing

complexity.

Microsoft Visual Studio .NET 2003 was used for compiling and coding the Pentium
executable version of the code, along with the necessary data gathering environment.
Two different proﬁlers, Intel’s VTune and Comquare’s DevPartner Proﬁler were used
to proﬁle the video decoding complexity on the Pentium platform. In the handful of
places where the operating system calls differ signiﬁcantly, separate coding sections are
included to properly interact with the code. Methods for accessing the system time were
the most common difference. Threads also showed the potential to require different
interfaces. These areas in the code are the most likely to cause difﬁculty in the event of

porting to a separate system but should be easily identiﬁable.

JVT Decoder
The JV T team, as part of their standard, is developing algorithmically correct

implementations of portions of the H.264 standard. One of these is version 6.1e [26].

28

Version 6.1e contains a ﬁrlly functional encoder and decoder for the standard features,
including all of those used in this research. Some of the advanced features, such as slices,
ﬂexible macroblock ordering, and error correction were not implemented as of this
release. (Later versions, in an attempt to improve the encoder and decoder and add

additional features, leave the encoder and/or decoder inoperable.)

As mentioned previously, the IV T technical speciﬁcation for the decoder is an ideal
decoder. Numerous ideas have been proposed for reducing complexity in exchange for a
degraded video signal. For the purpose of complexity overview, I implemented generic
software improvements which did not reduce the video signal quality. Some additional
changes were made to make monitoring software performance easier since the available
proﬁling tools for WinCE are impractical on non-custom hardware. The signiﬁcant

changes were:

Removal of system calls: All extraneous ﬁle inputs and outputs, such as output storage
and veriﬁcation were removed. Thus the decoded ﬂames were not written to a ﬁle, nor
compared against the “ideal” video. Selected sequences were viewed to ensure that the
decoder still functioned properly after code changes. Additionally, the majority of the
memory allocation and de-allocation calls were moved to the program initialization or
clean-up stage rather than the timed portion of the decoding. My experiments showed that
repeated memory cleanup and reallocation was shown to be using as much as 30-35% of
the decoding time based on the JV T software version that was initially considered. As

mentioned elsewhere, this heavy system load is one of the signiﬁcant differences ﬂom

29

the desktop, where the same memory tasks were taking an estimated two or three percent

of the run time.

Addition of circular frame buffer: The decoder handling of the decoded ﬂame storage
was changed to a circular buffer implemented through linked lists. Instead of copying
decoded ﬂames into a second memory location for storage, the pointer to the newly
decoded ﬂame was added to the list of decoded ﬂames and a new pointer to an unused
ﬂame was selected. When a ﬂame was displayed, the decoder returned the ﬂame space to
the list of unused ﬂames for future use. This change also facilitated video decoding by
allowing the decoder to pass a memory pointer to the video player, by naturally
reordering out of sequence ﬂames", and by adding a decoded ﬂame buffer to smooth
decoding irregularities. The addition of the buffer provided a nominal 5% improvement

01' 80.

Addition of timers: Due to the severe lack of proﬁling software for the WinCE
environment, a number of timers were inserted into the software to obtain timing
information for key sections of the code. As of this research, the author is unaware of any
quality proﬁling approaches to complicated WinCE tasks. A handful of private non-
commercial solutions are available but these are incredibly overhead intensive, seem
susceptible to misproﬁling functions, and interface poorly with other software. Function

counts can be found through proﬁling the decoder on the Pentium platforms. However, as

 

6 The out of sequence ﬂames occur because a B ﬂame, because of prediction, may be dependent on a ﬂame
that occurs after the B ﬂame. When this is the case, the ﬂames that the B ﬂame requires for prediction are
sent ﬁrst, which means that the ﬂames are not sent in order.

30

noted by the difference in memory allocation complexity, the handheld environment

response does not necessary correlate to the PC.

These inserted timers provide an overview of how the decoding time is being spent
within the decoder by partitioning the decoding time into roughly six different areas such
as overhead, ﬂame decoding, macroblock decoding, and macroblock reading. Tests
indicated that timers inserted were sufﬁciently few in number that there was negligible

impact upon the average run time.

Modification of frequently called functions:

Frequently called functions, such as the NALU locator, the start code search routine, and
portions of the symbol decoding routines were rewritten to handle the most ﬂequent case
more quickly. These changes involved techniques such as grabbing larger chunks of ﬁles
at a time, storing decode bits rather than recalculating each time, and an advanced search
approach. An estimated 15-25% improvement was seen as a result of these

improvements.

JVT Encoder

The original 30fps YUV video test sequences were encoded using version 6.1e of the I
JV T encoder [26]. This encoder, provided by the JV T team, correctly handled the options
involving entropy coding, ﬂame rate, and GoP structure. Advanced options such as those
involving error control, ﬂexible macroblock ordering, and search restrictions were not

fully implemented as of this release. This release is one of the earlier implementations of

31

the CAVLC entrOpy standard in the JV T source code7. Later releases began
implementing these features but also broke other components of the encoder. The
encoder was used “as is” with the exception of changes to its input and output routines to
provide the information in a format more conducive to interaction with the experimental

platform (described below).

JVT Player

A separate Windows video player was co-developed with another colleague in the
WAVES lab to display the decoded ﬂames ﬂom the JVT decoder on the screen. Separate
CE and PC versions were developed to take advantage of the different display
technologies. The CE version, the portion that I worked and developed, operated as a
separate thread ﬂom the decoder. The player thread calculated an interval i milliseconds
that it should sleep and then woke up, asked the decoder for a pointer to a ﬂame to
display, displayed the ﬂame, and went back to sleep for i milliseconds. Mutexes, a thread
synchronization method, prevented threads ﬂom interfering with each other and delay

conditions handled the buffer underﬂow and overﬂow cases.

Experimentation suggested that the best case display time for a QCIF image using GAPI
interface was about six (6) milliseconds, and a fullscreen image would require about
seventeen (17) milliseconds. This additional overhead meant that to display a Sfps
sequence real time requires a decoding speed of 5.2fps, a 10fps sequence requires a
10.6fps rate, and a 15fps sequence about 16.5 rate. The actual required rate will vary

depending on the actual performance of the video driver.

 

7 The CABAC and CAVLC entropy coding modes are discussed more fully in Chapter 2.

32

Experimental Platform
In order to efﬁciently encode, decode, and record experiment information, an automatic

experiment executor was built. Its functionality included several main components:

Automatic Encoder: Given a list of experiments, the executor generated the appropriate
encoder conﬁguration ﬁle, ran the JV T encoder, waited for the encoder to ﬁnish, and
recorded the bit counts and PSNR values. It also then stored the results in the appropriate
result ﬁle and moved the necessary input ﬁles for the decoder into the appropriate

directory.

Automatic Decoder: Given a list of experiments and the platform, the executor copied the
decoder input and conﬁguration ﬁles into the proper location, ran the appropriate
decoder, and read in the resulting timings ﬂom the decoder run. The results were stored
within the result ﬁles. The generator was capable of tracking different conﬁgurations and

comparing conﬁgurations against a speciﬁed optimum run.

Plotter: Given a list of experiments and the desired data calculation, the executor would
parse the result data and generate a ﬁle which speciﬁed how to plot the requested
information. This ﬁle was specially formatted to be fed into Matlab ﬁmction which then

interpreted the text into a plot.

33

Experimental JVT Video Sequence Settings

The ﬁrst subsection discusses the encoding and decoding settings that were varied for the

experimental results. The second subsection explains settings that were ﬁxed for this

research.

Experiment Variables:

Video Sequence: Three different video
sequences were used; all three are
standard 10-second long video coding
test sequences. Figure 3 shows a
sample ﬂame ﬂom each of the
sequences. On the left, the Akiyo with
the talking anchorwoman in a news

studio; in the center, Foreman with a

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Parameter Value

Video Sequences: Akiyo, Foreman,

Mobile
GoP Structures: 1, P, B, T
IFrame Frequencies: 4, 8, 12, 16,20
QP Values: 0-51
Frame Rate: 5, 10, 15, 3O
Entrgy Coding: CABAC & CAVLC
Video Format: QCIF (176x144)
Hadamard Transform: On
Max. Search Range: 16
Num. Reference Frames: 1
Forced Intra-Macroblocks: None
Slices: Unused
SP Frames: Unused
Loop Filter Parameters: Default

 

Table 2: Summary of Coding Variables

foreman talking about the construction site, which is shown at the ﬁnal ﬂames; and on

the right, Mobile with the moving train and calendar. The ﬁrst, Akiyo, is a shot of talking

news anchor woman. It features minimum motion, high temporal and spatial correlation,

and is highly compressible. Akiyo represents an ideal video sequence. The second,

Foreman, is a higher motion sequence of a construction foreman talking and then

showing the surrounding landscape. It features higher motion and lower temporal

correspondence compared to Akiyo and is more representative of a typical video

sequence. The third sequence was Mobile, which is a high motion video with lots of

detail. It is very difﬁcult to compress and it loses rather signiﬁcant quality with

compression.

34

 

 

Figure 3: First Frame of Video Sequences

Akiyo (Left), Foreman (Center), Mobile (Right)

Group of Picture (GOP) structure: Four main GoP structures were used, and then sub-
variants of two of those were further investigated. The four main GoP structures are
referred to here as I, P, B, and T. The I-GoP pattern is an all I-ﬂame sequence. This GoP
ignores temporal correlation and is a sequence of separate images, like stringing together
a series of JPEG images (i.e., similar to motion JPEG). The I-GoP has a high bitrate due
to minimal compression but also has no motion prediction (i.e., low complexity, in
general). The P-GoP is a mostly P-ﬂame sequence with an I ﬂame every 12 ﬂames. That
is, an I ﬂame, 11 P ﬂames, then a second I-ﬂame and 11 more P-ﬂames, and so on. The
repeated I ﬂame gives the decoder a periodic non-dependent ﬂame which prevents long
term signal degradation ﬂom occurring as well as providing a moderate level of error
control. Even if a ﬂame is corrupted in transmission, a complete picture will an'ive within

a second or two, depending upon the ﬂame rate.

The B-sequence replaces every other ﬂame with a B-ﬂame, so the pattern becomes I-B-
P-B-P-. . .I-B-P-B-P... The ﬂequent use of B-ﬂarnes takes advantage of temporal
correlation in both directions and thus increases the compression. However, bi-directional

motion prediction increases the complexity during decoding. The ﬁnal GoP, T, uses two

35

B ﬂames instead of one: I-B-B-P-B-B-P. This pattern further optimizes any correlation
between ﬂames. In both the B and T GoPs, an I ﬂame repeats every 12 ﬂames, regardless

of whether the intermediate ﬂames are P or B-ﬂames.

The two sub-variants proﬁled involved changing the ﬂequency of the I-ﬂames for the P
and B sequences. The P-GoP variants are 14X, 18X, Il6X, and IZOX where the number is
the I-ﬂame ﬂequency. Thus the P-GoP is equivalent to 112X and the I-GoP equivalent to
11X. The B-GoP variants are 84X, 88X, Bl6X, and B20X, again with the number
indicating I-ﬂame ﬂequency and B-GoP being BlZX. B frames repeat themselves every
other ﬂame, so B4X has the pattern I-B-P-B-I-B-P-. .. The complexity response to
changing the I-ﬂame ﬂequency gives insight into how ﬂequent image reﬂeshing impacts
performance compared against longer sequences of motion prediction. These results were
remarkably similar to the primary four GoPs examine (I, P, B, T) and speciﬁc analysis is
not presented for them. In general, their complexity variance is directly related to I-ﬂame

ﬂequency and follows the data shown in Table 9.

Frame Rate: The original 10 second long video sequences, were taken at thirty ﬂames
per second for a total of 300 ﬂames. The sequences were then encoded at four different
ﬂame rates: Five, ten, ﬁfteen, and thirty ﬂames per second (FPS). 15 FPS corresponds to
skipping every other ﬂame. 10 FPS to skipping two of every three ﬂames, and 5 FPS to
skipping ﬁve of every six ﬂames. In theory, skipping fewer ﬂames should increase the

temporal correlation between ﬂames, enabling higher compression and a lower bitrate.

36

However, the since the complexity increase verses the bitrate reduction is unknown, the

increased correlation may not aid computational complexity.

Quantization Parameter (QP): Video sequences were generated for all valid QP values,
which range ﬂom O to 51. O is almost uncompressed video, thus having extremely high
bitrates and excellent picture quality. 51 is the lowest quality, with minimum bitrate and
typically unacceptable picture quality. For most of the analysis, the working subrange of
25-45dB PSNR was selected; 25dB is at the low end of what is considered acceptable
video, and 45dB consists of an excellent picture. Thus 25-45dB represents a working
range within which to evaluate performance. Different QP were selected for this range for
each video sequence as the QP that generated the range varied based on ﬂame rate, GoP

structure and video sequence.

Entropy Coding: The JV T standard allows for two types of entropy coding, context based
adaptive arithmetic coding (CABAC) and context-adaptive variable length coding
(CAVLC). Previous research indicates that CABAC decreases the bitrate by 10-15%
compared the equivalently coded UVLC sequence [1 1]. However, CABAC is
traditionally more computationally complex and the bit reduction is less likely to be
useful with the complexity limitations of the iPAQ. When comparing CAVLC with
CABAC, Wiegand et al. estimate CABAC gains 5-15% in bitrate savings [21], but with
the highest savings in interlaced television signals (which are not being investigated in
this thesis). Within the test sequence, our experiments suggested that for a ﬁxed rate,

CABAC coding offered a distortion improvement ranging between zero and one decibels.

37

Experiment Constants:

Video Format: Due to the screen-size of most handheld devices (320x240 pixels or
smaller), the Quarter Common Intermediate Format (QCIF) (176x144 pixels) standard
was used. The other common standard is Common Intermediate Format (CIF) which is
four times larger (352x288 pixels) and far larger than most handheld devices can
accommodate without additional scaling. The standard video test clips come in standard
formats and the choice of QCIF reduced any degradation ﬂom issues not related to the

encoding process.

Hadamard Transform: The encoder used the standard Hadamard transform to provide
better encoding results. A less accurate approximation is also available which does not

use the Hadamard Transform.

Forced Intra-Macroblocks: Forced intra-macroblocks, blocks that are compressed
without motion prediction, are normally used to reﬂesh the signal quality and prevent
error propagation. Since the encoded video sequences included I-ﬂames at regular

intervals, no additional intra-macroblocks were added.
Maximum Search Range: The search range determines how far ﬂom the block center the

encoder will search when looking for the most similar block. Here a standard value of 16

pixels, the size of one full macroblock, was used.

38

Number of Reference Frames: The number of reference ﬂames determines how many
previous (or future) ﬂames are searched when looking for the most similar block. In my
research, a single reference ﬂame provided the minimal performance improvement when
compared with multiple reference ﬂames. However, multiple reference ﬂames lead to a
signiﬁcant increase in encoder complexity. Several recent JV T studies has used ﬁve
reference ﬂames [4, 5], but provide no rationale for this choice. We believe that any level
of prediction that is higher the one used in B-ﬂames (i.e., the usage of more-than two
reference ﬂames) corresponds to a signiﬁcant level of unreasonable complexity that

should not (cannot) be used for handheld devices.

Block Search Restrictions: No block search restrictions were enabled. The encoder was

allowed to choose 16x16, 16x8, 8x8, 8x4, or 4x4 as it found the best matches.

Slices, SP Frames, Loop Filter Parameters: None of these special features were used.
Only a single slice was used, no switching ﬂames were used, and the default loop

parameters were used.

Primary (I,P,B, T) GoP Results by Complaint-Distortion:
For each of the three sequences (Akiyo, Foreman, and Mobile), the results and analysis
are broken into sections. The ﬁrst section contains the results and an overview of the

important points. The next two sections are a performance comparison between the

CABAC and CAVLC modes and then a comparison between the CE and PC results. For

39

Foreman and Mobile, a brief synopsis is then presented between the current sequence and

the previous discussed sequences.

The results shown and discussed below is the decoding complexity in terms of ﬂames per
second (FPS) plotted against the signal to noise ratio (PSNR) in decibels. The higher the
signal to noise ratio, the better the quality is. At 0 QP, the PSNR is around 60-70dB,
depending on the sequence. Anything around 45dB or higher is considered excellent
signal quality. The distortion shown, 25-45dB represents the range between a rather
highly distorted signal (25dB) and a high quality signal (45dB). The run times are
measured in milliseconds or clock cycles, depending on which was most appropriate for
the measurement scale and then converted to FPS. FPS provides normalization since the
sequences differ slightly in their coded length based on their GoP pattern. The data is
shown for a single test, as repeated tests on the same sequence indicated slight (less than
2%) deviation between runs. Furthermore, since each test only slightly different ﬂom the
tests before and after it, the entire run ﬂom QP 0 to 51 serves as veriﬁcation to the
accuracy of individual runs. Certain tests, even when repeated, seem to be unusually
quick or complicated, and this presumably is related to a combination of encoding
choices and architectural caches. Because of its signiﬁcantly faster microprocessor, the

PC is more vulnerable to ﬂuctuations than the handheld.

It is important to note that the data and analysis presented in this chapter assume that
adequate bandwidth is available for receiving any of the available sequences. All bitrate

constraints are ignored. Also, the time to quickly read the data ﬂom a ﬁle is assumed to

40

be the same as the time needed to receive the video stream over a wireless connection.
However, the actual complexity between ﬁle reading and network reception may differ

signiﬁcantly.

Plot Speciﬁcations

In the plots presented in this thesis, the constant variables for a plot are speciﬁed in the
title. The video sequence is abbreviated to the ﬁrst three letters. The GoP structure has an
additional speciﬁcation: ‘-x ’. x speciﬁes the exact ﬂame type ﬂom a sequence that is
being plotted. ‘-A’ is the average of all ﬂames, ‘-i' is the I-ﬂames, ‘-p’ the P-ﬂames, and
‘-b’ the B-ﬂames. While only an average SNR number is available, frame-type speciﬁc
data is available for time and bitrate results. Clearly, ﬂame-speciﬁc availability is
dependent upon the sequence GoP: A P-GoP has no data for the B-ﬂames, for example.
Experiment variables that are constant for all experiments in a plot are printed in the title;
the plot variables are labeled in the legend. The title also speciﬁes the limits of the data
shown in the plot. There are three limits used: An SNR limit, a kilobytes per ﬂame
(KBPF) limit, and a millisecond per ﬂame (MSPF) limit. The default (“no limit”) setting
is a PSNR of between 25 and 45dB, and any complexity and bitrate. All three limits are
for the overall performance of a sequence, not the particular ﬂame types. For example,
the average size of all ﬂames in a T-GoP will be much higher than the average B-frame
size. However, the limit ﬁlter uses the average of all ﬂames, even when only plotting the

B-ﬂames.

41

PSNR (dB)

PSNR (dB)

Seq: aki; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits]

45F

01)
0'1

 

25
16

 

 

Hit

403?-

 

l I I l
I V

18 T 20 22 30
FPS

 

Seq: aki; GoP: -A; FPS: 15; Entropy: UVLC; Plat: CE [No Limits]

45l

35

 

25
12

 

0
\\
\\
\‘
\\
\\

 

Hit

4W1!-

 

‘\
\\
s‘
\\
o
\‘

\‘.

\‘

 

Figure 4: Akiyo Complexity-Distortion (15 FPS, CE)

42

 

 

PSNR (dB)

PSNR (dB)

 

Seq: aki; GoP: -A; FPS: 15; Entropy: CABAC; Plat: PC [No Limits]

 

 

 

 

 

 

 

 

 

 

 

—6— l
—N——— P
—+— B
—-i*— T
40 -
35 .
30 ~
25 1 L '1‘ 1 I 1 1
100 110 120 130 140 150 160 170
FPS
Seq: aki; GoP: -A; FPS: 15; Entropy: WLC; Plat: PC [No Limits]
45 r 0
o -9- I
\‘ \ —-—— P
\
‘ ‘a —l'— B
e —i-—- T
g
40 l' g
¢
a
g
g
\
35~ ‘ g
9
g
30 .
25 l L r v 1 1 J
60 80 100 120 140 160 180
FPS

Figure 5: Akiyo Complexity-Distortion (15 FPS, PC)

43

 

 

Overview

 

 

 

 

 

 

 

 

 

 

 

 

 

Akiyo Time (ms): slowest Percent °/o Entropy
Seq: minus fastest Slowdown Diff.
CE: CABAC CA VLC CABAC CA VLC 25dB 45dB
I 13.9 28.5 37.6 65.5 17.8 41.7
P 5.2 12.7 14.7 35.8 0.0 18.4
B 4.1 9.7 8.4 19.9 1.0 11.7
T 3.2 9.9 6.2 18.8 0.0 11.9
PC:
I 3.8 6.3 60.0 94.8 6.7 29.9
P 1.3 2.1 21.7 36.0 -1.2 10.4
B 0.9 1.4 11.9 18.8 -0.8 5.4
T 0.9 0.8 10.6 9.1 4.2 2.7

 

 

 

 

 

 

 

 

Table 3: Akiyo Complexity Deltas over 25—45 dB
Figure 4 and Figure 5 show the handheld and PC performance for the primary GoP types

(I, P, B, T) when run at 15 ﬂames per second for both entropy types. The 15 fps sequence
is shown for two reasons. First, it is the fastest ﬂamerate that the decoder can playback
real time on the tested iPAQ platform. As suggested by Figure 4 and conﬁrmed by tests
on 30fps sequences, the decoder stops short of reaching 30fps even for a very low quality
Akiyo sequence. Secondly, and among ﬂamerates that are lower than 30 fps, 15 fps
maximizes the temporal correlation since only every other ﬂame is removed. (Lower
ﬂame rates, such as 10 or 5 fps, ignore more ﬂames within the sequence.) While this
similarity matters less to Akiyo since every ﬂame is extremely similar, it does mean that
the P and B ﬂames achieve better compression for the Foreman and Mobile sequences

discussed later.

Clearly for the handheld, the P-GoP is the best compression approach for the easily
compressed Akiyo sequence. For any quality level, it has the lowest complexity (highest
ﬂamerate). Table 3 summarizes the overall complexity change for the 25-45 dB. The

time is the number of milliseconds longer that decoding a ﬂame at 45dB takes than

44

decoding at 25dB. The percent slowdown is the time delta as a percent of the fastest time
to decode a ﬂame, that is the time at 25dB. The percent difference is the percent slower
that a CAVLC sequence is compared to the same decibel CABAC signal. Thus for the
handheld P-Gop, moving ﬂom a low quality signal to a high quality signal is only a 15%
(5.2ms) complexity increase using CABAC, while using 45dB CAVLC is 18% slower
than using a 45dB CABAC signal. Note the PC and CE percentages can be compared
against one another because the percentages are being taken against an individual

baseline performance for each platform.

Several key conclusions stand out ﬂom these experiments. First, there can be signiﬁcant
differences in how handheld and desktop computers respond in complexity to distortion
changes. Between architectural, operating system, and compilation differences, the same
performance gains tested and developed for personal computers cannot be assumed to
have equivalent efﬁciency on a handheld device. Second, the B and T-GoPs, with their
high motion prediction and higher compression, are relatively stable in terms of
complexity regardless of the distortion, entropy coding, and platform for highly
correlated sequences. Finally, the I-GoP complexity varies drastically depending upon

distortion, entropy coding, and platform.

Handheld: CABAC & CA VLC

Surprisingly, CAVLC entropy coding is not a signiﬁcant improvement over the CABAC
coded sequences. The decoder software source notes mention that the CAVLC decoding
procedures have not been fully optimized and even the additional optimization described

earlier has not fully caught up to the CABAC implementation. It is worth noting that at

45

high distortion and excluding the I-GoP, there is relatively little difference between the
two entropy coding methods. Since the I-GoP uses the least compression (and thus a
much higher bitrate compared to the other GoPs) and has the highest complexity
difference at high distortion levels, these results suggests that the amount of input data
that must be parsed may be as signiﬁcant as the computation it takes to parse the data.
This suggests that memory and ﬁle I/O bottlenecks may be an important factor in
addition to available processor time. (This aspect will be further discussed and analyzed

in the next chapter.)

Handheld vs. PC

Interestingly, the PC has a much harder time with the I-GoP as distortion increases
compared to the iPAQ: The standard computer takes ninety-ﬁve percent longer to decode
a 45dB CAVLC sequence compared to a 25dB one. It also took sixty percent longer to
decode a CABAC sequence 45dB compared to the 25dB. The iPAQ, comparatively,
takes only 66% longer for CAVLC and 38% longer for CABAC. While the iPAQ
certainly slows down with lower distortion, its I-GoP performance is signiﬁcantly less
hindered: In Figure 4, the CABAC I-Gop is always faster than the B or T-GoPs; not so
for the PC (see Figure 5). Correspondingly, the CE CAVLC I-Gop stays much closer to

the T-GoP performance than the PC I-Gop stays to its corresponding T-GoP.

46

Foreman

Seq: for, GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits]

 

 

 

 

 

 

 

 

 

 

 

 

45[
«3— l '
+ P
—+— B .
40 ~
8
E
o: 35-
Z
(I)
o.
30 .
25 1 1 1 1 1 1 \J 1
14 16 18 20 22 24 26 28
FPS
Seq: for. GoP: -A; FPS: 15; Entropy: UVLC; Plat: CE [No Limits]
45 —e— I
+ P
—+— B
-1il— T
40 .
a
E
a: 35
Z
w
0.
30 _
25 1 1 1 1 T ’4
10 12 14 16 18 20 22 24 26

FPS

Figure 6: Foreman Complexity-Distortion (15 FPS, CE)

47

PSNR (dB)

PSNR (dB)

Seq: for, GoP: -A; FPS: 15; Entropy: CABAC; Plat: PC [No Limits]

 

 

 

 

 

 

 

 

 

 

45
”g —e— I
e \ —.._ p
e —+— B
“ ‘ . —+— T
\
40 ~ § \
§
e\\
“ \
e 1k .
\
35~ °
.s’ge
’. a
x \
\
-‘
30 r ' ¢
¢
\\
F e
S‘s
‘6
25 1 1 1 1 1 1 V 1
80 90 100 110 120 130 140 150 160
FPS
Seq: for. GoP: -A; FPS: 15; Entropy: UVLC; Plat: PC [No Limits]
45 - o
g .\ —e— l
g \ _..— p
\\ \.\ —i- B
\ .
’ g \x, + T
40 . e \\~.~
e \.
¢ \
g \
e \
\‘ \\
\
35 . ‘ g \.
~ \
‘
x “ 1“
g
.‘ \
\ z
30— ‘ e ‘i
e
e.\_
:8.“
\\
¢
\\
25 1 1 1 1 f 1
40 60 80 100 120 140 160
FPS

Figure 7: Foreman Complexity-Distortion (15 FPS, PC)

48

 

 

Overview

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Foreman Time (ms): slowest Percent % Entropy
Seq: minus fastest Slowdown Diff.

CE: CABAC CA VLC CABAC CA VLC 25dB 45dB
1 24.0 45.9 64.4 101.8 20.7 48.2
P 18.2 44.9 49.2 112.7 7.6 53.4
B 13.1 45.6 26.7 90.4 3.0 54.8
T 12.9 45.8 25.8 86.1 6.4 57.4

PC:
1 6.0 11.4 92.5 176.8 -0.6 42.9
P 4.1 7.7 63.8 117.1 1.3 34.3
B 3.2 6.5 41.3 82.9 1.6 31.4
T 3.2 6.2 40.0 77.1 1.6 28.6

 

 

Table 4: Foreman Complexity Deltas over 25-45 dB

The Foreman sequence results for 15 ﬂames per second are shown in Figure 6, Figure 7,
and Table 4. The Foreman sequence was included the test sequences because of its
similarity to a “typical” (more-less on average) sequence with reasonable temporal
correlation but also containing a moderate-to-signiﬁcant level of motion. Overall, the
Foreman sequence reﬂects greater complexity compared to the Akiyo sequence,
especially as the distortion decreases. Again in the results, like Akiyo, the P-GoP

provides the best lowest complexity for a given distortion.

The handheld performance reﬂects the general knowledge that Foreman is a more
complicated sequence than Akiyo was. While both sequences have roughly the same
complexity at 25dB, the Foreman complexity increases more rapidly. In addition, the B
and T-GoP complexity changes while it remained almost constant for the Akiyo
sequence. With reduced compression, the difference between the GoP structures shrinks
although a clear difference is still visible. Additionally with low distortion, real time

performance can barely be achieved in some instances: The CABAC low distortion I, B,

49

and T GoPs are near the cutoff point for realtime functionality, and all four GoPs are well

below realtime levels using CAVLC.

Handheld: CABAC & CA VLC

Overall when comparing the Akiyo with Foreman GoPs, all four GoPs are similar in
speed at the 25dB level. All four also are signiﬁcantly slower than their Akiyo
counterparts at by the 45dB level. By comparing Figure 4 with Figure 6, it is clear that
the starting complexity for a high distortion Akiyo sequence are similar to those of the
high distortion Foreman sequences, although the Foreman is slightly more complicated
even with high distortion. For the Foreman sequences at lower distortion rates, the

different GoP patterns come much closer to merging.

Handheld vs. PC

As with Akiyo, the I-GoP once again performs much better compared to other GoPs
when using the iPAQ rathen than the PC. (The PC response is shown in Figure 5 and
Figure 7.) Under CE, the I-GoP outperformed the B and T-GoPs with CABAC and
matched them with CAVLC. However, the PC I-GoP has signiﬁcantly greater complexity
compared to the B and T sequences. With Foreman as a less compressible sequence than
Akiyo, the complexity increases more sharply for the B and T-GoPs as distortion drops.
However, the 25-35dB Foreman range shows the same ﬂozen trend that characterized
these GoPs in Akiyo: As the distortion drops, complexity remains almost constant. In fact
for 25-31dB portions of the CABAC complexity-distortion graph, a reducing distortion

also reduces complexity. It’s not immediately clear what is causing this complexity

50

response and the question will be investigated more deeply with some of the detailed

timing data.

Handheld: Akiyo vs. Foreman

Looking at the handheld device, reducing distortion on the Foreman sequence produces a
very different result than for Akiyo. The Akiyo sequences begin with substantial time
gaps between the P and B GoPs. Under CABAC, the difference is about l3ms at 25dB
and is still around 12-l3ms at 45dB. For CAVLC, the gap is similar, starting at about
l4ms and ending at 12-13ms. However for Foreman, the gap starts around 10ms and
shrinking to around 7ms for CABAC. CAVLC, it also shrinks ﬂom about 10ms to 7ms.
In both sequences, the B and T GoPs mirror each other, with the T GoP being about lms
slower with CABAC and about 5ms slower with CAVLC coding. Additionally, the
complexity increase is signiﬁcantly greater with the Foreman sequence. For the Akiyo P-
GoP, the complexity increases 5ms per ﬂame for CABAC and 13ms per ﬂame for
CAVLC. For the Foreman sequence, those numbers are l8ms and 47ms respectively.
Thus the Foreman sequence, while having similar performance to Akiyo at 25dB, grows
in complexity much more rapidly. This greater overall complexity seems to minimize the

difference between I, P, B, and T ﬂames, especially for the CAVLC coding approach.

51

Mobile

PSNR (dB)

PSNR (dB)

Seq: mob; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits]

 

 

 

 

 

 

 

 

 

 

 

 

45
l I
p
B
T
40 ~
35 .
30 ~
25 P 1
10 12 14 16 18 20 22 24 26
FPS
Seq: mob; GoP: -A; FPS: 15; Entropy: WLC; Plat: CE [No Limits]
45 —
o -e- l
\ 0 —-—-11- P
\\ u -~l— B
+ T
40 .
35F
\\ \
a
. \
¢
30 . “
g
g
\a
‘\
g
25 J 1 1 1 ° 1 .~ 1 _1
6 8 10 12 14 16 18 20 22
FPS

Figure 8: Mobile Complexity-Distortion (15 FPS, CE)

-52

 

 

PSNR (dB)

PSNR (dB)

Seq: mob; GoP: -A; FPS: 15; Entropy: CABAC; Plat: PC [No Limits]

 

 

 

 

45
e1 I
+P
—+—— 31
+Tl

40»

35~

30-

25 1 1

50 100 150

FPS
Seq: mob; GoP: -A; FPS: 15; Entropy: WLC; Plat: PC [No Limits]

 

45 ﬂ

 

 

 

 

 

. -9- I
\
1‘ + P
1' \ —+— B
" 1 —+— T
40 " \
(t -
‘8.
a --.\
ts \
e \\
a
35 r n
a
a
a
s "\..
30- ~‘ ‘\
g \a
‘a
\,
\ i
\ \‘ \\
s‘ \
25 1 1 1 E g L 1 1 .. 1 J
40 50 60 70 80 90 100 110 120 130
FPS

Figure 9: Mobile Complexity-Distortion (15 FPS, PC)

53

Overview

 

 

 

 

 

 

 

 

 

 

 

 

 

Mobile Time (ms): slowest Percent % Entropy
Seq: minus fastest Slowdown Diff.

CE: CABAC CA VLC CABAC CA VLC 25dB 45dB
I 51.2 51.5 129.1 74.7 73.8 32.5
P 30.0 85.0 63.8 176.0 2.9 73.3
B 37.1 87.0 78.0 155.7 17.3 68.6
T 34.6 98.6 70.8 178.5 13.3 84.6

PC:
I 9.8 11.6 107.5 90.2 41.0 29.3
P 8.7 16.4 122.2 204.9 12.0 53. 7
B 8.3 16.0 103.3 190.2 4.2 48.8
T 8.7 15.7 110.0 180.5 9.6 46.3

 

 

 

 

 

 

 

 

Table 5: Mobile Complexity Deltas over 25—45 dB

The plots and a tabular summary of results for the Mobile sequence are presented in
Figure 8, Figure 9 and Table 5. Again, Mobile is a highly detailed sequence which is
difﬁcult to compress and contains a high degree of motion. It is by far the most
complicated of the three test sequences used. In fact, real time performance is possible
only in select cases: Below about 30dB for CAVLC and below 36 dB for CABAC. This
represents a signiﬁcant complexity increase over Akiyo, where all sequences were real

time, and Foreman, where most sequences were real time.

Handheld: CABAC & CA VLC
A key observation is that for the ﬁrst time, the I-GoP provides the lowest complexity
solution for 40-45dB distortion sequences using the CAVLC entropy coding. Although

the CAVLC sequences are still slower than their CABAC counterparts, there is no clear

54

explanation for the stronger performance by the I-GoP. This question will be revisited

later when the detailed proﬁling and rate aspects of these tests are examined.

The 25dB CABAC baseline behavior is also unexpected. The high motion prediction
GoPs (B and T) are actually less complicated than their Akiyo and Foreman counterparts.
In fact, Akiyo, the most compressible sequence, has the highest complexity of the three at
25dB. However, the I and P GoPs are more complex for Mobile than for the previous
sequences. This suggests that with high distortion, the B and T GoPs are fairly insensitive
to correlation, while the P and especially the I GoP, without temporal correlation, are

more sensitive to sequence complexity.

Handheld vs. PC

Surprisingly, the overall pattern is that as distortion decreases, the handheld complexity
increases more slowly than the PC complexity. For example, the net CABAC handheld
complexity increase for the P-GoP is 64%, but a shocking 122% for the PC. The
exception to this pattern is the CABAC I-GoP where the PC increase is 108%, the
handheld 129%. As noted before, the I-GoP is the least predictable GoP structure. The
overall lower complexity increase for the handheld, however, suggests that the iPAQ is

less sensitive to some algorithmic aspect related to distortion changes than the PC.

Mobile Compared Against Previous Sequences
The difﬁcult Mobile sequence also produces a more linear performance ﬂom the CABAC
sequences than has been seen before: The relative performance between the four GoPs is

similar regardless of the distortion, although the T and I GoPs gradually slip further

55

behind the B GoPs performance. (This linear nature is not seen as strongly in the desktop,
where the I-GoP is signiﬁcantly slower than the others.) The CAVLC sequences have
similar response for the P, B, and T GoPs while the I-GoP starts out much more complex
but grows in complexity more slowly than the other GoPs. This is a reversal of the

Foreman pattern where the I, B, and T-GoPs clustered and the P-GoP was less complex

The Mobile sequence also reﬂects a much greater complexity increase as the distortion
decreases. Focusing on the handheld as the primary experiments of interest, the 45dB
sequences grow in complexity roughly twice what the Foreman sequences did. For
Foreman, the CABAC slowdown percentages for the I, P, B, and T GoPs were 64%,
50%, 27%, and 26% respectively, and 102%, 112%, 90%, and 86% for CAVLC. For the
Mobile sequences, the equivalent CABAC numbers are 130%, 64%, 78%, and 81%; for
CAVLC 75%, 176%, 156%, and 179%. While the CAVLC I-Gop and the CABAC P-
GoP do not reﬂect the same slowdown rates, the overall pattern is that Mobile slows
down signiﬁcantly more compared to the baseline than Foreman (as well as Akiyo, which

experienced minimal complexity increase).

Trends and Observations
Several distinct patterns show themselves through all three sequences.
1. First, the different GoPs do show repeatable patterns. The P-GoP
consistently provides the lowest complexity for almost any level of distortion.
The B and T GoPs mirror each other’s performance with the double B ﬂame
always being a bit slower than the single B ﬂame. Presumably if motion

prediction and thus B ﬂames were signiﬁcantly faster, the increase in T ﬂames

56

would result in slightly faster performance compared to the B-GoP. The I-GoP is
the most unpredictable of the four GoPs, ranging ﬂom much slower than any
other GoP to being the least complex choice for high quality CAVLC coded
Mobile sequences. However, signiﬁcant improvement in the performance of B
ﬂames is clearly needed before the B or T-GoPs become a viable option. The I-
GoP, with special attention to parsing, may begin to outperform the P-GoP for

speciﬁc sequence choices.

. Secondly, the results repeatedly show that Pentium simulations do not

provide accurate predictions for handheld behavior. At times, the results are
very similar (e. g. CAVLC Akiyo) while at other points the slowest PC option is
the best handheld option (e. g. CAVLC Mobile). In particular, this unpredictability
is demonstrated by the I-GoP, which varies drastically in size and complexity
based on the sequence and distortion level. From the high level results, the B and
T-GoPs are very similar, with the T-GoP being just a bit slower than the B-GoP
on both platforms. Beyond that, while similarities exist, such as the strong P-GoP
performance, the similarities differ for each sequence and entropy coding
selection. These differences suggest that algorithmic optimization at a language

level may be inadequate to fully address the complexity issues.

The lack of correlation between experiments on the PC and handheld platforms,
despite compiler optimization for both, suggests that non-algorithmic aspects such

as chip architecture, operating system, and compiler techniques play a signiﬁcant

57

role in the overall complexity of a given sequence. This disconnect emphasizes
the need for multiple platform complexity studies as well as better analysis tools

for non-traditional platforms.8

4. There is strong correlation between the compressibility of a sequence and its
general complexity, both on the handheld and standard computer. Akiyo is the
fastest sequence, followed by Foreman and ﬁnally by Mobile. This correlation is
reﬂected regardless of platform, entropy technique, or Group of Picture structure.
The complexity-compression correlation is most visible in the lower distortion
sequences. Overall, the high distortion sequences of the same GoP structure have

similar complexities; as distortion drops, the correlation becomes more distinct.

5. The complexity increase (as distortion drops) is signiﬁcantly higher with CAVLC
entropy coding than with CABAC. This suggests the parsing/entropy decoding
method signiﬁcantly inﬂuences the overall complexity. This theme will be

explored more in depth in Chapter 4.

 

8 As of this writing (April 2003), Intel has just released a proﬁling tool for the PXA2xx processors (the I
processor series in the iPAQ); however the tool requires that the processor be used with Intel’s
development hardware and does not work with other hardware.

58

CHAPTER 4

BITRATE COMPLEXITY ANALYSIS
In analyzing the distortion-complexity plots shown in Chapter 3, the complexity variation
is strikingly different based upon sequence choice as well as GoP type. In order to better
understand the factors controlling the complexity result, a number of high level timing
probes were inserted into the code. While proﬁling programs are available to proﬁle code
built for traditional computers, equivalent software is not available for most mobile
platforms, including the iPAQ. Thus a creative and non-intrusive approach to timing is
necessary for understanding the program performance. As mentioned earlier, the iPAQ
timings do not act as a miniature PC results and thus proﬁling the PC is not an entirely

accurate substitute.

The remainder of the chapter is organized as follows. First, the timing probes used to
measure the program structure are discussed. Then the timing results for the primary four
GoP-structures are presented by sequence. Finally, the chapter concludes with an overall
summary. For normalization purposes, results are generally presented by milliseconds per
ﬂame. For the ﬁfteen ﬂames per second sequences, 60ms per ﬂame or better is required
to achieve realtime. The 60ms to decode leaves the video player with roughly 7ms to

display each ﬂame, which is the approximate display rate.

Timing Probes
A total of seven high precision timing probes were inserted into the decoding software.
These probes were intentionally placed in the inﬂequently called functions in order to

minimize the impact of the actual complexity impact of the probes. Initial tests indicated

59

that that any complexity change was well within the decoder’s standard variance. Special
attention has been given to the measurement descriptions to note “what is” and “what is

not” included in each probe. Probes are called a total of roughly 200 times per ﬂame.

Three estimates were derived for a probe time. The ﬁrst, by measuring the time before
and after a single probe, generated a worst case number. The second, by timing many
probes and then averaging for a per probe time, generated a more likely estimate but
ignored any advantage being gained ﬂom cache coherency. Neither estimate included the
beneﬁts that interleaving non-probe instructions could have on the CPU pipeline and
memory. The ﬁnal estimate was achieved by comparing pre-timing decoder runs with
timer implemented runs. The estimated worst case time for a probe in this iPAQ
environment is 40 ticks, with 3.69 million ticks occurring per second. For the ﬁfteen per
second ﬂames, that translates into an addition of roughly one-third of a second to the total
time. The averaged estimate, closer to 13 ticks, adds one-tenth of a second. Actual
experimentation noticed no signiﬁcant difference based on timing implementation
beyond the standard variance. This observation suggests that the actual time change was

under one-tenth of a second for most runs.

Actual Timing Probes
Equation 2 shows the relationship between each of the probes used, as well as the four
additional calculated times. Each of the speciﬁc probes is discussed at greater depth

below.

60

TTot = TOver + Z TDecFr
I,P,B Frames

TDecFr : TReadNewFr + TFrInit + TProcMB + TPostFrDec + TDeblockFr + TUnk

TProcMB = MEL/113v ReadMB + TDecodeMB ) + TMBOth

TNon-ReadMB = TProcMB " z (TReadMB)
All MB

Calculated, not timed, probes: T U nk , T Over , TMBOth a TNon-ReadMB

Equation 2: Timing Probe Interactions

Total Time ( T at): This is the total time spent decoding a complete sequence (e. g.
approximately 150 ﬂames for 15fps9), excluding decoder initialization. Any comparison
to the total decoder time is against this number. There is one call to this timer for the
whole program.

Decode Frame (DecF r): The time to decode a ﬂame, ignoring a slight amount of memory
manipulation and generic cleanup (calculated below as Overhead). The ﬂame times are
totaled by ﬂame type (I, P, or B), and are used for any calculation involving the picture
type total time calculations. The sum of all three of these picture type totals is slightly
less than the total time, due to the overhead.

Process Macroblocks (ProcMB/o: This is the total time it takes to decode a ﬂame,
including reading the pre-macroblock information, reading macroblocks, and decoding

them. It ignores deblocking time (measured elsewhere) as well as some ﬂame cleanup.

 

9 All original sequences have 300 ﬂames so a perfect 15fps sequence is 150 ﬂames. However due to the
GoP structure, some sequences end at 148 or 149 ﬂames because the next ﬂame would be a B-ﬂame but it
has no future I or P-ﬂame to use as a reference ﬂame. The data presented is normalized by ﬂames to
compensate.

61

The probe is called once per ﬂame and this section is expected to occupy a signiﬁcant
portion of the total run time.

Pre-Decode Frame [nit (F rInit): This probe measures the time to prepare for decoding a
ﬂame. It includes initializing ﬂame ﬂags, entropy settings, and error correction as well as
ordering the ﬂame buffers. This timing section, called once per ﬂame, is expected to take
very small percentage of the program.

Read New Frame (ReadNewF r): This probe measures the time to read and process the
pre-macroblock ﬂow control information for each ﬂame. This time is expected to be
minimal and the probe is called once per ﬂame.

Read Macroblock (ReadMB): This probe, called once per macroblock (ninety-nine times
per QCIF ﬂame) measures the amount of time it takes to read a macroblock, primarily
entropy decoding. However, the decoder also uses an inverse transform on intra-coded
(“1”) 16x16 macroblocks to extrapolate the DC coefﬁcients for the new intra prediction
modes. On selected sequences that were proﬁled for function counts, this rare transform
was called less than a thousand times (compared to over 300,000 calls to the transform in
DecodeMB).

Decode Macroblock (DecodeMB): The time spent taking the differential coefﬁcients and
motion coefﬁcients read during the Read Macroblock phase to ﬁnd motion estimate and
calculate the new macroblock coefﬁcients using the motion vectors. Like read
macroblock, the probe is called once per macroblock.

Post-Decode Frame (PostFrDec): This probe measures the clean-up time after

successfully reading and decoding the macroblocks in a ﬂame. The time here includes

 

'° This probe section corresponds to the “slice" breakdown within the reference code. Thus some of the

62

resetting various ﬂags, motion information copying, and motion vector storage. This time
is expected to be negligible but is included for completeness. The probe is called once per
ﬂame.

Deblock Frame (DeblockFr): This probe measures the time taken to deblock the ﬂame
(apply the deblocking ﬁlters to the decode image and remove some of the artifacts and
blockiness). This probe is called once per ﬂame, and deblocking is expected to be a

signiﬁcant task.

Calculated Timing Probes

The following additional measurements were calculated ﬂom the previous probes to
provide additional information about the timing within decoder. All of the timings below
use the previous nine timings as their foundation.

ProcessMB excluding Read Macroblock (Non-ReadMB): This is the time it took to
decode the macroblocks excluding the time spent reading macroblocks. This time
includes ProcessMB Other (see below) as well as Decode Macroblock.

Process Macroblocks Other (MBOth): The difference between the total time to decode all
macroblocks and the measured subparts of macroblock decoding. The two subparts are
read macroblock and decode macroblock. Those two times were added together and then
subtracted ﬂom the total macroblock decoding time.

Overhead (Over): The difference between the total time spent decoding all of the ﬂames

and the total time spent decoding, ignoring decoder initialization. Unlike other measures,

 

graphs reference this as a slice rather than ProcessMB. The discussion uses ProcessMB for clarity.

63

overhead is calculated as picture type independent because only a single total time is
available. Thus there is no measure of overhead for different picture types.

Unknown (Unk): The time spent decoding a ﬂame but not spent in any known subsection.
This time is calculating by summing the following parts: ﬂame deblocking, process
macroblocks, read new ﬂame, post ﬂame decoding, and pre-decode ﬂame init. The sum
is then subtracted ﬂom the total time spent decoding ﬂames. (Thus overhead is not

included as unknown time.)

Akiyo

This section presents the timing ﬁgures for the timing sections that show signiﬁcant
variation in complexity based upon bitrate for the Akiyo sequence. The overall rate-
complexity plots are shown in Figure 10. The bottom plot is the magniﬁed section

containing the non-I-GoP ﬂom the top plot1 1.

Many of the probes, shown in Table 6 indicate that bitrate has an insigniﬁcant impact
upon their performance. The slight variances shown are neither linear nor particularly
related to time. In all cases except for post ﬂame decode, they include the timings for all
four GoP structures (1, P, B, and T), and the variance for an individual GoP type is much
smaller. Over a range of O to 35 kilobits per ﬂame, the total variance is only 2.3ms for

CABAC and 3.4ms for CAVLC (due to read new ﬂame having a higher variance).

 

” As mentioned before, the rate limit in the title is for the average size of all ﬂames, not for the average
size of speciﬁc ﬂame type. The rate axis for the plots corresponds to either the average size of all ﬂames (‘-
A’) or of a speciﬁc ﬂame (‘-i’, ‘-p’, ‘-b’) as determined by the plot type.

64

In comparison, the deblocking ﬁmction is clearly rate related”. The deblocking rate-
complexity plots are shown in Figure 11, and then broken down by picture type in Figure
12. The bottom section of Figure 11 shows the magniﬁed view of the non-I GoPs. The
top half of Figure 12 looks at the I ﬂames ﬂom all four GoPs, the bottom half looks at the
P ﬂames ﬂom the non-I-GoPs. Figure 11 suggests the linear performance clearly
illustrated by Figure 12. The key point for the Figure 12 I frames is how similar the
results are for all four GoPs: Regardless of how many I ﬂames were used in a sequence,
the deblocking time per kilobit remains constant. The B ﬂames, although not shown, also
follow the P pattern of increasing complexity with rate. CAVLC plots match, also not
shown, form mirror images for the ﬁgures shown. This result is expected since entropy '

coding should not have any effect on the decoder outside of entropy reading.

Presented in Figure 13 are the complexity-rate results for read macroblock. Comparing
Figure 13 with Figure 10, it is clear that much of the decoding complexity gain comes
ﬂom the ReadMB, or parsing, portion of the decoder. The decoder increase is 14ms, the
parsing increase 15ms. The P-GoP is 6ms vs. 4ms; the B-GoP 3.5ms vs. 4ms. The
difference is even more noticeable in Figure 14, which shows the CAVLC ReadMB
timings over the 0-7 kilobits per ﬂame range. For the four GoPs (I, P, B, T), the total time
increases are 12ms, 6ms, 7ms, and 7ms. The correlated ReadMB times are 6ms, 10ms,
llms, and llms. Since the deblock ﬂame probe decreases with increased rate for the I-
GoP, the I-GoP total difference is somewhat less than the ReadMB increase. Likewise,

since Deblock Frame increases with complexity for P, B, and T, the difference is

 

'2 The relative complexity change compared to time is rather small, and mostly likely is related to how
many special cases the deblocker is dealing with, which allow the deblocker to do fewer calculations.

65

somewhat larger than purely the parsing increase. However in each case, the parsing

accounts for over half of the net increase in complexity.

Summary

The timing analysis of the Akiyo decoding rate-complexity suggests several trends. First,
that many sections of the decoding process are rate invariant: Frame reading,
initialization, other MB decoding, post ﬂame decode, overhead, and unknown time. (Post
ﬂame decoding, does however, have a very speciﬁc GoP based complexity.) Secondly,
parsing and deblocking complexity are both linked to rate, but that parsing has almost a
linear correlation to bitrate. The parsing complexity increase corresponds strongly to the
net complexity increase in the decoder. Finally, the analysis suggests, as expected, that

entropy coding has no effect on the decoding complexity outside of the parsing process.

66

 

Akiyo Tables and Figures

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Akiyo CE Time Range Time Variance Average Time
Seq. (ms) (ms) JmS)
CABAC: Fastest Slowest
Read New
Frame 1.1 2 0.9 1.6
Overhead 0.25 0.63 0.38 0.45
Unknown 0.01 0.18 0.17 0.1
Decode
Frame [nit 3.87 4.1 0.23 3.97
Process MB
Other 2.9 3.4 0.5 3 1
Post Frame +/- 0.1 ms ﬂom I: 2.4; P: 2.5
Decode avegge times B: 1.3; T: 0.8
CAVLC:
Read New 1.0 2.7 1.7 1.5
Frame
Overhead 0.2 0.6 0.4 0.4
Unknown 0.01 0.21 0.2 0.125
Decode
Frame Ini t 3.67 3.87 0.2 3.75
Process MB
Other 2.88 3.55 0.77 3.30
Post Frame +/- 0.1 ms ﬂom I: 2.4; P: 2.5
Decode average times B: 1.3; T: 0.8

 

 

 

 

 

Table 6: Akiyo Rate Invariant Probes (15fps, CE)

67

 

Time Per Frame (ms)

Time Per Frame (ms)

60

55~

Seq: aki; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits]

1'

 

I I I l

1 1.5 2 2.5 3 3.5 4
Kilobytes Per Frame

 

 

 

 

4.5

Seq: aki; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [0-1Kbytes/frame]

60

55

45

35

p

 

l I

0.2 0.4 0.6 0.8
Kilobytes Per Frame

Figure 10: Akiyo Rate-Complexity (15 FPS, CE)

68

 

 

—e—

iii

-'II.'D'U

 

 

 

Deblock Frame (ms/frame)

Deblock Frame (ms/frame)

12

11~

10-

Seq: aki; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits]

 

 

l I l

0.5 1 1.5 2 2.5 3 3. 5 4
Kilobytes Per Frame

J

4.5

 

 

tilt

-ICD'D-

 

Seq: aki; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [0-1Kbytes/frame]

12

11.

10-

 

 

Kilobytes Per Frame

Figure 11: Akiyo Deblock Rate-Complexity (15 FPS, CE)

69

0 0.2 0.4 0.6 0.8

 

 

 

 

 

Deblock Frame (ms/frame)

Deblock Frame (ms/frame)

Seq: aki; GoP: -i; FPS: 15; Entropy: CABAC: Plat: CE [No Limits]

 

 

 

 

 

 

 

115i
—e— I
+ p
11 — —i—— B
—i-— T
10.5 -
10 »
9.5 -
9 .
8.5 ~
8 J 1 1 1 1 1 1 1 1
O 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Kilobytes Per Frame
Seq: aki; GoP: -p; FPS: 15; Entropy: CABAC; Plat: CE [No Limits]
—6- P
—I— B
—+— T

 

 

 

 

 

1.6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Kilobytes Per Frame

Figure 12: Akiyo Deblock Rate-Complexity by Picture Type (15 FPS, CE)

70

 

 

Read Macroblock (ms/frame)

Read Macroblock (ms/frame)

Seq: aki; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits]
25 -

 

 

HED'U-

 

 

I I l I J

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Kilobytes Per Frame

 

Seq: aki; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [0-1Kbytes/frame]

 

101

 

 

 

; J

3 1 1 1
0 0.2 0.4 0.6 0.8 1
Kilobytes Per Frame

Figure 13: Akiyo ReadMB Rate-Complexity (15 FPS, CE, CABAC)

71

 

 

 

Read Macroblock (ms/frame)

Time Per Frame (ms)

 

Seq: aki; GoP: -A; FPS: 15; Entropy: CAVLC; Plat: CE [O-1Kbytes/frame]
20
l A

x... *6—
18 i “/1 :

+

 

#013

 

 

 

2 1 1 1 1
O 0.2 0.4 0.6 0.8 1
Kilobytes Per Frame

Seq: aki; GoP: -A; FPS: 15; Entropy: CAVLC; Plat: CE [0-1Kbytes/frame]
65 —

 

—e—

+

—+—
—+—

 

HW'O

 

45-

P

 

35 1 1 1 A 1
0 0.2 0.4 0.6 0.8 1
Kilobytes Per Frame

Figure 14: Akiyo ReadMB Rate-Complexity (15 FPS, CE, CAVLC)

72

 

 

Decode Slice exl. MB Read (ms/frame)

Decode Slice exl. MB Read (ms/frame)

Seq: aki; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits]

40

35

30

 

. if
-111

 

 

 

 

 

 

 

25 -
W
20 L
15 r
10
0
Kilobytes Per Frame
Seq: aki; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [0-1Kbytes/frame]
4O ' —e— l
W + P
——+— B
35 . W + T
30 ~
25 -

20-

15»

 

 

A A A LA
I L J I

 

1O
0

0.2 0.4 0.6 0.8 1
Kilobytes Per Frame

Figure 15: Akiyo ProcessMB exl. ReadMB Rate-Complexity (15 FPS, CE, CABAC)

73

Foreman

Foreman, despite being a more complex sequence, exhibits the same probe invariants.
Table 7 shows the same probe invariants noted for the Akiyo sequence. The range,
variance, and average times are almost unchanged when compared to the Akiyo table.
This table further suggests that deblocking, decoding, and parsing contribute most of the

complexity change to the decoder.

Figure 16 presents the rate-complexity plots for the Foreman sequence, once again with a
magniﬁed look at P, B, and T GoPs. For Akiyo with these three GoPs, the 45dB
distortion was about 7Kb per ﬂame. With Foreman, it is between 25 and 30Kb. The
ﬁgure also emphases the sharp drop in complexity for the B and T GoPs at low bitrates.

This drop will be discussed more when the ProcessMB decoding probe is discussed.

Figure 17 shows the blocking rate complexity for the CABAC sequences. The CAVLC
are almost identical to those in the ﬁgure, as was true for the Akiyo sequence.
Deblocking remains a fairly rate independent task, although clearly it is rate dependent in
to a small degree. Compared to the Akiyo sequence, the I-GoP rates are within a quarter
of a second of each other. However, for the same rate, the other three GoPs are roughly
lms more complex for the Foreman sequence. This difference suggests that deblocking is
slightly sequence dependent. Figure 18 presents the deblocking complexity by picture
type for the I and P ﬂames. Like the Akiyo, the I ﬂames take the same time to deblock,

regardless of the sequence type they are used in. However, the P ﬂames vary in

74

deblocking time by the GoP, especially at lower rates. In addition, while the Akiyo I
ﬂame deblock rate is roughly equivalent to the Foreman one, the P ﬂames more complex

it deblock, regardless of the GoP.

Figure 19 presents the parsing complexities for the Foreman sequence. Again, the parsing
complexity is roughly linear, although the slope depends on the GoP structure. The l
ﬂame parsing rates are exactly the same as the Akiyo rate: All of the points ﬂom both
sequences fall on a single line for each GoP. The P and B ﬂames reﬂect the same linear
trend, although the Foreman sequence is about 0.5ms (10%) more complex for both
ﬂame types. As with Akiyo, Figure 19 clearly shows the linear relationship between rate

and complexity. Figure 20 shows the same linear pattern for the CAVLC parsing speed.

As with Akiyo, the parsing complexity increase represents a signiﬁcant contribution to
the overall complexity increase. For example, the CABAC I-GoP increases in complexity
by 24ms, the parsing increase contributes 28ms to that time. Other decreases contribute
the -4ms. For the CAVLC P-GoP, the complexity increase is 35ms, and the parsing
contribution is over 30ms. With the more complicated Foreman sequence, less
compression, and longer ﬂames, the proportion of time spent parsing grows. With Akiyo,
the largest parsing change was on the order of 10ms, with Foreman it is four times that

number.

Figure 21 shows that the non-parsing complexity is steady, with the exception of a drastic

complexity shift for the B and T GoPs between 0 and 5 Kb per ﬂame. This shift explains

75

the unusual drop in overall complexity in the Foreman results, although the coding
mechanism change responsible for the drop is unknown. It’s worth noting that the actual
decoding times, excluding parsing, are the same between the Foreman and Akiyo I and P-
GoPs. The Foreman B and T-GoPs actually have lower non-parsing decode time than
Akiyo. One possibility for this difference is that Akiyo, with higher compressibility, can
make better use of the motion vectors which then require longer to decode; the weaker
Foreman compression may simply decode more quickly. However, even excluding
parsing, the I GoP still decodes twice as fast as the P-GoP, and 2.5 times faster than the B

and T GoPs

Summary

The Foreman sequence, with a broader bitrate range than the Akiyo sequence, maintains
very similar baseline decoding values aside ﬂom parsing. Despite being a more motion
intensive sequence, the Foreman B and T GoPs actually decoded slightly faster than
Akiyo when parsing time was ignored. The rate invariant sections carried over ﬂom the
Akiyo sequence. Parsing contributed to most of the complexity gain, and more clearly

than the Akiyo sequence because of the wider bitrate range.

76

Foreman Tables and Figures

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Foreman CE Time Range Time Variance Average Time
SetL (ms) lms) (ms)
CABAC: Fastest Slowest
ReadNeW 1.1 2.1 1.0 1.6
Frame
Overhead 0.24 0.62 0.38 0.45
Unknown 0.01 0.23 0.22 0.1
Decode
Frame Init 3.87 4.15 0.28 3.97
Process MB
Other 2.9 3.75 0.85 3.2
Post Frame +/- 0.1 ms ﬂom I: 2.4; P: 2.5
Decode average times B: 1.3; T: 0.8
CAVLC:
Read New 1.2 2.1 1.7 0.9
Frame
Overhead 0.2 0.63 0.43 0.45
Unknown 0.02 0.2 0.18 0.12
Decode
Frame [nit 3.67 3.85 0.18 3.75
Process MB
Other 3.1 4.9 1.8 3.80
Post Frame +/- 0.1 ms ﬂom I: 2.3; P: 2.4
Decode avergge times B: 1.2; T: 0.8

 

 

 

 

 

77

Table 7: Foreman Rate Invariant Probes (15fps, CE)

 

Time Per Frame (ms)

Time Per Frame (ms)

65

I"

 

35
0

Seq: for, GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits]

 

I I I I l I

2 3 4 5 6 7
Kilobytes Per Frame

.-

 

 

HGT)—

 

Seq: for; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [04Kbytes/frame]

65

60»

55-

50

45-

40

p

 

35
0

I

1 1.5 2 2.5 3 3.5
Kilobytes Per Frame

Figure 16: Foreman Rate-Complexity (15 FPS, CE)

78

h—

 

 

 

 

 

Deblock Frame (ms/frame)

Deblock Frame (ms/frame)

 

Seq: for; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits]

 

p—
p—

l L 4 I I

1 2 3 4 5 6 7 8

Kilobytes Per Frame

 

 

iiii

40'0—

 

:—

Seq: for; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [O-4Kbytes/frame]

11

10-

p-

 

 

I

0.5 1 1.5 2 2.5 3 3.5 4

Kilobytes Per Frame

Figure 17: Foreman Deblock Rate-Complexity (15 FPS, CE)

79

 

 

—e—

iii

4031)

 

 

 

Deblock Frame (ms/frame)

Deblock Frame (ms/frame)

Seq: for, GoP: -i; FPS: 15; Entropy: CABAC; Plat: CE [No Limits]

11~

10.5 ~

10L

9.51

7.5 r

6.5 ~

 

I I I

0 1 2 3 4 5

03

Kilobytes Per Frame

6.5

5.5» "0

4.5" '19

3.5 '- (3:,

 

4

0 0.5 1 1.5 2 2.5
Kilobytes Per Frame

L

3

I

3.5

 

Seq: for, GoP: -p; FPS: 15; Entropy: CABAC; Plat: CE [No Limits]

,—

 

 

-6- |

+

 

 

 

—e—P

—i——T

 

4.5

Figure 18: Foreman Deblock Rate-Complexity by Picture Type (15 FPS, CE)

80

 

 

Read Macroblock (ms/frame)

Read Macroblock (ms/frame)

 

Seq: for; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits]

35v

 

 

iiii

 

 

 

 

o 1 1 1 1 I_

0 1 2 3 4 5 6 7 8
Kilobytes Per Frame

y.—
1..

Seq: for: GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [0-4Kbytes/frame]
24 -
—e— I

 

22»

iii

 

 

201 o

 

\\

18-

\‘\
\‘\
\‘

16*

\

14»

0

12-

10-

 

 

I

0 0.5 1 1.5 2 2.5 3 3.5 4
Kilobytes Per Frame

>—
y—
L.

Figure 19: Foreman ReadMB Rate-Complexity (15 FPS, CE, CABAC)

81

 

Read Macroblock (ms/frame)

Time Per Frame (ms)

55

50

45-

40

35-

30~

25

20

15-

10'

 

I

0 0.5 1 1.5 2 2.5

Kilobytes Per Frame

\\

p—

3.5

Seq: for; GoP: -A; FPS: 15; Entropy: CAVLC; Plat: CE [0-4Kbyteslframe]

.—

 

 

-e— |

iii

 

Seq: for; GoP: -A; FPS: 15; Entropy: CAVLC; Plat: CE [0-4Kbyteslframe]

100

90»

80

70

60-

50~

40

30
0

p-

I

0.5 1 1.5 2 2.5
Kilobytes Per Frame

 

 

 

—e— I

+P
——i—B
+1-

 

Flgure 20: Foreman ReadMB Rate-Complexity (15 FPS, CE, CAVLC)

82

 

 

Decode Slice exl. MB Read (ms/frame)

Decode Slice exl. MB Read (ms/frame)

35

30~

25-

20

15

Seq: for. GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits]

 

P —e—i‘
—-i—B
+T!
___1

W

1

 

10
0

   

I
-----

...........
, -

L.

1 2 3 4 5 6 7
Kilobytes Per Frame

Seq: for; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [0-4Kbytes/frame]
r

35

25»

20-

15-

 

 

—e— |

iii

 

 

 

 

 

 

10
0

Wee—GWG—e—We

I L I I 41 I

0.5 1 1.5 2 2.5 3 3.5 4
Kilobytes Per Frame

Figure 21: Foreman ProcessMB exl. ReadMB Rate-Complexity (15 FPS, CE, CABAC)

83

Mobile

This section breaks down the rate-complexity response of the decoder for the Mobile
sequence. It deals ﬁrst with the time invariant probes, then moves on to examine
deblocking, parsing, and decoding excluding parsing. As before, the overall rate-
complexity is shown in Figure 22. Table 8 presents the same time invariant probes shown
for the Akiyo and Foreman sequences. Examining table against the previous tables (Table
6 and Table 7), it is clear that even with a highly complex sequence, these measures
remain highly stable. One exception, the read new ﬂame probe, slows down for the
mobile at high rates. Excluding the Mobile I-GoP, the highest value for read new ﬂame is
2.3ms (for CABAC). Even this increase over the Foreman results is at the higher bitrates
for the mobile sequences. This again suggests that reading the new ﬂame information is
linked to the bitrate, but not in a signiﬁcant way. The other signiﬁcant increase, CAVLC
ProcessMB other, is caused by higher times for the B and T-GoPs at high bitrates.
However, even here the total increase is slight (0.6ms over Foreman) even alter doubling
the bitrate. (The 45dB B and T Foreman sequences took around 30Kb per ﬂame; the

Mobile sequences around 70Kb.)

The deblocking rate-complexities are presented in Figure 23 and then the speciﬁc
complexity for I and P ﬂames shown in Figure 24. As before, the deblocking complexity
for the I and P ﬂames is GoP independent although individual experiment variation can
be seen. Figure 23 shows the deblocking complexities following a similar pattern to the
Foreman series. The P, B, and T GoPs start out with low complexity and gradually

increase until the rate reaches about 30Kb per ﬂame. For higher rates (beyond the highest

84

Foreman rate), the deblock complexities drops slightly. The 1 GOP starts at a high
complexity, roughly the Same complexity as the rate equivalent Foreman, and continues
dropping as the rate increases. This complexity response is similar to the previous two

sequences and seems to be more closely related to bitrate than to sequence complexity.

The parsing complexity graphs are present in Figure 25 for CABAC (compare against
Figure 22) and for CAVLC, Figure 26, which includes overall time as well as parsing
time. Here the relation between parsing and overall complexity can be clearly seen. For
example, the net increase in complexity for the CABAC I-GoP is roughly 45ms; the
parsing increase also roughly 45ms. For the CAVLC P-GoP, a net increase of 90ms, with
a parsing increase of over 75ms. The contribution of parsing to the overall complexity
increase is further highlighted by Figure 27 which shows a very steady decoding
complexity for non-parsing related decoding. The exception is a slight complexity
increase during the 0-10Kb per ﬂame portion of the B and T GoPs. Interestingly, the B
and T non-parsing complexity is slightly less than the Akiyo complexity while slightly
more complex than the Foreman GoPs. All three sequences show extremely close

decoding complexities for the I and P GoPs once parsing time is excluded.

Summary

Overall, the Mobile sequence continues a pattern of parsing being a signiﬁcant factor in
the decoding complexity, especially with this implementation of the decoder. Because the
Mobile sequence is achieves poorer complexity than the Akiyo or Foreman sequences, it
spans a broader set of bitrates, which highlights the complexity dependence on rate.

Deblocking has a small correlation to rate as do a handful of other portions of the

85

decoder. Aside ﬂom parsing, the ﬂame decoding complexities are very similar to the
Akiyo and Foreman ﬂames, suggesting that sequence complexity does not signiﬁcantly

impact the non-parsing complexity.

Mobile Tables and Figures

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Mobile CE Time Range Time Variance Average Time
Seq. (ms) (ms) (ms)
CABAC: Fastest Slowest
Re?“ New 1.5 3.5 2.0 2.0
rame
Overhead 0.21 0.60 0.39 0.45
Unknown 0.01 0.23 0.22 0.1
Decode
Frame [nit 3.88 4.13 0.25 3.97
Process MB
Other 2.93 3.90 0.97 3. 3
Post Frame +/- 0.1 ms ﬂom I: 2. 4; P: 2. 5
Decode average times B: 1.3; T: 0. 8
CAVLC:
Read New 1.5 3.5 2.0 2.3
Frame
Overhead 0.25 0.63 0.38 0.43
Unknown 0.02 0.22 0.20 0.1 1
Decode
Frame Ini t 3.66 3.90 0.24 3.76
Process MB
Other 3.4 5.5 2.1 4.0
Post Frame +/- 0.1 ms ﬂom I: 2.3; P: 2.4
Decode averge times B: 1.2; T: 0.8

 

 

 

 

 

Table 8: Mobile Rate Invariant Probes (15fps, CE)

 

 

 

Time Per Frame (ms)

Time For Frame (ms)

Seq: mob; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits]
100 r

 

90.

iiii

 

 

 

80-

70

60-

50»

 

40~

 

I I

1—

30 4 1 1 4L 1 J
0 2 4 6 8 10 12 14 16 18

Kilobytes Per Frame

Seq: mob; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [0-10Kbytes/frame]

 

90' —e—I

80~

 

iii

 

 

701

60-

50-

 

40»

 

)—
u—

I

I I

2 4 6 8 10
Kilobytes Per Frame

30
0

Figure 22: Mobile Rate-Complexity (15 FPS, CE)

87

Deblock Frame (ms/frame)

Deblock Frame (ms/frame)

Seq: mob; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits]

 

 

 

 

 

 

 

 

 

 

 

 

9
—e— I
_.._ p
8 _ —+— B
—ir— T
7 .
6 ..
5 .
4 .
3 1 I 1 J_ 1 1 1 1
0 4 6 8 10 12 14 16 18
Kilobytes Per Frame
Seq: mob; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [0-10Kbytes/frame]
9
l —e— l
+ p
—+— B
8 h —i-— T
7 L
6 1
5 .
4 .
3 1 1 L L a
0 2 4 6 8 10

Kilobytes Per Frame

Figure 23: Mobile Deblock Rate-Complexity (15 FPS, CE)

88

 

 

Deblock Frame (ms/frame)

Deblock Frame (ms/frame)

Seq: mob; GoP: -i; FPS: 15; Entropy: CABAC; Plat: CE [No Limits]

 

8.5r
-€-|

 

 

 

7‘
or

 

 

y.—
1—

I

4 6 8 10 12 14 16 18
Kilobytes Per Frame

4.5
2

Seq: mob; GoP: -p; FPS: 15; Entropy: CABAC; Plat: CE [No Limits]

 

5.5 r

 

 

 

 

>—
r—
l—

 

2.5 1 1 1 1 1 1
0 1 2 3 4 5 6 7 8 9

Kilobytes Per Frame

Figure 24: Mobile Deblock Rate-Complexity by Picture Type (15 FPS, CE)

89

 

 

.11--

Read Macroblock (ms/frame)

Read Macroblock (ms/frame)

70

60»

50~

40»

30>

20~

10-

Seq

45

40-

35-

30-

25-

20-

15

10-

Seq: mob; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits]

1

 

 

L I l I

 

10

 

 

—e— I

+

 

I I I I

 

 

 

 

Kilobytes Per Frame

0 2 4 6 8 12 14 16 18
Kilobytes Per Frame
: mob; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [O-10Kbyteslframe]
o
r,
e
r,
r,
x
/‘ /
l' ’6”;
’1"
we"
,.
0 2 4 6 8 10

Figure 25: Mobile ReadMB Rate-Complexity (15 FPS, CE, CABAC)

90

 

 

Read Macroblock (ms/frame)

Time For Frame (ms)

Seq: mob; GoP: -A; FPS: 15; Entropy: CAVLC; Plat: CE [0-10Kbytes/frame]

110

100

90»

80-

70-

60-

50»

40»

30-

20

F

 

10-
0

I J I

2 4 6
Kilobytes Per Frame

 

 

—e—

41:11!—

 

Seq: mob; GoP: -A; FPS: 15; Entropy: CAVLC; Plat: CE [O-IOKbytes/frame]

160

140 ~

120 ~

100 -

a:
O

O)
O

40

1"

 

 

I I

4 6
Kilobytes Per Frame

2

 

 

—e—

+11

41111)

 

10

Figure 26: Mobile ReadMB RataComplexity (15 FPS, CE, CAVLC)

91

 

 

 

 

Seq: mob; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits]

 

 

 

 

 

35
—e~l
+P
m —+—T
E30
'7',
5
3
a) 25+
0:
m
2
3 20.
0
.9
(75
0
r;
O
WWW-e

 

10 1 L 1 1 1
0 2 4 6 8 10 12 14 16 18

Kilobytes Per Frame

Seq: mob; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [0-10Kbytes/frame]

 

 

 

 

 

 

35
—e-|
+P
A ——+—B
330 —+—T
3%
71
E
1:
8 25»
DC
to
2
-' Mae—WWW!
«’1‘: 201
o
.2
(73
8
£15»
(WWW
1o 1 1 A 1 1
0 2 4 6 8 10

Kilobytes Per Frame

Figure 27: Mobile ProcessMB exl. ReadMB Rate-Complexity (15 FPS, CE, CABAC)

92

.- 91.: Ian

 

 

Conclusions

An overview of the relative complexities of decoding macroblocks compared to parsing
rates for this implementation can be seen in Table 913 . The rates were found by the slope
of the linear line between the 25dB complexity and the 45dB complexity for each
category. The range represents the different sequences, and numbers are given for each
picture type, not group of picture structure. That is, for mobile P ﬂames, P ﬂames were
combined ﬂom the Mobile P, B, and T GoP structures. As can be seen ﬂom the ﬁgures

presented earlier in this chapter, the linear assumption is not entirely accurate. For

 

example, the complexity per bit tends to decrease at high bitrates. However, linearity is a

 

close approximation and provides practical summary of the data.

The table shows very consistent rates for decoding I and P-ﬂames, with more variance in
the time for B-ﬂame decoding. However, the I-ﬂames clearly decode the fastest,
followed by P ﬂames. B-ﬂames follow a distant third. The same relationship holds true
for the parsing rates as well. The table highlights the poor performance of CAVLC

parsing in this implementation.

The comparative performance boost ﬂom the Pentium IV is shown as well. The clock
rate of the Pentium is roughly three times greater than the iPAQ, and the speed multiplier
gives a relative sense of performance increase, not a direct comparison of capability. If
the complexity relationship was linear, the multiplier would be roughly the same for each
category. Instead, I-ﬂames decode about four times faster, P-ﬂames around ﬁve times

faster, and B-ﬂames a bit above six times faster. The CABAC parsing, in contrast, is

93

 

fairly consistent, running four and a half to ﬁve times faster on the PC. The CAVLC
parsing, on the other hand, ranges ﬂom the four times faster for the CABAC to
something like ten times faster for B ﬂames. This range in performance speedup serves to
highlight the Pentium optimizations are not being uniformly applied to the decoding

process, but boost speciﬁc sections of the decoder much more than others.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Frame iPAQ Pentium IV, PC Speedup
Type 400MhZ 1.5GhZ (Times Faster)
Low End I High End
Decoding, exl. Parsin (ms/frame)
I ll.5-l3.5 2.75-3.10 4.18 4.35
P 21.5-22.5 3.95-4.45 5.44 5.06
B 33.0-48.5 5.40-7.10 6.11 6.83
Parsing (bytes/ms)
CABAC Codin r
I 250-280 1 100-1300 4.40 4.60
P 160-250 720-1 100 4.50 4.40
B 80-210 450-1000 5.63 4.76
CA VLC Coding
I 135-245 600-1130 4.44 4.61
P 50-100 380-520 7.60 5.20
B 30-80 360-500 12.00 6.25

 

 

 

 

 

 

 

 

Table 9: Complexity Summary over 25-45 dB range

The timing analysis of the three separate sequences based upon complexity and bitrate
suggests that much of the decoding process is both rate and sequence independent.
Improvements to these sections are likely to equally improve any video sequence. Two

sections, deblocking and parsing, show a correlation between complexity and bitrate.

The deblocking correlation with rate appears extremely limited with the I-ﬂames

deblocking faster with more bits per ﬂame. The other ﬂames experience a slight

 

 

'3 This is the same table as shown in the introduction.

94

complexity increase of roughly 2ms over the 0 to 60Kb per ﬂame range. However, this
increase is relatively insigniﬁcant compared the parsing complexity increases. l-ﬂames
are three to four times slower to deblock than other ﬂames at high distortion. At low

distortion, all the GoPs take about the same amount of time to deblock.

The parsing complexity, while it varies by ﬂame type and to a smaller degree sequence,
overall shows a strong linear correlation to bitrate. For poorly designed parsing code, this
portion of the complexity easily dominates the decoding complexity. Comparing the three

sequences strongly suggests that bitrate, not sequence complexity, is the primary factor in

 

decoding time.

Aside ﬂom parsing, the baseline decoding complexity is extremely level, with the P-GoP
being about twice as complex as the I-GoP, and the B and T-GoPs being about three
times more complex than the I-GoP. This complexity difference suggests that in addition
to parsing improvement, signiﬁcant room for improvement exists for the motion

prediction implementation.

 

95

CHAPTER 5
CONCLUSION

Summary

The major contributions of this research include highlighting the performance difference
between computers and mobile devices; emphasizing that parsing, especially for poorly
compressed sequences, can be a controlling complexity factor, and providing a

ﬂamework against which to compare other handheld performance.

The complexity response between PCs and handhelds, as discussed in Chapter 3, is
signiﬁcant with complexity ratios varying at a high level based on both entropy and
picture type. The implication is that the complexity models researched and published for
desktop computers, especially Pentiums, are not entirely transferable to handheld devices.
The handheld device architecture is sufﬁciently different that while algorithm
optimizations are unlikely to hurt performance, equivalent improvement should not be
assumed. Additionally, separate handheld algorithm improvements may be possible
which signiﬁcantly impact the handheld devices without corresponding improvement for
the PCs. One example of this type of change is the memory management change which
took approximately one-third of the handheld time but an estimated two percent of the
PC. Additional professional tools are needed to proﬁle the decoder performance on

handheld devices.

The primary complexity controlling factor for both entropy coding methods was the read

macroblock, or parsing, portion of the code. Despite the lack of research into the entropy

96

coding complexity constraints, practical optimum implementations, for a variety of
platforms aside ﬂom hardware additions, are critical to prevent parsing ﬂom being the
controlling aspect of the sequence complexity. Despite the focus on traditionally complex
aspects such as transforms and deblocking, which indeed remain problematic, parsing
requires attention. This attention is especially needed if handhelds are to take advantage

of possessing a higher bandwidth to computation power ratio.

In general, the P-GoP provides a balance between the high bitrate of a mostly I-ﬂame
sequence and the high complexity of B-ﬂames due to the double motion vector. The P-
GoP provides a high level of compression signiﬁcantly reducing the bitrate without the
added overhead of a second motion vector. When parsing is ignored, the decoding
complexity of the P-ﬂame does reﬂect a signiﬁcantly higher complexity than the I-ﬂame,
and a more tightly optimized entropy decoder might reduce the I-ﬂame complexity to the
point where I-ﬂames are less complex. Signiﬁcant work on reducing the complexity of
motion compensation would be needed, however, before B ﬂames become a viable

option.

Future Directions
The research into the complexity-distortion suggests that additional research on several
ﬂonts would provide further valuable insight into video complexity-optimization for

handheld devices.

97

F

 

One speciﬁc research is the entropy symbol parsing complexity for non-hardware
speciﬁc solutions. While hardware solutions have been introduced capable of processing
on the order of 50 to 100 million symbols per second, equipping handheld devices with

entropy decoding processors may not be a practical solution.

There is a clear need for better proﬁlers to do architectural analysis of handheld devices.
While professional solutions such Intel’s VTune exist for PCs, equivalent grade tools do
not exist for handheld devices. Despite the problems in implementing such tools, the

ability to easily probe speciﬁc functions and optimizations in terms of complexity would

be a useful step in real time video optimization.

A third research aspect is additional investigation into architecture speciﬁc behavior and
ways of utilizing the operating system and architectural features to further reduce
complexity. While this research ignored the operating system interaction, operating
system interaction may not be negligible. Additionally, while neither of the two
compilers tested showed superior performance, the optimization quality of handheld
compilers is likely to be less than their PC counterparts. The lack of complexity
difference between Intel and Microsoft compilers further suggests such improvements

may be possible.

98

 

 

 

BIBLIOGRAPHY

99

 

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[3]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

BIBLIOGRAPHY

"Draft ITU-T Recommendation and Final Draft International Standard of Joint Video
Speciﬁcation (ITU-T Rec. H.264 | ISO/IEC 14496-10 AVC)," Joint Video Team (JVT), Mar.
2003.

H. Schwarz and T. Wiegand, "The Emerging JVT/H.26L Video Coding Standard," presented at
Proc. of the IBIC, Sept. 2002.

N. Kamaci and Y. Altunbasak, "Performance Comparison of the Emerging H.264 Video Coding
Standard with the Existing Standards," presented at IEEE ICME, July 2003.

M. Horowitz, A. Joch, F. Kossentini, and A. Hallapuro, "H.264/AVC Baseline Proﬁle Decoder
Complexity Analysis," IEEE Transactions on Circuits and Systems for Video Technology, vol. 13,
pp. 704-716, July 2003.

V. Lappalainen, A. Hallapuro, and T. Hamalainen, "Complexity of Optimized H.26L Video
Decoder Implementation," IEEE Transactions on Circuits and Systems for Video Technology, vol.
13, pp. 717-725, July 2003.

H. Malvar, A. Hallapur, M. Karczewicz, and L. Kerofsky, "Low-Complexity Transform and
Quantization in H.264/AVC," IEEE Transactions on Circuits and Systems for Video Technology,
vol. 13, pp. 598-603, July 2003.

A. Chang, 0. Au, and Y. M. Yeung, "A Novel Approachto Fast Multi-Block Motion Estimation
for H.264 Video Coding," presented at IEEE ICME, Baltimore, Maryland, 2003.

1. Richardson, "H.264 / MPEG-4 Part 10 : Variable Length Coding," in H.264 White Papers, 2002,
Available: http://www.vcodex.fsnet.co.uk/h264_vlc.pdf

G. Bjontegaard and K. Lillevold, "Context-adaptive VLC (CVLC) coding of coefficients." Fairfax,
Virginia: Doc JVT-C028-r1, Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG
(ISO/IEC JTCl/SC29/WGll and ITU-T SG16 Q.6), May 2002.

D. Marpe, G. Blattermann, G. Heising, and T. Wiegand, "Video Compression Using Context-
Based Adaptive Arithmetic Coding," Proc. IEEE ICMP '01, vol. 111, pp. 558-561, Oct. 2001.

D. Marpe, H. Schwarz, and T. Wiegand, "Context-Based Adaptive Binary Arithmetic Coding in
H.264/AVC Video Compression Standard," IEEE Transactions on Circuits and Systems for Video
Technology, vol. 13, pp. 620-636, July 2003.

S. Saponara, C. Blanch, K. Denolf, and J. Bormans, "The JVT Advanced Video Coding Standard:
Complexity and Performance Analysis on a Tool-by-Tool Basis," presented at Packet Video, April
2003.

E. Baum, V. Hart, and J. Speidel, "Improvement of H.263 Encoding by Adaptive Arithmetic
Coding," IEEE Transactions on Circuits and Systems for Video Technology, August 2000.

I. Witten, R. Neal, and J. Cleary, "Arithmetic Coding for Data Compression," Communications of
the A CM, vol. 30, June 1987.

D. Marpe and T. Wiegand, "A Highly Efﬁcient Multiplication-Free Binary Arithmetic Coder and
its Application to Video Coding," IEEE Proc. ICIP, Sept 2003.

100

 

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

I. Richardson, "H.264 / MPEG-4 Part 10: Transform & Quantization," in H.264 White Papers,
Mar. 2003.

MPEG-2: ISO/TEC JTCl/SC29/WG11 and ITU-T, "Revised text for ITU-T recommendation
H.262-ISO/IEC 13 818-2: Information technology-generic coding of moving pictures and
associated audio information Video," lSO/IEC and ITU-T, Ed. Genf, Switzerland, 1995.

K. Panusopone and D. Baylon, "An Analysis and Efﬁcient Implementation of Half-Pel Motion
Estimation," IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, Aug.
2002.

0. Werner, "Drift Analysis and Drift Reduction for Multiresolution Hybrid Video Coding," Signal
Processing: Image Commun., vol. 8, July 1996.

T. Wedi and H. G. Musmann, "Motion- and Aliasing-Compensated Prediction for Hybrid Video
Coding," IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, July 2003.

T. Wieland, G. Sullivan, G. Bjontegaard, and A. Luthra, "Overview of the H.264/AVG Video
Coding Standar " IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, July '
2003.

 

S. Akramullah, I. Ahmad, and M. Liou, "Optimization of H.263 Video Encoding Using a Single
Processor Conrputer: Performance Tradeoffs and Benchmarking," IEEE Transactions on Circuits
and Systems for Video Technology, vol. 11, pp. 901-915, Aug. 2001.

K. Yu, J. Lv, J. Li, and S. Li, "Practical Real-Time Video Codec for Mobile Devices," IEEE ICME
2003, 2003.

SymbolicTools, Port SDK, 2003, Available:
http://www.symbolictools.de/public/pocketconsole/developers/portsdk.htm

SymbolicTools, PocketConsole Version 1.3, 2003, Available:
http://www.symbolictools.de/public/pocketconsole/

JVT, JM 6. 1 e, 2003, Available: http://bs.hhi.de/~suehring/tml/download/jm6le.zip

General References

Harald Kosch, Distributed Multimedia Database Technologies Supported MPEG-7 and MPEG-2!, Boca
Raton: CRC Press, 2004.

A. Murat Tekalp, Digital Video Processing, Upper Saddle River: Prentice Hall, 1995.

A. Peymandoust, T. Simunic, and G. De Micheli, “Low Power Embedded Software Optimization using
Symbolic Algebra,” presented at DATE’OZ, March 2002.

K. Yu, J Lv, J. Li, and S. Li, “Practical Real-Time Video Codec for Mobile Devices," presented at ICME
2003, July 2003.

University of British Columbia, T MNDecode, 1997, Available:
http://www.openmash.org/lxr/source/codec/tmndec

101

 

T. Turletti and C. Huitema, “Videoconferencing on the Internet,” IEEE/A CM Transactions on Networking,

vol. 4, June 1996.

J. Zheng and L. Chan, “A Motion Vector Recovery Algorithm for Digital Video Using Language
Interpolation,” IEEE Transactions on Broadcasting, vol. 49, Dec. 2003.

N. Ling and N. Wang, “Real-Time Video Decoding Scheme for HDTV Set-Top Boxes,” IEEE
Transactions on Broadcasting, vol. 48, Dec. 2002.

M. Sima, Y. Zhou, and W. Zhang, “An Efﬁcient Architecture for Adaptive Deblocking Filter of
H.264/AVC Video Coding,” IEEE Transactions on Consumer Electronics, vol. 50, Feb. 2004.

T. Chang, Y. Tsai, C. Chien, C. Lin, and J. Guo, “A High-Performance Bitstream Processing IP Core
Design for MPEG4 Video Compression Applications,” presented at International Symposium on
Nanoelectronic Circuits and Olga-scale systems (ISNCGS 2004).

K. Yamada, M. Kojirna, T. Shimizu, F. Sato and T. Mizuno, “A new RISC processor architecture for
MPEG-2 decoding,” IEEE Transactions on Consumer Electronics, vol. 48, Feb. 2002.

B. Zheng and M. Atiquzzaman, “A Novel Scheme for Streaming Multimedia to Personal Wireless
Handheld Devices,” IEEE Transactions on Consumer Electronics, vol. 49, Feb. 2003.

S. Lee and 1. Park, “A Low-Power Variable Length Decoder for MPEG-2 Based on Successive Decoding
of Short Codewords,” IEEE Transactions on Circuits and Systems, vol. 50, Feb. 2003.

A. Chimienti, C. Ferraris, and D. Pau, “A Complexity-Bounded Motion Estimation Algorithm,” IEEE
Transactions on Image Processing, vol. 11, April 2002.

102

.
U
I
.
0
.
.
, .
-
_.
r-
. .
Cu.K
.‘

I . 1 ‘ h 1 ,
" "“ﬁ'm'ﬂi-T'L wr«~mar-r.1i.i:‘ri.:'~

 

   

i[ii111111111111111111111