é . in...“ «W.. a”, r. . L ‘yu . . $.79 in. «arm: «at .ufir. «"314. «RH. K ”1.0.3:; haw.” . . Jig‘auifig. i... s, P: Pr.) . ...:x Lun:.ra....a“~n :N. 315.1 .fiPfiu "5. AL} {Saw .. ; ~ .. 3.: n. "ma...“ L1 . 3%?» «$13» . , figfflk THESIS :1 . .' _ ; of "I @0393330 This is to certify that the thesis entitled COMPLEXITY DISTORTION OPTIMIZATION FOR VIDEO OVER WIRELESS DEVICES presented by Alan Paul Ray has been accepted towards fulfillment of the requirements for the MS. degree in Electrical and Computer Eng. {gt/M4 M cj/toi'ajor Profeséefs Signature 4714/ 2 7, 20451/ Date MSU is an Affinnative Action/Equal Opportunity Institution LIBRARY Michigan State University PLACE IN RETURN Box to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 6/01 cJCIRCIDatoDuopes-p. 1 5 COMPLEXITY DISTORTION OPTIMIZATION FOR VIDEO OVER WIRELESS DEVICES By Alan Paul Ray A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Department of Electrical and Computer Engineering 2004 ABSTRACT COMPLEXITY DISTORTION OPTIMIZATION FOR VIDEO OVER WIRELESS DEVICES By Alan Paul Ray The introduction of powerful compact wireless devices (such as handheld computers) raises new and unexplored aspects to optimizing complexity-distortion for video. Historically, the computational power of powerful platforms (e.g., computers) has exceeded the available communication bandwidth over emerging networks such as the Internet and wireless networks. Hence, improved performance was achieved by reducing the required transmission bitrate of video (through higher compression) and by increasing the computational complexity. Handheld devices do not follow the traditional balance of bandwidth and power; computational power is at a premium while bandwidth is comparatively more available. This thesis investigates the complexity-distortion options for a state-of-the-art video compression standard, H.264/JV T, which is an estimated three times more complex than the previous standards. The research focuses on analyzing the video standard’s complexity for one of the most popular handheld devices, the iPAQi Operational complexity-distortion and rate-complexity curves are presented for a number of Group of Picture (GoP) structures, entropy coding modes and compared against Pentium IV performance. The experiments include some of the earliest comparisons between JV T’s entropy coding modes, context-adaptive binary arithmetic coding and content-adaptive variable-length coding. TABLE OF CONTENTS LIST OF TABLES ............................................................................................................. V LIST OF FIGURES .......................................................................................................... VI CHAPTER 1 INTRODUCTION .............................................................................................................. 1 Problem Background ...................................................................................................... 1 Prior Work and Motivation ............................................................................................. 2 Objectives ....................................................................................................................... 4 Approach ......................................................................................................................... 5 Contributions ................................................................................................................... 5 Thesis Overview ............................................................................................................. 7 CHAPTER 2 BACKGROUND ................................................................................................................ 8 Standard Encoder Model ................................................................................................. 8 Input Frame ................................................................................................................. 9 Motion Estimation & Prediction Error ..................................................................... 10 Frequency Transform & Quantization ...................................................................... 12 Decode Frame Prediction .......................................................................................... 13 Entropy Coding ......................................................................................................... 13 IV T Decoder Implementation ....................................................................................... 14 Input Decoding .......................................................................................................... 14 Dequantization and Hadamard Transform ................................................................ 17 Motion Prediction ..................................................................................................... 18 Deblocking (Loop Filter) .......................................................................................... 21 CHAPTER 3 IMPACT OF GROUP OF PICTURE TYPES ON REAL TIME PERFORMANCE ...... 23 Hardware Selection ....................................................................................................... 24 Handheld device - iPAQ: .......................................................................................... 25 PC Platform: .............................................................................................................. 25 Software Selection ........................................................................................................ 25 Operating System ...................................................................................................... 26 Compiler and Software Development ....................................................................... 27 IV T Decoder ............................................................................................................. 28 JV T Encoder ............................................................................................................. 31 IV T Player ................................................................................................................ 32 Experimental Platform .............................................................................................. 33 Experimental JV T Video Sequence Settings ................................................................ 34 Experiment Variables: .............................................................................................. 34 Experiment Constants: .............................................................................................. 38 Primary (I,P,B,T) GoP Results by Complexity-Distortion: .......................................... 39 Plot Specifications .................................................................................................... 41 iii Akiyo ............................................................................................... i .......................... 42 Foreman .................................................................................................................... 47 Mobile ....................................................................................................................... 52 Trends and Observations ............................................................................................... 56 CHAPTER 4 BITRATE COMPLEXITY ANALYSIS .......................................................................... 59 Timing Probes ............................................................................................................... 59 Actual Timing Probes ............................................................................................... 6O Calculated Timing Probes ......................................................................................... 63 Akiyo ............................................................................................................................. 64 Summary ................................................................................................................... 66 Akiyo Tables and Figures ......................................................................................... 67 Foreman ........................................................................................................................ 74 Summary ................................................................................................................... 76 Foreman Tables and Figures ..................................................................................... 77 Mobile ........................................................................................................................... 84 Summary ................................................................................................................... 85 Mobile Tables and Figures ........................................................................................ 86 Conclusions ................................................................................................................... 93 CHAPTER 5 CONCLUSION ................................................................................................................. 96 Summary ....................................................................................................................... 96 Future Directions .......................................................................................................... 97 BIBLIOGRAPHY ........................................................................................................... 100 General References ..................................................................................................... 101 iv LIST OF TABLES Table 1: Complexity Summary over 25-45 dB range ......................................................... 7 Table 2: Summary of Coding Variables ........................................................................... 34 Table 3: Akiyo Complexity Deltas over 25-45 dB ........................................................... 44 Table 4: Foreman Complexity Deltas over 25-45 dB ....................................................... 49 Table 5: Mobile Complexity Deltas over 25-45 dB .......................................................... 54 Table 6: Akiyo Rate Invariant Probes (15fps, CE) ........................................................... 67 Table 7: Foreman Rate Invariant Probes (15fps, CE) ....................................................... 77 Table 8: Mobile Rate Invariant Probes (leps, CE) ......................................................... 86 Table 9: Complexity Summary over 25-45 dB range ....................................................... 94 LIST OF FIGURES Figure 1: Basic Video Encoder Model ................................................................................ 9 Figure 2: Group of Picture Illustration .............................................................................. 20 Figure 3: First Frame of Video Sequences ....................................................................... 35 Figure 4: Akiyo Complexity-Distortion (15 FPS, CE) ..................................................... 42 Figure 5: Akiyo Complexity-Distortion (15 FPS, PC) ..................................................... 43 Figure 6: Foreman Complexity-Distortion (15 FPS, CE) ................................................. 47 Figure 7: Foreman Complexity-Distortion (15 FPS, PC) ................................................. 48 Figure 8: Mobile Complexity-Distortion (15 FPS, CE) .................................................... 52 Figure 9: Mobile Complexity-Distortion (15 FPS, PC) .................................................... 53 Figure 10: Akiyo Rate-Complexity (15 FPS, CE) ............................................................ 68 Figure 11: Akiyo Deblock Rate-Complexity (15 FPS, CE) ............................................. 69 Figure 12: Akiyo Deblock Rate-Complexity by Picture Type (15 FPS, CE) ................... 70 Figure 13: Akiyo ReadMB Rate-Complexity (15 FPS, CE, CABAC) ............................. 71 Figure 14: Akiyo ReadMB Rate-Complexity (15 FPS, CE, CAVLC) ............................. 72 Figure 15: Akiyo ProcessMB exl. ReadMB Rate-Complexity (15 FPS, CE, CABAC) .. 73 Figure 16: Foreman Rate-Complexity (15 FPS, CE) ........................................................ 78 Figure 17: Foreman Deblock Rate-Complexity (15 FPS, CE) ....................................... i 79 Figure 18: Foreman Deblock Rate-Complexity by Picture Type (15 FPS, CE) .............. 80 Figure 19: Foreman ReadMB Rate-Complexity (15 FPS, CE, CABAC) ....................... 81 Figure 20: Foreman ReadMB Rate-Complexity (15 FPS, CE, CAVLC) ......................... 82 Figure 21: Foreman ProcessMB exl. ReadMB Rate-Complexity (15 FPS, CE, CABAC) .......................................................................................................................... 83 vi Figure 22: Mobile Rate-Complexity (15 FPS, CE) .......................................................... 87 Figure 23: Mobile Deblock Rate-Complexity (15 FPS, CE) ............................................ 88 Figure 24: Mobile Deblock Rate-Complexity by Picture Type (15 FPS, CE) ................. 89 Figure 25: Mobile ReadMB Rate-Complexity (15 FPS, CE, CABAC) ........................... 90 Figure 26: Mobile ReadMB Rate-Complexity (15 FPS, CE, CAVLC) ........................... 91 Figure 27: Mobile ProcessMB exl. ReadMB Rate-Complexity (15 FPS, CE, CABAC). 92 vii CHAPTER 1 INTRODUCTION Problem Background Compression is a crucial component of video coding. Uncompressed video is extremely space intensive, with an average resolution 320x240 pixel image taking up over 300 kilobytes, or over 27 megabits per second at the 30 frames per second standard. As video streaming grows and spreads, today’s bandwidth and storage simply cannot support any significant level of uncompressed video. Previous standards such as MPEG-2, MPEG-4, and H.263 have successively introduced newer and improved approaches to video compression. However, computer processing power has continued to rapidly grow, allowing the introduction of better compression standards to meet the rising use of video usage. High compression standards attempt to deal with difficulties such as increased network traffic, limited bandwidth, and storage space. The current state-of-the-art standard is the JV Tl/H.264 standard2 developed by the Moving Picture Experts Group and Video Coding Experts Group (under ISO - International Standard Organization) and the Video Coding Experts Group (under ITU — the International Telecommunications Union) [1]. This standard has been developed as a response to “the need for an industry standard for compressed video representation with substantially increased coding efficiency and enhanced robustness to network environments.” [1] ' JVT is a Joint Video Team from the two aforementioned experts groups. 2 Throughout this thesis, JV T, H.264, JVT/H.264, and H.264/JVT are used interchangeably. Prior Work and Motivation As discussed later, JV T provides significant advantages and improvements over previous standards in terms of higher compression ratios. One estimate put H.264 bit savings at 39% compared to MPEG-4, 59% over H.263 High Latency Profile, and 64% better than MPEG-2 [2]. Another study suggested around 50% coding gain over MPEG-2 and the baseline H.263, and 24% over the high profile H.263 model [3]. However, to obtain the improved compression rates, the JVT standard significantly increases the computational complexity. One study suggests that the baseline H.264 decoder increases complexity an average of 2.4 times compared to the previous H.263 baseline implementation [4]. Although much research has gone into reduced-complexity improvements to H.264, it remains a computationally complex standard. While the increased complexity of the H.264 standard poses some problems for today’s traditional desktop and laptop computers, these computers have grown rapidly in their processing power, and typically have specialized video cards and instruction sets. In addition, significant amounts of research are focused on reducing complexity for traditional computers. However, the spread of handheld computers and wireless networks introduces a new set of requirements to video decoding. Handheld computers, with high-speed Internet and quality graphics, are clearly viable candidates for video decoding. Their wireless network access has the same bandwidth as any other wireless implementation, currently around one to five megabits per second, depending upon the technology being used, the environment interference, error handling techniques, and the network implementation and usage. The typical handheld graphics, while adequate, are far less capable than a typical desktop. The handheld uses a 240x360 pixel screen, at least four times smaller than a desktop screen. A desktop’s color range ranges in the millions of colors; a handheld device at sixty-five thousand or fewer (a maximum of 16 bits per pixel). More importantly, the handhelds typically lack specialized video cards, powerful processors, or extra instruction sets. The handheld processors are slower, both in terms of clock rate as well as optimizations. Because of power, space, and heat issues, architectural speedups are typically not implemented to the same degree as for standard systems. Thus a 4OOMhz processor for a handheld is not as powerful as a 4OOMhz processor for a desktop. Additionally, handheld devices do not run the same operating systems as desktops. Thus, the needs and limitations for a video decoder over a handheld are likely to be significantly different from video decoder optimized for a desktop. Specifically, bandwidth may be much less of an issue compared to the lack of computing power. Furthermore, due to the different operating system and architectural designs, handheld optimization may not particularly correspond to desktop optimizations. However, designing a new video standard for handheld devices is undesirable for several reasons. First, video will often be generated for multiple destinations, possibly without knowing the platform of the end-user. In addition, video is often encoded for multiple uses (handheld, storage, desktop) and encoding video multiple times for different applications is less than desirable. Furthermore, the development of additional standards defeats one of the major purposes of having a standard: To have a universally accepted format for video encoding. IV T, designed and backed by both the MPEG and ITU standards organizations, is likely to become the next generation standard in many applications. However, little research has been done for addressing the questions involved in how I VT can be best implemented on handheld devices. Since handheld devices, due to their low storage space, are better suited for real-time video than to storing video, this thesis focuses on the real-time performance of a handheld decoder with a variety of approaches to decoding JV T video sequences. Special attention is paid to the time involved in the subparts of video decoding. One previous study [5] breaks JV T decoding process into six main sections3 : The deblocking filter, interpolation, inverse transform and quantization, variable length decoding (VLD), intra prediction, and miscellaneous. Based on using a Pentium HI processor, this previous study estimated that the deblocking filter takes 20-3 5% of the time, interpolation (including motion prediction) another 35-50%, intra prediction less than 1%, inverse transform and quantization 7-11%, variable length decoding 4-9%, and miscellaneous around 6-21%. The wide range of results reflects the use of different video sequences. Objectives The purpose of this research was three-fold: To understand the impact of high level encoder options such as entropy coding and Group of Picture (GoP) structure on complexity-distortion; to develop a handheld specific complexity model to guide further 3 The following chapter elaborates on and provides some background on these parts of the video decoder. handheld decoder optimizations; and to generate a real-time JVT decoder for a handheld device. Approach Since prior research, such as the study by Lappalainen et a1. suggests interpolation accounts for a large portion of the decoding time, an initial research focus is the impact of the frame type composition upon the computational complexity. The frame type composition, or Group of Picture (GoP) types, is a significant factor in controlling the amount of motion prediction required by the video decoder. Contributions The major contributions of this research are broken down into four sections. The first contribution, developing the available reference software into a viable real time H.264 decoder, provided the underlying software used for performance analysis throughout the rest of this research. In addition, the real time H.264 decoder ensures the practical application of the research to actual decoding. The second contribution, discussed in chapter three, provides a baseline complexity for the H.264 decoder under different parameters, especially the impact of motion prediction through controlling the picture frame types. The exploration of these parameters gives insight into the complexity control available at a high level through varying the encoder options. This contribution focused on the impact of I, P, and B-frames on the complexity- distortion curves generated by different levels of quantization in the encoder. This contribution highlights the generally optimum performance of P-frames, even when no rate-limit is imposed on the sequence. The third contribution, examined in chapter 4, breaks down the decoder into sub- components and presents the rate-distortion for each aspect. It identifies the portions of the decoder that are complexity invariant, regardless of rate, and the components that are rate-dependent. Additionally, this section breaks out the section that goes into parsing, which is highly linear based on rate, and also breaks out the decoding (non-parsing) complexity for I, P, and B frames. This section highlights the significance of parsing, especially with a sub-optimal implementation, as well as the large decoding time, especially with B frames. The fourth contribution compares the iPAQ results to the same experiments run on a Pentium IV processor. The contribution highlights that the performance differences between the two processors are not simply scaled versions of each other. The Pentium speedup varies significantly based upon the component of the decoder. This variance strongly suggests that algorithms and optimizations from one processor will not necessarily be equally beneficial on the other processor and that separate handheld research is needed. This contribution is discussed in both chapter 3 and 4 after the relevant iPAQ discussion. Table 1 provides an overview of the complexity and parsing rates for a range of sequences.4 ‘ This table is discussed in more depth at the conclusion of chapter 4. Frame iPAQ Pentium IV, PC Speedup Type 4OOMhz 1.5GhZ (Times Faster) Low End J High End Decoding, exl. Parsin r ans/frame) I 11.5-13.5 2.75-3.10 4.18 4.35 P 21.5-22.5 3.95-4.45 5.44 5.06 B 33.0-48.5 5.40-7.10 6.11 6.83 Parsing (bytes/ms) CABAC Codin r I 250-280 1 100-1300 4.40 4.60 P 160-250 720-1 100 4.50 4.40 B 80-210 450- 1 000 5.63 4.76 CA VLC Coding I 135-245 600-1130 4.44 4.61 P 50-100 380-520 7.60 5.20 B 30-80 360-500 12.00 6.25 Table 1: Complexity Summary over 25-45 dB range Thesis Overview The remaining chapters are organized as follows: Chapter 2 discusses video coding standards in general and the important features of the J VT standard. Chapter 3 looks at the complexity-distortion of difference GoP sequences as well as contrasting the PC and iPAQ performance. Chapter 4 breaks down iPAQ decoder performance into its components and looks at the component performance based on compression. Chapter 5 concludes with summarizing the contributions in more detail and looking at future work. CHAPTER 2 BACKGROUND This chapter discusses some key basics and well-established concepts for video decoders, the critical portions of the H.264/JV T standard, and the previous studies into H.264 and wireless complexity. The first section of this chapter provides a general summary of how a video sequence is broken down for coding and provides a very brief synopsis of the generic encoding process. The next section steps through the JVT standard decoding process of a correctly coded video stream, starting with the incoming stream, then looking at coefficient dequantization, motion prediction, and finally image deblocking. These sections include exploration of previous research on J VT complexity issues as well as the few previous studies on implementing video decoding on handheld platforms. Standard Encoder Model The typical video compression model is based on an encoder, either real-time or offline, that is encoding a series of pictures, transmitting the encoded video sequence over some channel, possibly with additional channel coding, where the sequence is received by a decoder, which reverses the encoding process and generates a sequence similar to the original. The similarity to the uncorrupted original sequence, often referred to as the signal-to-noise ratio, depends upon the amount of information discarded by the encoder while compressing the sequence and any optimization in the decoder that neglected transmitted information. Many standards, including the H.264 standard, specify only the correct (“standard compliant”) input and output of the decoder; for the encoder, any algorithm that generates a correctly encoded (compliant) stream is acceptable. Likewise, any decoding process that takes a valid input stream and produces the correct decoded video sequence is acceptable. In practice, there is a standard encoder model as shown in Figure 1. The rest of this section examines each of these seven components (input frame, motion estimation, prediction error, frequency transform, quantization, decode frame prediction, and entropy coding). 1 Frame Motion Frequency Estimation Transform —p Quantization i (DCT) Error (Actual vs. Decoder Prediction) 33:: Decode I Frame {— Entropy (Prediction) I OWN Figure 1: Basic Video Encoder Model Input Frame Video frames, typically in YUV format, are sequentially fed into the encoder. YUV is a color representation scheme where Y represents the intensity, or luminance (luma), of the color. The U (or Cb) and V (or Cr) components contain the chrominance, or color component. (Cb and Cr primarily contain the blue and red chrominance components of the original Red-Green-Blue (RGB) pixel, respectively; where as the luminance Y primarily contains the green component.) Because the human eye is much more sensitivity to intensity than to color, the chrominance is ofien specified with fewer bits. One of the significant advantages of the YUV scheme is the ease of sending more detail 9 regarding the luminance by including more bits for the Y component. The second way of exploiting the YUV scheme, used by popular video compression standards such as J VT, is to use a single sample of the UV component for multiple samples of the Y component. One format that is used by the MPEG and ITU series of video standards (including J VT) is referred to as the YUV 4:220, which implies that for each pixel, the Y component is encoded. However, a Cb and Cr sample are only taken every other column and every other row, effectively one UV sample per four Y samples. Various schemes are used to properly average the UV components over all four Y samples. Currently the JVT standard supports 8-bits per Y, U, or V sample, although support for additional bits is expected [6]. Motion Estimation & Prediction Error Motion estimating plays a major role in enabling modern video compression standards to achieve high coding efficiency. Motion estimation is traditionally achieved by dividing the image frame (under consideration by the encoder) of the video sequence (being compressed) into square (or rectangle) regions. The classic subdivision of a frame is into sixteen by sixteen (16x16) pixel blocks known as macroblocks. Each macroblock, therefore, contains a 16x16 array of the Y luminance information, and two 8x8 arrays, one for the Cb component and the other for the Cr component. (The 4:2:0 YUV format has four times the Y information compared to the U and V.) The macroblock is the fundamental block which is encoded, transmitted, and then decoded. Note that the standard video formats, such as QCIF (176x144 pixels) are multiples of 16x16. QCIF, for example, contains 99 macroblocks (11x9). 10 When the encoder receives a frame, it breaks the frame (current frame) down into macroblocks and for each macroblock in the current frame, searches for the most similar 16x16 area within the previously coded frame (reference frame). Once the most similar 16x16 area (“best match”) in the reference frame is identified by the motion search process, the encoder then takes the difference between the two: the macroblock in the current frame and the 16x16 best-match area in the reference frame. The 16x16 area (“best match”) in the previous frame is the same size as the macroblock of the current frame that is being coded; however this 16x16 area might consist of multiple macroblock subsections of the previous frame. The range of the search depends upon the type of picture being encoded and the encoder options. Some picture types only allow searching within the same flame for similar macroblocks, others allow looking at one or more previous reference flames. The number of reference flames depends upon the encoder options; more flames increases the encoder complexity significantly. The definition of “most similar” macroblock depends upon the encoder implementation. Numerous schemes for increasing the search efficiency and performance have been published. For example, Andy Chang et al. propose a fast estimation procedure that reduces computational complexity by forty percent compared to a standard full search [7]. The difference between the current block and the reference block is taken because usually the difference contains smaller values than the original macroblock and thus requires fewer bits to encode. In addition to the difference, a motion vector is sent specifying the reference 16x16 area (in the reference flame) for the current macroblock. 11 As discussed later, the encoder, after encoding each frame, immediately (internally) decodes the flame and stores it locally to serve as a reference decoded frame for the next (current) flame. This allows the encoder to be “in synch” with the decoder at the receiver (as explained further below). In other words, this enables the encoder to use the correct “reference” flame that will be available at the true decoder when computing the difference signal. (Recall that the receiver will not have access to the original lossless flames, and it will only have access to the decoded flames, distorted due to compression.) Frequency Transform & Quantization Once the Y, U, and V coefficients have been determined along with any motion vector, the coefficients of the macroblock undergo a frequency transform. This transform is usually a discrete cosine transform (DCT). Here, since the data is in arrays of sizes such as 4x4, 8x8, or 16x16, the transforms can easily be applied. The result is a set of coefficients that describe the samples and motion vectors flom in the flequency domain rather than pixel by pixel. Many of these components tend to be zero or very small, further reducing the amount of information needing to be transmitted. The J VT standard uses a special Hadamard transform, discussed later, to avoid multiplications or floating point discrepancies by the decoder. Once the flequency coefficients are determined, they are quantized to further reduce the amount of information. The simplest form of quantization is to simply divide each coefficient by some number and send the integer result, discarding any remainder. The decoder simply multiplies the remainder by the coefficient, which is also sent, and its 12 final coefficients are close to the original. More complicated schemes with scaled quantization parameters are actually used to improve quantization performance. Careful selection to the quantization equations in the standard enable multiplication and division to be avoided and replaced with shifting and additions. This quantization step is the primary lossy step where information is discarded by the encoder that cannot be recovered by the decoder. The level of quantization is determined by the encoder options. Decode Frame Prediction Because of the loss of information flom quantization, and possibly from the frequency transform (mainly due to minor loss in the precision of computation), a decoded flame is not exactly the same flame that was earlier encoded. Since the decoder is using motion prediction based on past decoded flames, not past original flames, the encoder needs to select the motion prediction based on the actual decoder flames. Otherwise, the motion vectors for the macroblocks will point to different coefficients than expected, and the newly decoded flame will not be correct. Since future flames will be based on past flames, the error will propagate and grow, adding significant distortion. To avoid this problem, the encoder includes a decoder which decodes each encoded flame. These decoded flames are then used for motion estimation, and so the motion estimation is correct regardless of what loss the encoder introduces. Entropy Coding The quantized macroblock coefficients, motion vectors, and other overhead information are now ready for lossless entropy encoding to minimize the amount of transmitted data. 13 Channel coding and transmission protocols, which compensate for lossy channels, are irrelevant to the video standard although they clearly impact the overall video performance. Many standards, including JV T, include advanced features for handling lost data at the cost of additional complexity and bandwidth. The JVT standard supports two entropy coding standards, discussed shortly in more depth as part of the J VT decoder. Entropy coding represents the final encoding step and the output is the encoded video, which is then channel coded, and sent to the decoder. JVT Decoder Implementation This discussion of the key aspects of the JVT standard are subdivided into several parts. The first portion deals with the decoder input and entropy decoding. The next focuses on the coefficient dequantization and inverse transform. The final parts examine motion prediction and the deblocking filter. Schwarz and Wiegand provide a more in—depth discussion of many of these aspects in [2]. Input Decoding The first segment of the decoder is processing the incoming video stream and then reversing the entropy coding. The video stream processing is based upon breaking the stream down into pro-defined packets. The packets then specify the protocol and data within each packet, and the data is extracted in the appropriate manner. 14 Network Abstraction Layer Unit (NAL U) Since the JV T standard is designed to work independent of any network or channel technology, JV T uses its own packet model, called a network abstraction layer unit (NALU), which is very similar to an Internet Protocol (IP) packet. The NALU encapsulates all the flame data, whether motion vectors, frequency coefficients, or flow control data. The incoming stream is searched for the unique NALU startcodes, then the decoder processes one NALU at a time. NALUs may contain either flow control data or picture data. (Flow control data involves information such as the reference flame buffer handling, slices, and macroblock ordering schemes.) Different NALU types contain different data packed in according to that NALU type. Entropy Coding: CAL VC & CABAC Depending upon the encoder options and the NALU type, the data is unpacked appropriately. Two major types of entropy coding are used to pack the data. The first, a combination of universal variable length codes (UV LC) and context adaptive variable length codes (CAVLC). Throughout this thesis, this combination is generally referred to by the CAVLC abbreviation, although this method of entropy encoding uses both. However, previous coding standards generally relied on a pure UVLC entropy approach. The context adaptive VLC codes use a combination of different lookup tables, where the context is the nearby macroblocks. In addition to the lookup tables, additional syntax is used to specify the location of zeroes in between the non-zero coefficients. A more in- depth explanation can be found in [8], as well as the original standard proposal [9]. Horowitz et al. [4] briefly note in their decoder complexity analysis that their empirical 15 tests indicate CAVLC is about two times more complicated than the previous pure UVLC defined by J VT. However, they do not examine CAVLC complexity beyond that brief footnote. The other entropy method that is supported by the J VT standard is context based adaptive binary arithmetic coding (CABAC). CABAC is an extension of arithmetic coding. Arithmetic coding works by encoding a series of coefficients as a floating point number. The more bits, the more specific the floating point number and the more coefficients can be encoded in a single number. Arithmetic coding can nearly achieve the source entropy because it uses the probability distribution to determine which floating-point numbers correspond to which series of coefficients. Since the probability distributions are variable for different video sequences, and even within the same sequence, the JV T model recalculates the probabilities so that the distributions are representative of the recent past. Studies such as [10, l 1] suggest that the context based arithmetic coding represents a 10- 25% bit improvement over the UVLC implementation, depending upon quantization parameter and video sequence. A study by Saponara et al. suggested that 16% was the upper limit [12]. In terms of prior research into coding complexity, both CABAC and CAVLC are relatively new coding schemes, and more focus has been given to the bitrate improvement of CABAC than to its complexity [10, 11, 13]. This research does, though, clearly note the bitrate advantages of context based coding. A 1987 study by Wittten et al. [14] between arithmetic coding and Huffinan coding (which is traditionally the core of 16 VLC type of coding) suggests that arithmetic coding was generally superior to Huffman coding using the computers of that time. However, the study did not examine context based coding, nor did it attempt to fully optimize the Hufflnan implementation. (Huffman codes, like VLC codes, generally rely on table based decoding.) The H.263 annex E first introduced CABAC to video standards. The JVT standard used an improved low- complexity table-based implemented, and a multiplication flee coder was recently proposed in [15]. However the study by Saponara et al., using version 2 of the JVT reference soflware (pre-CAVLC), reports that CABAC had only a minor complexity increase but provided no experimental numbers for entropy coding [12]. It’s important to remember that the context based usage of VLC tables differs flom previous video standards which relied upon a single universal table. Dequantization and Hadamard Transform ('1 1 1 11’ 1'1 1 1 1”) 1 1 —1 -1 z 1 1 —1 -1 W: /2 1 —1 -1 1 1 -1 —1 1 ([1 —1 1 -1_L ”1 -1 1 —1_‘/ Z = Frequency Coefficients W = Unscaled Spatial Coefficients Equation 1: Luma DC Hadamard Inverse Transform As mentioned earlier, many previous standards have used discrete cosine transforms (DCT) for the space-flequency domain transform. Because the DCT relies on floating point approximations irrational numbers, precision is platform and implementation dependent. This imprecision leads to “drifl,” the same problem that occurred when 17 motion vectors were based off of the original flames rather than the decoded flames and led to a decoder instance in the encoder. To prevent this drifl problem, as well as to improve performance, the J VT standard implements a 4x4 integer variant of the DCT. All of the transforms, one of which is shown in Equation 1, can be implemented as a series of one dimension transforms. which can be implemented using only addition, subtraction, and bit shifts for the multiplications and divisions. The JV T standard uses a 4x4 transform, where previous standards use a larger 8x8 transform. The smaller transform allows the J VT to use smaller macroblock divisions, as discussed below. A fuller treatment of the J VT Hadamard transform progression and implementation is given by Richardson [16]. The transform specifics vary by the type of sub-macroblock they are decoding; luma DC blocks and chroma DC blocks have their own variants of this process. Once the unsealed spatial coefficients are reconstructed by the inverse transform, they are rescaled by multiplying by a factor determined using lookup tables referenced by the quantization parameter. Motion Prediction Each image is encoded and decoded one macroblock at a time. The encoded information for each macroblock includes the motion vectors and the differential coefficients. For images with any significant amount of motion, simply using the difference with the previous image provides poor correlation compared to elsewhere, both within the same picture and previous pictures. In order to best exploit the sequence correlation, two types of motion are exploited. The first, spatial correlation, tries to use portions of the same flame as a base for encoding the remainder of the flame. The other, temporal correlation, 18 uses previously encoded flames as the basis for encoding. The difference between the current block and the selected match is the motion vector, which is generally encoded as the difference between that motion vector and previous motion vectors. The encoder decides how best to encode the macroblock. It can encode the macroblock as one large 16x16 chunk, or subdivide it into 8x8, 16x8, or 8x16 blocks. Those blocks can further be divided into 4x4, 8x4, or 4x8 sub-blocks. The more blocks that are used, the better match the motion compensation can be to that particular block. Chang et al. simulations suggest as much as a 15% bitrate improvement by using the subblock sizes in addition to the whole macroblock [7]. However, more blocks increase the information needed by adding more motion vectors. Blocks may be intro-coded, taking advantage of spatial correlation, or inter-coded, pointing to previously coded flames. Inter-coded blocks may be predictive, Using one motion vector, or bi-predictive, using two motion vectors. The bi-predictive mode allow a block to combine information from two previous blocks, saving on coefficient information but increasing the computation required on the decoding end. Frame Types & Picture Buffering The flame type, determined by the encoder options, determines which coding modes are used in coding that flame. An intra-flame, or l-frame, only uses intra-coded blocks. Thus it is completely self-decoding as it has no motion vectors pointing to previous information. However, I-frames tend to be the largest type of frame because they do not take advantage of any temporal correlation. P-frames are predictive frames, which can use either intra coding (intra-coded macroblocks)or a single motion vector per 19 macroblock. This motion vector points to the “best match” area (normally of size 16x16 pixels) with the previously coded I-frame or P-frame. And the bi-predictive flame (B- frame) use two motion vectors, which can point to different reference pictures. The pattern in which an encoder uses these frame types is known as the Group of Picture (GoP) type. The encoder uses this pattern to decide which picture type to encode next, resulting in the pattern repeating over and over. Figure 2 illustrates two classic GoPs. The top one, IPPP, consists of a single I-frame followed by P-frames. Each P-frame uses the previous P-flame as a reference flame. The second, IBBPBBP, consists of an I-frame, then a repeating pattern of two B-flames followed by a P-frame. Each P-frame uses the previous I or P-flame as a reference; the each B-flame uses the previous and next P- flarnes as reference. (The first two B-flames use the I-flame as a reference since there is no previous P-frame.) Figure 2: Group of Picture Illustration 20 One-quarter pixel prediction The goal of the motion compensation is to find a vector that reduces the macroblock coefficients to as close to zero as possible. In theory, motion may not happen in pixel steps, and a higher degree of precision may find better reference matches. Previous video standards, such as MPEG-2 [17], have frequently used half-pixel, or half-pel, motion vector precision. Increased precision on the motion vector increases the search complexity for the encoder. Research into motion search techniques demonstrate that improved methods are possible. For example, Panusopone and Baylon present a half-pel 8x8 search method in [18] than can be generalized to different block sizes and resolutions. Previously, the added precision did not reduce bit usage for previous standards. Werner [19] proposed a filtering method to overcome signal noise in motion estimation. For JV T, Wedi and Musmann extended Wemer’s proposal into a functional method for using one-quarter and one-eighth pel precision [20]. The specific implementation in JVT, discussed briefly by Wieland et al. in [21], uses a finite input response filter for the half-pel estimation, and then averages the integer and half-pel values for the quarter-pel prediction. This process is clearly more complex than full-pel decoding. Deblocking (Loop Filter) Because the encoder and decoding is done in blocks, the edges of the blocks tend to have seams, which create visual artifacts and blockiness along the block boundaries. To help clear up this distortion, decoded macroblocks are run through a deblocking filter. The deblocking filter runs first on the vertical edges of the block, then on the horizontal 21 edges. The filter serves to average out values along the edges, thus giving a smoother look to the finished flame. This filter is an integer based filter, so it can be quickly implemented. 22 CHAPTER 3 IMPACT OF GROUP OF PICTURE TYPES ON REAL TIME PERFORMANCE Research shows that the JVT standards provide a computationally-complex algorithm for video decoding. While significant research has been done into complexity reduction methods for the JV T standard[6, 7, 22], little research has been done into relating the problem to mobile devices. An exception is Yu et al. proposing a real-time mobile encoder in [23]. Lappalainen et al. did conclude that real time decoding for the H.263 standard should be possible for low flequency mobile processors and that 12-38Mhz of a Pentium HI was required for lOfps QCIF decoding [5]. Horowitz et al. analyzed the JVT complexity from by looking at the number of average operations to decode a macroblock and then extrapolating that into complexity based on processor specifications [4] The previous study, however, largely ignored the entropy decoding complexity. As highlighted in the previous chapter, the JVT standard provides a wide range of video coding tools and related options. By varying these options, the JV T codec can provide a wide range of video quality; at the same time, these coding options imply a wide range of complexity that is encountered by the JVT decoder. Hence, a key objective of this thesis is to identify high-level IV T coding options that provide the best quality (minimum distortion) while adhering to a maximum complexity constraint. Hence, our work establishes an initial baseline for the complexity-distortion performance of J VT decoding over handheld devices. Consequently, a variety of video sequences were encoded to explore the impact of flame rate, entropy coding, group of picture type, and video sequence upon the decoding performance. Equivalent tests were performed using a 23 Pentium IV to provide a comparative-performance baseline. The baseline tests (for both handhelds and standard computers) used a lossless decoder and rejected any optimizations which would represent a lossy decoder’. The purpose of these experiments was three-fold: First, to answer the question “Given that a video sequence is encoded for a handheld device using the J VT standard, how should one proceed to encode that sequence to achieve the best possible quality under a given complexity constraint?” Second, to identify the broad over-arching complexity of the JV T decoder implementation chosen. Finally, to establish the performance of this decoding software compared to that used in previous research. Our experiments investigated the complete range of quantization choices, four different frame rates, both entropy encoding methods, four common Group of Picture (GoP) types, and Intra-coded (I-frame) flequency variance for two common GoPs. This chapter is organized as follows. First, the hardware and software components of the experiments are described. Then the video encoding configurations used in the experiments are described. Subsequently, the complexity-distortion results are presented in graph form with corresponding analysis. Hardware Selection The significant hardware choices were first the type of handheld, and then the standard computer used as a reference comparison for the handheld performance. 5 Here, a lossy decoder implies a decoder that discards some information/data that is already sent by the encoder for the purpose of reducing complexity. 24 Handheld device - iPAQ: The iPAQ series, built by Compaq and later by HP after the Compaq / HP merger, is a series of handheld devices that are representative of the capabilities of the generic handheld devices. They are designed as multipurpose consumer devices. They possess simple graphics, using 16 bits per pixel — 6 bits for green and 5 for red and blue. (Many of the early mobile devices used this 5-6-5 RGB format but did not support 65K colors.) The processors are simple but fast 32-bit processors, capable of speeds between 200 and 600 Mhz. The selected iPAQ used in this research is the h5550 model. Designed by Intel, the 5550 model employs a 32-bit processor targeted for mobile environments. The iPAQ represents a commercial integrated product typical of a generic mobile computing device. PC Platform: To provide the PC baseline, a system running Windows 2000 with a Pentium 4 1500 Mhz chip and 256 megabytes of RAM was used. Aside flom the normal system services and background tasks (e. g. virus detection, firewall) no other loads were running during simulation testing. Software Selection In order to prepare the experiments, the following software was used. The general principles behind the selection were to use the commercially available state-of-the-art, to use pro-developed solutions as a baseline and then improve the bottlenecks, and to initially avoid any degrading in video-quality or decoder-speed due to software changes. 25 Improvements which sacrificed video quality for improved complexity were not considered at this point (i.e., we focused on lossless decoding). Operating System The handheld devices are developed and designed with Windows Compact Edition (WinCE) in mind. Thus hardware interface and control problems are minimized. WinCE is a light version of the traditional Windows system. It supports most of the basic Windows features, but is uses relatively few of the advance features. Graphically, for example, WinCE uses its own high speed graphics library (GAPI) for direct access to the screen rather than the standard DirectX used by other systems. The WinCE software development kit (SDK) supports few of the libraries that are normally available. The operating system, however, supports minimal configuration of its memory management, thread priorities, or file handling. However, because of WinCE’s simplicity, availability on a wide variety of handhelds, and support by Microsoft, it makes an excellent platform for examining the actual issues involved in real-time video decoding over low- complexity devices. The software development kit supports all standard C/C++ syntax as well as wireless intemet, file I/O, display writing, basic timing, and 32-bit arithmetic, giving the basic building blocks for video decoding. Additionally, WinCE is designed to interface with the desktop, facilitating desktop controlled program execution, video transfer, and data collection for experiments. Given that gigabytes of video sequences would eventually be generated, this automation is a valuable component. For those interested, others have written Linux versions to replace the default WinCE operating system. The open source aspects of Linux would facilitate firrther research into operating 26 system improvements to speed up decoder performance. However, significant work in terms of a reasonable coding environment (see below) would be needed for such a change. Compiler and Software Development Since the current iPAQs are based upon the PXA255 or related processors, they use the ARM instruction set. Microsoft’s freely released Embedded Visual C++ 3.0 (EVC3) and 4.0 (EVC4) development kits were used to facilitate development of handheld executables. Version 3.0 represents an early version of WinCE; version 4.0 represents the current version. Early design and development was done with version 3.0 although the results presented throughout this thesis are based on a more updated iPAQ which used the Visual C++ 4.0 environment. While EVC allows a handful of Microsoft/ARM specific optimizations, and some of these were used at specific bottlenecks in the program, the majority of the code remains standard C/C++ and can be ported to any platform. All linking was done with the included Microsoft linker, while compilation was done with Intel’s Windows Compiler for Embedded Visual C++ (later combined into Intel’s Window’s Compiler program). Initial trials used the Microsoft compiler but the Microsoft compiler incorrectly compiled code for a single array resulting in incorrect array values in the decoder. The code compiles correctly with both Visual Studio .NET (discussed later) and the Intel compiler. Speed comparisons between the two compilers showed no statistically significant variance in performance based upon the compiler. 27 A WinCE software development kit called Port SDK [24] was used to enable the use of standard console HQ in the WinCE environment. Port SDK also provided a simpler method to execute the compiled decoder since the decoder did not need to create its own windows environment. A generic console application [25] actually served as a DOS console under WinCE. (WinCE has no native console). Preliminary tests showed no significant overhead to using the PortSDK/console when compared to the standard windows interface. All timings ignore program initialization and cleanup, including thread and window initialization. The console interface allowed for standard I/O methods to be used instead of specializing for the WinCE, increasing portability and reducing complexity. Microsoft Visual Studio .NET 2003 was used for compiling and coding the Pentium executable version of the code, along with the necessary data gathering environment. Two different profilers, Intel’s VTune and Comquare’s DevPartner Profiler were used to profile the video decoding complexity on the Pentium platform. In the handful of places where the operating system calls differ significantly, separate coding sections are included to properly interact with the code. Methods for accessing the system time were the most common difference. Threads also showed the potential to require different interfaces. These areas in the code are the most likely to cause difficulty in the event of porting to a separate system but should be easily identifiable. JVT Decoder The JV T team, as part of their standard, is developing algorithmically correct implementations of portions of the H.264 standard. One of these is version 6.1e [26]. 28 Version 6.1e contains a firlly functional encoder and decoder for the standard features, including all of those used in this research. Some of the advanced features, such as slices, flexible macroblock ordering, and error correction were not implemented as of this release. (Later versions, in an attempt to improve the encoder and decoder and add additional features, leave the encoder and/or decoder inoperable.) As mentioned previously, the IV T technical specification for the decoder is an ideal decoder. Numerous ideas have been proposed for reducing complexity in exchange for a degraded video signal. For the purpose of complexity overview, I implemented generic software improvements which did not reduce the video signal quality. Some additional changes were made to make monitoring software performance easier since the available profiling tools for WinCE are impractical on non-custom hardware. The significant changes were: Removal of system calls: All extraneous file inputs and outputs, such as output storage and verification were removed. Thus the decoded flames were not written to a file, nor compared against the “ideal” video. Selected sequences were viewed to ensure that the decoder still functioned properly after code changes. Additionally, the majority of the memory allocation and de-allocation calls were moved to the program initialization or clean-up stage rather than the timed portion of the decoding. My experiments showed that repeated memory cleanup and reallocation was shown to be using as much as 30-35% of the decoding time based on the JV T software version that was initially considered. As mentioned elsewhere, this heavy system load is one of the significant differences flom 29 the desktop, where the same memory tasks were taking an estimated two or three percent of the run time. Addition of circular frame buffer: The decoder handling of the decoded flame storage was changed to a circular buffer implemented through linked lists. Instead of copying decoded flames into a second memory location for storage, the pointer to the newly decoded flame was added to the list of decoded flames and a new pointer to an unused flame was selected. When a flame was displayed, the decoder returned the flame space to the list of unused flames for future use. This change also facilitated video decoding by allowing the decoder to pass a memory pointer to the video player, by naturally reordering out of sequence flames", and by adding a decoded flame buffer to smooth decoding irregularities. The addition of the buffer provided a nominal 5% improvement 01' 80. Addition of timers: Due to the severe lack of profiling software for the WinCE environment, a number of timers were inserted into the software to obtain timing information for key sections of the code. As of this research, the author is unaware of any quality profiling approaches to complicated WinCE tasks. A handful of private non- commercial solutions are available but these are incredibly overhead intensive, seem susceptible to misprofiling functions, and interface poorly with other software. Function counts can be found through profiling the decoder on the Pentium platforms. However, as 6 The out of sequence flames occur because a B flame, because of prediction, may be dependent on a flame that occurs after the B flame. When this is the case, the flames that the B flame requires for prediction are sent first, which means that the flames are not sent in order. 30 noted by the difference in memory allocation complexity, the handheld environment response does not necessary correlate to the PC. These inserted timers provide an overview of how the decoding time is being spent within the decoder by partitioning the decoding time into roughly six different areas such as overhead, flame decoding, macroblock decoding, and macroblock reading. Tests indicated that timers inserted were sufficiently few in number that there was negligible impact upon the average run time. Modification of frequently called functions: Frequently called functions, such as the NALU locator, the start code search routine, and portions of the symbol decoding routines were rewritten to handle the most flequent case more quickly. These changes involved techniques such as grabbing larger chunks of files at a time, storing decode bits rather than recalculating each time, and an advanced search approach. An estimated 15-25% improvement was seen as a result of these improvements. JVT Encoder The original 30fps YUV video test sequences were encoded using version 6.1e of the I JV T encoder [26]. This encoder, provided by the JV T team, correctly handled the options involving entropy coding, flame rate, and GoP structure. Advanced options such as those involving error control, flexible macroblock ordering, and search restrictions were not fully implemented as of this release. This release is one of the earlier implementations of 31 the CAVLC entrOpy standard in the JV T source code7. Later releases began implementing these features but also broke other components of the encoder. The encoder was used “as is” with the exception of changes to its input and output routines to provide the information in a format more conducive to interaction with the experimental platform (described below). JVT Player A separate Windows video player was co-developed with another colleague in the WAVES lab to display the decoded flames flom the JVT decoder on the screen. Separate CE and PC versions were developed to take advantage of the different display technologies. The CE version, the portion that I worked and developed, operated as a separate thread flom the decoder. The player thread calculated an interval i milliseconds that it should sleep and then woke up, asked the decoder for a pointer to a flame to display, displayed the flame, and went back to sleep for i milliseconds. Mutexes, a thread synchronization method, prevented threads flom interfering with each other and delay conditions handled the buffer underflow and overflow cases. Experimentation suggested that the best case display time for a QCIF image using GAPI interface was about six (6) milliseconds, and a fullscreen image would require about seventeen (17) milliseconds. This additional overhead meant that to display a Sfps sequence real time requires a decoding speed of 5.2fps, a 10fps sequence requires a 10.6fps rate, and a 15fps sequence about 16.5 rate. The actual required rate will vary depending on the actual performance of the video driver. 7 The CABAC and CAVLC entropy coding modes are discussed more fully in Chapter 2. 32 Experimental Platform In order to efficiently encode, decode, and record experiment information, an automatic experiment executor was built. Its functionality included several main components: Automatic Encoder: Given a list of experiments, the executor generated the appropriate encoder configuration file, ran the JV T encoder, waited for the encoder to finish, and recorded the bit counts and PSNR values. It also then stored the results in the appropriate result file and moved the necessary input files for the decoder into the appropriate directory. Automatic Decoder: Given a list of experiments and the platform, the executor copied the decoder input and configuration files into the proper location, ran the appropriate decoder, and read in the resulting timings flom the decoder run. The results were stored within the result files. The generator was capable of tracking different configurations and comparing configurations against a specified optimum run. Plotter: Given a list of experiments and the desired data calculation, the executor would parse the result data and generate a file which specified how to plot the requested information. This file was specially formatted to be fed into Matlab fimction which then interpreted the text into a plot. 33 Experimental JVT Video Sequence Settings The first subsection discusses the encoding and decoding settings that were varied for the experimental results. The second subsection explains settings that were fixed for this research. Experiment Variables: Video Sequence: Three different video sequences were used; all three are standard 10-second long video coding test sequences. Figure 3 shows a sample flame flom each of the sequences. On the left, the Akiyo with the talking anchorwoman in a news studio; in the center, Foreman with a Parameter Value Video Sequences: Akiyo, Foreman, Mobile GoP Structures: 1, P, B, T IFrame Frequencies: 4, 8, 12, 16,20 QP Values: 0-51 Frame Rate: 5, 10, 15, 3O Entrgy Coding: CABAC & CAVLC Video Format: QCIF (176x144) Hadamard Transform: On Max. Search Range: 16 Num. Reference Frames: 1 Forced Intra-Macroblocks: None Slices: Unused SP Frames: Unused Loop Filter Parameters: Default Table 2: Summary of Coding Variables foreman talking about the construction site, which is shown at the final flames; and on the right, Mobile with the moving train and calendar. The first, Akiyo, is a shot of talking news anchor woman. It features minimum motion, high temporal and spatial correlation, and is highly compressible. Akiyo represents an ideal video sequence. The second, Foreman, is a higher motion sequence of a construction foreman talking and then showing the surrounding landscape. It features higher motion and lower temporal correspondence compared to Akiyo and is more representative of a typical video sequence. The third sequence was Mobile, which is a high motion video with lots of detail. It is very difficult to compress and it loses rather significant quality with compression. 34 Figure 3: First Frame of Video Sequences Akiyo (Left), Foreman (Center), Mobile (Right) Group of Picture (GOP) structure: Four main GoP structures were used, and then sub- variants of two of those were further investigated. The four main GoP structures are referred to here as I, P, B, and T. The I-GoP pattern is an all I-flame sequence. This GoP ignores temporal correlation and is a sequence of separate images, like stringing together a series of JPEG images (i.e., similar to motion JPEG). The I-GoP has a high bitrate due to minimal compression but also has no motion prediction (i.e., low complexity, in general). The P-GoP is a mostly P-flame sequence with an I flame every 12 flames. That is, an I flame, 11 P flames, then a second I-flame and 11 more P-flames, and so on. The repeated I flame gives the decoder a periodic non-dependent flame which prevents long term signal degradation flom occurring as well as providing a moderate level of error control. Even if a flame is corrupted in transmission, a complete picture will an'ive within a second or two, depending upon the flame rate. The B-sequence replaces every other flame with a B-flame, so the pattern becomes I-B- P-B-P-. . .I-B-P-B-P... The flequent use of B-flarnes takes advantage of temporal correlation in both directions and thus increases the compression. However, bi-directional motion prediction increases the complexity during decoding. The final GoP, T, uses two 35 B flames instead of one: I-B-B-P-B-B-P. This pattern further optimizes any correlation between flames. In both the B and T GoPs, an I flame repeats every 12 flames, regardless of whether the intermediate flames are P or B-flames. The two sub-variants profiled involved changing the flequency of the I-flames for the P and B sequences. The P-GoP variants are 14X, 18X, Il6X, and IZOX where the number is the I-flame flequency. Thus the P-GoP is equivalent to 112X and the I-GoP equivalent to 11X. The B-GoP variants are 84X, 88X, Bl6X, and B20X, again with the number indicating I-flame flequency and B-GoP being BlZX. B frames repeat themselves every other flame, so B4X has the pattern I-B-P-B-I-B-P-. .. The complexity response to changing the I-flame flequency gives insight into how flequent image refleshing impacts performance compared against longer sequences of motion prediction. These results were remarkably similar to the primary four GoPs examine (I, P, B, T) and specific analysis is not presented for them. In general, their complexity variance is directly related to I-flame flequency and follows the data shown in Table 9. Frame Rate: The original 10 second long video sequences, were taken at thirty flames per second for a total of 300 flames. The sequences were then encoded at four different flame rates: Five, ten, fifteen, and thirty flames per second (FPS). 15 FPS corresponds to skipping every other flame. 10 FPS to skipping two of every three flames, and 5 FPS to skipping five of every six flames. In theory, skipping fewer flames should increase the temporal correlation between flames, enabling higher compression and a lower bitrate. 36 However, the since the complexity increase verses the bitrate reduction is unknown, the increased correlation may not aid computational complexity. Quantization Parameter (QP): Video sequences were generated for all valid QP values, which range flom O to 51. O is almost uncompressed video, thus having extremely high bitrates and excellent picture quality. 51 is the lowest quality, with minimum bitrate and typically unacceptable picture quality. For most of the analysis, the working subrange of 25-45dB PSNR was selected; 25dB is at the low end of what is considered acceptable video, and 45dB consists of an excellent picture. Thus 25-45dB represents a working range within which to evaluate performance. Different QP were selected for this range for each video sequence as the QP that generated the range varied based on flame rate, GoP structure and video sequence. Entropy Coding: The JV T standard allows for two types of entropy coding, context based adaptive arithmetic coding (CABAC) and context-adaptive variable length coding (CAVLC). Previous research indicates that CABAC decreases the bitrate by 10-15% compared the equivalently coded UVLC sequence [1 1]. However, CABAC is traditionally more computationally complex and the bit reduction is less likely to be useful with the complexity limitations of the iPAQ. When comparing CAVLC with CABAC, Wiegand et al. estimate CABAC gains 5-15% in bitrate savings [21], but with the highest savings in interlaced television signals (which are not being investigated in this thesis). Within the test sequence, our experiments suggested that for a fixed rate, CABAC coding offered a distortion improvement ranging between zero and one decibels. 37 Experiment Constants: Video Format: Due to the screen-size of most handheld devices (320x240 pixels or smaller), the Quarter Common Intermediate Format (QCIF) (176x144 pixels) standard was used. The other common standard is Common Intermediate Format (CIF) which is four times larger (352x288 pixels) and far larger than most handheld devices can accommodate without additional scaling. The standard video test clips come in standard formats and the choice of QCIF reduced any degradation flom issues not related to the encoding process. Hadamard Transform: The encoder used the standard Hadamard transform to provide better encoding results. A less accurate approximation is also available which does not use the Hadamard Transform. Forced Intra-Macroblocks: Forced intra-macroblocks, blocks that are compressed without motion prediction, are normally used to reflesh the signal quality and prevent error propagation. Since the encoded video sequences included I-flames at regular intervals, no additional intra-macroblocks were added. Maximum Search Range: The search range determines how far flom the block center the encoder will search when looking for the most similar block. Here a standard value of 16 pixels, the size of one full macroblock, was used. 38 Number of Reference Frames: The number of reference flames determines how many previous (or future) flames are searched when looking for the most similar block. In my research, a single reference flame provided the minimal performance improvement when compared with multiple reference flames. However, multiple reference flames lead to a significant increase in encoder complexity. Several recent JV T studies has used five reference flames [4, 5], but provide no rationale for this choice. We believe that any level of prediction that is higher the one used in B-flames (i.e., the usage of more-than two reference flames) corresponds to a significant level of unreasonable complexity that should not (cannot) be used for handheld devices. Block Search Restrictions: No block search restrictions were enabled. The encoder was allowed to choose 16x16, 16x8, 8x8, 8x4, or 4x4 as it found the best matches. Slices, SP Frames, Loop Filter Parameters: None of these special features were used. Only a single slice was used, no switching flames were used, and the default loop parameters were used. Primary (I,P,B, T) GoP Results by Complaint-Distortion: For each of the three sequences (Akiyo, Foreman, and Mobile), the results and analysis are broken into sections. The first section contains the results and an overview of the important points. The next two sections are a performance comparison between the CABAC and CAVLC modes and then a comparison between the CE and PC results. For 39 Foreman and Mobile, a brief synopsis is then presented between the current sequence and the previous discussed sequences. The results shown and discussed below is the decoding complexity in terms of flames per second (FPS) plotted against the signal to noise ratio (PSNR) in decibels. The higher the signal to noise ratio, the better the quality is. At 0 QP, the PSNR is around 60-70dB, depending on the sequence. Anything around 45dB or higher is considered excellent signal quality. The distortion shown, 25-45dB represents the range between a rather highly distorted signal (25dB) and a high quality signal (45dB). The run times are measured in milliseconds or clock cycles, depending on which was most appropriate for the measurement scale and then converted to FPS. FPS provides normalization since the sequences differ slightly in their coded length based on their GoP pattern. The data is shown for a single test, as repeated tests on the same sequence indicated slight (less than 2%) deviation between runs. Furthermore, since each test only slightly different flom the tests before and after it, the entire run flom QP 0 to 51 serves as verification to the accuracy of individual runs. Certain tests, even when repeated, seem to be unusually quick or complicated, and this presumably is related to a combination of encoding choices and architectural caches. Because of its significantly faster microprocessor, the PC is more vulnerable to fluctuations than the handheld. It is important to note that the data and analysis presented in this chapter assume that adequate bandwidth is available for receiving any of the available sequences. All bitrate constraints are ignored. Also, the time to quickly read the data flom a file is assumed to 40 be the same as the time needed to receive the video stream over a wireless connection. However, the actual complexity between file reading and network reception may differ significantly. Plot Specifications In the plots presented in this thesis, the constant variables for a plot are specified in the title. The video sequence is abbreviated to the first three letters. The GoP structure has an additional specification: ‘-x ’. x specifies the exact flame type flom a sequence that is being plotted. ‘-A’ is the average of all flames, ‘-i' is the I-flames, ‘-p’ the P-flames, and ‘-b’ the B-flames. While only an average SNR number is available, frame-type specific data is available for time and bitrate results. Clearly, flame-specific availability is dependent upon the sequence GoP: A P-GoP has no data for the B-flames, for example. Experiment variables that are constant for all experiments in a plot are printed in the title; the plot variables are labeled in the legend. The title also specifies the limits of the data shown in the plot. There are three limits used: An SNR limit, a kilobytes per flame (KBPF) limit, and a millisecond per flame (MSPF) limit. The default (“no limit”) setting is a PSNR of between 25 and 45dB, and any complexity and bitrate. All three limits are for the overall performance of a sequence, not the particular flame types. For example, the average size of all flames in a T-GoP will be much higher than the average B-frame size. However, the limit filter uses the average of all flames, even when only plotting the B-flames. 41 PSNR (dB) PSNR (dB) Seq: aki; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits] 45F 01) 0'1 25 16 Hit 403?- l I I l I V 18 T 20 22 30 FPS Seq: aki; GoP: -A; FPS: 15; Entropy: UVLC; Plat: CE [No Limits] 45l 35 25 12 0 \\ \\ \‘ \\ \\ Hit 4W1!- ‘\ \\ s‘ \\ o \‘ \‘. \‘ Figure 4: Akiyo Complexity-Distortion (15 FPS, CE) 42 PSNR (dB) PSNR (dB) Seq: aki; GoP: -A; FPS: 15; Entropy: CABAC; Plat: PC [No Limits] —6— l —N——— P —+— B —-i*— T 40 - 35 . 30 ~ 25 1 L '1‘ 1 I 1 1 100 110 120 130 140 150 160 170 FPS Seq: aki; GoP: -A; FPS: 15; Entropy: WLC; Plat: PC [No Limits] 45 r 0 o -9- I \‘ \ —-—— P \ ‘ ‘a —l'— B e —i-—- T g 40 l' g ¢ a g g \ 35~ ‘ g 9 g 30 . 25 l L r v 1 1 J 60 80 100 120 140 160 180 FPS Figure 5: Akiyo Complexity-Distortion (15 FPS, PC) 43 Overview Akiyo Time (ms): slowest Percent °/o Entropy Seq: minus fastest Slowdown Diff. CE: CABAC CA VLC CABAC CA VLC 25dB 45dB I 13.9 28.5 37.6 65.5 17.8 41.7 P 5.2 12.7 14.7 35.8 0.0 18.4 B 4.1 9.7 8.4 19.9 1.0 11.7 T 3.2 9.9 6.2 18.8 0.0 11.9 PC: I 3.8 6.3 60.0 94.8 6.7 29.9 P 1.3 2.1 21.7 36.0 -1.2 10.4 B 0.9 1.4 11.9 18.8 -0.8 5.4 T 0.9 0.8 10.6 9.1 4.2 2.7 Table 3: Akiyo Complexity Deltas over 25—45 dB Figure 4 and Figure 5 show the handheld and PC performance for the primary GoP types (I, P, B, T) when run at 15 flames per second for both entropy types. The 15 fps sequence is shown for two reasons. First, it is the fastest flamerate that the decoder can playback real time on the tested iPAQ platform. As suggested by Figure 4 and confirmed by tests on 30fps sequences, the decoder stops short of reaching 30fps even for a very low quality Akiyo sequence. Secondly, and among flamerates that are lower than 30 fps, 15 fps maximizes the temporal correlation since only every other flame is removed. (Lower flame rates, such as 10 or 5 fps, ignore more flames within the sequence.) While this similarity matters less to Akiyo since every flame is extremely similar, it does mean that the P and B flames achieve better compression for the Foreman and Mobile sequences discussed later. Clearly for the handheld, the P-GoP is the best compression approach for the easily compressed Akiyo sequence. For any quality level, it has the lowest complexity (highest flamerate). Table 3 summarizes the overall complexity change for the 25-45 dB. The time is the number of milliseconds longer that decoding a flame at 45dB takes than 44 decoding at 25dB. The percent slowdown is the time delta as a percent of the fastest time to decode a flame, that is the time at 25dB. The percent difference is the percent slower that a CAVLC sequence is compared to the same decibel CABAC signal. Thus for the handheld P-Gop, moving flom a low quality signal to a high quality signal is only a 15% (5.2ms) complexity increase using CABAC, while using 45dB CAVLC is 18% slower than using a 45dB CABAC signal. Note the PC and CE percentages can be compared against one another because the percentages are being taken against an individual baseline performance for each platform. Several key conclusions stand out flom these experiments. First, there can be significant differences in how handheld and desktop computers respond in complexity to distortion changes. Between architectural, operating system, and compilation differences, the same performance gains tested and developed for personal computers cannot be assumed to have equivalent efficiency on a handheld device. Second, the B and T-GoPs, with their high motion prediction and higher compression, are relatively stable in terms of complexity regardless of the distortion, entropy coding, and platform for highly correlated sequences. Finally, the I-GoP complexity varies drastically depending upon distortion, entropy coding, and platform. Handheld: CABAC & CA VLC Surprisingly, CAVLC entropy coding is not a significant improvement over the CABAC coded sequences. The decoder software source notes mention that the CAVLC decoding procedures have not been fully optimized and even the additional optimization described earlier has not fully caught up to the CABAC implementation. It is worth noting that at 45 high distortion and excluding the I-GoP, there is relatively little difference between the two entropy coding methods. Since the I-GoP uses the least compression (and thus a much higher bitrate compared to the other GoPs) and has the highest complexity difference at high distortion levels, these results suggests that the amount of input data that must be parsed may be as significant as the computation it takes to parse the data. This suggests that memory and file I/O bottlenecks may be an important factor in addition to available processor time. (This aspect will be further discussed and analyzed in the next chapter.) Handheld vs. PC Interestingly, the PC has a much harder time with the I-GoP as distortion increases compared to the iPAQ: The standard computer takes ninety-five percent longer to decode a 45dB CAVLC sequence compared to a 25dB one. It also took sixty percent longer to decode a CABAC sequence 45dB compared to the 25dB. The iPAQ, comparatively, takes only 66% longer for CAVLC and 38% longer for CABAC. While the iPAQ certainly slows down with lower distortion, its I-GoP performance is significantly less hindered: In Figure 4, the CABAC I-Gop is always faster than the B or T-GoPs; not so for the PC (see Figure 5). Correspondingly, the CE CAVLC I-Gop stays much closer to the T-GoP performance than the PC I-Gop stays to its corresponding T-GoP. 46 Foreman Seq: for, GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits] 45[ «3— l ' + P —+— B . 40 ~ 8 E o: 35- Z (I) o. 30 . 25 1 1 1 1 1 1 \J 1 14 16 18 20 22 24 26 28 FPS Seq: for. GoP: -A; FPS: 15; Entropy: UVLC; Plat: CE [No Limits] 45 —e— I + P —+— B -1il— T 40 . a E a: 35 Z w 0. 30 _ 25 1 1 1 1 T ’4 10 12 14 16 18 20 22 24 26 FPS Figure 6: Foreman Complexity-Distortion (15 FPS, CE) 47 PSNR (dB) PSNR (dB) Seq: for, GoP: -A; FPS: 15; Entropy: CABAC; Plat: PC [No Limits] 45 ”g —e— I e \ —.._ p e —+— B “ ‘ . —+— T \ 40 ~ § \ § e\\ “ \ e 1k . \ 35~ ° .s’ge ’. a x \ \ -‘ 30 r ' ¢ ¢ \\ F e S‘s ‘6 25 1 1 1 1 1 1 V 1 80 90 100 110 120 130 140 150 160 FPS Seq: for. GoP: -A; FPS: 15; Entropy: UVLC; Plat: PC [No Limits] 45 - o g .\ —e— l g \ _..— p \\ \.\ —i- B \ . ’ g \x, + T 40 . e \\~.~ e \. ¢ \ g \ e \ \‘ \\ \ 35 . ‘ g \. ~ \ ‘ x “ 1“ g .‘ \ \ z 30— ‘ e ‘i e e.\_ :8.“ \\ ¢ \\ 25 1 1 1 1 f 1 40 60 80 100 120 140 160 FPS Figure 7: Foreman Complexity-Distortion (15 FPS, PC) 48 Overview Foreman Time (ms): slowest Percent % Entropy Seq: minus fastest Slowdown Diff. CE: CABAC CA VLC CABAC CA VLC 25dB 45dB 1 24.0 45.9 64.4 101.8 20.7 48.2 P 18.2 44.9 49.2 112.7 7.6 53.4 B 13.1 45.6 26.7 90.4 3.0 54.8 T 12.9 45.8 25.8 86.1 6.4 57.4 PC: 1 6.0 11.4 92.5 176.8 -0.6 42.9 P 4.1 7.7 63.8 117.1 1.3 34.3 B 3.2 6.5 41.3 82.9 1.6 31.4 T 3.2 6.2 40.0 77.1 1.6 28.6 Table 4: Foreman Complexity Deltas over 25-45 dB The Foreman sequence results for 15 flames per second are shown in Figure 6, Figure 7, and Table 4. The Foreman sequence was included the test sequences because of its similarity to a “typical” (more-less on average) sequence with reasonable temporal correlation but also containing a moderate-to-significant level of motion. Overall, the Foreman sequence reflects greater complexity compared to the Akiyo sequence, especially as the distortion decreases. Again in the results, like Akiyo, the P-GoP provides the best lowest complexity for a given distortion. The handheld performance reflects the general knowledge that Foreman is a more complicated sequence than Akiyo was. While both sequences have roughly the same complexity at 25dB, the Foreman complexity increases more rapidly. In addition, the B and T-GoP complexity changes while it remained almost constant for the Akiyo sequence. With reduced compression, the difference between the GoP structures shrinks although a clear difference is still visible. Additionally with low distortion, real time performance can barely be achieved in some instances: The CABAC low distortion I, B, 49 and T GoPs are near the cutoff point for realtime functionality, and all four GoPs are well below realtime levels using CAVLC. Handheld: CABAC & CA VLC Overall when comparing the Akiyo with Foreman GoPs, all four GoPs are similar in speed at the 25dB level. All four also are significantly slower than their Akiyo counterparts at by the 45dB level. By comparing Figure 4 with Figure 6, it is clear that the starting complexity for a high distortion Akiyo sequence are similar to those of the high distortion Foreman sequences, although the Foreman is slightly more complicated even with high distortion. For the Foreman sequences at lower distortion rates, the different GoP patterns come much closer to merging. Handheld vs. PC As with Akiyo, the I-GoP once again performs much better compared to other GoPs when using the iPAQ rathen than the PC. (The PC response is shown in Figure 5 and Figure 7.) Under CE, the I-GoP outperformed the B and T-GoPs with CABAC and matched them with CAVLC. However, the PC I-GoP has significantly greater complexity compared to the B and T sequences. With Foreman as a less compressible sequence than Akiyo, the complexity increases more sharply for the B and T-GoPs as distortion drops. However, the 25-35dB Foreman range shows the same flozen trend that characterized these GoPs in Akiyo: As the distortion drops, complexity remains almost constant. In fact for 25-31dB portions of the CABAC complexity-distortion graph, a reducing distortion also reduces complexity. It’s not immediately clear what is causing this complexity 50 response and the question will be investigated more deeply with some of the detailed timing data. Handheld: Akiyo vs. Foreman Looking at the handheld device, reducing distortion on the Foreman sequence produces a very different result than for Akiyo. The Akiyo sequences begin with substantial time gaps between the P and B GoPs. Under CABAC, the difference is about l3ms at 25dB and is still around 12-l3ms at 45dB. For CAVLC, the gap is similar, starting at about l4ms and ending at 12-13ms. However for Foreman, the gap starts around 10ms and shrinking to around 7ms for CABAC. CAVLC, it also shrinks flom about 10ms to 7ms. In both sequences, the B and T GoPs mirror each other, with the T GoP being about lms slower with CABAC and about 5ms slower with CAVLC coding. Additionally, the complexity increase is significantly greater with the Foreman sequence. For the Akiyo P- GoP, the complexity increases 5ms per flame for CABAC and 13ms per flame for CAVLC. For the Foreman sequence, those numbers are l8ms and 47ms respectively. Thus the Foreman sequence, while having similar performance to Akiyo at 25dB, grows in complexity much more rapidly. This greater overall complexity seems to minimize the difference between I, P, B, and T flames, especially for the CAVLC coding approach. 51 Mobile PSNR (dB) PSNR (dB) Seq: mob; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits] 45 l I p B T 40 ~ 35 . 30 ~ 25 P 1 10 12 14 16 18 20 22 24 26 FPS Seq: mob; GoP: -A; FPS: 15; Entropy: WLC; Plat: CE [No Limits] 45 — o -e- l \ 0 —-—-11- P \\ u -~l— B + T 40 . 35F \\ \ a . \ ¢ 30 . “ g g \a ‘\ g 25 J 1 1 1 ° 1 .~ 1 _1 6 8 10 12 14 16 18 20 22 FPS Figure 8: Mobile Complexity-Distortion (15 FPS, CE) -52 PSNR (dB) PSNR (dB) Seq: mob; GoP: -A; FPS: 15; Entropy: CABAC; Plat: PC [No Limits] 45 e1 I +P —+—— 31 +Tl 40» 35~ 30- 25 1 1 50 100 150 FPS Seq: mob; GoP: -A; FPS: 15; Entropy: WLC; Plat: PC [No Limits] 45 fl . -9- I \ 1‘ + P 1' \ —+— B " 1 —+— T 40 " \ (t - ‘8. a --.\ ts \ e \\ a 35 r n a a a s "\.. 30- ~‘ ‘\ g \a ‘a \, \ i \ \‘ \\ s‘ \ 25 1 1 1 E g L 1 1 .. 1 J 40 50 60 70 80 90 100 110 120 130 FPS Figure 9: Mobile Complexity-Distortion (15 FPS, PC) 53 Overview Mobile Time (ms): slowest Percent % Entropy Seq: minus fastest Slowdown Diff. CE: CABAC CA VLC CABAC CA VLC 25dB 45dB I 51.2 51.5 129.1 74.7 73.8 32.5 P 30.0 85.0 63.8 176.0 2.9 73.3 B 37.1 87.0 78.0 155.7 17.3 68.6 T 34.6 98.6 70.8 178.5 13.3 84.6 PC: I 9.8 11.6 107.5 90.2 41.0 29.3 P 8.7 16.4 122.2 204.9 12.0 53. 7 B 8.3 16.0 103.3 190.2 4.2 48.8 T 8.7 15.7 110.0 180.5 9.6 46.3 Table 5: Mobile Complexity Deltas over 25—45 dB The plots and a tabular summary of results for the Mobile sequence are presented in Figure 8, Figure 9 and Table 5. Again, Mobile is a highly detailed sequence which is difficult to compress and contains a high degree of motion. It is by far the most complicated of the three test sequences used. In fact, real time performance is possible only in select cases: Below about 30dB for CAVLC and below 36 dB for CABAC. This represents a significant complexity increase over Akiyo, where all sequences were real time, and Foreman, where most sequences were real time. Handheld: CABAC & CA VLC A key observation is that for the first time, the I-GoP provides the lowest complexity solution for 40-45dB distortion sequences using the CAVLC entropy coding. Although the CAVLC sequences are still slower than their CABAC counterparts, there is no clear 54 explanation for the stronger performance by the I-GoP. This question will be revisited later when the detailed profiling and rate aspects of these tests are examined. The 25dB CABAC baseline behavior is also unexpected. The high motion prediction GoPs (B and T) are actually less complicated than their Akiyo and Foreman counterparts. In fact, Akiyo, the most compressible sequence, has the highest complexity of the three at 25dB. However, the I and P GoPs are more complex for Mobile than for the previous sequences. This suggests that with high distortion, the B and T GoPs are fairly insensitive to correlation, while the P and especially the I GoP, without temporal correlation, are more sensitive to sequence complexity. Handheld vs. PC Surprisingly, the overall pattern is that as distortion decreases, the handheld complexity increases more slowly than the PC complexity. For example, the net CABAC handheld complexity increase for the P-GoP is 64%, but a shocking 122% for the PC. The exception to this pattern is the CABAC I-GoP where the PC increase is 108%, the handheld 129%. As noted before, the I-GoP is the least predictable GoP structure. The overall lower complexity increase for the handheld, however, suggests that the iPAQ is less sensitive to some algorithmic aspect related to distortion changes than the PC. Mobile Compared Against Previous Sequences The difficult Mobile sequence also produces a more linear performance flom the CABAC sequences than has been seen before: The relative performance between the four GoPs is similar regardless of the distortion, although the T and I GoPs gradually slip further 55 behind the B GoPs performance. (This linear nature is not seen as strongly in the desktop, where the I-GoP is significantly slower than the others.) The CAVLC sequences have similar response for the P, B, and T GoPs while the I-GoP starts out much more complex but grows in complexity more slowly than the other GoPs. This is a reversal of the Foreman pattern where the I, B, and T-GoPs clustered and the P-GoP was less complex The Mobile sequence also reflects a much greater complexity increase as the distortion decreases. Focusing on the handheld as the primary experiments of interest, the 45dB sequences grow in complexity roughly twice what the Foreman sequences did. For Foreman, the CABAC slowdown percentages for the I, P, B, and T GoPs were 64%, 50%, 27%, and 26% respectively, and 102%, 112%, 90%, and 86% for CAVLC. For the Mobile sequences, the equivalent CABAC numbers are 130%, 64%, 78%, and 81%; for CAVLC 75%, 176%, 156%, and 179%. While the CAVLC I-Gop and the CABAC P- GoP do not reflect the same slowdown rates, the overall pattern is that Mobile slows down significantly more compared to the baseline than Foreman (as well as Akiyo, which experienced minimal complexity increase). Trends and Observations Several distinct patterns show themselves through all three sequences. 1. First, the different GoPs do show repeatable patterns. The P-GoP consistently provides the lowest complexity for almost any level of distortion. The B and T GoPs mirror each other’s performance with the double B flame always being a bit slower than the single B flame. Presumably if motion prediction and thus B flames were significantly faster, the increase in T flames 56 would result in slightly faster performance compared to the B-GoP. The I-GoP is the most unpredictable of the four GoPs, ranging flom much slower than any other GoP to being the least complex choice for high quality CAVLC coded Mobile sequences. However, significant improvement in the performance of B flames is clearly needed before the B or T-GoPs become a viable option. The I- GoP, with special attention to parsing, may begin to outperform the P-GoP for specific sequence choices. . Secondly, the results repeatedly show that Pentium simulations do not provide accurate predictions for handheld behavior. At times, the results are very similar (e. g. CAVLC Akiyo) while at other points the slowest PC option is the best handheld option (e. g. CAVLC Mobile). In particular, this unpredictability is demonstrated by the I-GoP, which varies drastically in size and complexity based on the sequence and distortion level. From the high level results, the B and T-GoPs are very similar, with the T-GoP being just a bit slower than the B-GoP on both platforms. Beyond that, while similarities exist, such as the strong P-GoP performance, the similarities differ for each sequence and entropy coding selection. These differences suggest that algorithmic optimization at a language level may be inadequate to fully address the complexity issues. The lack of correlation between experiments on the PC and handheld platforms, despite compiler optimization for both, suggests that non-algorithmic aspects such as chip architecture, operating system, and compiler techniques play a significant 57 role in the overall complexity of a given sequence. This disconnect emphasizes the need for multiple platform complexity studies as well as better analysis tools for non-traditional platforms.8 4. There is strong correlation between the compressibility of a sequence and its general complexity, both on the handheld and standard computer. Akiyo is the fastest sequence, followed by Foreman and finally by Mobile. This correlation is reflected regardless of platform, entropy technique, or Group of Picture structure. The complexity-compression correlation is most visible in the lower distortion sequences. Overall, the high distortion sequences of the same GoP structure have similar complexities; as distortion drops, the correlation becomes more distinct. 5. The complexity increase (as distortion drops) is significantly higher with CAVLC entropy coding than with CABAC. This suggests the parsing/entropy decoding method significantly influences the overall complexity. This theme will be explored more in depth in Chapter 4. 8 As of this writing (April 2003), Intel has just released a profiling tool for the PXA2xx processors (the I processor series in the iPAQ); however the tool requires that the processor be used with Intel’s development hardware and does not work with other hardware. 58 CHAPTER 4 BITRATE COMPLEXITY ANALYSIS In analyzing the distortion-complexity plots shown in Chapter 3, the complexity variation is strikingly different based upon sequence choice as well as GoP type. In order to better understand the factors controlling the complexity result, a number of high level timing probes were inserted into the code. While profiling programs are available to profile code built for traditional computers, equivalent software is not available for most mobile platforms, including the iPAQ. Thus a creative and non-intrusive approach to timing is necessary for understanding the program performance. As mentioned earlier, the iPAQ timings do not act as a miniature PC results and thus profiling the PC is not an entirely accurate substitute. The remainder of the chapter is organized as follows. First, the timing probes used to measure the program structure are discussed. Then the timing results for the primary four GoP-structures are presented by sequence. Finally, the chapter concludes with an overall summary. For normalization purposes, results are generally presented by milliseconds per flame. For the fifteen flames per second sequences, 60ms per flame or better is required to achieve realtime. The 60ms to decode leaves the video player with roughly 7ms to display each flame, which is the approximate display rate. Timing Probes A total of seven high precision timing probes were inserted into the decoding software. These probes were intentionally placed in the inflequently called functions in order to minimize the impact of the actual complexity impact of the probes. Initial tests indicated 59 that that any complexity change was well within the decoder’s standard variance. Special attention has been given to the measurement descriptions to note “what is” and “what is not” included in each probe. Probes are called a total of roughly 200 times per flame. Three estimates were derived for a probe time. The first, by measuring the time before and after a single probe, generated a worst case number. The second, by timing many probes and then averaging for a per probe time, generated a more likely estimate but ignored any advantage being gained flom cache coherency. Neither estimate included the benefits that interleaving non-probe instructions could have on the CPU pipeline and memory. The final estimate was achieved by comparing pre-timing decoder runs with timer implemented runs. The estimated worst case time for a probe in this iPAQ environment is 40 ticks, with 3.69 million ticks occurring per second. For the fifteen per second flames, that translates into an addition of roughly one-third of a second to the total time. The averaged estimate, closer to 13 ticks, adds one-tenth of a second. Actual experimentation noticed no significant difference based on timing implementation beyond the standard variance. This observation suggests that the actual time change was under one-tenth of a second for most runs. Actual Timing Probes Equation 2 shows the relationship between each of the probes used, as well as the four additional calculated times. Each of the specific probes is discussed at greater depth below. 60 TTot = TOver + Z TDecFr I,P,B Frames TDecFr : TReadNewFr + TFrInit + TProcMB + TPostFrDec + TDeblockFr + TUnk TProcMB = MEL/113v ReadMB + TDecodeMB ) + TMBOth TNon-ReadMB = TProcMB " z (TReadMB) All MB Calculated, not timed, probes: T U nk , T Over , TMBOth a TNon-ReadMB Equation 2: Timing Probe Interactions Total Time ( T at): This is the total time spent decoding a complete sequence (e. g. approximately 150 flames for 15fps9), excluding decoder initialization. Any comparison to the total decoder time is against this number. There is one call to this timer for the whole program. Decode Frame (DecF r): The time to decode a flame, ignoring a slight amount of memory manipulation and generic cleanup (calculated below as Overhead). The flame times are totaled by flame type (I, P, or B), and are used for any calculation involving the picture type total time calculations. The sum of all three of these picture type totals is slightly less than the total time, due to the overhead. Process Macroblocks (ProcMB/o: This is the total time it takes to decode a flame, including reading the pre-macroblock information, reading macroblocks, and decoding them. It ignores deblocking time (measured elsewhere) as well as some flame cleanup. 9 All original sequences have 300 flames so a perfect 15fps sequence is 150 flames. However due to the GoP structure, some sequences end at 148 or 149 flames because the next flame would be a B-flame but it has no future I or P-flame to use as a reference flame. The data presented is normalized by flames to compensate. 61 The probe is called once per flame and this section is expected to occupy a significant portion of the total run time. Pre-Decode Frame [nit (F rInit): This probe measures the time to prepare for decoding a flame. It includes initializing flame flags, entropy settings, and error correction as well as ordering the flame buffers. This timing section, called once per flame, is expected to take very small percentage of the program. Read New Frame (ReadNewF r): This probe measures the time to read and process the pre-macroblock flow control information for each flame. This time is expected to be minimal and the probe is called once per flame. Read Macroblock (ReadMB): This probe, called once per macroblock (ninety-nine times per QCIF flame) measures the amount of time it takes to read a macroblock, primarily entropy decoding. However, the decoder also uses an inverse transform on intra-coded (“1”) 16x16 macroblocks to extrapolate the DC coefficients for the new intra prediction modes. On selected sequences that were profiled for function counts, this rare transform was called less than a thousand times (compared to over 300,000 calls to the transform in DecodeMB). Decode Macroblock (DecodeMB): The time spent taking the differential coefficients and motion coefficients read during the Read Macroblock phase to find motion estimate and calculate the new macroblock coefficients using the motion vectors. Like read macroblock, the probe is called once per macroblock. Post-Decode Frame (PostFrDec): This probe measures the clean-up time after successfully reading and decoding the macroblocks in a flame. The time here includes '° This probe section corresponds to the “slice" breakdown within the reference code. Thus some of the 62 resetting various flags, motion information copying, and motion vector storage. This time is expected to be negligible but is included for completeness. The probe is called once per flame. Deblock Frame (DeblockFr): This probe measures the time taken to deblock the flame (apply the deblocking filters to the decode image and remove some of the artifacts and blockiness). This probe is called once per flame, and deblocking is expected to be a significant task. Calculated Timing Probes The following additional measurements were calculated flom the previous probes to provide additional information about the timing within decoder. All of the timings below use the previous nine timings as their foundation. ProcessMB excluding Read Macroblock (Non-ReadMB): This is the time it took to decode the macroblocks excluding the time spent reading macroblocks. This time includes ProcessMB Other (see below) as well as Decode Macroblock. Process Macroblocks Other (MBOth): The difference between the total time to decode all macroblocks and the measured subparts of macroblock decoding. The two subparts are read macroblock and decode macroblock. Those two times were added together and then subtracted flom the total macroblock decoding time. Overhead (Over): The difference between the total time spent decoding all of the flames and the total time spent decoding, ignoring decoder initialization. Unlike other measures, graphs reference this as a slice rather than ProcessMB. The discussion uses ProcessMB for clarity. 63 overhead is calculated as picture type independent because only a single total time is available. Thus there is no measure of overhead for different picture types. Unknown (Unk): The time spent decoding a flame but not spent in any known subsection. This time is calculating by summing the following parts: flame deblocking, process macroblocks, read new flame, post flame decoding, and pre-decode flame init. The sum is then subtracted flom the total time spent decoding flames. (Thus overhead is not included as unknown time.) Akiyo This section presents the timing figures for the timing sections that show significant variation in complexity based upon bitrate for the Akiyo sequence. The overall rate- complexity plots are shown in Figure 10. The bottom plot is the magnified section containing the non-I-GoP flom the top plot1 1. Many of the probes, shown in Table 6 indicate that bitrate has an insignificant impact upon their performance. The slight variances shown are neither linear nor particularly related to time. In all cases except for post flame decode, they include the timings for all four GoP structures (1, P, B, and T), and the variance for an individual GoP type is much smaller. Over a range of O to 35 kilobits per flame, the total variance is only 2.3ms for CABAC and 3.4ms for CAVLC (due to read new flame having a higher variance). ” As mentioned before, the rate limit in the title is for the average size of all flames, not for the average size of specific flame type. The rate axis for the plots corresponds to either the average size of all flames (‘- A’) or of a specific flame (‘-i’, ‘-p’, ‘-b’) as determined by the plot type. 64 In comparison, the deblocking fimction is clearly rate related”. The deblocking rate- complexity plots are shown in Figure 11, and then broken down by picture type in Figure 12. The bottom section of Figure 11 shows the magnified view of the non-I GoPs. The top half of Figure 12 looks at the I flames flom all four GoPs, the bottom half looks at the P flames flom the non-I-GoPs. Figure 11 suggests the linear performance clearly illustrated by Figure 12. The key point for the Figure 12 I frames is how similar the results are for all four GoPs: Regardless of how many I flames were used in a sequence, the deblocking time per kilobit remains constant. The B flames, although not shown, also follow the P pattern of increasing complexity with rate. CAVLC plots match, also not shown, form mirror images for the figures shown. This result is expected since entropy ' coding should not have any effect on the decoder outside of entropy reading. Presented in Figure 13 are the complexity-rate results for read macroblock. Comparing Figure 13 with Figure 10, it is clear that much of the decoding complexity gain comes flom the ReadMB, or parsing, portion of the decoder. The decoder increase is 14ms, the parsing increase 15ms. The P-GoP is 6ms vs. 4ms; the B-GoP 3.5ms vs. 4ms. The difference is even more noticeable in Figure 14, which shows the CAVLC ReadMB timings over the 0-7 kilobits per flame range. For the four GoPs (I, P, B, T), the total time increases are 12ms, 6ms, 7ms, and 7ms. The correlated ReadMB times are 6ms, 10ms, llms, and llms. Since the deblock flame probe decreases with increased rate for the I- GoP, the I-GoP total difference is somewhat less than the ReadMB increase. Likewise, since Deblock Frame increases with complexity for P, B, and T, the difference is '2 The relative complexity change compared to time is rather small, and mostly likely is related to how many special cases the deblocker is dealing with, which allow the deblocker to do fewer calculations. 65 somewhat larger than purely the parsing increase. However in each case, the parsing accounts for over half of the net increase in complexity. Summary The timing analysis of the Akiyo decoding rate-complexity suggests several trends. First, that many sections of the decoding process are rate invariant: Frame reading, initialization, other MB decoding, post flame decode, overhead, and unknown time. (Post flame decoding, does however, have a very specific GoP based complexity.) Secondly, parsing and deblocking complexity are both linked to rate, but that parsing has almost a linear correlation to bitrate. The parsing complexity increase corresponds strongly to the net complexity increase in the decoder. Finally, the analysis suggests, as expected, that entropy coding has no effect on the decoding complexity outside of the parsing process. 66 Akiyo Tables and Figures Akiyo CE Time Range Time Variance Average Time Seq. (ms) (ms) JmS) CABAC: Fastest Slowest Read New Frame 1.1 2 0.9 1.6 Overhead 0.25 0.63 0.38 0.45 Unknown 0.01 0.18 0.17 0.1 Decode Frame [nit 3.87 4.1 0.23 3.97 Process MB Other 2.9 3.4 0.5 3 1 Post Frame +/- 0.1 ms flom I: 2.4; P: 2.5 Decode avegge times B: 1.3; T: 0.8 CAVLC: Read New 1.0 2.7 1.7 1.5 Frame Overhead 0.2 0.6 0.4 0.4 Unknown 0.01 0.21 0.2 0.125 Decode Frame Ini t 3.67 3.87 0.2 3.75 Process MB Other 2.88 3.55 0.77 3.30 Post Frame +/- 0.1 ms flom I: 2.4; P: 2.5 Decode average times B: 1.3; T: 0.8 Table 6: Akiyo Rate Invariant Probes (15fps, CE) 67 Time Per Frame (ms) Time Per Frame (ms) 60 55~ Seq: aki; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits] 1' I I I l 1 1.5 2 2.5 3 3.5 4 Kilobytes Per Frame 4.5 Seq: aki; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [0-1Kbytes/frame] 60 55 45 35 p l I 0.2 0.4 0.6 0.8 Kilobytes Per Frame Figure 10: Akiyo Rate-Complexity (15 FPS, CE) 68 —e— iii -'II.'D'U Deblock Frame (ms/frame) Deblock Frame (ms/frame) 12 11~ 10- Seq: aki; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits] l I l 0.5 1 1.5 2 2.5 3 3. 5 4 Kilobytes Per Frame J 4.5 tilt -ICD'D- Seq: aki; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [0-1Kbytes/frame] 12 11. 10- Kilobytes Per Frame Figure 11: Akiyo Deblock Rate-Complexity (15 FPS, CE) 69 0 0.2 0.4 0.6 0.8 Deblock Frame (ms/frame) Deblock Frame (ms/frame) Seq: aki; GoP: -i; FPS: 15; Entropy: CABAC: Plat: CE [No Limits] 115i —e— I + p 11 — —i—— B —i-— T 10.5 - 10 » 9.5 - 9 . 8.5 ~ 8 J 1 1 1 1 1 1 1 1 O 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Kilobytes Per Frame Seq: aki; GoP: -p; FPS: 15; Entropy: CABAC; Plat: CE [No Limits] —6- P —I— B —+— T 1.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Kilobytes Per Frame Figure 12: Akiyo Deblock Rate-Complexity by Picture Type (15 FPS, CE) 70 Read Macroblock (ms/frame) Read Macroblock (ms/frame) Seq: aki; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits] 25 - HED'U- I I l I J 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Kilobytes Per Frame Seq: aki; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [0-1Kbytes/frame] 101 ; J 3 1 1 1 0 0.2 0.4 0.6 0.8 1 Kilobytes Per Frame Figure 13: Akiyo ReadMB Rate-Complexity (15 FPS, CE, CABAC) 71 Read Macroblock (ms/frame) Time Per Frame (ms) Seq: aki; GoP: -A; FPS: 15; Entropy: CAVLC; Plat: CE [O-1Kbytes/frame] 20 l A x... *6— 18 i “/1 : + #013 2 1 1 1 1 O 0.2 0.4 0.6 0.8 1 Kilobytes Per Frame Seq: aki; GoP: -A; FPS: 15; Entropy: CAVLC; Plat: CE [0-1Kbytes/frame] 65 — —e— + —+— —+— HW'O 45- P 35 1 1 1 A 1 0 0.2 0.4 0.6 0.8 1 Kilobytes Per Frame Figure 14: Akiyo ReadMB Rate-Complexity (15 FPS, CE, CAVLC) 72 Decode Slice exl. MB Read (ms/frame) Decode Slice exl. MB Read (ms/frame) Seq: aki; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits] 40 35 30 . if -111 25 - W 20 L 15 r 10 0 Kilobytes Per Frame Seq: aki; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [0-1Kbytes/frame] 4O ' —e— l W + P ——+— B 35 . W + T 30 ~ 25 - 20- 15» A A A LA I L J I 1O 0 0.2 0.4 0.6 0.8 1 Kilobytes Per Frame Figure 15: Akiyo ProcessMB exl. ReadMB Rate-Complexity (15 FPS, CE, CABAC) 73 Foreman Foreman, despite being a more complex sequence, exhibits the same probe invariants. Table 7 shows the same probe invariants noted for the Akiyo sequence. The range, variance, and average times are almost unchanged when compared to the Akiyo table. This table further suggests that deblocking, decoding, and parsing contribute most of the complexity change to the decoder. Figure 16 presents the rate-complexity plots for the Foreman sequence, once again with a magnified look at P, B, and T GoPs. For Akiyo with these three GoPs, the 45dB distortion was about 7Kb per flame. With Foreman, it is between 25 and 30Kb. The figure also emphases the sharp drop in complexity for the B and T GoPs at low bitrates. This drop will be discussed more when the ProcessMB decoding probe is discussed. Figure 17 shows the blocking rate complexity for the CABAC sequences. The CAVLC are almost identical to those in the figure, as was true for the Akiyo sequence. Deblocking remains a fairly rate independent task, although clearly it is rate dependent in to a small degree. Compared to the Akiyo sequence, the I-GoP rates are within a quarter of a second of each other. However, for the same rate, the other three GoPs are roughly lms more complex for the Foreman sequence. This difference suggests that deblocking is slightly sequence dependent. Figure 18 presents the deblocking complexity by picture type for the I and P flames. Like the Akiyo, the I flames take the same time to deblock, regardless of the sequence type they are used in. However, the P flames vary in 74 deblocking time by the GoP, especially at lower rates. In addition, while the Akiyo I flame deblock rate is roughly equivalent to the Foreman one, the P flames more complex it deblock, regardless of the GoP. Figure 19 presents the parsing complexities for the Foreman sequence. Again, the parsing complexity is roughly linear, although the slope depends on the GoP structure. The l flame parsing rates are exactly the same as the Akiyo rate: All of the points flom both sequences fall on a single line for each GoP. The P and B flames reflect the same linear trend, although the Foreman sequence is about 0.5ms (10%) more complex for both flame types. As with Akiyo, Figure 19 clearly shows the linear relationship between rate and complexity. Figure 20 shows the same linear pattern for the CAVLC parsing speed. As with Akiyo, the parsing complexity increase represents a significant contribution to the overall complexity increase. For example, the CABAC I-GoP increases in complexity by 24ms, the parsing increase contributes 28ms to that time. Other decreases contribute the -4ms. For the CAVLC P-GoP, the complexity increase is 35ms, and the parsing contribution is over 30ms. With the more complicated Foreman sequence, less compression, and longer flames, the proportion of time spent parsing grows. With Akiyo, the largest parsing change was on the order of 10ms, with Foreman it is four times that number. Figure 21 shows that the non-parsing complexity is steady, with the exception of a drastic complexity shift for the B and T GoPs between 0 and 5 Kb per flame. This shift explains 75 the unusual drop in overall complexity in the Foreman results, although the coding mechanism change responsible for the drop is unknown. It’s worth noting that the actual decoding times, excluding parsing, are the same between the Foreman and Akiyo I and P- GoPs. The Foreman B and T-GoPs actually have lower non-parsing decode time than Akiyo. One possibility for this difference is that Akiyo, with higher compressibility, can make better use of the motion vectors which then require longer to decode; the weaker Foreman compression may simply decode more quickly. However, even excluding parsing, the I GoP still decodes twice as fast as the P-GoP, and 2.5 times faster than the B and T GoPs Summary The Foreman sequence, with a broader bitrate range than the Akiyo sequence, maintains very similar baseline decoding values aside flom parsing. Despite being a more motion intensive sequence, the Foreman B and T GoPs actually decoded slightly faster than Akiyo when parsing time was ignored. The rate invariant sections carried over flom the Akiyo sequence. Parsing contributed to most of the complexity gain, and more clearly than the Akiyo sequence because of the wider bitrate range. 76 Foreman Tables and Figures Foreman CE Time Range Time Variance Average Time SetL (ms) lms) (ms) CABAC: Fastest Slowest ReadNeW 1.1 2.1 1.0 1.6 Frame Overhead 0.24 0.62 0.38 0.45 Unknown 0.01 0.23 0.22 0.1 Decode Frame Init 3.87 4.15 0.28 3.97 Process MB Other 2.9 3.75 0.85 3.2 Post Frame +/- 0.1 ms flom I: 2.4; P: 2.5 Decode average times B: 1.3; T: 0.8 CAVLC: Read New 1.2 2.1 1.7 0.9 Frame Overhead 0.2 0.63 0.43 0.45 Unknown 0.02 0.2 0.18 0.12 Decode Frame [nit 3.67 3.85 0.18 3.75 Process MB Other 3.1 4.9 1.8 3.80 Post Frame +/- 0.1 ms flom I: 2.3; P: 2.4 Decode avergge times B: 1.2; T: 0.8 77 Table 7: Foreman Rate Invariant Probes (15fps, CE) Time Per Frame (ms) Time Per Frame (ms) 65 I" 35 0 Seq: for, GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits] I I I I l I 2 3 4 5 6 7 Kilobytes Per Frame .- HGT)— Seq: for; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [04Kbytes/frame] 65 60» 55- 50 45- 40 p 35 0 I 1 1.5 2 2.5 3 3.5 Kilobytes Per Frame Figure 16: Foreman Rate-Complexity (15 FPS, CE) 78 h— Deblock Frame (ms/frame) Deblock Frame (ms/frame) Seq: for; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits] p— p— l L 4 I I 1 2 3 4 5 6 7 8 Kilobytes Per Frame iiii 40'0— :— Seq: for; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [O-4Kbytes/frame] 11 10- p- I 0.5 1 1.5 2 2.5 3 3.5 4 Kilobytes Per Frame Figure 17: Foreman Deblock Rate-Complexity (15 FPS, CE) 79 —e— iii 4031) Deblock Frame (ms/frame) Deblock Frame (ms/frame) Seq: for, GoP: -i; FPS: 15; Entropy: CABAC; Plat: CE [No Limits] 11~ 10.5 ~ 10L 9.51 7.5 r 6.5 ~ I I I 0 1 2 3 4 5 03 Kilobytes Per Frame 6.5 5.5» "0 4.5" '19 3.5 '- (3:, 4 0 0.5 1 1.5 2 2.5 Kilobytes Per Frame L 3 I 3.5 Seq: for, GoP: -p; FPS: 15; Entropy: CABAC; Plat: CE [No Limits] ,— -6- | + —e—P —i——T 4.5 Figure 18: Foreman Deblock Rate-Complexity by Picture Type (15 FPS, CE) 80 Read Macroblock (ms/frame) Read Macroblock (ms/frame) Seq: for; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits] 35v iiii o 1 1 1 1 I_ 0 1 2 3 4 5 6 7 8 Kilobytes Per Frame y.— 1.. Seq: for: GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [0-4Kbytes/frame] 24 - —e— I 22» iii 201 o \\ 18- \‘\ \‘\ \‘ 16* \ 14» 0 12- 10- I 0 0.5 1 1.5 2 2.5 3 3.5 4 Kilobytes Per Frame >— y— L. Figure 19: Foreman ReadMB Rate-Complexity (15 FPS, CE, CABAC) 81 Read Macroblock (ms/frame) Time Per Frame (ms) 55 50 45- 40 35- 30~ 25 20 15- 10' I 0 0.5 1 1.5 2 2.5 Kilobytes Per Frame \\ p— 3.5 Seq: for; GoP: -A; FPS: 15; Entropy: CAVLC; Plat: CE [0-4Kbyteslframe] .— -e— | iii Seq: for; GoP: -A; FPS: 15; Entropy: CAVLC; Plat: CE [0-4Kbyteslframe] 100 90» 80 70 60- 50~ 40 30 0 p- I 0.5 1 1.5 2 2.5 Kilobytes Per Frame —e— I +P ——i—B +1- Flgure 20: Foreman ReadMB Rate-Complexity (15 FPS, CE, CAVLC) 82 Decode Slice exl. MB Read (ms/frame) Decode Slice exl. MB Read (ms/frame) 35 30~ 25- 20 15 Seq: for. GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits] P —e—i‘ —-i—B +T! ___1 W 1 10 0 I ----- ........... , - L. 1 2 3 4 5 6 7 Kilobytes Per Frame Seq: for; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [0-4Kbytes/frame] r 35 25» 20- 15- —e— | iii 10 0 Wee—GWG—e—We I L I I 41 I 0.5 1 1.5 2 2.5 3 3.5 4 Kilobytes Per Frame Figure 21: Foreman ProcessMB exl. ReadMB Rate-Complexity (15 FPS, CE, CABAC) 83 Mobile This section breaks down the rate-complexity response of the decoder for the Mobile sequence. It deals first with the time invariant probes, then moves on to examine deblocking, parsing, and decoding excluding parsing. As before, the overall rate- complexity is shown in Figure 22. Table 8 presents the same time invariant probes shown for the Akiyo and Foreman sequences. Examining table against the previous tables (Table 6 and Table 7), it is clear that even with a highly complex sequence, these measures remain highly stable. One exception, the read new flame probe, slows down for the mobile at high rates. Excluding the Mobile I-GoP, the highest value for read new flame is 2.3ms (for CABAC). Even this increase over the Foreman results is at the higher bitrates for the mobile sequences. This again suggests that reading the new flame information is linked to the bitrate, but not in a significant way. The other significant increase, CAVLC ProcessMB other, is caused by higher times for the B and T-GoPs at high bitrates. However, even here the total increase is slight (0.6ms over Foreman) even alter doubling the bitrate. (The 45dB B and T Foreman sequences took around 30Kb per flame; the Mobile sequences around 70Kb.) The deblocking rate-complexities are presented in Figure 23 and then the specific complexity for I and P flames shown in Figure 24. As before, the deblocking complexity for the I and P flames is GoP independent although individual experiment variation can be seen. Figure 23 shows the deblocking complexities following a similar pattern to the Foreman series. The P, B, and T GoPs start out with low complexity and gradually increase until the rate reaches about 30Kb per flame. For higher rates (beyond the highest 84 Foreman rate), the deblock complexities drops slightly. The 1 GOP starts at a high complexity, roughly the Same complexity as the rate equivalent Foreman, and continues dropping as the rate increases. This complexity response is similar to the previous two sequences and seems to be more closely related to bitrate than to sequence complexity. The parsing complexity graphs are present in Figure 25 for CABAC (compare against Figure 22) and for CAVLC, Figure 26, which includes overall time as well as parsing time. Here the relation between parsing and overall complexity can be clearly seen. For example, the net increase in complexity for the CABAC I-GoP is roughly 45ms; the parsing increase also roughly 45ms. For the CAVLC P-GoP, a net increase of 90ms, with a parsing increase of over 75ms. The contribution of parsing to the overall complexity increase is further highlighted by Figure 27 which shows a very steady decoding complexity for non-parsing related decoding. The exception is a slight complexity increase during the 0-10Kb per flame portion of the B and T GoPs. Interestingly, the B and T non-parsing complexity is slightly less than the Akiyo complexity while slightly more complex than the Foreman GoPs. All three sequences show extremely close decoding complexities for the I and P GoPs once parsing time is excluded. Summary Overall, the Mobile sequence continues a pattern of parsing being a significant factor in the decoding complexity, especially with this implementation of the decoder. Because the Mobile sequence is achieves poorer complexity than the Akiyo or Foreman sequences, it spans a broader set of bitrates, which highlights the complexity dependence on rate. Deblocking has a small correlation to rate as do a handful of other portions of the 85 decoder. Aside flom parsing, the flame decoding complexities are very similar to the Akiyo and Foreman flames, suggesting that sequence complexity does not significantly impact the non-parsing complexity. Mobile Tables and Figures Mobile CE Time Range Time Variance Average Time Seq. (ms) (ms) (ms) CABAC: Fastest Slowest Re?“ New 1.5 3.5 2.0 2.0 rame Overhead 0.21 0.60 0.39 0.45 Unknown 0.01 0.23 0.22 0.1 Decode Frame [nit 3.88 4.13 0.25 3.97 Process MB Other 2.93 3.90 0.97 3. 3 Post Frame +/- 0.1 ms flom I: 2. 4; P: 2. 5 Decode average times B: 1.3; T: 0. 8 CAVLC: Read New 1.5 3.5 2.0 2.3 Frame Overhead 0.25 0.63 0.38 0.43 Unknown 0.02 0.22 0.20 0.1 1 Decode Frame Ini t 3.66 3.90 0.24 3.76 Process MB Other 3.4 5.5 2.1 4.0 Post Frame +/- 0.1 ms flom I: 2.3; P: 2.4 Decode averge times B: 1.2; T: 0.8 Table 8: Mobile Rate Invariant Probes (15fps, CE) Time Per Frame (ms) Time For Frame (ms) Seq: mob; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits] 100 r 90. iiii 80- 70 60- 50» 40~ I I 1— 30 4 1 1 4L 1 J 0 2 4 6 8 10 12 14 16 18 Kilobytes Per Frame Seq: mob; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [0-10Kbytes/frame] 90' —e—I 80~ iii 701 60- 50- 40» )— u— I I I 2 4 6 8 10 Kilobytes Per Frame 30 0 Figure 22: Mobile Rate-Complexity (15 FPS, CE) 87 Deblock Frame (ms/frame) Deblock Frame (ms/frame) Seq: mob; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits] 9 —e— I _.._ p 8 _ —+— B —ir— T 7 . 6 .. 5 . 4 . 3 1 I 1 J_ 1 1 1 1 0 4 6 8 10 12 14 16 18 Kilobytes Per Frame Seq: mob; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [0-10Kbytes/frame] 9 l —e— l + p —+— B 8 h —i-— T 7 L 6 1 5 . 4 . 3 1 1 L L a 0 2 4 6 8 10 Kilobytes Per Frame Figure 23: Mobile Deblock Rate-Complexity (15 FPS, CE) 88 Deblock Frame (ms/frame) Deblock Frame (ms/frame) Seq: mob; GoP: -i; FPS: 15; Entropy: CABAC; Plat: CE [No Limits] 8.5r -€-| 7‘ or y.— 1— I 4 6 8 10 12 14 16 18 Kilobytes Per Frame 4.5 2 Seq: mob; GoP: -p; FPS: 15; Entropy: CABAC; Plat: CE [No Limits] 5.5 r >— r— l— 2.5 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9 Kilobytes Per Frame Figure 24: Mobile Deblock Rate-Complexity by Picture Type (15 FPS, CE) 89 .11-- Read Macroblock (ms/frame) Read Macroblock (ms/frame) 70 60» 50~ 40» 30> 20~ 10- Seq 45 40- 35- 30- 25- 20- 15 10- Seq: mob; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits] 1 L I l I 10 —e— I + I I I I Kilobytes Per Frame 0 2 4 6 8 12 14 16 18 Kilobytes Per Frame : mob; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [O-10Kbyteslframe] o r, e r, r, x /‘ / l' ’6”; ’1" we" ,. 0 2 4 6 8 10 Figure 25: Mobile ReadMB Rate-Complexity (15 FPS, CE, CABAC) 90 Read Macroblock (ms/frame) Time For Frame (ms) Seq: mob; GoP: -A; FPS: 15; Entropy: CAVLC; Plat: CE [0-10Kbytes/frame] 110 100 90» 80- 70- 60- 50» 40» 30- 20 F 10- 0 I J I 2 4 6 Kilobytes Per Frame —e— 41:11!— Seq: mob; GoP: -A; FPS: 15; Entropy: CAVLC; Plat: CE [O-IOKbytes/frame] 160 140 ~ 120 ~ 100 - a: O O) O 40 1" I I 4 6 Kilobytes Per Frame 2 —e— +11 41111) 10 Figure 26: Mobile ReadMB RataComplexity (15 FPS, CE, CAVLC) 91 Seq: mob; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [No Limits] 35 —e~l +P m —+—T E30 '7', 5 3 a) 25+ 0: m 2 3 20. 0 .9 (75 0 r; O WWW-e 10 1 L 1 1 1 0 2 4 6 8 10 12 14 16 18 Kilobytes Per Frame Seq: mob; GoP: -A; FPS: 15; Entropy: CABAC; Plat: CE [0-10Kbytes/frame] 35 —e-| +P A ——+—B 330 —+—T 3% 71 E 1: 8 25» DC to 2 -' Mae—WWW! «’1‘: 201 o .2 (73 8 £15» (WWW 1o 1 1 A 1 1 0 2 4 6 8 10 Kilobytes Per Frame Figure 27: Mobile ProcessMB exl. ReadMB Rate-Complexity (15 FPS, CE, CABAC) 92 .- 91.: Ian Conclusions An overview of the relative complexities of decoding macroblocks compared to parsing rates for this implementation can be seen in Table 913 . The rates were found by the slope of the linear line between the 25dB complexity and the 45dB complexity for each category. The range represents the different sequences, and numbers are given for each picture type, not group of picture structure. That is, for mobile P flames, P flames were combined flom the Mobile P, B, and T GoP structures. As can be seen flom the figures presented earlier in this chapter, the linear assumption is not entirely accurate. For example, the complexity per bit tends to decrease at high bitrates. However, linearity is a close approximation and provides practical summary of the data. The table shows very consistent rates for decoding I and P-flames, with more variance in the time for B-flame decoding. However, the I-flames clearly decode the fastest, followed by P flames. B-flames follow a distant third. The same relationship holds true for the parsing rates as well. The table highlights the poor performance of CAVLC parsing in this implementation. The comparative performance boost flom the Pentium IV is shown as well. The clock rate of the Pentium is roughly three times greater than the iPAQ, and the speed multiplier gives a relative sense of performance increase, not a direct comparison of capability. If the complexity relationship was linear, the multiplier would be roughly the same for each category. Instead, I-flames decode about four times faster, P-flames around five times faster, and B-flames a bit above six times faster. The CABAC parsing, in contrast, is 93 fairly consistent, running four and a half to five times faster on the PC. The CAVLC parsing, on the other hand, ranges flom the four times faster for the CABAC to something like ten times faster for B flames. This range in performance speedup serves to highlight the Pentium optimizations are not being uniformly applied to the decoding process, but boost specific sections of the decoder much more than others. Frame iPAQ Pentium IV, PC Speedup Type 400MhZ 1.5GhZ (Times Faster) Low End I High End Decoding, exl. Parsin (ms/frame) I ll.5-l3.5 2.75-3.10 4.18 4.35 P 21.5-22.5 3.95-4.45 5.44 5.06 B 33.0-48.5 5.40-7.10 6.11 6.83 Parsing (bytes/ms) CABAC Codin r I 250-280 1 100-1300 4.40 4.60 P 160-250 720-1 100 4.50 4.40 B 80-210 450-1000 5.63 4.76 CA VLC Coding I 135-245 600-1130 4.44 4.61 P 50-100 380-520 7.60 5.20 B 30-80 360-500 12.00 6.25 Table 9: Complexity Summary over 25-45 dB range The timing analysis of the three separate sequences based upon complexity and bitrate suggests that much of the decoding process is both rate and sequence independent. Improvements to these sections are likely to equally improve any video sequence. Two sections, deblocking and parsing, show a correlation between complexity and bitrate. The deblocking correlation with rate appears extremely limited with the I-flames deblocking faster with more bits per flame. The other flames experience a slight '3 This is the same table as shown in the introduction. 94 complexity increase of roughly 2ms over the 0 to 60Kb per flame range. However, this increase is relatively insignificant compared the parsing complexity increases. l-flames are three to four times slower to deblock than other flames at high distortion. At low distortion, all the GoPs take about the same amount of time to deblock. The parsing complexity, while it varies by flame type and to a smaller degree sequence, overall shows a strong linear correlation to bitrate. For poorly designed parsing code, this portion of the complexity easily dominates the decoding complexity. Comparing the three sequences strongly suggests that bitrate, not sequence complexity, is the primary factor in decoding time. Aside flom parsing, the baseline decoding complexity is extremely level, with the P-GoP being about twice as complex as the I-GoP, and the B and T-GoPs being about three times more complex than the I-GoP. This complexity difference suggests that in addition to parsing improvement, significant room for improvement exists for the motion prediction implementation. 95 CHAPTER 5 CONCLUSION Summary The major contributions of this research include highlighting the performance difference between computers and mobile devices; emphasizing that parsing, especially for poorly compressed sequences, can be a controlling complexity factor, and providing a flamework against which to compare other handheld performance. The complexity response between PCs and handhelds, as discussed in Chapter 3, is significant with complexity ratios varying at a high level based on both entropy and picture type. The implication is that the complexity models researched and published for desktop computers, especially Pentiums, are not entirely transferable to handheld devices. The handheld device architecture is sufficiently different that while algorithm optimizations are unlikely to hurt performance, equivalent improvement should not be assumed. Additionally, separate handheld algorithm improvements may be possible which significantly impact the handheld devices without corresponding improvement for the PCs. One example of this type of change is the memory management change which took approximately one-third of the handheld time but an estimated two percent of the PC. Additional professional tools are needed to profile the decoder performance on handheld devices. The primary complexity controlling factor for both entropy coding methods was the read macroblock, or parsing, portion of the code. Despite the lack of research into the entropy 96 coding complexity constraints, practical optimum implementations, for a variety of platforms aside flom hardware additions, are critical to prevent parsing flom being the controlling aspect of the sequence complexity. Despite the focus on traditionally complex aspects such as transforms and deblocking, which indeed remain problematic, parsing requires attention. This attention is especially needed if handhelds are to take advantage of possessing a higher bandwidth to computation power ratio. In general, the P-GoP provides a balance between the high bitrate of a mostly I-flame sequence and the high complexity of B-flames due to the double motion vector. The P- GoP provides a high level of compression significantly reducing the bitrate without the added overhead of a second motion vector. When parsing is ignored, the decoding complexity of the P-flame does reflect a significantly higher complexity than the I-flame, and a more tightly optimized entropy decoder might reduce the I-flame complexity to the point where I-flames are less complex. Significant work on reducing the complexity of motion compensation would be needed, however, before B flames become a viable option. Future Directions The research into the complexity-distortion suggests that additional research on several flonts would provide further valuable insight into video complexity-optimization for handheld devices. 97 F One specific research is the entropy symbol parsing complexity for non-hardware specific solutions. While hardware solutions have been introduced capable of processing on the order of 50 to 100 million symbols per second, equipping handheld devices with entropy decoding processors may not be a practical solution. There is a clear need for better profilers to do architectural analysis of handheld devices. While professional solutions such Intel’s VTune exist for PCs, equivalent grade tools do not exist for handheld devices. Despite the problems in implementing such tools, the ability to easily probe specific functions and optimizations in terms of complexity would be a useful step in real time video optimization. A third research aspect is additional investigation into architecture specific behavior and ways of utilizing the operating system and architectural features to further reduce complexity. While this research ignored the operating system interaction, operating system interaction may not be negligible. Additionally, while neither of the two compilers tested showed superior performance, the optimization quality of handheld compilers is likely to be less than their PC counterparts. The lack of complexity difference between Intel and Microsoft compilers further suggests such improvements may be possible. 98 BIBLIOGRAPHY 99 [1] [2] [3] [4] [5] [6] [7] [3] [9] [10] [11] [12] [13] [14] [15] BIBLIOGRAPHY "Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec. H.264 | ISO/IEC 14496-10 AVC)," Joint Video Team (JVT), Mar. 2003. H. Schwarz and T. Wiegand, "The Emerging JVT/H.26L Video Coding Standard," presented at Proc. of the IBIC, Sept. 2002. N. Kamaci and Y. Altunbasak, "Performance Comparison of the Emerging H.264 Video Coding Standard with the Existing Standards," presented at IEEE ICME, July 2003. M. Horowitz, A. Joch, F. Kossentini, and A. Hallapuro, "H.264/AVC Baseline Profile Decoder Complexity Analysis," IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, pp. 704-716, July 2003. V. Lappalainen, A. Hallapuro, and T. Hamalainen, "Complexity of Optimized H.26L Video Decoder Implementation," IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, pp. 717-725, July 2003. H. Malvar, A. Hallapur, M. Karczewicz, and L. Kerofsky, "Low-Complexity Transform and Quantization in H.264/AVC," IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, pp. 598-603, July 2003. A. Chang, 0. Au, and Y. M. Yeung, "A Novel Approachto Fast Multi-Block Motion Estimation for H.264 Video Coding," presented at IEEE ICME, Baltimore, Maryland, 2003. 1. Richardson, "H.264 / MPEG-4 Part 10 : Variable Length Coding," in H.264 White Papers, 2002, Available: http://www.vcodex.fsnet.co.uk/h264_vlc.pdf G. Bjontegaard and K. Lillevold, "Context-adaptive VLC (CVLC) coding of coefficients." Fairfax, Virginia: Doc JVT-C028-r1, Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG (ISO/IEC JTCl/SC29/WGll and ITU-T SG16 Q.6), May 2002. D. Marpe, G. Blattermann, G. Heising, and T. Wiegand, "Video Compression Using Context- Based Adaptive Arithmetic Coding," Proc. IEEE ICMP '01, vol. 111, pp. 558-561, Oct. 2001. D. Marpe, H. Schwarz, and T. Wiegand, "Context-Based Adaptive Binary Arithmetic Coding in H.264/AVC Video Compression Standard," IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, pp. 620-636, July 2003. S. Saponara, C. Blanch, K. Denolf, and J. Bormans, "The JVT Advanced Video Coding Standard: Complexity and Performance Analysis on a Tool-by-Tool Basis," presented at Packet Video, April 2003. E. Baum, V. Hart, and J. Speidel, "Improvement of H.263 Encoding by Adaptive Arithmetic Coding," IEEE Transactions on Circuits and Systems for Video Technology, August 2000. I. Witten, R. Neal, and J. Cleary, "Arithmetic Coding for Data Compression," Communications of the A CM, vol. 30, June 1987. D. Marpe and T. Wiegand, "A Highly Efficient Multiplication-Free Binary Arithmetic Coder and its Application to Video Coding," IEEE Proc. ICIP, Sept 2003. 100 [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] I. Richardson, "H.264 / MPEG-4 Part 10: Transform & Quantization," in H.264 White Papers, Mar. 2003. MPEG-2: ISO/TEC JTCl/SC29/WG11 and ITU-T, "Revised text for ITU-T recommendation H.262-ISO/IEC 13 818-2: Information technology-generic coding of moving pictures and associated audio information Video," lSO/IEC and ITU-T, Ed. Genf, Switzerland, 1995. K. Panusopone and D. Baylon, "An Analysis and Efficient Implementation of Half-Pel Motion Estimation," IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, Aug. 2002. 0. Werner, "Drift Analysis and Drift Reduction for Multiresolution Hybrid Video Coding," Signal Processing: Image Commun., vol. 8, July 1996. T. Wedi and H. G. Musmann, "Motion- and Aliasing-Compensated Prediction for Hybrid Video Coding," IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, July 2003. T. Wieland, G. Sullivan, G. Bjontegaard, and A. Luthra, "Overview of the H.264/AVG Video Coding Standar " IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, July ' 2003. S. Akramullah, I. Ahmad, and M. Liou, "Optimization of H.263 Video Encoding Using a Single Processor Conrputer: Performance Tradeoffs and Benchmarking," IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, pp. 901-915, Aug. 2001. K. Yu, J. Lv, J. Li, and S. Li, "Practical Real-Time Video Codec for Mobile Devices," IEEE ICME 2003, 2003. SymbolicTools, Port SDK, 2003, Available: http://www.symbolictools.de/public/pocketconsole/developers/portsdk.htm SymbolicTools, PocketConsole Version 1.3, 2003, Available: http://www.symbolictools.de/public/pocketconsole/ JVT, JM 6. 1 e, 2003, Available: http://bs.hhi.de/~suehring/tml/download/jm6le.zip General References Harald Kosch, Distributed Multimedia Database Technologies Supported MPEG-7 and MPEG-2!, Boca Raton: CRC Press, 2004. A. Murat Tekalp, Digital Video Processing, Upper Saddle River: Prentice Hall, 1995. A. Peymandoust, T. Simunic, and G. De Micheli, “Low Power Embedded Software Optimization using Symbolic Algebra,” presented at DATE’OZ, March 2002. K. Yu, J Lv, J. Li, and S. Li, “Practical Real-Time Video Codec for Mobile Devices," presented at ICME 2003, July 2003. University of British Columbia, T MNDecode, 1997, Available: http://www.openmash.org/lxr/source/codec/tmndec 101 T. Turletti and C. Huitema, “Videoconferencing on the Internet,” IEEE/A CM Transactions on Networking, vol. 4, June 1996. J. Zheng and L. Chan, “A Motion Vector Recovery Algorithm for Digital Video Using Language Interpolation,” IEEE Transactions on Broadcasting, vol. 49, Dec. 2003. N. Ling and N. Wang, “Real-Time Video Decoding Scheme for HDTV Set-Top Boxes,” IEEE Transactions on Broadcasting, vol. 48, Dec. 2002. M. Sima, Y. Zhou, and W. Zhang, “An Efficient Architecture for Adaptive Deblocking Filter of H.264/AVC Video Coding,” IEEE Transactions on Consumer Electronics, vol. 50, Feb. 2004. T. Chang, Y. Tsai, C. Chien, C. Lin, and J. Guo, “A High-Performance Bitstream Processing IP Core Design for MPEG4 Video Compression Applications,” presented at International Symposium on Nanoelectronic Circuits and Olga-scale systems (ISNCGS 2004). K. Yamada, M. Kojirna, T. Shimizu, F. Sato and T. Mizuno, “A new RISC processor architecture for MPEG-2 decoding,” IEEE Transactions on Consumer Electronics, vol. 48, Feb. 2002. B. Zheng and M. Atiquzzaman, “A Novel Scheme for Streaming Multimedia to Personal Wireless Handheld Devices,” IEEE Transactions on Consumer Electronics, vol. 49, Feb. 2003. S. Lee and 1. Park, “A Low-Power Variable Length Decoder for MPEG-2 Based on Successive Decoding of Short Codewords,” IEEE Transactions on Circuits and Systems, vol. 50, Feb. 2003. A. Chimienti, C. Ferraris, and D. Pau, “A Complexity-Bounded Motion Estimation Algorithm,” IEEE Transactions on Image Processing, vol. 11, April 2002. 102 . U I . 0 . . , . - _. r- . . Cu.K .‘ I . 1 ‘ h 1 , " "“fi'm'fli-T'L wr«~mar-r.1i.i:‘ri.:'~ i[ii111111111111111111111