IIINHHIWWWWINI'llWill“HIIHHIHIIIWWI 135 This is to certify that the thesis entitled ROBUSTNESS 0F TEC SPEECH NATERMARKING TO CROPPING AND ADDITIVE NOISE presented by Aparna Gurijala has been accepted towards fulfillment of the requirements for Master's degreein Electrical Eng Major professor QM [8' Q (645*— (LA-9 Date 09 717.51%?! 2511/! 0.7639 MS U is an Affirmative Action/Equal Opportunity Institution LIBRARY Michigan State University PLACE IN RETURN Box to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 1100 mm.“ ROBUSTNESS OF TEC SPEECH WATERMARKING TO CROPPING AND ADDITIVE NOISE By Aparna Gurijala A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Department of Electrical and Computer Engineering 2001 ABSTRACT ROBUSTNESS OF TEC SPEECH WATERMARKIN G TO CROPPING AND ADDITIVE NOISE By Aparna Gurijala The widespread use of the intemet has created a need for technologies for the protection of copyrighted digital information. Digital watermarking is one such technology in which a preferably imperceptible signal (watermark) is embedded into a copyrighted host signal. Digital watermarks are prone to a wide range of “attacks” and other forms of distortion. In this work, the robustness of a new watermarking method based on transform encryption coding (TEC) to cropping and additive noise is investigated. Experiments were conducted to test the robustness of TEC speech watermarking to additive noise under different conditions including different SNRs and watermark masking algorithm parameters. Although a cropping attack is easy to implement, the resulting desynchronization severely hinders watermark detection and recovery. A dynamic programming (DP) based algorithm for the detection of cropped speech samples and reconstruction of the cropped stego—signal to enable watermark recovery has been developed. Implementation details of the DP algorithm and performance under different environmental conditions are presented. Factors influencing the robustness of TEC speech watermarking are analyzed. ACKNOWLEDGMENTS I would like to acknowledge Dr. J.R. Deller, my advisor for his invaluable guidance, encouragement and support. Special thanks to Dr.Deller for his very helpful remarks and suggestions that greatly contributed to my learning and understanding. Special thanks to Dr.Seadle and Dr.Radha for their consideration, patience and effort. The time spent by Dr.Deller, Dr. Seadle and Dr.Radha to ensure the completion of my thesis is truly appreciated. Personally I would like to thank my parents for their love and encouragement. My thanks to all my friends for their kindness and help. iii TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES CHAPTER 1 INTRODUCTION Watermarking for the National Gallery of the Spoken Word A typical watermarking system Properties of digital watermarks Classification of watermarking techniques Attacks on watermarking systems Document overview CHAPTER 2 DIGITAL WATERMARKING OF SPEECH USING TEC Watermarking algorithm Correlation detector Security and robustness CHAPTER 3 ROBUSTNESS STUDY Additive noise Cropping Algorithm for watermark recovery from cropped speech Memory and computational requirements Cropping in the presence of additive noise Counterfeit attacks CHAPTER 4 IMPLEMENTATION DETAILS, RESULTS AND CONCLUSIONS Robustness testing engine Robustness to additive noise Robustness to cropping Implementation details of the modified DP algorithm Experimental results vii viii 0000me 11 15 16 2O 22 24 28 28 29 32 33 42 42 45 Robustness to cropping in the presence of noise Conclusions CHAPTER 5 FUTURE WORK REFERENCES 48 49 50 52 LIST OF TABLES . Quality rating . Robustness to Gaussian noise (constant gain factor) . Robustness to Gaussian noise (adaptive gain factor) . Robustness to uniformly distributed noise (constant gain factor) . Robustness to uniformly distributed noise (adaptive gain factor) . Robustness to cropping and additive noise (adaptive gain factor) vi 33 35 37 4O 4O 47 LIST OF FIGURES 1. A typical watermarking system 2. 3. Watermarking process Watermark recovery Encryption using quasi m-arrays . Watermarking selectively to watermarking the entire record Encryption and decryption processes . Noise amplitude distribution . Cropping in speech and images . Dynamic programming approach to recovering cropped speech samples 10. Robustness of TEC watermarking to Gaussian noise (constant gain factor) 11. Robustness of TEC watermarking to Gaussian noise (adaptive gain factor) 12. Robustness of TEC watermarking to uniformly distributed noise (constant gain factor) 13. Modified implementation of DP algorithm 14. DP algorithm for watermark recovery vii 11 12 13 14 17 22 23 25 36 38 41 43 46 Chapter 1 INTRODUCTION 1.1 Watermarking for the National Gallery of the Spoken word The National Gallery of the Spoken Word (NGSW) [1] project is creating an online database of spoken word collections, spanning the 20th century. These collections are mainly drawn from Michigan State University’s Vincent Voice Library, MSU Museum, Chicago Historical Society and Northwestern University. They include Thomas Edison’s first cylinder recordings to the voices of Theodore Roosevelt, Florence Nightingale, and Babe Ruth. The aural resources for the NGSW are in the digital form. Representation of information in digital form has many properties that make it preferable to analog forms. An unlimited number of digital copies can be made with ease and accuracy. This benefit, however, has been a cause of concern for intellectual property owners and content providers. The widespread use of the Internet coupled with the developments in compression techniques facilitates fast and efficient distribution of digital content. However, while easy to implement, distribution of copyrighted digital information without authorization threatens intellectual property rights. Copyright laws protecting analog information are inapplicable to digital information. As a result, there is a need to develop techniques for protecting the ownership of digital content and for tracking intellectual piracy. Digital watermarking is one such technique. Digital watermarking is the process of embedding a permanent and preferably imperceptible signal into a copyrighted host signal. The embedded signal may typically convey information about the owner, author or carrier. More information about the need for watermarking in the NGSW project is found in [7]. The concept of watermarking has its origins in the ancient Greek technique of steganography or “covered writing” — interpreted as hiding information in other information. Detailed information on the history of steganography and watermarking is found in [8]. Applications of digital watermarking include copyright protection, fingerprinting, authentication, copy control, owner identification, broadcast monitoring, security control and tamper proofing. Watermarking can be used to protect virtually any form of digital information including images, speech, music, and video. Most of the digital watermarking schemes have been developed for images. Audio watermarking schemes include the method due to Boney et al. [9] in which the watermark is generated by filtering a PN-sequence with a filter that approximates the frequency masking characteristics of the human auditory system, and then accounting for temporal masking. Bassia and Pita [10] developed an audio watermarking method that modifies the temporal characteristics of the audio signal in accordance with a seed (watermark key) known only to the copyright owner. In [11] an audio watermarking technique operating in the Fourier domain is presented. Bender et al. [12] use homomorphic signal processing techniques to place information imperceptibly into audio streams by the introduction of closely spaced echoes. Luy et al. [13] proposed a multi— purpose audio watermarking scheme that embeds two complementary watermarks — one for audio authentication and the other for the detection of tampered regions. The spread spectrum watermarking technique developed by Cox et al. [14] can be applied to audio, image, video and multimedia data. This paper is concerned with the robustness of the digital speech watermarking technique employing transform encryption coding (TEC) [2, 3]. 1.2 A typical watermarking system A typical watermarking system consists of a watermark generator, an embedder, a watermark detector and possibly a component that distorts the stego-signal (defined below). Signal Key 67 Watermark generator Cover-signal Key ....................................... -., Distorted : . Watermark Stego—I'""""""”: stego— "—1—” t k Recovered - E _ , i . a ermar Signals Distortion :srgnal detection watermark Kn---" at: . - : or es/no Watermark ——>5 rntroducer he or recovery 9)’ embedder i 5 component Stego-key * Same as the one used for watermark generation. Figure 1. A typical watermarking system Due to the wide variations in the watermarking techniques, it is difficult to generalize and characterize a “typical” watermarking scheme. To account for the vast variations in watermarking approaches, certain inputs are indicated by dotted lines in Figure 1, meaning that they may not be present in all techniques. A signal for which copyright protection must be provided is called a cover-signal. A watermark is a signal that is embedded into the cover-signal for this purpose in accordance with the stego-key'. The stego-key ensures the imperceptibility of the watermark and thus introduces additional protection, by making the watermark location unknown. A watermark may take different forms — an encrypted or modulated speech sequence or image, least significant bit manipulations, a pseudo-random sequence. As a result, the inputs to watermark generators are highly diverse. For example, in the audio watermarking technique proposed by Bassia and Pitas [10] the input signal is the cover-signal itself, and the key is a randomly generated constant. In the spread spectrum watermarking scheme of Cox et al. [14], the input signal is the same as the key and comprises a pseudo-random sequence. Watermark embedding techniques may be additive, multiplicative or quantization-based [15] and may operate in the space or time domain, or in some transform domain. The output of a watermark embedder is the stego-signal. The stego- signal should be perceptibly similar to the cover-signal, in spite of the presence of the watermark. Watermark detectors are classified as type I or type II. Type I detectors require knowledge of the cover-signal to extract the watermark from the stego-signal. Type II detectors provide a yes or no answer to question of whether the watermark is present in a distorted stego-signal. In the motivating application for this work, the TEC speech watermarking system employs a type I detector. Typically the term “watermark” is used to refer to the processed (modulated, encrypted, etc.) form of the original signal to be embedded in the cover-signal as indicated in Figure 1. However, in this document, “watermark” will refer to the 1 In the TEC speech watermarking technique the stego-key is the constant or adaptive gain factor of the masking algorithm [2]. unprocessed watermark signal. The result of the processing step will be called the encrypted watermark. 1.3 Properties of digital watermarks Some essential properties of watermarks are as follows: I Perceptual transparency: Inserting a watermark into the host or cover-signal will alter the cover-signal in some way. If the amount of alteration does not introduce any perceptual degradation then the watermark is said to be perceptually transparent [16,17]. This ensures that the value of the original material is not reduced by the presence of the watermark. o Robustness: Robustness refers to the degree to which a watermark can survive an “attack” or distortion. An attack is a deliberate attempt to remove the watermark or hinder its recovery. The watermark should not be able to be destroyed without simultaneous destruction of the cover-signal. A successful attack is one that removes the watermark or obstructs the recovery process without causing perceptual degradation of the cover-signal [16]. 0 Unambiguity: A recovered watermark should unambiguously identify the owner of the watermarked material. 0 Security: Encryption keys, if any, used in the watermarking process, and keys used for watermark generation, should be very difficult to predict, guess, or otherwise ascertain. Another important property is the watermark bit rate [17]. This is determined by the amount of information contained in the watermark (watermark payload) and the amount of data needed to embed one unit of watermark information (watermark granularity) while ensuring perceptual transparency. For greater robustness it is desirable to have stronger components of the watermark in the stego-signal. This in turn will affect the perceptual transparency of the signal containing the watermark (stego-signal). Thus there are trade-offs among the various watermark properties that must be considered in light of the requirements of a particular watermarking application. Further, for an application like fragile watermarking [19] robustness is not desirable. In such a case, fragile watermarks that get destroyed by some or all of the transformations are used. The degree of adherence to the ideal properties is dictated by the requirements of the particular application and the availability of resources. More information is found in [16]-[19]. 1.4 Classification of watermarking techniques Watermarking techniques are classified according to the domain in which the watermark is inserted, the requirements of the watermark detection process, or the availability of the keys. Watermarking schemes are categorized as restricted- or unrestricted-key watermarking schemes based on the relative availability of the key(s) [20]. Schemes in which the keys are available to all the watermark detectors are called unrestricted-key schemes. In the case of restricted-key schemes, the knowledge of keys is confined to a small number of detectors. The TEC-based speech watermarking scheme is a restricted key scheme. Though such a categorization appears to be mainly based on a difference in usage, the complexity and suitability of a watermarking algorithm differs between the two C3868. Schemes that require knowledge of the cover-signal to recover the watermark are said to be non-oblivious [21]-[23]. TEC-based watermarking is non-oblivious. Watermark recovery is effected by subtracting the cover-signal from the stego—signal. Non-oblivious techniques generally yield more robust watermarks. However, non- oblivious watermarking may be more prone to protocol attacks [20]-[22] due to the availability of greater freedom for creating fake cover-signals and hence fake watermarks. For example, a hacker may succeed in developing a suitable fake watermark (say, a pseudo-random pattern). On subtracting it from the stego-signal, to which he or she has access, a fake original can be created. The hacker now claims to be the owner of this original. Of course, oblivious watermarking schemes are more prone to attacks based on neutralizing the detector (if there is access to one as in the case of the DVD copy control problem [20]) readings. The cover-signal is not required during the detection process in oblivious watermarking and may be treated as noise. Oblivious watermarking methods permit faster detection of the watermark and include bit-wise or noise-dependent methods. These methods are sensitive to even small variations of the stego—signal and are thus more fragile [22] A watermarking strategy is designated as a spatial (time) or transform domain technique according to whether the watermark is embedded into the cover-signal in the signal or the transform domain. In the present application in which audio rather than image data are watermarked, the term “signal,” rather than “spatial,” domain is more appropriate. If the same key is required in the watermark recovery or detection process as that used for watermark embedding, the scheme is said to be symmetric. The need for asymmetric or public key watermarking arises when the user of the copyrighted information [23] must perform watermark detection. In this case there is a set of two keys — a public key and a private key. The private key is required for watermark embedding and recovery, and is known only to the owner. The public key is given to the users solely for watermark detection. Knowledge of the public key should not provide any information about the private key, and should not compromise the security and robustness of the scheme. A variation on this idea occurs in the TEC strategy. A “public” key is made available to descramble the speech signal, but this process has the effect of further encrypting the watermark rather than detecting it. 1.5 Attacks on watermarking systems Digital watermarks are prone to a wide range of attacks [15, 20] and other means of distortion. As mentioned earlier, an attack is an attempt to remove the watermark or preclude its recovery, while ensuring tolerable or no apparent damage to the stego-signal. An attack can also be an attempt to create ambiguity of ownership. Attacks include those due to common signal processing operations like resampling, compression, filtering, D/A conversion, and requantization. Introduction of noise can also affect a watermark. Deliberate manipulations of the content like cropping, rescaling and rotation can severely hinder watermark recovery. By using secure keys, a cryptographic attack like brute-force key search [25] can be thwarted. In the attack by statistical averaging [18, 20, 25], a large number of differently watermarked copies of the same stego-signal may be averaged to get the attacked stego-signal. Collusion attack differs from an attack by statistical averaging in the sense that only portions of the stego-signal and not the entire stego-signal are used to create the attacked stego-signal [25, 26]. Counterfeit attacks [15, 24, 25], including inversion attack, multiple watermarks, and copy attack, attempt to undermine the concept of watermarking itself by producing fake originals or fake watermarked signals. Watermarking an already marked signal (the problem of multiple watermarks) can negate the utility of any watermarking scheme. To counteract these attacks, watermark registration with a trusted authority has been proposed by some parties [21, 23]. Distortion is normally the result of signal processing operations or the presence of noise. Attacks encompass the different types of distortion that may be unintentionally introduced into the stego-signal. Robustness against attacks is a very important aspect of a watermarking scheme. A particular watermarking scheme may not be robust to all forms of attack. An attack on the watermark may be directed at either removing the watermark or hindering its recovery while causing tolerable apparent damage to the stego-signal. When the attack hinders watermark recovery, then the general remedy is to attempt to identify the attack and to undo the damage. Duric et al. [22] describe a method of recognizing distorted images and recovering watermarks using identification marks, salient features of the image invariant to transformations like cropping, scaling and rotation. In the present paper, watermark recovery from cropped speech is accomplished using a dynamic programming approach. Most watermarking techniques are susceptible to damage caused by cropping due to its desynchronization of the watermark detection and recovery process. 1.6 Document overview Research in the field of watermarking is progressing in different directions. New watermarking techniques are being devised [11, 13, 27], new attacks on watermarking schemes are being identified [15, 18, 20, 25], benchmarks to evaluate the different watermarking schemes are being developed [15, 28], and algorithms for watermark detection and recovery after attacks and other forms of distortion are being developed [22, 24]. This document describes work that was mainly directed at TEC-based watermark detection and recovery after being subjected to certain attacks. Chapter 2 presents the TEC-based speech-watermarking technique. Chapter 3 describes the robustness of this watermarking scheme to different attacks that include additive noise, cropping, or a combination. Algorithms for watermark recovery when subjected to these attacks are described. Chapter 4 is concerned with Matlab implementation, experimental results and performance evaluation under different conditions. Chapter 5 comprises a description of future work. 10 Chapter 2 DIGITAL WATERMARKIN G OF SPEECH USING TEC 2.1 Watermarking algorithm Transform encryption coding (TEC) was originally developed by Kuo et al. [3] as an algorithm for image compression and efficient and secure transmission. It can be applied to speech and other signals as well. TEC produces independent transform coefficients by passing the signal through an all-pass filter with unity gain. TEC derives its encryption properties, and hence security, from the use of highly random filter coefficients. Typically, quasi m-arrays and gold code arrays [4] are used to obtain filter coefficients with the desired property of unpredictability. The phase spectrum of the signal to be transformed is scrambled in accordance with the phase spectrum of a quasi m-array or gold code array [3]. The speech watermarking technique developed by Ruiz et al. [2] employs TEC in conjunction with a masking algorithm for encrypting and watermarking speech. The one dimensional speech signal is arranged in the form of two-dimensional arrays, each having the same dimensions as the quasi m-arrays used for the TEC process. c 3" s T, '1 ‘———’ —_‘Tl E W w I s=c+T."{k°'5x T2{w}} ___,. T2 . k0.5 Figure 2. Watermarking process ll The watermarking process involves the application of TEC to both the cover- signal and the watermark. Different quasi m-arrays are used for encrypting the cover- signal and the watermark. The encrypted watermark is subjected to a masking algorithm to ensure perceptual transparency based on the cover to watermark ratio (CWR), defined as E CWRdB = ioiogmmg’vé—M—l (1) where E‘[n] and Wm are the respective short-term energy measures for the encrypted cover and watermark signals, and k[n] is an adaptive gain factor (stego-key) at time n. Altemately, a constant gain factor k can be used instead of k[n]. Since the encryption process involves passing the cover and watermark signals through all—pass filters with unity gain [see Figure 4], the energy content of the encrypted and non-encrypted signals are similar in each case. The encrypted cover-signal and watermark are converted into a one-dimensional arrays. The encrypted, masked watermark is then added to the encrypted cover-signal to obtain the encrypted stego-signal. Applying the inverse TEC operation to decrypt the cover-signal component of the encrypted stego-signal subjects the watermark to a second level of encryption (see Figure 6). S r k0.5 1?. i w + T1 Ti1 ‘—’® I C T Q; = k"()'5 X T2'1{T1{S‘C} } Figure 3. Watermark recovery 12 For watermark recovery, an estimate of the doubly encrypted watermark is obtained by subtracting the cover-signal from the stego-signal, (2) =S—C Em Finally the inverse TEC operations and the gain factor are applied to the estimated twice-encrypted watermark (Figure 3): w = k’1 ng' {'1'1 {3 B: k‘1 xTz‘l {1'1 {s — c}} (D (6) Figure 4. Encryption using quasi m-arrays. (a) The original “Lena” image. (b) The original “mandrill” image. (c) Lena encrypted using quasi m-array A (key A). (d) Mandrill encrypted using quasi m-array B (key B). (e) Amplitude distribution of the Lena encrypted using key A. (f) Amplitude distribution of the mandrill encrypted using key B. The amplitude distributions are different for the encrypted versions of the two images. However, they are similar to the amplitude distributions of the respective original images due to the all pass nature of the encryption process. The recovery of the watermark is only possible with the knowledge of the two quasi m-arrays (encryption keys) used in the process. The watermark may take many forms including speech samples or images. Part of the future work of this project will be concerned with researching pr0perties that assure quality watermarks. Cover-Signal Cover-Signal Encrypted stego-srmai Encrypted steoo-signai (a) (b) Figure 5. Watermarking selectively to watermarking the entire speech record. (a) Tire entire speech record consisting of 48387 samples in watermarked. All the recovered watermarks are shown. (b) The watermark is embedded in the first 16129 speech samples. Although the entire speech record is watermarked in Ruiz’s work [2], only selected frames of speech may be watermarked depending upon the requirements of the application (Figure 5). By watermarking selectively, a degree of unpredictability is introduced about the exact locations of the watermarks. It is preferable to watermark the higher intensity speech regions, since for a given CWR, the watermark intensity will be 14 greater in these regions. As a result, the embedded watermarks will be more robust against certain attacks. The results supporting this are presented in Chapter 4. The computational complexity and hence the amount of time required for watermarking will be reduced on watermarking selected speech regions. 2.2 Correlation detector A correlation detector (type II detector, see Section 1.2) can be used to detect the presence of the watermark in the stego-signal when subjected to linear distortion. The detector uses the normalized correlation between the original and possibly distorted watermark recovery signals. The latter are obtained by taking differences between the respective stego-signals and the cover-signal. Such a detector may not appear to be necessary for the TEC strategy as the watermark recovery signal (possibly distorted) may be fed to the recovery process in any case. However, the correlation detector is useful for acquiring quantified information about the presence or absence of the watermark. This information is crucial, for example, when an attack hinders the recovery process. The correct alignment of the two watermark recovery signals (original and possibly distorted) is an essential requirement for the correlation detector to perform correctly. If s'is a speech signal that differs from the original cover-signal by an added sequence 1], the correlation detector can be used to obtain information about the presence or absence of the watermark in s’ as follows. ' = c + 7]. (4) The normalized correlation between i? and(s' - c) is defined as, -(s'—c) lW|l(s'—c)l Em (5) p: 15 A high value of p indicates the presence of the watermark in 5'. If the distortion has the effect of misaligning the stego-signal and the original, then the samples must be resynchronized before using the correlation detector. Due to such a requirement the correlation detector can be used for studying the effectiveness of the algorithm for watermark recovery from cropped speech as described in the next chapter. 2.3 Security and robustness A good watermarking technique is one for which the security relies on the key and not on the secrecy of the algorithm. Public knowledge of the watermarking technology must not compromise security. This holds true for TEC-based speech watermarking. Security means that only authorized parties can decode the watermark [26]. It entails unpredictability and non-invertibility [23]. Non-invertibility of the watermarking technique means, for a modulated or encrypted watermark signal, it is practically impossible to find a fake watermark that can be produced by the same process [23]. TEC speech watermarking derives its security from the quasi m-arrays used for cover-signal and watermark encryption. The recovery of the watermark is only possible with the knowledge of two quasi m-arrays (encryption keys) per frame. Further, encryption ensures secure transmission across the communication channel. It also helps in data access control i.e., an unauthorized person cannot retrieve the information [23]. Mere encryption without watermarking cannot provide copyright protection, as the data are unprotected and open to content tampering and c0pyright violation [23]. Hence it is important to hookup encryption and watermarking for secure copyright protection. 16 It is generally recommended [20] that the unmarked original not be publicly released. Enhanced security is also achieved by embedding the watermarks in random locations of the stego-signal rather than predictably throughout the entire stego-signal. Further, different signals being watermarked differently, copies of the same signal similarly watermarked (alternately, better to have copies of the stego-signal rather than the cover-signal), having more than one watermark (preferably different watermarks and keys) in a particular stego-signal, and using keys of different dimensions, all can contribute to security. The use of different keys avoids the obsolescence of the watermark if a set of keys used for watermark recovery were by chance made public knowledge after intentional tampering or copyright violation. Using quasi m-arrays and gold code arrays (keys) of higher dimension achieves greater encryption security. This is because, the number of available quasi m-arrays or gold code arrays increases with their dimension. Greater security also implies increased computational complexity implying a trade-off involved between increased security and computational burden. TEC’s masking algorithm provides additional protection by using different parameters (stego-keys) in a random fashion while ensuring the imperceptibility of the watermark. The amount of data, measured in bits, needed to embed one unit of watermark information is termed the watermark granularity [17]. Finer granularity may result in greater robustness against certain attacks. In the case of cropping, for example, spreading the watermark across a large number of cover-signal samples implies greater risk of sample loss. However, finer granularity works against higher key dimension and hence security. 17 For the robustness of the watermarking scheme, the CWR plays a very crucial role. Lower CWR contributes to increased robustness. However the need for a perceptually transparent watermark places a practical lower bound on the CWR. (C) (d) Figure 6. Encryption and decryption processes. (a) Original “mandrill” image. (b) Mandrill encrypted using key B. (c) Decryption of the encrypted mandrill in (b) using key C. If a signal is encrypted using key A (key B), it can be decrypted only using key A (key B). By using a different key for decryption, the mandrill gets encrypted twice. (d) Decryption of the encrypted mandrill in (b) using key B. The next chapter focuses on the robustness of TEC-based speech watermarking to additive noise, cropping and protocol attacks. Chapter 3 ROBUSTNESS STUDY The issue of watermark robustness is introduced in Chapter 1. Robustness is the ability of the watermark to survive attacks and other forms of distortion. A watermarking scheme is said to be robust against a particular attack, if watermark detection and recovery are possible. This chapter deals with the robustness of TEC-based speech watermarking to additive noise, cropping and protocol attacks. The encryption capabilities of TEC ensure secure transmission of stego-signal across a communication channel and secure storage in an archive. If a hacker tries to intercept (or download) and then attempts to decrypt a transmitted (or stored) stego-signal without the knowledge of the encryption keys, the nominal stego—signal obtained on decryption will be unintelligible. Hence, it is sufficient and necessary to use the unencrypted stego-signal for robustness study. According to [3], the TEC encrypted signal is insensitive and robust to channel noise. Due to the all pass nature of the encryption and decryption processes, the noise strength in every sample of the decrypted signal will be small. The main focus of this chapter is robustness to cropping and additive noise. Cropping results in irretrievable loss of information that causes desynchronization in the watermark detection and recovery processes. As a consequence, the watermark fails to be detected and recovered. A number of watermarking techniques, especially signal (spatial or time) domain techniques are vulnerable to the damage caused by cropping. The damage caused by cropping depends more on the watermark embedding (for example, according to an additive rule) or detection strategy, than on the nature of the watermark. l9 In order to tackle cropping, an algorithm for watermark detection and recovery from cropped speech is presented. This algorithm can be applied to any cropped stego—signal, even if watermarked using a different watermarking technique. 3.1 Additive noise Addition of uncorrelated and randomly generated noise is a common attack against a watermarked stego-signal. Techniques where the watermark is in the form of LSB modifications, are especially prone to such an attack or distortion. To study the robustness of TEC speech watermarking to additive noise, independent, uncorrelated and randomly generated noise was added to every sample of the stego-signal. The noise amplitude was either uniformly distributed or Gaussian distributed as shown in Figure 7. If s’ is the noisy stego-signal, then s’ = c + “2» + 77 (6) The recovered watermark signal will now be, \ Em II Sn + 77 (7) As implied by equation (6), the robustness of the watermarking technique to additive noise depends upon the watermark to noise power ratio (WNR). The significance of a particular value of the WNR cannot be ascertained independently of the following factors: i) Stego-signal to noise ratio (SNR), defined as we... =1mog.. j: (8) ’7 where Ps and Pn are the signal and noise energy, averaged over the entire duration of the speech sequence (that is the signal and noise power). 20 ins-£24m? (9) 1», $1172an (10) s[n] and t][n] are samples of the stego-signal and noise at time n. ii) Cover-signal to watermark ratio (CWRZ). The CWR is influenced by whether a constant or an adaptive gain factor is used in the masking algorithm. If a constant gain factor (k factor), is used, then the CWR varies from samples to sample. On the other hand, when an adaptive gain factor (k[n] factor) is used, the CWR is a constant throughout. In this case, robustness will also depend on the temporal placement of the watermark in the cover-signal. Since the intensity of the embedded watermark is adapted to the intensity of the cover-signal, the strength of the watermark will be greater in the higher intensity regions of speech. Experimental results for different CWRs, SNRs, and watermarks are presented in Chapter 4. To assess the significance of the experimental results, a group of individuals were asked to look at or listen to the results. The squared error (E) and normalized correlation (p) were used in conjunction, to provide quantified information. It was inferred from the experimental results that the damage survived by the watermark was sufficient to lower the commercial value of the attacked stego-signal, when the embedded watermark was mildly perceptible. A detailed discussion of the results is presented in Chapter 4. C[n] 2 . defined rn(l)as, CWR =1010 ____~__ 4” g” k[n]xW[n] 21 lfltmnrmsn mm mm / j Gasumuwrbuso um ’.——.——.—— k V . -2 0 2 . \‘0 *0 (Vitamin J (a) (b) Figure 7. Noise amplitude distribution 3.2 Cropping Cropping is an attack on the content of the stego-signal wherein samples of the signal are deleted in a random or deterministic manner. About 1 in 50 speech samples may be cropped without introducing any perceptible difference. Cropping may be an intentional attack or unintentionally introduced distortion. It is extremely easy to implement, but most digital watermarking schemes are vulnerable to the damage caused by it. One method of identifying the attack to be cropping is by making use of the cross- correlation between the original watermark recovery signal (obtained by taking the difference between an undistorted stego-signal and the cover-signal) and the attacked watermark recovery signal. If samples are indeed cropped from the stego-signal, the normalized cross-correlation continues to sharply decrease as more and more cropped samples are encountered. 22 Cropping desynchronizes the recovery process, making watermark recovery difficult. Hence there is a need for an algorithm to identify the cropped samples, and to undo the damage caused by cropping, in order to make possible watermark recovery. (b) l I (C) (d) Figure 8. Cropping in images and speech. (a) Original mandrill image. (b) Cropped mandrill image. (c) 1000 samples from the speech “Theodore Roosevelt talks about Wilson and Taft”. (d) A cropped version of the speech in (c). About 1 in 50 samples were cropped. It can be observed that there is greater predictability in the manner in which cropping manifests itself in images than in speech. Duric et al. [22] make use of registration patterns (invariant features of an image) to recognize and restore images that are subjected to detection—disabling affine transformations. Typically, the registration patterns, also known as identification marks might be groups of points that exhibit uniqueness. Watermarks are then recovered from 23 the restored images. Such a methodology works for images, due to the manner in which cropping manifests itself in images. On cropping, the aspect ratio, shape or resolution of the image is generally affected. Hence, the effects of cropping are more predictable in the case of images (see Figure 8). Due to the random nature of speech, the geometg does not facilitate the derivation of registration patterns that are unique and invariant to transformations. Even in the case of images, the registration patterns can be exploited or attacked [20] to undermine their functionality. Hence, in order to deal with cropping in speech, a dynamic programming based approach to identify the cropped samples and undo the damage was favored. 3.2.1. Algorithm for watermark recovery from cropped speech A recovery algorithm is presented which is based on the concept of dynamic programming [5]. An attempt to temporally align the samples of the cropped stego-signal with the original stego-signal using dynamic programming (and hence dynamic time warping (DTW) [5]) will inherently determine the (former) time locations of the cropped samples. Consider the i-j plane (as shown in Figure 9) with the cropped stego-signal (test string) along the i-axis and the stego-signal (reference string) along the j-axis. Determination of the cropped samples is treated as the problem of finding the minimum distance path through the grid. A path is a collection of nodes of the form (t(i), sm) connecting the original and terminal nodes. Distances or costs are assigned to paths in the form of nodal costs. The cost associated with the node (t(i), s0)) is defined as, 4.0.» = W) - so»? (1 1) 24 3(5) 3(7) s(i+N) s(i) s(N+1) - N+I \llsU) (0.0) 1(1) t(i) t(7) W ~. Figure 9. Dynamic programming approach to recovering cropped speech samples. The search for the optimal path is described as follows. Let S be the length (total number of time samples) of the uncropped stego—signal, and T be the length of the cropped stego-signal. Assuming that no additional or duplicate samples are added to the stego—signal, the number of samples cropped is N=S—T. 0% The following search constraints are imposed on the search region to limit the amount of computation and to ensure appropriate matching between the test and reference strings: Monotonicity. For the path to be monotonic it must advance in the upward direction, i.e., it should not go “south” or “west” in the grid. Further, movement of the path in the horizontal or the vertical direction is prohibited as a single test sample cannot be associated with more than one reference sample and vice versa. Global path constraints. Since N samples are cropped and the path can only move in the upward direction, element t(i) of the cropped stego-signal can be matched only with the 25 (N+1) elements s(i) to s(i+N) of the stego-signal. A similar constraint is applied at the endpoints. The result is a constrained search region in the form of a diagonal strip as shown in Figure 9. Local path constraints. As every sample of the cropped stego-signal is contained in the original stego-signal, the optimal path should include all the test string elements. That is, no skips are permitted along the i-axis. At most, N reference string samples may be skipped in the process of finding the optimal path, as N samples were cropped. Thus, for node (t(i), s(i)) in the search region, the possible immediate predecessor nodes include (t(i-I), s(k)) where k ranges from (H) to (i-I). As a consequence of the Bellman optimality principle [5], the optimal path to the node (t(i), s(i)) can be found by considering the best paths associated with all the possible predecessor nodes and choosing the one with the minimum cost, Dmin (i, j) = min (i-I,k){Dmin(i-1. k) + dnU» 1.)}, k =(i-1).----,(l"1) (13) After all the nodes in the search region are considered, a set of N+1 optimal paths is obtained. The first path, that is the one that involves zero skips, is the same as the cropped stego-signal. The global optimal path is the one associated with least cost among them. If the first path is associated with the least cost, then it implies that the last N samples of the stego-signal were cropped. It can be observed that the number of optimal paths is one more than the number of cropped samples. This follows as a direct consequence of the search constraints and equation (13). Although the paths might have common nodes, they never traverse each other. At every node (t(i), s(m of a particular optimal path, it is necessary to record the immediate predecessor node from which the 26 path was extended. This way the path may be reconstructed by backtracking beginning at the terminal node. The overall algorithm based on the principles above involves the following steps: i) Initialization: The original node is (0,0) and the nodal cost associated with it is zero. (0,0) is the only predecessor associated with nodes (t(l), s(D), j = 1,. . .,(I+N). Danae) = dn(0.0) + dn(1.1). j = 1......(1+N) M], j) = (0,0), j= I,....,(1+N) MI, 1') = the index of the predecessor node to (1, j). 51(1) = Dmm(1.J). J' = 1......(1+N) ii) Recursion: For i = 2,...,T Forj = i,...,(i+N) Compute Dmin(i, j) using (13). (Dmm(i-1, j) is held in (5,0)). (MI, j) is recorded for every (i, j)). 51(1) = Dmin(i. 1') Next j Next i iii) Termination: The best path is the one associated with the least cost. min (Danna. 1)}. J'= T.....(T+N) iv) Reconstruction: The best path accurately identifies samples of the cropped stego- signal that are present in the stego-signal. The cropped samples are the ones, which are not present in the cropped stego-signal. The reconstructed stego-signal can be obtained 27 easily by reinserting the cropped samples at the appropriate places of the cropped stego- signal. v) Watermark recovery: The watermark recovery process is applied to the reconstructed stego-signal. 3.2.2. Memory and computational requirements The algorithm requires about (N+1)T nodal costs or distance measures to be computed and approximately ((N+1)(N+2)T)/2 implementations of equation (13). Considering the memory requirements, a matrix of size 0(TS) must be allocated for backtracking. This requirement cannot be replaced by the use of N+1 arrays of size 0(T) each. Such a replacement will require precise knowledge of the nodes comprising each path and this information will not be available until the entire algorithm has been executed. To compute Dmin(i, j) at every (i, j) within the search region, it is necessary to have just the past Dmin(i-1, j) values for j = (i-I),...,(i-1). Therefore, at most an array of dimension IX(N+I) is required assuming that the computation can be done in-place. 3.2.3 Cropping in the presence of additive noise Though TEC speech watermarking is fairly robust to additive noise, the recovery process is severely affected even if one sample is cropped. However, it was found that the DTW algorithm for watermark detection and recovery functioned quite efficiently in the presence of independent uncorrelated random noise. The experimental results for different SNR and CWR values are discussed in the next chapter. 28 3.3 Counterfeit attacks Also known as protocol attacks [11, 18, 21, 24, 25], counterfeit attacks seek to undermine the concept of watermarking itself by producing fake originals or fake watermarked signals. Counterfeit attacks are not concerned with destroying the ' embedded watermark nor disabling the recovery process. In the context of counterfeit attacks, robustness has a different meaning. A watermarking scheme is said to be robust against counterfeit attacks if the attack does not succeed in creating ambiguity in the resolution of ownership (or any other purpose for which watermarking is used). There are different types of counterfeit attacks including inversion attacks, multiple watermarks, and copy attacks. The basic idea behind watermark copy attack is to copy a watermark from a stego-signal to another signal without the knowledge of the watermarking algorithm and the key that were used to create the rightful stego-signal [15, 25]. This is achieved by estimating the embedded watermark either by direct prediction or denoising [25]. In the case of TEC speech watermarking, the coefficients of the embedded doubly encrypted watermark are outcomes of Gaussian random variables. For watermark recovery, a good estimate of these coefficients and knowledge of encryption keys will be essential. Thus, the copy attack will be extremely difficult to implement in the case of TEC watermarking. In an inversion attack, the attacker subtracts his or her watermark from the stego- signal. The attacker thus obtains a fake cover-signal (original) and claims to be the owner of the watermarked signal. This can create ambiguity in the resolution of the ownership of the stego-signal. Craver et al. [11] show that non-invertibility of the embedded 29 watermarks is essential for robustness against inversion attack. Non-invertibility of TEC speech watermarking is discussed in Section 2.3. The problem of multiple watermarks arises when an attacker inserts another watermark into the already watermarked signal and claims ownership of the signal. As a consequence, this creates ambiguity in the resolution of ownership. The TEC speech watermarking technique can be made robust against such a problem as discussed below. Suppose person A is the real owner of a speech watermarked using TEC. A’s stego—signal is, €12 s = c + (14). Person A releases only the watermarked speech and not the cover-signal to the public. Person B obtains a copy of s and is interested in selling illegal copies. B embeds another watermark wB into sand circulates illegal copies of SB. It is assumed that W}; is embedded in s in accordance with an additive rule and also that W]; is not correlated with s. This ensures that the distortion produced as a result of watermarking the already marked (using TEC) stego—signal is linear and uncorrelated. Robustness against distortion that is non-linear and correlated with the stego-signal is beyond the scope of this research. s3 =c+uzr+w3 (15). Person A comes across one of the illegal copies and recovers w from it. When A tries to sue B, B claims ownership of sB . However this fails to create enough ambiguity due to the following reasons. 30 i) It will not be possible for B to show a copy of the speech that does not contain A’s watermark. ii) A has the cover-signal and stego-signal that do not contain B’s watermark, giving credence to the proposition that A is the true owner. iii) Copies of the stego-signal (if any) not circulated by B contain A’s and not B’s watermark. 31 Chapter 4 IMPLEMENTATION DETAILS, RESULTS AND CONCLUSIONS 4.1 Robustness testing engine In this chapter, the experimental results obtained by testing the robustness of TEC watermarking to additive noise and to cropping are presented. One main problem faced by the current digital watermarking technology is the absence of common benchmarks for the evaluation of different watermarking schemes. Petitcolas [28] proposes the establishment of a public benchmarking service. The performance metrics to be used for evaluation are yet to be established. Software packages StirMark [29] and unZign [30] include robustness testing engines for image watermarks. Such services provide a common platform for the evaluation of different watermarking techniques. Public domain software for testing the robustness of audio watermarking techniques is not yet available. To study the robustness of TEC speech watermarking, the robustness testing engine developed by Ruiz et al. [35] was used. The testing engine can be used to perform 17 tests on the stego-signal. The tests include addition of random noise, cropping, filtering, u—law compression and expansion. The robustness testing engine accommodates a high degree of flexibility for setting the parametric values characterizing the tests. For the evaluation of TEC speech watermarking, an error measure was determined according to the following equation. 32 12.. — (W _ wr)2 A2 W (16) In (16), it indicates the original watermark recovery signal and w' indicates the watermark recovery signal obtained from a distorted stego-signal. In addition to the error E the normalized correlation p, defined in (5), is used to evaluate the performance. A quality rating (see Table l) on a scale from 1 to 5 was used to quantitatively describe the perceived results. Martin used a similar rating in [34] to rank the quality of the watermarked image. Two individuals were asked to rate the quality of the stego—signal, distorted stego-signal and the watermarks recovered form the distorted stego-signal in accordance with Table 1. These ratings were obtained without providing the individuals with the knowledge of the error and normalized correlation values. A quality rating of 3 for the recovered watermark is considered sufficient and it implies that the watermark is identifiable. Table 1 - Quality rating Rating Quality of the Quality of the recovered Effect of distortion on the watermarked signal watermark stego-sigpal 1 Watermark imperceptible Excellent No perceptible damfle 2 Perceptible, not annoyin Good Perceptible 3 Slighm annoying Fair Mildly degrading 4 Disturbifl Poor Degradifi 5 Very disturbing Bad Destructive 4.2 Robustness to additive noise Experiments were performed for studying the robustness of the stego-signal against uncorrelated additive noise. Gaussian or uniformly distributed noise was used. For all the experimental results and simulations presented in this section, a record of 48387 samples (3 seconds) of the speech “Theodore Roosevelt talks about Wilson and 33 Taft” [32] was used as source material. The signal is monaural, sampled at 16kHz with l6-bit quantization. The “Lena” image was used as the watermark. Table 2 enumerates the experimental results obtained by adding randomly generated Gaussian noise to the stego-signal. A set of three watermarks was embedded in the 48387-sample speech waveform, by dividing it into three frames, each consisting of 16129 samples (Figure 10). For the last five entries in the table, a watermark was embedded selectively in the first frame as it was associated with higher speech energy. For all results in Table 2, the masking algorithm used of a constant gain factor. Since every sample of the encrypted watermark was scaled by a constant, the CWR as defined in (1) was not a constant, but a varying quantity. In Table 2 the average CWR values over every frame and across the entire speech segment are shown. When a constant gain factor is used, the mean CWR across the entire speech segment varies widely from the CWRs averaged across individual frames. The SNR and the normalized correlation between the distorted and original watermark recovery signals are tabulated. It can be inferred from Table 2 that robustness against Gaussian additive noise depends on the CWR and the SNR. A lower CWR and a higher SNR contribute to increased robustness. Since the embedding process is independent of the speech intensity, watermarking selectively does not contribute to increased robustness and all the recovered watermarks are of the same quality. In interpreting the normalized correlation . value, it must be noted that it is dependent on both the SNR and CWR. Even if the SNR ' is low, the normalized correlation between the original and distorted watermark recovery signals may be high, if the CWR is low. When the embedded watermarks were very mildly perceptible, corresponding to a mean CWR of approximately 26 dB, the mean of 34 the recovered watermarks was identifiable for an SNR of 42.63dB. In this case, the noise had the effect of just mildly degrading the stego-signal. When the mean CWR was 21.4 dB, better robustness was exhibited. However, the watermark was perceptible in the speech. Table 2 - Robustness to Gaussian noise (constant gain factor) CWR Mean Gaussian SNR Norm. Quality Effect Recovered (dB) CWR noise (dB) Correl of of noise watermarks (dB) ation stego- on p signal stego- 1 2 3 p 0 Signal 1 2 3 31.4 19.1 27.6 26.04 0 .0046 26.61 0.3491 2 4 5 5 5 31.4 19.1 27.6 26.03 0 .00092 42.63 0.8784 2 2 4 4 4 31.4 19.1 27.6 26.06 0 .00009 62.60 0.9985 2 2 2 2 2 26.4 14.2 22.6 21.06 0 .0092 22.59 0.3161 2 4 5 5 5 26.4 . 14.2 22.6 21.05 0 .0046 28.63 0.5479 2 3 5 5 5 26.4 14.2 22.1 21 .06 0 .0037 30.59 0.6302 2 3 4 4 4 26.4 14.2 22.6 21.06 0 .0023 34.67 0.7925 2 3 4 4 4 26.4 14.2 22.6 21.06 0 .00092 42.64 0.9562 2 2 2 2 2 21.4 9.2 17.6 16.06 0 .0046 28.68 0.7605 3 4 4 4 4 26.4 - - 26.38 0 .0092 22.55 0.1865 2 4 5 - - 26.4 - - 26.38 0 .0023 34.65 0.5988 2 3 4 - - 26.4 - - 26.38 0 .00092 42.58 0.8805 2 2 3 - - 21.4 - - 21.38 0 .0046 28.60 0.5531 3 4 4 - - 21.4 - - 21.38 0 .00092 42.63 0.9581 3 2 2 - - 35 Cover-sums Mean recovered watermark Encrypted stegosrmat Distorted asgosngnat M II- (a) (b) -1 0 1 2 3 4 5 s 1: . , engines ,, -: Hstogramotmewetmnanu oven/Signal attornotse atticirnonx 10 ii § Number 01 sanples N h 8 8 .<3 ii new ‘ § 1 v v v - r r“: mi , M MD . ,1 W71“ 1 i v . 5 1O 50 100 15 00 250 300 g 200 - 0 831 0005 0 CNS 001 0015 0 2000 4000 6000 8000 10000 12000 14(200 16000 18000 mum (C) (d) Figure 10. Robustness of TEC watermarking to Gaussian noise (constant gain factor). 48387 samples of the speech “Theodore Roosevelt talks about Wilson and Taft” [32] was used as the cover—signal. The “Lena” image was used as the watermark. (a) The cover-signal, encrypted stego—signal, the stego—signal distorted by the addition of Gaussian noise and the watermarks recovered from the distorted stego-signal. The recovered watermarks were associated with a quality rating of 4 (Table 2). (b) Mean recovered watermark (quality rating of 3) (c) Histogram of the watermark recovery signal before and after the addition of Gaussian noise. (d) One of the recovered watermarks. Also shown are the histograms of the original and recovered watermarks, and the encrypted watermark reshaped into an array. 36 Table 3 - Robustness to Gaussian noise (adaptive gain factor) CWR Gaussian SN R Error (E) Norm. Quality Effect Recovered (dB) noise (dB) Correl of of noise watermarks ation stego- on p signal stego- u o l 2 3 signal 1 2 23.90 .0009 42.59 .071 .6202 .203 0.9599 3 2 3 5 29.90 0 .0046 28.70 .784 1.076 .957 0.3275 1 4 5 5 29.90 0 .0009 42.66 .203 .8418 .433 0.8657 1 2 3 5 29.90 0 .0001 62.67 .002 .0754 .007 0.9983 1 l 1 3 26.91 0 .0046 28.64 .703 1.087 .869 0.4396 2 4 4 5 26.90 0 .0009 42.64 .1 15 .7401 .291 0.9253 2 2 3 5 22.04 0 .0009 42.66 .058 - - 0.9500 3 2 3 - 26.04 0 .007 25.09 .849 - - 0.2435 2 4 5 - 26.91 0 .0069 25.13 .959 - - 0.3071 2 4 5 - 26.05 0 .0046 28.65 .709 - - 0.3541 2 4 5 - 26.05 0 .0009 42.65 .135 - - 0.8873 2 2 3 - 25.05 0 .0009 42.65 .1 19 - - 0.9061 2 2 3 - 28.05 0 .0009 42.63 .179 - - 0.8364 1 3 3 - 28.04 0 .0001 62.68 .002 - - 0.9979 1 I 2 - 37 Cover-sugnai Mean recovered watermark Encrypted stegosrmat (a) (b) .m 000 - 4 500 - ”~* 1 Rev-WM (I L L L A L -1 0 1 2 3 4 5 6 Etude J Hstogram otn'ewaennargrr'e'cmwstw dternoise attritionx 1° Nurnberot samples _ A" - . - 'r~ - , ‘1 . 1 i i 1 i . 4 -2 0 2 4 6 8 10 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 (C) (d) Figure 11. Robustness of TEC watermarking to Gaussian noise (adaptive gain factor). 48387 samples of the speech “Theodore Roosevelt talks about Wilson and Taft” [32] was used as the cover—signal. The “Lena” image was used as the watermark. (a) The cover-signal, encrypted stego-signal, the stego-signal distorted by the addition of Gaussian noise and the watermarks recovered from the distorted stego-signal. The recovered watermarks were associated with a quality ratings of 3, 5 and 4 respectively (Table 3). (b) Mean recovered watermark (quality rating of 3) (c) Histogram of the watermark recovery signal before and after the addition of Gaussian noise. The shape of the histogram before attack indicates the use of an adaptive gain factor for watermark embedding ((1) One of the recovered watermarks. Also shown are the histograms of the original and recovered watermarks, and the encrypted watermark reshaped into an array. Table 3 shows the results obtained by testing the robustness of TEC speech watermarking against Gaussian noise when the masking process uses an adaptive gain factor (also see Figure 11). Hence, the CWR is a constant throughout the speech. The error, determined according to (16) is also tabulated. In addition to the CWR and the SNR, robustness (and the error) is influenced by the intensity of the speech. On comparing Tables 2 and 3, it is inferred that for a given quality of the watermarking process, better robustness is exhibited by speech watermarked using an adaptive gain factor. This fact suggests the use of masking algorithms that exploit the perceptual properties of human auditory system for increased robustness. The ultimate aim would be to achieve robustness such that the quality of the recovered watermarks is better than, or comparable to, the effect of noise on the stego-signal, even when the embedded watermarks are imperceptible. That is, the rating in the recovered watermarks or at least the mean recovered watermark column (see Table 2 or 3) is less than or equal to the rating in the effect of noise on stego-signal column. At present, such results are achieved only for CWRs that cause watermarks to be at least mildly perceptible. A similar behavior was observed when experiments were conducted using non-zero mean Gaussian noise. Experiments were conducted to study the robustness of TEC watermarking to uniformly distributed noise (see Figure 12). The results are tabulated in Tables 4 and 5. When the embedded watermarks were mildly perceptible, at least one of the recovered watermarks was identifiable for an SNR of approximately 30dB for the adaptive gain factor case. When the CWR was 31.4 dB, the mean recovered watermark was identifiable 39 for an SNR of 28.34 dB. Taking into account the CWR, robustness of TEC watermarking tends to be better in the presence of uniformly distributed noise over Gaussian noise. 0 0 Table 4 - Robustness to uniformly distributed noise (constant gain factor) CWR Mean Noise SNR Norm. Quality Effect Recovered (dB) CWR (uniform) (dB) Correl of of noise watermarks (dB) ation stego- on p signal stego- signal 1 2 3 n max. M l 2 3 3 i .4 19.1 27.6 26.0 .0046 .0092 27.40 0.8876 2 4 5 5 5 5 31.4 19.1 27.6 26.0 .0023 .0046 33.41 0.9259 2 3 4 5 5 5 31.4 19.1 27.6 26.1 .0012 .0023 39.43 0.9625 2 2 3 3 3 3 26.4 14.1 22.6 21.1 .0041 .0083 28.34 0.9251 2 3 3 4 4 4 26.4 14.1 22.6 21.0 .0023 .0046 33.39 0.9568 2 3 3 3 3 3 21.4 9.16 17.6 16.1 .0046 .0093 27.38 0.9512 3 4 3 4 4 4 21.4 9.15 17.6 16.1 .0035 .0070 29.09 0.9647 3 3 3 3 3 3 Table 5 — Robustness to uniformly distributed noise (adaptive gain factor) CWR Noise SNR Error Norm. Quality Effect Recovered (dB) (uniform) (dB) Correl of of noise watermarks ation stego- on p signal stego- :1 max 1 2 3 5'81““ M 1 2 3 29.89 .0046 .0092 27.39 .178 .294 .233 0.8103 1 4 5 5 5 5 29.91 .0035 .0069 29.90 .141 .286 .209 0.8365 1 3 5 4 5 5 29.90 .0012 .0023 39.44 .046 .214 .099 0.9351 1 2 3 3 5 3 26.90 .0035 .0069 29.92 .1 19 .271 .172 0.8682 2 4 4 4 5 4 23.92 .0042 .0083 28.29 .1 1 l .259 .157 0.8838 3 4 4 3 5 5 23.90 .0035 .0069 29.90 .077 .248 .138 0.9006 3 3 3 3 5 4 40 Cover-stare! Mean recovered waermerk Encrypted stegosrgnal Distorted stage-Signet WM ---2 (a) (b) ” r . ‘ “ ' Recoveredwaerrnarkrurberm Error01498 g 2’; . . ‘3' I :4 1‘,— 3600- j « ‘ . . . - ‘ 7. 3400- WM . _; ‘ _ ' E200 ‘ '1 '1 . .2' ‘ ~ - ‘1 .1 o 1 2 3 4 5 e ,1 .3- 14" I an” .3 . ~ _ .' ‘r 4. ,. ,5. I-hstogremotthewetermerkrecd'dggsrgnaietternorseeddmonx10 '_ '3 _‘ . ‘ " .‘ , - g 092 . . . E300 001} i 1 3 fit: J B 200 0 g 1 1 1 50 300 5 100. 05 0 r A l I I A I ‘- 0 0002 0004 0006 0008 001 0012 0014 0 2000 40m 6WD 8000 10000 12% 14mm 16000 18000 artpltude (C) ((1) Figure 12. Robustness of TEC watermarking to uniformly distributed noise (constant gain factor). 48387 samples of the speech “Theodore Roosevelt talks about Wilson and Taft” [32] was used as the cover-signal. The “Lena” image was used as the watermark. (a) The cover-signal, encrypted stego-signal, the stego-signal distorted by the addition of noise and the watermarks recovered from the distorted stego—signal. The recovered watermarks were associated with a quality ratings of 4 (Table 4). (b) Mean recovered watermark (quality rating of 3) (c) Histogram of the watermark recovery signal before and after the addition of Gaussian noise. ((1) One of the recovered watermarks. Also shOwn are the histograms of the original and recovered watermarks, and the encrypted watermark reshaped into an array. 41 4.3 Robustness to cropping The DP algorithm for the detection of cropped speech samples and watermark recovery is described in Section 3.2.1. The Matlab implementation differs slightly from the description found in 3.2.1. This deviation was necessary to account for the “out of memory " problems encountered in Matlab when the algorithm was used for a speech sequence consisting of more than approximately 7000 samples. 4.3.1. Implementation details of the modified DP algorithm The algorithm described in Chapter 3 requires a matrix of size 0(TS), where T and S are the lengths of the cropped and original stego-signals, respectively. The values of T and S employed here result in out of memory problems when the unaltered DP algorithm is implemented in Matlab. One simple remedy would be to break down the long speech sequence (greater than 7000 samples) to shorter sequences and to apply the algorithm separately to each of them. However, this would necessitate the determination of the exact number of cropped samples in each of the shorter segments. For this, the exact end-points may have to be determined by cross-correlation between the original stego-signal and cropped speech segment in the appropriate region. Such an approach may not perform optimally in the presence of noise, as noise might hinder the accurate determination of the end points. Hence, the implementation of the algorithm was modified to alleviate out of memory problems or the need to determine the exact number of cropped samples in each of the shorter speech segments. The modified form determines the global best path and requires ‘p’ matrices, each of which comprises of m rows and m+N columns. Here, m is a 42 number less than 8000 and greater than N, the number of cropped samples. The modification involves dividing the cropped stego-signal into frames of m samples each except the last one, which may contain less than tn samples. p is the total number of frames, excluding the last one. The search constraints described in Section 3.2.1 are applied here (Figure 13). The algorithm proceeds similar to the original version by assigning costs to nodes and applying the Bellman optimality principle. During transition from one frame to another, the costs associated with the last N+I nodes [in the case of the first frame they include nodes (t(m), s(l)) , m S j 2 m+N] of the previous frame are taken as the initial costs associated with the next frame. Backtracking information for each of the m-segment frames is stored in p matrices of dimension (m, m+N). At the end of the last (that is, p+1‘”) frame, the global best path is chosen from the N+I optimal paths, by selecting the one with the least cost. On backtracking across the various frames, the global best path is reconstructed. 3(3) 3 1 NH s(T) -------------------------------------------- s(2m+1+N) s(2m+l) s(m+1 +N) s(m+l) s(N-t-l) ~. \/ (0,0 t(rn+1) t(2m+1) t(T) Figure 13. Modified implementation of the DP algorithm 43 The modified algorithm involves the following steps: i) Initialization: The original node is (0,0) and the nodal cost associated with it is zero. (0,0) is the only predecessor associated with nodes (t(I), s(J)), j = 1,. ..,(I +N). Dmin(0, 0) = dn(0,0) , i = j = 0. 5107 = Dmin(0. 0) ii) Recursion: For k = I,...,p For i = i+1,...,km Forj = i,...,(i+N) Dmm (i, J) = min (i-l.j){Dmin(i‘1, j) + dn(i, J)}, k = (i-I),....,(i-1) (Dmg..(i-I, J) is held in 61(1)). Record m(i, J). t/Ik(i, J) = the index of the predecessor node to (i, J) in the kth frame. 51(1) = Dmrn(i, 1') Next j Next i Next k iii) Termination: Fori = km+l,...,T Forj = i+1,...,S Dmin (131) = min (r-r.j){Dmrn(i-1. 1') + dn(i. 1)}, k =(i-1).----.(J'-1) (Danna-I, J) is held in 61(1)). Record %+1(i, J). 44 WPHU, j) = the index of the predecessor node to (i, j) in the last frame. 510) = Dunno} D Next j Next i The best path is the one associated with the least cost. min {Dmin(T: 1)}. J'= T,....(T+N) iv) Reconstruction: By backtracking through the (p+1) frames, the global best path is obtained. The best path accurately identifies samples of the cropped stego-signal that are present in the stego-signal. The cropped samples are the ones, which are not present in the cropped stego—signal. The reconstructed stego-signal can be obtained easily by reinserting the cropped samples at the appropriate places of the cropped stego-signal. v) Watermark recovery: The watermark recovery process is applied to the reconstructed stego-signal. Computational requirements for this modified implementation are the same as those for the original algorithm of Section 3.2.2. Instead of a single matrix of size 0(TS), the modified implementation requires p matrices of size 0(m2-I-mN), where m is small compared to T. This modification solves the out of memory problems. 4.3.2. Experimental results As an example, the DP algorithm was applied to a cropped stego-signal watermarked using TEC. The cover-signal was obtained from the TIMIT speech database [33] and has a male voice saying: “She had your dark suit in greasy wash water all year.” In Figure 12, 7968 speech samples were used. Quasi m-arrays of dimension 63x63 were used for encryption. 45 Cover-Signal Encrypted stage-signal Cropped stage-signal (a) Cover-signal Encrypted stego-signal Reconstructed stego-signal (b) Figure 14. DP algorithm for watermark recovery (a) Cropping and watermarks recovered from cropped speech. (b) Reconstructed stego-signal obtained after the application of the DP algorithm. Watermarks recovered from the reconstructed stego- signal. 46 After the cover-signal was TEC watermarked, 150 samples of the stego-signal were randomly cropped using the robustness testing engine. The watermarks recovered from the cropped stego-signal are shown in Figure 14(a). The DP algorithm was then applied to the cropped stego-signal to detect the cropped samples and to reconstruct the stego—signal. The cropped samples were accurately determined and the watermarks recovered from the reconstructed stego-signal (Figure 14(b)). The robustness of the DP algorithm was tested with varying CW Rs and varying numbers of cropped samples. In the absence of additive noise, the cropped samples were accurately determined under all tested conditions . Table 6 — Robustness to cropping and additive noise (adaptive gain factor) CWR Gaussian SNR Number of Error Normalized Cropped (dB) Noise (dB) cropped (E) correlation samples samples p accurately 1l 0 determined Yes/No 32.2197 0 2.546x10" 44.9662 19 0.2663 0.9134 Yes 32.1 181 0.0013 30.8328 19 0.8355 0.4365 Yes 32.2246 0.0023 25.7200 19 0.9234 0.3252 Yes 31.91 15 0.0026 25.3634 19 0.9902 0.2323 Yes 32.0573 0.0025 25.2290 92 1.0290 0. 1 167 Yes 31.9341 0.0128 1 1.3201 3 1.2426 0.0391 Yes 32.2893 0.01 15 12.464 19 1.1785 0.0296 Yes 31.8826 0.01 15 12.4326 92 1.2323 0.0861 No 26.0974 0.01 15 12.2692 92 1.2043 0.0960 N o 47 4.4. Robustness to cropping in the presence of noise In the previous section, it was determined that in the absence of noise, cr0pped samples were accurately determined. The DP algorithm was also tested for watermark recovery from stego-signals distorted by additive noise as well as cropping. TEC speech watermarking was found to be fairly (Section 4.3) robust to additive noise alone when the embedded watermarks were not perfectly imperceptible. It is important for DP algorithm to tolerate noise, at least to the extent to which TEC speech watermarking is robust against additive noise. Using Ruiz’s robustness testing engine, the stego-signal was randomly cropped and subjected to additive noise. In all the experiments (Table 6), 961 samples of the utterance, “She had your dark suit in greasy wash water all year” [33] was used. The accuracy of the DP algorithm was verified by comparing the actual cropped samples with the missing samples detected by the algorithm. The performance is mainly dependent on the SNR. The algorithm is robust for a SNR of 11.3 dB or above. It was observed that when the SNR approaches the 11.3dB threshold, the performance degrades, with an increase in the number of cropped samples (see Table 6). The DP algorithm is robust to additive noise well above the robustness threshold (approximately 30dB when the embedded watermarks were mildly perceptible) of TEC speech watermarking. Experiments have confirmed that for the range of importance, that is when the recovered watermarks are identifiable, the algorithm is reliable. 48 4.5. Conclusions The salient points from the experiments described above are summarized as follows: The robustness of TEC speech watermarking to additive noise, is mainly dependent on the SNR and CWR. Higher SNRs and lower CWRs contribute to increased robustness. The need to maintain the perceptual transparency of the embedded watermark imposes a lower limit on the CWR. When the watermark masking algorithm involves the use of an adaptive gain factor, better robustness is exhibited by watermarks embedded in the higher intensity regions of speech. The DP algorithm for the detection of cropped samples and subsequent watermark recovery performs with 100% accuracy in the absence of noise. In the absence of noise, the performance is independent of the number of samples cropped. In the presence of cropping and uncorrelated additive noise, the performance of the DP algorithm is mainly determined by the SNR and the number of cropped samples. Unlike most watermarking techniques [9,10], TEC watermarking admits watermark recovery, not just watermark detection. If the watermark contains information supporting the owner or title, on recovery, this information will lead to greater credence in the true ownership. 49 #4] Chapter 5 FUTURE WORK Digital watermarking is an emerging technology and it faces problems typical of many new signal-processing endeavors. The main problems include difficulty in dealing with many types of attacks, a lack of standard tools with which to assess and compare watermarking schemes, and the lack of clear definitions of watermarking requirements [42]. As the need for the technology increases, these problems will have to be resolved if the methods are to be effective and therefore accepted by those depending on the technology for copyright protection. In addition to the challenges common to all watermarking techniques, TEC speech watermarking as described in this thesis, requires further research in a number of areas. Some areas identified for future work are as follows: 9 Robustness of TEC watermarking to other attacks [45] must be studied. In particular, study of the robustness to signal-processing transformations like resampling, compression, filtering and quantization is of importance. These transformations may be the consequence of routine and unintentional operations on the stego-signal. Some of the other deliberate attacks to be studied include collusion attacks, cryptographic attacks and time-scale modification. While studying the robustness, it is also essential to test TEC watermarking against a combination of two or more attacks. Robustness study will entail developing a more elaborate robustness testing engine. 0 The watermark-masking algorithm as described in this document, involves scaling the watermark by a gain factor in accordance with the CWR. Future work in this area comprises the application of masking algorithms that exploit the perceptual properties 50 of the human ear. Application of perceptual model-based masking algorithms becomes necessary for watermarking due to the rigid requirements of imperceptibility and robustness. Such an application must be viewed in conjunction with robustness against perceptual model-based compression algorithms like MPEG. Not much research has been done in the field towards understanding the embedding capacity [31] offered by the different watermarking techniques. Research must be done to determine the best strategies for utilizing the embedding capacity so as to fulfill watermarking and channel bandwidth requirements. An appropriate audio transform coding strategy must be implemented to effect compression in conjunction with watermarking. In such a scenario, it will be important to use audio or speech rather than image watermarks. In the context of various attacks (although this did not matter for additive noise and cropping), some classes of watermarks may perform than others. Future work will also pursue understanding watermark characteristics that result in optimum watermarks in the presence of a particular attack. TEC was initially developed as an image compression algorithm [3]. Application of TEC watermarking to images and the performance appraisal are areas in need of further work 51 REFERENCES National Gallery of the Spoken Word, Michigan State U., http://www.ngsw.msu.edu Fco. J. Ruiz and JR. Deller, Jr., “Digital watermarking of speech signals for the National Gallery of the Spoken Word,” International Conference on Acoustics, Speech and Signal Processing 2000, Istanbul, May 2000. CJ. Kuo, J.R. Deller, Jr. and AK. Jain, “Pre/post-filter performance improvement of transform coding,” Signal Processing: Image Communication, vol. 8, pp.229-239, 1996. C]. Kuo and RB. Rigas, “2-D quasi m-arrays and gold code arrays,” IEEE Transactions Information Theory, vol. 37, pp. 385-388, March 1991. JR. Deller, Jr., J .H.L. Hansen, and LG. Proakis, Discrete Time Processing of Speech Signals (2d ed.), New York: IEEE Press, 2000. A. Gurijala and JR. Deller, Jr., “Robust algorithm for watermark recovery from cropped speech,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing 2001 , Salt Lake City, May 2001 (in press). J.R. Deller, Jr., A. Gurijala, and MS. Seadle, “Audio watermarking techniques for the National Gallery of the Spoken Word,” Proc. ISt ACM—IEEE Joint Conference on Digital Libraries 200], Roanoke, Virginia, June 2001 (in press). F.A.P. Petitcolas, R]. Anderson and MG. Kuhn, “Information Hiding - A Survey,” Proceedings of the IEEE, special issue on protection of multimedia content, pp. 1062-1078, July 1999. L. Boney, A.H. Tewfik and K.N. Hamdy, “Digital watermarks for audio signals,” Proc. IEEE International Conference on Multimedia Computing and Systems, Hiroshima, pp. 473-480, June 1996. 52 10 ll. 12. 13. 14. 15. 16. 17. 18. 19. P. Bassia and I. Pitas, “Robust audio watermarking in the time domain,” Proc. IX European Signal Processing Conference, Rhodes, Greece, vol. I, pp.25-28, Sept. 1998. M. Arnold, “Audio watermarking: features, applications and algorithms,” Proc. IEEE International Conference on Multimedia and Expo (II), pp. 1013-1016, 2000. W. Bender, D. Gruhl, and N. Marimoto, and A. Lu, “Techniques for data hiding,” IBM Systems Journal, vol. 35, pp.3l3-336, 1996. CS. Lu, H.Y.M. Liao, and L.H. Chen, "Multipurpose Audio Watermarking", Proc.15th International Conference on Pattern Recognition, Barcelona, Spain, vol. III. Pp. 286-289, Sept. 2000. 1.]. Cox, J. Kilian, T. Leighton, and T. Shamoon, “Secure spread spectrum watermarking for multimedia,” IEEE Transactions on Image Processing, vol. 6, pp. 1673-1687, Dec. 1997. S. Voloshynovskiy, S. Pereira, T. Pun, J.K. Su, J.J. Eggers, "Attacks and Benchmarking, " submitted to IEEE Communication Magazine, 2001. 1.]. Cox, M.L. Miller, J. M.G. Linnartz, T. Kalker, "A review of watermarking principles and practices," Digital Signal Processing for Multimedia Systems, K. K. Parhi, T. Nishitani (eds), New York: Marcel] Dekker, Inc., pp. 461-485, 1999. GO Langelaar, I. Setyawan, and R.L. Lagendijk, “Watermarking digital image and video data,” IEEE Signal Processing Magazine, vol. 17, pp.20—46, Sept. 2000. 1.]. Cox, ML. Miller, and J.A. Bloom, “Watermarking applications and their properties,” Proc. IEEE International Conference on Information Technology: Coding and Computing, pp.6-10, Mar. 2000. M. Wu and B. Lin, "Watermarking for image authentication," Proc. IEEE International Conference on Image Processing, Chicago, vol. 2, pp. 437-441, Oct. 1998. 53 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. I. J. Cox and J .P.Linnartz, "Some general methods for tampering with watermarks," IEEE Journal on Selected Areas of Communication, vol. 16, pp. 587-593, May 1998. M. Ramkumar, A.N. Akansu, “Robust protocols for proving ownership of images,” Proc. IEEE International Conference on Information Technology: Coding and Computing, Las Vegas, pp. 22-27, Mar. 2000. Z. Duric, N.F. Johnson, and S. Jajodia, “Recovering watermarks from images,” Information and Software Engineering Technical Report, ISE-TR-99-04, Apr. 1999. G. Voyatzis and I. Pitas, "The use of watermarks in the protection of digital multimedia products," Proceedings of the IEEE, vol. 87, pp. 1197-1207, July 1999. M. Ramkumar, A.N. Akansu, "Image watermarks and counterfeit attacks : Some problems and solutions", Content Security and Data Hiding in Digital Media, Newark, NJ, May 1999. M. Kutter, S. Voloshynovskiy and A. Herrigel, “Watermark copy attack,” IS&T/SPIE’s 12th Annual Symposium, Electronic Imaging 2000: Security and Watermarking of Multimedia Content II, San Jose, vol. 3971, Jan. 2000. J .K. Su, J.J. Eggers, and B. Girod, "Capacity of digital watermarks subjected to an optimal collusion attack," Proc. European Signal Processing Conference, Tarnpere, Finland, Sept. 2000. J. J. Eggers, J. K. Su, and B. Girod, "Asymmetric watermarking schemes," GI Jahrestagung Informatik 2000, Sicherheit in Mediendaten, Berlin, Sept. 2000. F.A.P. Petitcolas, “Watermarking schemes evaluation,” IEEE Signal Processing Magazine, vol. 17, Sept. 2000. F.A.P. Petitcolas and R]. Anderson, “Evaluation of copyright marking systems,” IEEE Multimedia Systems, Florence, Italy, vol. 1, pp. 574--579, June 1999. F .A.P. Petitcolas, “unZign: is your watermark secure?,” In http://www.cl.cam.ac.uk/~fapp2/watermarking/image watermarkinglunzignz 54 31. C. Candan, and N. Jayant, “A new interpretation of data hiding capacity,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing 2001, Salt Lake City, May 2001 (in press). 32. “Theodore Roosevelt talks about Wilson and Taft” audio file, Vincent Voice Library, Michigan State U. Libraries, http://www.lib.m§u.edu/vincent/t roosevelt.m 33. W.M. Fisher, G.R. Doddington, and KM. Goudie-Marshall, “The DARPA speech recognition research database: Specifications and status,” Proc. DARPA Speech Recognition Workshop, pp. 93-99, 1986. 34. CG. Martin, “Digital image watermarking techniques,” In http://home.att.net/~steamedcrab/masterns.Ef 35. Fco. J. Ruiz, A.A. Gokhale, and J .Y. Lee, Unpublished report, Class research project, ECE 966A, Michigan State U., East Lansing, Fall 1999. 55