STATISTICAL LEARNING-BASED ADAPTIVE ATTACKS TOWARDS AUDIO
WATERMARKING

By

Weikang Ding

A THESIS

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Computer Science—Master of Science

2025

ABSTRACT

The abuse of original audios has attracted widespread attention in the society. Audio

watermarking, which embeds imperceptible signals into audio content, has been proposed as

an effective way to assert user copyright of audios. Although recent deep learning-based audio

watermarking methods have enhanced robustness and capacity compared to traditional approaches,

they are vulnerable to adversarial attacks. Our findings reveal that the message probabilities

output by the watermark decoder follow a normal distribution for both clean and watermarked

audio. This observation can be leveraged to detect existing audio watermark attacks. In this thesis,

we introduce AWM, an adaptive audio watermark attack method designed to bypass existing

detection strategies. The attack has three different types: watermark replacement, watermark

creation, and watermark removal. AWM employs a two-step optimization process: the first step

ensures the success of the watermark attack and bypasses the detection by optimizing message

probabilities within an estimated normal range, while the second step focuses on enhancing

audio quality while maintaining a successful attack. The proposed attack iteratively estimates the

parameters of the normal distribution using a small set of feature-similar audio samples based on

the target audio and applies adaptive optimization to adjust the decoded message probabilities

toward the estimated normal range. We evaluate AWM on two watermarking methods across three

diverse voice datasets and compare the results with existing audio watermark attack techniques.

Our experiments demonstrate that the proposed attack achieves a high attack success rate while

effectively bypassing detection, with detection success rates remaining under 10% for watermark

replacement and watermark creation, and at 0% for watermark removal. Additionally, AWM

exhibits high robustness against various no-box perturbations,

including low-pass filtering,

amplitude scaling, and compression, while maintaining high perceptual audio quality. Our

experiments highlight a significant security gap in current watermark defenses and show that

statistical assumptions about the decoder output can be exploited by attackers. These findings also

provide a foundation for future research in audio watermark attack detection and the development

of more advanced attacks.

ACKNOWLEDGMENTS

I would like to thank my advisor, Dr. Qiben Yan. He not only offered careful guidance in

academic research, but also subtly shaped my thinking, research methodology, and understanding

of academic norms.

Special thanks to my collaborator Hanqing Guo, his deep expertise in audio watermarking

greatly helped me build a strong foundation in this field. I would also like to thank my friends

and collaborators Guangjing Wang, Juexing Wang, Ce Zhou, Yuanda Wang, Bocheng Chen, and

other members of Lab 1100. Their support has provided me with valuable experiences in both

academia and life, and I am truly grateful for the time spent with them.

I would also like to thank my thesis committee members, Dr. Li Xiao and Dr. Huacheng Zeng.

Their insightful comments and constructive suggestions not only enhanced the quality of my thesis

but also deepened my understanding of the research topic.

Finally, I would like to express my deepest gratitude to my parents. Thank you for your

unwavering love and support. It is your care that has given me the courage and strength to pursue

my dreams. In times of difficulty and discouragement, you have always been my greatest source

of strength. Your understanding and encouragement have been the driving force that has sustained

me to this day.

iii

TABLE OF CONTENTS

CHAPTER 1: INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 2: RELATED WORK .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 3: BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

6

8

CHAPTER 4: METHODOLOGY .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

CHAPTER 5: EVALUATION . .

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

CHAPTER 6: DISCUSSION .

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

CHAPTER 7: CONCLUSION . .

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

BIBLIOGRAPHY . .

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

iv

CHAPTER 1: INTRODUCTION

In recent years, the rapid growth of social networking platforms has encouraged many users to

publicly share their audio content, including original works such as audiobooks and self-produced

music. These audio contents can bring them substantial income. However, many unauthorized

users steal the work of creators, make modifications, and re-upload it to mainstream platforms for

profit, which significantly diminishes the enthusiasm of audio creators. Besides, voice cloning

attacks can illegally synthesize the target’s voice for malicious purposes, potentially resulting in

severe consequences such as financial losses and reputation damage [28]. To address these issues,

audio watermarking has been proposed [34]. This technique embeds a noise-tolerant signal into the

target audio, which remains imperceptible to human hearing while being detectable by specialized

AI models.

Traditional watermarking methods use signal processing techniques to embed the watermark

into the time domain [4,17,21], frequency domain [46], or transform domain [20,39,43]. However,

these traditional watermarking methods struggle to defend against various complex attacks and

offer limited capacity, often relying on a single optimization objective tailored to specific attack

types. In contrast, deep-learning-based watermarking methods greatly address the limitations of

traditional watermarking methods and improve the robustness and generalization of watermarking.

Typically, a watermark is composed of multiple binary bits. Deep-learning-based watermarking

methods use Encoder-Decoder neural network architectures to embed and extract the watermark

from audio. Moreover, they introduce distortion mechanisms to simulate more complex potential

attack scenarios, such as audio re-recording [23], voice cloning [24, 36], and Codec [47].

However, the robustness of deep-learning-based audio watermark methods against adversarial

attacks is a concerning issue [41]. Recent studies [25,44] have revealed that attackers can remove or

forge watermarks by embedding adversarial perturbations. The basic principle involves adding and

optimizing a new perturbation to deceive the watermark decoder into generating incorrect outputs.

1

Figure 1.1: Overview of the watermark attack (left) and the watermark attack detection process
used to detect whether the audio has been tampered with (right).

There are three types of watermark attack scenarios: watermark replacement, watermark

creation, and watermark removal. Through watermark replacement and creation attacks, attackers

can forge fake watermarks in audio to falsely claim ownership and overwrite others’ original

creations with their own. Alternatively, attackers can transfer the copyright of some illegally

created audios to others, which can be used to evade responsibility or frame innocent parties.

Watermark removal attacks, on the other hand, aim to remove the watermark from watermarked

audio, which prevents the watermark decoder from outputting the original message. Through

watermark removal attacks, attackers can remove the original owner’s copyright and redistribute

the content on public platforms, resulting in financial losses for the creators.

Currently, no existing attack method effectively balances attack effectiveness with preserved

audio quality. First, existing attack approaches lack a well-designed method to balance audio

quality with attack effectiveness. The watermark model is typically trained through a joint

2

c. Watermark Removala. Watermark Replacementb. Watermark CreationWatermarked AudioPerturbedDifferent WatermarkClean AudioPerturbedNo WatermarkAttackerAttacked?YESNoDefenderWatermark Attack DetectionPerturbationDeployPerturbationAttackerPerturbedNew WatermarkPerturbationAttackerWatermarked Audioencoder-decoder framework: the encoder embeds the watermark into imperceptible regions of the

audio, while the decoder is trained to robustly extract the watermark under various perturbations.

However, attackers generally do not have access to the watermark encoder and can only query it

to obtain watermarked audio. In contrast, the watermark decoder is more accessible to the public,

making it the primary source of model knowledge in most attack scenarios. Since the decoder

is responsible for extracting watermark messages, effective attacks require carefully balancing

perturbation strength to modify the message while preserving perceptual quality.

Second, our findings indicate that simply altering the binary watermark message bits is

insufficient. Given an audio input, the decoder produces both a binary message and its associated

probability scores. These message probabilities tend to follow a normal distribution, and the

defenders can use this result to detect whether an audio sample has been manipulated. Therefore,

for an attack to remain undetected, the attacker must ensure that the decoded message probabilities

fall within the normal range predefined by the defender.

Third, attackers typically lack access to the training data used to construct the watermark

model. To estimate the distribution parameters required for a successful attack, they need to rely on

analyzing the output of the decoder. Since different data samples can generate different message

probability outputs, the significant difference between the attacker’s estimated parameters and the

defender’s true parameters may lead to attack failure.

In this thesis, we propose AWM, an Adaptive audio Watermark attack Method, which is capable

of bypassing the defender’s detection strategy. Figure 1.1 illustrates the application scenarios. The

attacker obtains the target audio and adds an adversarial perturbation to generate the perturbed

audio. The defender then receives the perturbed audio and uses the watermark decoder to extract the

message probabilities. A predefined set of distribution parameters is employed to detect outliers. If

any outliers are identified, the audio is classified as “attacked”; otherwise, it is considered “clean”.

To design AWM, we face the following challenges:

C1: How to design an attack method to balance audio quality and attack effectiveness?

Balancing audio quality and attack effectiveness can be viewed as a game-theoretic challenge.

3

An overly aggressive attack can significantly degrade audio quality, while prioritizing perceptual

quality may compromise attack success. Therefore, it is essential to strike a balance that ensures

sufficient attack effectiveness while preserving audio quality.

C2: How to design an attack to bypass the detection strategy? The goal of the perturbed audio is

to bypass the detection strategy. The defender detects perturbed audio by analyzing the distribution

of decoded message probabilities. After successfully altering the binary watermark message bits,

attackers must further optimize the audio to ensure that the decoded message probabilities fall

within the range classified by the defender as non-outliers.

C3: How to select the suitable audio samples to estimate the decoded message probability

distribution? Attackers do not have access to the training data used by the watermark model. To

estimate the distribution parameters of decoded message probabilities, they must rely on a limited

set of audio samples. This necessitates designing an effective estimation strategy and selecting

audio samples with features similar to the target to improve the accuracy of distribution fitting.

Our Idea: We propose three solutions to the three challenges above. To address C1, we design a

two-step approach for the watermark attack methods. The first step focuses on maximizing attack

effectiveness by ensuring the attack succeeds while evading the defender’s detection strategy. The

second step aims to improve the audio quality. Inspired by [26], we set a reasonable threshold to

enhance the audio quality while constraining the decoded message probabilities to remain within

the expected normal range. To address C2, we introduce an adaptive optimization strategy that

guides decoded message probabilities toward the estimated normal distribution range. The core

idea is to prioritize optimization efforts on the probabilities that fall outside the estimated normal

range. In other words, if a perturbed message probability already falls within the estimated normal

range, its optimization weight is reduced. This ensures that the optimization process remains

dynamic and focuses on the most optimization-needed binary message bits. To address C3, data

with similar feature distributions are more likely to produce similar decoded message probability

distributions [30, 31]. We collect a small set of audio samples and estimate the probability

distribution by selecting those whose features closely resemble those of the target audio.

4

Contribution: In this thesis, we make the following contributions:

• We observe that the decoded message probabilities output by the watermark decoder follow

normal distributions. This statistical property can be leveraged by defenders to design

detection strategies based on outlier detection.

• We propose AWM, an adaptive audio watermark attack method targeting three attack types.

Our approach successfully bypasses detection strategies through an adaptive two-step

optimization framework. The first step enhances the effectiveness of the watermark attack,

while the second step focuses on improving audio quality. To initialize the optimization,

we estimate the parameters of the normal distribution using a limited set of audio samples

selected based on feature similarity to the target audio. Our adaptive optimization

strategy prioritizes message probabilities that require adjustment, further improving attack

performance.

• We evaluate the effectiveness of AWM on three speech datasets and two state-of-the-art

watermarking models. Compared to the baseline, our method achieves superior performance

in both Attack Success Rate (ASR) and Detection Success Rate (DSR). For watermark

replacement and creation, AWM achieves the DSR scores below 10%, which is acceptable

given a False Acceptance Rate (FAR) of around 5%. For watermark removal, AWM

achieves the DSR scores of 0%. Furthermore, even after applying five no-box perturbations,

AWM consistently maintains a high ASR, with most scores nearing or reaching 100%.

5

CHAPTER 2: RELATED WORK

2.1: Deep Learning-Based Audio Watermarking Scheme

Unlike traditional schemes [18], which rely on predefined transformations, deep-learning-based

schemes can learn complex feature representations and optimize watermarking dynamically.

This scheme follows the architecture of the Encoder-Distortion-Decoder [7, 24, 36]. The encoder

embeds the message into the audio and generates the watermarked audio, the decoder receives the

watermarked audio and extracts the corresponding messages, and the distortion simulates a variety

of potential attack scenarios. This thesis builds upon the scheme and implements both the defense

mechanism and the corresponding attack strategy.

2.2: Audio Watermark Attack

Audio watermark attacks based on adversarial perturbations can be categorized as no-box,

black-box, gray-box, or white-box, depending on the attacker’s knowledge of the watermarking

model. In no-box perturbations, the attacker relies on general audio processing techniques that

distort the signal while attempting to preserve perceptual quality. These attacks aim to remove the

watermark from watermarked audio using techniques such as band-pass filter, amplitude scaling,

audio compression [12], and others. Additionally, methods like voice conversion [11, 22] and

text-to-speech [35, 38] have also proven effective in removing watermarks [41].

In black-box

perturbations, the attacker can query the watermarking detector but does not have access to its

internal architecture or parameters. AudioMarkBench [25] performs the watermark removal attack

by applying the thought of HSJA [9] and Square Attack [2]. In gray-box perturbations, the attacker

is assumed to know the architecture of the decoder but does not have access to its trained weights.

In white-box perturbations, the attacker has full access to the detector, including its architecture and

parameters, and introduces a perturbation to perform the gradient-based attack [6,27]. While some

6

existing studies have successfully demonstrated watermark removal attacks, our experimental

results show that none can effectively perform watermark replacement and watermark creation

attacks. Prior work in the image domain has explored averaging watermark patterns for attack

purposes [44], but this approach is ineffective in the audio domain. In this thesis, we focus on

performing watermark replacement, watermark creation, and watermark removal attacks.

2.3: Audio Watermark Attack Detection

Adversarial examples and watermarks both achieve the desired goal by adding imperceptible

noise. Different from adversarial examples, adding watermarks to objects has little impact on

the performance of the model inference (such as audio classification models [13] and speech

recognition models [33]). Therefore, some outlier detection methods based on the time series [5,42]

are generally ineffective for identifying whether an audio sample has been watermarked or if

the watermark has been removed or forged. Recent works have explored detection methods

for generated images [15, 40], watermark images [19, 30], audio deepfake [1, 45], and dataset

copyright protection [14]. However, to the best of our knowledge, these approaches are not

directly applicable to detecting audio watermark replacement, creation, or removal attacks.

7

CHAPTER 3: BACKGROUND

3.1: Preliminary

Adding perturbations to the audio is a common method to perform watermark attacks. The core

idea is to either destroy the original watermark or forge a new one by introducing perturbations

that deceive the watermark decoder. There are three types of attacks (as shown in Figure 3.1):

watermark replacement attacks, which aim to replace an existing watermark with a different one;

watermark creation attacks, which aim to embed a new watermark into clean audio; and watermark

removal attacks, which aim to eliminate the original watermark from a watermarked audio.

Audio Watermarking Framework.

The audio watermarking framework has three main

components: encoder, decoder or detector, and distortion layer. The encoder embeds the message

into the clean audio and then generates the watermarked audio. The decoder or detector receives

the watermarked audio and then outputs the extracted message. The distortion layer simulates

potential attack scenarios. Figure 3.2 shows the Encoder-Distortion-Decoder architecture.

Audio Watermark Decoder. The audio watermark decoder Dec(·) takes the encoded audio as

input and outputs the extracted message1. The extracted message can be represented in two forms:

a binary message and message probabilities. The binary message is a direct representation of the

decoded output as a sequence of 0s and 1s, denoted as m ∈ {0, 1}N . The message probability form

provides the decoder’s confidence values for each bit, indicating the likelihood of the bit being 1 or

0. A predefined threshold θ is typically used to convert probabilities into binary message values: if

a probability exceeds the threshold, the binary message bit is decoded as 1; otherwise, it is decoded

as 0. The form of message probabilities is:

p = Dec(s),

(3.1)

1We only consider multi-bit messages in this thesis.

8

Figure 3.1: Watermark attack types.

where s is clean or watermarked audio, Dec(·) outputs the message probabilities2.

Watermark Replacement. Let sc denote clean audio and sw denote watermarked audio. The

watermark decoder Dec(·) extracts the clean message probabilities pc from sc and the watermarked

message probabilities pw from sw, respectively. The attacker specifies a target message probability,

denoted as ptarget. A perturbation δ is introduced to perform the attack on the audio.

In watermark replacement (Figure 3.1-a), a perturbation δ is added to the watermarked audio sw,

deceiving the decoder into misclassifying the embedded watermark as a different one. Formally,

the goal of watermark replacement is:

δreplacement = arg min

δ

||Dec(sw + δ) − ptarget||.

(3.2)

Watermark Creation. Watermark creation (Figure 3.1-b) involves adding a perturbation δ to a

2The message probabilities p can be same as the binary message.

9

PerturbationTargeted MessagePerturbationPerturbationNo Message DetectedWatermarked AudioWatermarked AudioClean AudioWatermark DecoderPerturbed Audioa. Watermark Replacementb. Watermark Creationc. Watermark Removal110…011110…011-----------000…111000…111Targeted MessageWatermark DecoderWatermark DecoderPerturbed AudioPerturbed AudioFigure 3.2: The audio watermarking framework follows the Encoder-Distortion-Decoder
architecture.

clean audio sc to generate an perturbed watermarked audio, which deceives the watermark decoder

into recognizing it as containing a valid watermark. Formally, the goal of watermark creation is:

δcreation = arg min

δ

||Dec(sc + δ) − ptarget)||.

(3.3)

Watermark Removal. Watermark removal is an untargeted attack aimed at removing the

embedded watermark. In a watermark removal attack (Figure 3.1-c), a perturbation δ is added

to the watermarked audio sw to deceive the watermark decoder into outputting a binary message

that does not match the original watermark binary message. The perturbed audio, denoted as ˆsc,

is considered clean (i.e., unwatermarked).

In this context, we introduce a watermark detector,

Detector(·), which determines whether the audio contains a watermark. Formally, the goal of

watermark removal is:

δremoval = arg min

δ

(Detector(sw + δ) = None).

(3.4)

3.2: Message Probability Distribution

Benign Distribution.

The watermark decoder outputs the values of probabilities for each binary message bit. A bit

is decoded as 1 if its probability exceeds a predefined threshold; otherwise, it is decoded as 0. For

both clean and watermarked audio, we observe two distinct distributions, each following a normal

distribution pattern. Clean audio exhibits a unimodal normal distribution, with the peak density

10

Clean audioEncoderWatermarked audioMessageDistortion LayerDecoder/DetectorExtracted MessageDistorted Watermarked audio(a) Normal distribution with clean audio.

(b) Normal distribution with watermarked audio.

Figure 3.3: Benign distribution of message probabilities (Timbre).

(a) Normal Distribution with Clean Audio.

(b) Normal Distribution with Watermarked Audio.

Figure 3.4: Benign distribution of message probabilities (AudioSeal).

centered near the predefined threshold. As shown in Figure 3.3a, where the threshold is set to

0, the mean µ of the distribution is also close to 0. In contrast, the watermarked audio follows

a bimodal, approximately normal distribution (Figure 3.3b), with two peaks corresponding to the

decoded binary message values 0 and 1. Specifically, in the Timbre, message probabilities below

0 are decoded as 0, while message probabilities above 0 are decoded as 1. Figure 3.4 shows the

distribution findings of the AudioSeal.

Audio Watermark Attack Distribution. For existing watermark attack methods [25], we observe

that the distributions of message probabilities deviate significantly from the benign distributions.

Figure 3.5a illustrates the distribution of watermark removal attacks, which aim to remove the

watermark from watermarked audio. Compared to Figure 3.3a, the attacked distribution differs

11

0.40.20.00.20.40.6Message Probabilities0.00.51.01.52.02.53.03.54.0DensityDistribution of Message Probabilities1.00.50.00.51.0Message Probabilities0246810DensityDistribution of Message Probabilities0.440.460.480.500.520.540.560.58Message Probabilities0510152025DensityDistribution of Message Probabilities0.20.30.40.50.60.70.8Message Probabilities02468101214DensityDistribution of Message Probabilities(a) Message distribution under watermark removal.

(b) Message distribution under watermark creation.

Figure 3.5: Distribution of message probabilities under attacks (Timbre). We use the
AudioMarkBench to perform the removal and creation attacks.

in the range of message probabilities along the x-axis, even though it still exhibits a unimodal

normal distribution. Figure 3.5b presents the distribution of watermark creation attacks. Here,

most message probability values cluster around the threshold value of 0. In contrast to the clear

bimodal normal distribution shown in Figure 3.3b, the distribution resulting from the attack visually

diverges from the benign distribution.

Outlier Detection. The 3-sigma rule, also known as the 68–95–99.7 rule, is a widely adopted

method for identifying outliers in normally distributed data. It states that approximately 68% of

values fall within one standard deviation of the mean, 95% within two, and 99.7% within three.

Values that fall beyond three standard deviations from the mean are considered statistically rare

and are classified as outliers. Based on this principle, the defender can detect potential attacks by

estimating the distribution parameters of the message probabilities and identifying values that fall

outside the 3-sigma range.

3.3: Threat Model

Roles. In this thesis, we focus on two roles: defenders and adversaries. The defenders are the attack

detectors, who identify whether audio has been tampered with. The adversaries attempt to perform

watermark attacks to either claim new copyright ownership or remove the original copyright.

Attack Capabilities. For adversaries, we have the following assumptions: 1) Adversaries have

12

5051015Message Probabilities0.000.050.100.150.200.250.30DensityDistribution of Message Probabilities (Removal)0.750.500.250.000.250.500.751.00Message Probabilities0.00.20.40.60.81.01.21.41.6DensityDistribution of Message Probabilities (Creation)no access to the data used by defenders to fit the distribution, nor to the audio dataset used to train

the watermarking model. 2) They have access to the watermark architecture and parameters of

the watermark decoder, allowing them to perform gradient-based attacks.3 3) They do not possess

the complete watermarking model, but they can embed watermarks into a small set of clean audio

samples. 4) They are aware that the decoded message probabilities output by the watermark decoder

follow a normal distribution, but they do not know the corresponding mean and standard deviation.

For defenders, we have several assumptions: 1) Defenders have access to a large amount of

ground-truth audio samples, which are used to fit a normal distribution and estimate the mean and

variance via maximum likelihood estimation. 2) The watermarking model is publicly available

through an online platform for commercial use, with restrictions on the number of watermarked

audio generations allowed per user per day.

Attack Scenario. The adversaries aim to estimate the normal distribution parameters of the

decoded message probabilities. However, due to their lack of access to the training dataset, they

can only collect a limited amount of data independently. When targeting a specific audio sample

for attack, they identify a small set of audio samples with feature distributions similar to that of

the target and use these to estimate the mean and standard deviation. After estimating the mean

and standard deviation, adversaries deploy the watermark decoder and perform gradient-based

watermark attacks to constrain the decoded message probabilities within the estimated normal

range. On the defense side, defenders design a defense strategy based on the ground-truth

distribution. This strategy should have a reasonable false acceptance rate. When a potentially

compromised audio sample is received, the defender examines the decoded message probabilities

to determine whether it has been tampered with.

3The knowledge of watermark model’s parameter setting can be extend to black-box by adopting HSJA [9] and

Sign-Opt [10].

13

CHAPTER 4: METHODOLOGY

4.1: Audio Watermark Attack Detection

As observed in the previous section, the distribution of message probabilities in perturbed audio

differs significantly from that of benign audio. Based on this observation, we propose a detection

mechanism to distinguish between perturbed and benign audio. It is important to note that this task

is non-trivial, as the defender does not have access to the exact distribution of message probabilities

under normal (benign) conditions, making it challenging to achieve optimal detection performance.

To address this issue, we propose a two-step approach.

First, the defender possesses a large collection of ground truth audio samples, including both

clean audio and watermarked audio generated by the watermark encoder. These samples are used to

model the message probability distributions. The defender deploys a watermark decoder to extract

decoded message probabilities and categorizes them into two groups: probabilities from clean

audio and probabilities from watermarked audio. Next, the defender applies maximum likelihood

estimation to fit the predicted distributions. As illustrated in Figure 3.3, this process outputs three

normal distributions: one derived from message probabilities of clean audio and two (red and blue

boundary line on the right) are from those of watermarked audio. Given n watermarked samples,

each producing a probability vector of size 1×N message probabilities, the complete set of message

probabilities is denoted as p = {p1, p2, ..., pn}. The mean µ and standard deviation δ of the normal

distribution estimated via maximum likelihood are:

µ =

σ2 =

n∑

pi,

1
n

i=1
n∑
1
n

i=1

(pi − µ)2.

(4.1)

Finally, the defender obtains the corresponding means µ and standard deviations δ, which are used

14

Figure 4.1: Audio watermark attack detection based on the predicted distribution of ground truth
(GT) audio. The defender uses ground truth (GT) audio (watermarked and clean) to estimate the
predicted distribution (top) and applies outlier detection based on this distribution to detect
whether an audio sample is attacked (bottom).

to detect outliers in suspicious audio and determine whether it has been attacked.

Figure 4.1 illustrates the audio watermark attack detection scenario. The defender collects

two large datasets (one consisting of watermarked audio and the other of clean audio) and uses a

watermark decoder to extract message probabilities. These probabilities are then used to generate

two sets of predicted distributions via maximum likelihood estimation. When the defender receives

a potentially perturbed audio sample, the watermark decoder is applied to extract its message

probabilities distributions. Using the 3-sigma rule, the defender identifies any probability values

that fall outside the expected range. If such outliers are detected, the audio is classified as attacked;

otherwise, it is considered clean.

4.2: Adaptive Attack Design

Given the success of message-probability-distribution-based attack detection, we consider to

propose more dangerous attack:

the adaptive watermark attack. Assuming the attackers are

aware of a statistics-based detection mechanism, their goal is to bypass the detection. This is also

15

DetectPerturbedAudioOutlier DetectionAttacked?GT AudioEstimateDeployWatermark DecoderPredicted DistributionOutputMessage ProbabilitiesYESNoWatermark DecoderMessage ProbabilitiesDeployOutputa. Predicted Distribution Estimationb. Watermark Attack DetectionFigure 4.2: The distribution estimation by the attacker. (a) Watermark replacement and creation:
The attacker uses a small set of clean audio samples to generate watermarked audio samples,
which are then used to estimate the distribution. (b) Watermark removal: The attacker directly
uses the clean audio samples to estimate the distribution.

non-trivial, as the attacker does not know the detection distribution used by defender, nor aware

of the defender’s outlier detection approach, moreover, the attacker need to preserve the audio

quality, this limit the strength of the perturbation. We introduce the adaptive attack design as

follows.

4.2.1: Adaptive Attack Pipeline

First, the attacker prepares for the attack by estimating the defender’s distribution. Second, the

attacker performs the audio watermark attack, which consists of two stages: (1) modifying the

original audio to bypass the defense and achieve a successful attack, and (2) improving the quality

of the perturbed audio, while maintaining that the decoded message probabilities remain within an

acceptable threshold.

4.2.2: Attack Preparation: Estimate Defender’s Distribution

The goal of the adversaries is to estimate the mean uest and the standard deviation σest of the

decoded message probabilities. These parameters are categorized into two groups corresponding

to the types of watermark attacks: watermark replacement and creation, and watermark removal.

16

OutputEstimateEstimated DistributionDeployClean AudioWatermarkModelWatermarked AudioDeployGenerateMessage ProbabilitiesWatermark Decodera. Distribution Estimation for Message Probabilities in Watermark Replacement and Creationb. Distribution Estimation for Message Probabilities in Watermark RemovalWatermark DecoderOutputEstimateDeployEstimated DistributionMessage ProbabilitiesClean AudioAdaptive AttackObtainObtainFigure 4.3: The design of AWM generator. The audio watermark attack step (left) ensures the
success of the watermark attack, while the audio quality optimization step (right) focuses on
improving audio quality.

The overall process is illustrated in Figure 4.2.

Parameter Estimation for Watermark Replacement and Creation. The attackers select several

clean audio samples sc from a small dataset. These samples are then used to query the watermark

model Enc(·) and generate new watermarked audio samples sw. The watermark decoder Dec(·) is

deployed to extract message probabilities, which are subsequently used to estimate the parameters

of the normal distribution using either Bayesian inference [29] or maximum likelihood estimation.

Since the distribution of watermarked message probabilities is bimodal (as shown in 3.3b), the

attacker can obtain two distributions (-1’s decoded as bit 0 on the left and 1’s decoded as bit 1 on

the right). The estimated mean and standard deviation are as follows:

est, σ0
µ0

est = T 0(Dec(Enc(sc))),

µ1
est, σ1

est = T 1(Dec(Enc(sc))),

(4.2)

where µest and σest are the mean and standard deviation, respectively, with the superscript

identifying the associated target distribution for each parameter. µ0

est and σ0

est represent the

estimated mean and standard deviation of decoded message probabilities predicted as 0, and µ1
est

and σ1

est correspond to those predicted as 1. T (·) is the method used to estimate the parameters of

normal distribution, which depends on the prior knowledge of the attackers and the volume of data.

Commonly used techniques are based on Bayesian inference or maximum likelihood estimation.

17

Watermark DecoderPerturbationDeployMessage ProbabilitiesOutputEstimated DetectorNoUpdateYesWatermark DecoderDeployMessage ProbabilitiesExtended Estimated DetectorPerturbed AudioSuccessful Perturbed Audio (+opt)Watermark Attack (AWM)Optimization (AWM +opt)Attack goal achieved & Attack not detectedAWM perturbationOriginal AudioMessage LossSuccessful Perturbed AudioOutputPerturbed Audio (+opt)Original AudioSignal LossSpec LossUpdateNoAttack goal achieved & Attack not detectedYesT 0(·) and T 1(·) denote the estimated method for estimating the 0’s and 1’s distribution.

Parameter Estimation for Watermark Removal. The attackers estimate the distribution directly

using clean audio samples. They use the watermark decoder to output message probabilities, which

are used for distribution estimation. Since the estimated distribution is unimodal (as shown in 3.3a),

the parameters are defined as follows:

est, σc
µc

est = T c(Dec(sc)),

(4.3)

where µc

est and σc

est are the estimated mean and standard deviation of the clean message

probabilities, and T c represents the estimation approach applied to these values.

4.2.3: Audio Watermark Attack (AWM)

With the estimated defender’s distribution, the attacker’s next step is to add a subtle adversarial

perturbation to the original audio s, intentionally deceiving the decoder to produce incorrect binary

messages while bypassing the detection strategy. Because attackers possess the watermark decoder,

they can compute and adjust gradient directions to align with their attack goals. Figure 4.3-left

illustrates the watermark attack step. The original audio is clean audio in watermark creation attack,

and it can be watermarked audio in watermark replacement or watermark removal attack. At the

beginning, the original audio adds the perturbation. The perturbation is initialized by a fraction

of the original audio signal. Next, we add the perturbation to original audio to form a perturbed

audio. The perturbed audio is fed into a watermark decoder, where the attacker receives a message

probability and queries the estimated detector. If the attack is not detected and the attack goal

is achieved, the attacker obtains the successful perturbed audio. Otherwise, the attacker further

optimizes the perturbation by message loss. The message loss Lmsg modifies the perturbed message

probabilities to match the target message probabilities ptarget:

Lmsg = ||Dec(satt) − ptarget)||2
2.

(4.4)

18

Let satt denote the perturbed audio, this loss enforces the decoded message close to the target

message. Different from the previous attack, our loss optimization step follows a strict bit-

to-bit optimization design (detailed in Algorithm 1). This algorithm uses the estimated detector

knowledge to ensure that the perturbed audio exhibits a distribution similar to that of benign

watermarked audio, with high confidence.

It is worth noting that our algorithm constrains the

message probabilities within the normal value range. Through iterative updates and perturbation

optimization, the perturbed message probabilities are gradually optimized to fall within this range

[µest − σest, µest + σest], ensuring that they are classified as non-outliers, thereby getting the

updated adversarial perturbation. Finally, the perturbed audio is generated by adding the updated

perturbation to the original audio. Our findings suggest that if the perturbed audio’s message

probability falls into the interval of [µest − σest, µest + σest], it would improve the attack success

rate.

Besides the message loss, we also formulate the signal loss and mel loss to minimize the quality

degradation from the attack. Specifically, the signal loss controls the audio quality at the signal

level:

Lsignal =

1
n

n∑

i=1

|satt − s|.

The Mel-Spectrogram loss Lmel maintains the audio quality at the Mel-Spectrogram level:

Lmel = ||M el(satt) − M el(s)||2
2.

The total loss in the attack step is:

(4.5)

(4.6)

L = λ1Lsignal + λ2Lmel + λ3Lmsg + λ4Lother,

(4.7)

where the Lother depends on the specific watermarking method. For example, AudioSeal [36]

includes a localization loss used for indicating the probability of the audio being watermarked. In

the audio watermark attack process, the parameter λmsg is assigned a relatively high value.

19

4.2.4: Audio Quality Optimization (AWM +opt)

Since audio watermark attacks prioritize message loss optimization, audio quality may be adversely

impacted. Therefore, this step aims to improve audio quality while maintaining a successful attack.

To achieve this, we make three adaptions. (1) Extend the Estimated Detector Range; (2) Update

the Attack Goal; and (3) Modify the Optimization Loss.

Figure 4.3-right illustrates the step. First, attackers obtain the perturbation from the early step

as initial perturbation. Then, the attacker uses the watermark decoder to extract the corresponding

message probabilities. These message probabilities are input into the estimated distribution, with

the acceptable range extended to [µest − 2σest, µest + 2σest]. (1) This is a critical step where

the attacker makes a trade-off between audio quality and attack success rate. In previous

AWM setting, the attacker starts from strict constrain to enforce the attack success, however, this

enforcement limits the flexibility of perturbation, making it hard to retain the audio quality as well.

In our design, we expand the maximum allowable range for message probability optimization

to reserve more space for perturbation optimization, resulting in a balanced attack mode which

considers both quality and attack success rate.

Besides extending the acceptable range, we also (2) update the attack goal by enforcing the

optimization go through fixed number of optimization epochs. This ensures that the perturbation

is fully optimized, instead of ending up with a boundary case. At the end, we (3) modify the

optimization loss by reformulating the Mel-Spectrogram loss into a standard spectrogram loss and

apply a softmax function to the spectrogram. Inspired from Audioseal, the softmax adaption can

better keep the loudness and perceptual similarity between two audios. The resulting softmax-based

spectrogram loss, denoted as Lspec, is defined as:

Lspec =

1
n

n∑

i=1

|Sof tmax(Satt) − Sof tmax(Ss)|

(4.8)

20

Where Satt and Ss are the spectrogram of the attacked audio and original audio.

Lopt = λ1Lsignal + λ2Lspec + λ3Lmsg + λ4Lother,

(4.9)

In the audio quality optimization (+opt) process, the parameters λ1 and λ2 are assigned a relatively

high value.

4.3: Adaptive Attack in Replacement, Creation, and Removal

4.3.1: Adaptive Attack in Watermark Replacement

During the watermark attack, we prioritize refining the decoded message probabilities that fall

outside the estimated normal range. Algorithm 1 outlines the adaptive attack process used in

watermark replacement. For example, consider a watermark replacement scenario involving six

binary message bits. Suppose the original watermarked audio contains the bits “101100”, which

are modified to “111000” after the attack. We define a list, msgdif f , which contains the indices

of binary message bits differing between the original watermarked audio sw and the perturbed

watermarked audio ˆsw

4. In this case, msgdif f = [1, 3]. For the indices not included in msgdif f , we

assign the ground truth watermark message probabilities pw to the corresponding target watermark

message probabilities ptarget. That is, for indices [0, 2, 4, 5], the ptarget is equal to the pw.

This optimization has two advantages: (1) It directs the gradient to focus more on the indices

where the binary message bits require modification. In watermark replacement, only the differing

bits need to be changed, so bits that already align with the target message do not require further

optimization. (2) It ensures that the message probabilities in the non-change bits remain the same,

and within the acceptable distribution. Additionally, in certain audio watermarking methods,

message probabilities do not directly map to binary message values. For the AudioSeal as shown in

Figure 3.4b, probabilities corresponding to binary 1 typically lie in the range of 0.7–0.8, and those

4This notation specify for watermark replacement scenario, the general representation of attacked audio is satt,
it can represent both attacked clean audio (for watermark creation) or attacked watermarked audio (for watermark
replacement and removal)

21

Algorithm 1: Adaptive Attack in Watermark Replacement

Input: Watermarked audio sw, watermark message probabilities pw, perturbed watermark

probabilities ˆpw, target message probabilities ptarget, watermark decoder Dec(·),
thresholds τ 0

inf i, scale factor r, list of changed indices msgdif f

inf i, τ 1

sup, τ 1

sup, τ 0

Output: Perturbed watermarked audio ˆsw

1 δ ← sw × r ;
2 pw ← Dec(sw) ;
3 ˆsw ← sw + δ ;
4 for index ← 1 to len(ptarget) − 1 do
if index /∈ msgdif f then
5

ptarget[index] ← pw[index] ;

7 for i ← 1 to Iter do
8

δ ← Attack(sw, ˆsw, Dec( ˆsw), ptarget, δ)
ˆsw ← sw + δ ;
ˆpw ← Dec( ˆsw) ;
if acc == 1 then

if τ 1

foreach index ∈ msgdif f do
inf i < ˆpw[index] < τ 1
sup then
ptarget[index] ← ˆpw[index] ;
Remove index from msgdif f ;
inf i < ˆpw[index] < τ 0
sup then
ptarget[index] ← ˆpw[index] ;
Remove index from msgdif f ;

if τ 0

6

9

10

11

12

13

14

15

16

17

18

19

20

if meet the estimated detection Detection( ˆpw) then

return ˆsw ;

// Optimize δ;

21 return Failed ;

for binary 0 fall between 0.2–0.3. Therefore, it is not appropriate to optimize message probabilities

directly to 1 or 0. Instead, they should be adjusted to fall within the typical range of 0.7–0.8 for a

binary message bit of 1, and 0.2–0.3 for a bit of 0. This is why we set ptarget equal to pw for indices

not included in msgdif f . After that, we optimize the perturbation δ by Function Attack(·) (Line

8) to generate the perturbed watermarked audio ˆsw. The Attack(·) function calculates the loss

in Equation 4.7 and use the gradient to update the perturbation. The initial perturbation is scaled

based on the original watermarked audio sw to ensure that the decoded message probabilities of

the perturbed audio ˆsw closely resemble those of the original. In the attack justification step, we

define the supremum and infimum thresholds for the watermark decoder outputs corresponding to

22

Algorithm 2: Estimated Detection by Attackers

Input: Perturbed watermark probabilities ˆpw, copy of list of different message

probabilities msgcopy
dif f

Output: True (successful attack) or False (failed attack)

1 cnt ← len(msgcopy
2 if acc == 1 then
3

dif f ) ;

foreach index ∈ msgcopy
− σ1

dif f do
est) < ˆpw[index] < (µ1

if (µ1
est
cnt ← cnt − 1 ;

est + σ1

est) then

− σ0

if (µ0
est
cnt ← cnt − 1 ;

est) < ˆpw[index] < (µ0

est + σ0

est) then

if cnt == 0 then
return True ;

// Successful attack

4

5

6

7

8

9

binary message bits. Specifically, τ 0

sup and τ 0

inf i represent the thresholds for bit 0, while τ 1

sup and

τ 1
inf i correspond to bit 1:

(µ0
est

(µ1
est

− σ0

est) ≤ τ 0

inf i < τ 0
sup

− σ1

est) ≤ τ 1

inf i < τ 1
sup

≤ (µ0

est + σ0

est),

(4.10)

≤ (µ1

est + σ1

est).

Attackers can choose appropriate thresholds based on the desired range of message probabilities. If

the distance between the supremum and infimum thresholds is too small, the message probability

may fall outside the estimated range, making it difficult to optimize. When optimizing bit 1 of

the original binary message to the target bit 0, a greater distance between the supremum threshold

sup and µ0
τ 0

est + σ0

est may require more optimization iterations. Similarly, when modifying bit 0 to

the target bit 1, a greater distance between the infimum threshold τ 1

inf and µ1
est

− σ1

est may also

necessitate additional iterations.

Once the message probability of the perturbed watermarked audio at a given index falls within

the specified threshold range, it is assigned to the target message probability ptarget, and the

index is removed from the list msgdif f . Optimization then proceeds with the remaining message

probabilities in the list. To ensure that all perturbed message probabilities remain within the

estimated normal range, the attacker simulates the defender’s role by performing outlier detection.

23

Algorithm 3: Adaptive Attack in Watermark Creation

Input: Clean audio sc, clean message probabilities pc, perturbed watermark probabilities
ˆpw, target message probabilities ptarget, watermark decoder Dec, threshold bounds
sup, τ 0
τ 0
inf i, scale factor r, list of different message probabilities msgdif f

inf i, τ 1

sup, τ 1

Output: Perturbed watermarked audio ˆsw

// Optimize δ;

1 δ ← sc × r;
2 pc ← Dec(sc);
3 ˆsw ← sc + δ;
4 for i ← 1 to Iter do
5

δ ← Attack(sc, ˆsw, Dec( ˆsw), ptarget, δ)
ˆsw ← sc + δ;
ˆpw ← Dec( ˆsw);
if acc == 1 then

if τ 1

foreach index ∈ msgdif f do
inf i < ˆpw[index] < τ 1
sup then
ptarget[index] ← ˆpw[index];
Remove msgdif f [index];
inf i < ˆpw[index] < τ 0
sup then
ptarget[index] ← ˆpw[index];
Remove msgdif f [index];

if τ 0

6

7

8

9

10

11

12

13

14

15

16

17

if meet the estimated detection Detection( ˆpw) then

return ˆsw;

18 return Failed;

Since the watermark attack process uses the interval [µest − σest, µest + σest], this same range

is applied for detecting outliers. Algorithm 2 illustrates the simulated detection process. After

obtaining the perturbed watermark probabilities ˆpw, the attackers must determine whether all

probabilities fall within the estimated normal range by acting in the role of defenders.

In the

watermark attack step, we recommend defining this normal range as [µest − σest, µest + σest].

Additionally, in Algorithm 1, since the indices in the list of differing message probabilities,

msgdif f , will be progressively removed, the final msgdif f will eventually be empty. To preserve

the original reference, we define a copy of this list, denoted as msgcopy

dif f , which initially mirrors

msgdif f . We then check whether each index in ˆpw falls within the normal range. If all indices

satisfy this condition, the attack is considered successful, and the algorithm returns True to

Algorithm 1; otherwise, the perturbation δ must continue to be optimized.

24

Algorithm 4: Adaptive Attack in Watermark Removal

Input: Watermarked audio sw, watermark message probabilities pw, perturbed clean
probabilities ˆpc, target message probabilities ptarget, watermark decoder Dec,
threshold bounds τ c
msgdif f

inf i, scale factor r, list of different message probabilities

sup, τ c

Output: Perturbed clean audio ˆsc

1 δ ← sw × r;
2 pw ← Dec(sw);
3 ˆsc ← sw + δ;
4 cnt ← len(msgdif f );
5 for i ← 1 to Iter do
6

δ ← Attack(sw, ˆsc, Dec( ˆsc), ptarget, δ)
ˆsc ← sw + δ;
ˆpc ← Dec( ˆsc);
if acc ≤ th then

foreach index ∈ msgdif f do
inf i < ˆpc[index] < τ c
cnt ← cnt − 1;

if τ c

sup then

if cnt == 0 then
return ˆsc;

7

8

9

10

11

12

13

14

// Optimize δ;

15 return Failed;

4.3.2: Adaptive Attack in Watermark Creation

For watermark creation, the adaptive attack process is the same as that of watermark replacement.

The difference is that, in watermark creation, msgdif f includes all message indices, which is equal

to the full binary message length. Algorithm 3 describes the adaptive attack process for watermark

creation. The length of the list msgdif f is equal to the length of the binary message. For example,

the length of the binary message is 3, the msgdif f is [0,1,2]. The process of estimated detection is

shown in Algorithm 2.

4.3.3: Adaptive Attack in Watermark Removal

Watermark removal is an untargeted attack, it is not necessary to achieve an accuracy of exactly

0, any value below 1 is sufficient. Typically, using a lower accuracy threshold allows for more

iteration steps to further optimize the perturbation δ. We recommend using an accuracy threshold

25

around 0.5 [19]. Algorithm 4 introduces the adaptive attack process for watermark removal. In this

attack, we introduce a threshold th. If the accuracy acc falls below this threshold th, the attack is

considered successful. We change the optimization object to µc

est and σc

est. The target probability

vector ptarget is constructed such that each element is equal to the predefined threshold θ used for

converting probabilities into binary message values. For example, in AudioSeal, the θ = 0.5; in

Timbre, θ = 0. The threshold supremum and infimum of the decoder messages are τ c

sup and τ c

inf i:

(µc

est

− δest) ≤ τ c

inf i < µc

est < τ c

sup

≤ (µc

est + δest)

(4.11)

In addition, the definition of the list msgdif f is the same as that in the watermark creation, the

difference is that we do not need to remove the indices of the msgdif f .

26

CHAPTER 5: EVALUATION

5.1: Experiment Setup

Datasets. We use three public datasets for our experiments.

The first dataset

is the

LibriSpeech [32], released by OpenSLR. We select the small-sized subset, which has 6.3G

audios, and covers 100.6 hours of audio data spoken by 251 speakers. The second dataset is

obtained from AudioMarkData [25], which is built based on the Common Voice dataset [3].

It contains 20,000 audio samples, each with a duration of 5 seconds. The third dataset is

GigaSpeech [8], which includes audio from audiobooks, podcasts, and YouTube. We use the XS

subset, which contains a total of 10 hours of audio samples.

Audio Watermarking Methods. We select two state-of-the-art audio watermarking methods for

our experiments: Timbre [24] and AudioSeal [36]. These methods achieve watermark embedding

while maintaining perceptual audio quality at a high level. We fix the binary message length to 16

bits for all experiments.

Evaluation Metrics. To evaluate the performance of our watermark, we use the following

metrics. First, we introduce the Detection Success Rate (DSR), which measures the ability to

identify outliers in the decoded message probabilities. In this experiment, the defender defines

the acceptable value range using the 3-sigma rule; any message probability falling outside this

range is considered an outlier. A high DSR indicates a stronger defense capability, reflecting

effective detection of perturbed audio based on probability distributions. Second, we use the

False Acceptance Rate (FAR), which evaluates the rate that unattacked audio is mistakenly

classified as attacked. Ideally, the FAR should remain relatively low to ensure reliable detection.

In watermark replacement and creation, the defender estimates the distribution using watermarked

audio samples. The FAR, in this case, represents the proportion of unattacked watermarked

audio samples that are incorrectly classified as attacked, evaluated across the entire unattacked

27

watermarked dataset. In watermark removal, the defender estimates the distribution using clean

audio samples. The FAR reflects the rate at which unattacked clean audio samples are incorrectly

classified as attacked, evaluated over the entire set of unattacked clean audio. Third, we employ

the Attack Success Rate (ASR) to measure how successfully the adaptive attackers perform

the watermark attack, which also corresponds to the watermark decoder’s accuracy at the binary

message level. The high ASR indicates that the attacker is able to successfully alter the message,

either to the targeted message in watermark replacement and creation or to an untargeted message

in watermark removal.

For the audio quality metrics, we use the Signal-to-noise ratio (SNR) and ViSQOL [16]. SNR

measures the quality of the original watermarked audio or perturbed audio by comparing the level

of added perturbation (noise) to that of the original clean audio. A higher SNR indicates better

audio quality. ViSQOL evaluates audio quality through a simulation of human hearing perception.

The score range is from 1 (the worst) to 5 (the best). The higher ViSQOL indicates that the audio

has higher quality. In the experiment, we use clean audio samples as a baseline.

Audio Watermark Attack Methods. We compare our attack method with the AudioMarkBench [25].

AudioMarkBench is a benchmark designed to evaluate the robustness of audio watermarking

against watermark replacement, creation, and removal attacks. Our method consists of two attack

variants: Ours (AWM) (Section 4.2.3), which includes only the watermark attack step, and Ours

(+opt) (AWM +opt) (Section 4.2.4), which adds an additional audio quality optimization step.

5.2: Detection Result

In our experiments, we measure our defense method with AudioMarkBench [25] attacks.

Specifically, we assume the attacker uses the default AudioMarkBench attack, as well as five

alternative perturbations on the perturbed audio: (1) low-pass filtering (LP), (2) amplitude scaling

(AS), (3) Gaussian noise (GN), (4) MP3 compression (MP3), and (5) high-pass filtering (HP).

The attacker’s goal is to use the perturbation to alternate our defense success rate. Note that after

applying the no-box perturbations, some attack may fail. Therefore, we only consider defending

28

Table 5.1: Detection performance across different datasets, watermark methods, and attack
methods. LP: Low-pass Filter. AS: Amplitude Scaling. GN: Gaussian Noise. MP3: MP3
Compression. HP: High-pass Filter. ’-’ indicates that no successfully perturbed audio is available.

Attack Type Watermark Method

Attack Method

Watermark
Replacement

Watermark
Creation

Watermark
Removal

AudioSeal

Timbre

AudioSeal

Timbre

AudioSeal

Timbre

AudioMarkBench
AudioMarkBench (+LP)
AudioMarkBench (+AS)
AudioMarkBench (+GN)
AudioMarkBench (+MP3)
AudioMarkBench (+HP)
Ours
Ours (+opt)
AudioMarkBench
AudioMarkBench (+LP)
AudioMarkBench (+AS)
AudioMarkBench (+GN)
AudioMarkBench (+MP3)
AudioMarkBench (+HP)
Ours
Ours (+opt)
AudioMarkBench
AudioMarkBench (+LP)
AudioMarkBench (+AS)
AudioMarkBench (+GN)
AudioMarkBench (+MP3)
AudioMarkBench (+HP)
Ours
Ours (+opt)
AudioMarkBench
AudioMarkBench (+LP)
AudioMarkBench (+AS)
AudioMarkBench (+GN)
AudioMarkBench (+MP3)
AudioMarkBench (+HP)
Ours
Ours (+opt)
AudioMarkBench
AudioMarkBench (+LP)
AudioMarkBench (+AS)
AudioMarkBench (+GN)
AudioMarkBench (+MP3)
AudioMarkBench (+HP)
Ours
Ours (+opt)
AudioMarkBench
AudioMarkBench (+LP)
AudioMarkBench (+AS)
AudioMarkBench (+GN)
AudioMarkBench (+MP3)
AudioMarkBench (+HP)
Ours
Ours (+opt)

Gigaspeech

Librispeech

Audiomark
DSR (%) FAR (%) F1 (%) DSR (%) FAR (%) F1 (%) DSR (%) FAR (%) F1 (%)
97.24
5.50
94.19
4.82
97.53
5.23
98.85
10.81
96.93
7.23
78.26
5.36
17.80
5.50
27.24
5.50
96.15
6.50
100.00
9.09
96.82
4.48
-
0.00
-
0.00
97.83
3.85
11.62
6.50
13.79
6.50
97.24
5.50
95.96
2.82
97.18
3.97
93.33
0.00
94.74
4.76
98.11
2.94
0.62
5.50
20.40
5.50
96.15
6.50
97.30
5.31
96.80
6.40
100.00
33.33
97.87
3.70
97.46
5.71
0.00
6.50
0.00
6.50
97.07
3.50
98.03
2.75
98.02
2.20
97.97
2.37
97.58
2.21
97.10
2.82
0.00
3.50
0.00
3.50
96.00
6.00
96.62
6.00
96.62
6.00
88.06
7.29
95.66
6.00
96.62
6.00
0.00
6.00
0.00
6.00

100.00
96.05
99.61
98.11
96.93
66.86
10.33
16.67
100.00
100.00
100.00
-
-
100.00
6.67
8.00
100.00
100.00
100.00
100.00
100.00
100.00
0.33
12.00
100.00
100.00
100.00
100.00
100.00
100.00
0.00
0.00
99.33
99.28
99.63
99.25
98.50
98.17
0.00
0.00
100.00
100.00
100.00
84.67
98.67
100.00
0.00
0.00

96.79
95.75
96.97
98.03
94.85
90.82
6.40
9.75
97.08
100.00
98.96
-
100.00
99.25
2.22
2.95
97.94
97.61
97.79
97.46
97.35
99.36
1.45
2.89
98.87
99.49
99.08
100.00
100.00
98.97
0.00
0.00
93.46
93.62
92.80
91.85
78.89
78.72
0.00
0.00
97.27
97.22
97.22
81.54
97.22
97.22
0.00
0.00

97.71
93.94
97.96
98.03
97.18
86.41
3.44
5.34
100.00
100.00
100.00
-
100.00
100.00
1.15
1.53
100.00
100.00
100.00
100.00
100.00
100.00
0.76
1.53
100.00
100.00
100.00
100.00
100.00
100.00
0.00
0.00
92.75
91.41
91.54
89.53
83.53
81.96
0.00
0.00
100.00
100.00
100.00
84.13
100.00
100.00
0.00
0.00

100.00
93.98
100.00
100.00
100.00
96.43
7.50
8.00
100.00
100.00
100.00
100.00
100.00
100.00
7.00
6.50
100.00
100.00
100.00
100.00
100.00
100.00
0.50
2.00
100.00
100.00
100.00
100.00
100.00
100.00
0.00
0.00
100.00
100.00
100.00
100.00
100.00
100.00
0.00
0.00
100.00
100.00
100.00
74.17
100.00
100.00
0.00
0.00

97.32
89.66
97.45
94.87
96.51
95.58
13.28
14.10
96.85
95.65
97.81
100.00
100.00
98.11
12.33
11.50
97.32
98.61
98.05
100.00
97.67
98.55
0.94
3.68
96.85
97.41
96.90
85.71
98.18
97.22
0.00
0.00
98.28
98.65
98.91
98.83
98.91
98.51
0.00
0.00
97.09
97.09
97.09
81.75
97.09
97.09
0.00
0.00

5.67
8.45
5.85
7.14
3.40
4.76
5.67
5.67
8.00
0.00
6.94
-
-
5.73
8.00
8.00
5.67
9.00
6.76
14.89
11.76
4.00
5.67
5.67
8.00
6.09
7.77
0.00
4.55
6.03
8.00
8.00
5.33
4.11
4.67
4.31
4.33
5.12
5.33
5.33
8.33
8.57
8.57
9.29
8.57
8.57
8.33
8.33

4.20
2.27
4.08
1.97
4.20
3.88
4.20
4.20
2.29
0.00
2.11
-
0.00
1.51
2.29
2.29
4.20
4.90
4.50
5.22
5.45
1.28
4.20
4.20
2.29
1.03
1.85
0.00
0.00
2.08
2.29
2.29
5.73
5.86
5.77
5.43
5.88
5.88
5.73
5.73
5.73
5.73
5.73
5.56
5.73
5.73
5.73
5.73

29

the successful attack samples.

Table 5.1 shows the detection results. From the results, the FAR is maintained within an

acceptable range, with most values around 5%. Given that the 2-sigma range covers approximately

95.45% of the data, we consider this FAR to be reasonable.

Across all three attack types and eight attack methods, our method demonstrates superior

performance compared to the baseline. Additionally, the DSR for our watermark attack method

(Ours) is lower than that of our audio-quality-improvement optimization, Ours (+opt).

In the

watermark replacement and creation, most DSR values are below 10%. The best performance

is observed in watermark removal, where none of the perturbed audio samples are detected by the

defenders. These findings suggest that watermark removal is the most effective attack strategy,

while watermark replacement presents the greatest challenge among the three evaluated attack

types.

5.3: Distribution Analysis

We randomly select some audio samples from the Librispeech dataset, and use the AudioMarkBench

and our attack method to generate the distribution of message probabilities. Figure 5.1 illustrates

the normal distributions for both the AudioSeal and Timbre.

In AudioSeal, the predefined

threshold θ for converting probabilities into binary messsage values is 0.5, and in Timbre, it is

0. The results from AudioMarkBench show that many message probabilities cluster around the

predefined threshold, and the overall distribution tends to exhibit a unimodal shape. Therefore,

the detection method is capable of identifying that the audio has been attacked. For our method,

the resulting normal distribution is bimodal, with a shape similar to that shown in Figure 3.3b. As

a result, our attack method successfully bypasses the detection approach.

5.4: Perturbation Visualization

To analyze the attack visually, we generate spectrograms of the audio samples. Figure 5.2

presents the visualization using dB-scaled spectrograms of the watermark creation attack. In the

30

(a) Message probabilities distribution (AudioSeal).

(b) Message probabilities distribution (Timbre).

Figure 5.1: Message probabilities distribution comparisons between AudioMarkBench and Ours
for the watermark creation.

spectrograms produced by AudioMarkBench and Ours default attack method, we can clearly

observe some noticeable noise (horizontal lines), which is highlighted with red boxes. In Ours

(+opt) approach, since the encoder and decoder are jointly trained during watermark training,

this process helps minimize noticeable noise. We think that when watermark attacks target only

the decoder, some noise may not be optimized well. Additionally, the visibility of the noise

is influenced by the specific watermark model used.

In the Timbre watermarking method, the

watermark is embedded by the encoder, and some noticeable noise (horizontal lines) can also

be observed in the spectrogram, which is highlighted with the blue box. Comparing the attack

methods Ours and Ours (+opt), we observe that the audio quality improves and some of the

noticeable noise is effectively reduced through optimization. Besides, Ours (+opt) is more visually

similar to AudioMarkBench (green boxes).

31

0.20.30.40.50.60.70.8Message Probabilities0.00.51.01.52.02.53.03.5DensityAudioSeal Distribution (AudioMarkBench)0.20.30.40.50.60.70.8Message Probabilities024681012DensityAudioSeal Distribution (Ours)0.750.500.250.000.250.500.751.00Message Probabilities0.00.20.40.60.81.01.21.41.6DensityTimbre Distribution (AudioMarkBench)1.00.50.00.51.0Message Probabilities012345678DensityTimbre Distribution (Ours)Figure 5.2: The spectrograms of the watermark creation in AudioSeal and Timbre.

(a) SNR comparison in AudioSeal and Timbre.

(b) ViSQOL comparison in AudioSeal and Timbre.

Figure 5.3: Audio quality comparisons in AudioSeal and Timbre.

5.5: Audio Quality

We evaluate audio quality using SNR and ViSQOL. The results are shown in Figure 5.3. The

watermarked audio (shown in blue) serves as the baseline for comparison with the attack methods.

Based on the SNR results, we observe that the audio quality in watermark replacement and

watermark removal is nearly identical, whereas watermark creation shows a significant difference.

In the watermark creation, a similar trend is seen in the ViSQOL scores, where AudioMarkBench

achieves a score close to 5.0 in Timbre. As shown in our subsequent experiments in Figure 5.4b,

although AudioMarkBench successfully alters the watermark binary message, the attack lacks

robustness. The specificity of the watermark creation attack is to alter the clean binary messages

to the targeted attack binary messages. When the attack prioritizes audio quality, the features of

32

 2 U L J L Q D O : D W H U P D U N $ X G L R 0 D U N % H Q F K 2 X U V 2 X U V    R S W  2 U L J L Q D O : D W H U P D U N $ X G L R 0 D U N % H Q F K 2 X U V 2 X U V    R S W  $ X G L R 6 H D O 7 L P E U HReplacementCreationRemoval25.027.530.032.535.037.540.042.5SNRAudioSealReplacementCreationRemoval253035404550556065SNRTimbrewatermarkAudioMarkBenchoursours(+opt)ReplacementCreationRemoval4.34.44.54.64.74.84.9ViSQOLAudioSealReplacementCreationRemoval4.44.54.64.74.84.95.0ViSQOLTimbrewatermarkAudioMarkBenchoursours(+opt)(a) Robustness of Our Attack (Watermark Replacement).

(b) Robustness of Our Attack (Watermark Creation).

Figure 5.4: ASR after applying different no-box perturbations on the watermark replacement and
creation attacks. The watermark method is AudioSeal. (a) evaluates the ASR results of
watermark replacement against the no-box perturbations, it uses the Gigaspeech dataset for
validation; (b) evaluates the ASR results of watermark creation against the no-box perturbations,
it uses the Librispeech dataset for validation.

the perturbed audio closely resemble those of clean audio, which also means that the watermark

features are weak. As a result, applying a no-box perturbation may increase the likelihood of

removing these weakened watermark features. Therefore, balancing audio quality and attack

robustness is an important consideration. Our subsequent experiments for the watermark creation

attack demonstrate that our audio quality is relatively lower, but the robustness of our attack is

higher.

In the watermark replacement and watermark removal, the audio quality of our attack is

comparable to that of AudioMarkBench. The attack method of Ours has the worst audio quality.

However, after applying the optimization step, the audio quality significantly improves audio

quality, becoming nearly equivalent to that of AudioMarkBench.

33

 $ X G L R 6 H D O 7 L P E U H           $ 6 5     / R Z  3 D V V  ) L O W H U $ X G L R 6 H D O 7 L P E U H $ P S O L W X G H  6 F D O L Q J $ X G L R 6 H D O 7 L P E U H * D X V V D L Q  1 R L V H $ X G L R 6 H D O 7 L P E U H 0 3   & R P S U H V V L R Q $ X G L R 6 H D O 7 L P E U H + L J K  3 D V V  ) L O W H U $ X G L R 0 D U N % H Q F K 2 X U V 2 X U V    R S W  $ X G L R 6 H D O 7 L P E U H           $ 6 5     / R Z  3 D V V  ) L O W H U $ X G L R 6 H D O 7 L P E U H $ P S O L W X G H  6 F D O L Q J $ X G L R 6 H D O 7 L P E U H * D X V V D L Q  1 R L V H $ X G L R 6 H D O 7 L P E U H 0 3   & R P S U H V V L R Q $ X G L R 6 H D O 7 L P E U H + L J K  3 D V V  ) L O W H U $ X G L R 0 D U N % H Q F K 2 X U V 2 X U V    R S W 5.6: Robustness of AWM

Robustness of the attack in watermarking refers to whether the watermark remains detectable in

the audio after no-box perturbations have been applied to the watermarked audio. Watermark

replacement and creation attacks, which are targeted attacks that aim to modify all binary message

bits to the specific target binary message bits, require more sophisticated and robust consideration.

Therefore, we add no-box perturbations to the perturbed watermarked audio and further observe

if the forged watermark can be detected. Figure 5.4 evaluates the robustness of watermark

replacement and creation attacks against no-box perturbations. The results show that our attacks

can better defend against the no-box perturbations, especially with most ASR scores in the

watermark creation scenario achieving around 100% scores. For comparing the ASR between

Ours and Ours (+opt), we find that our attack without optimization overall demonstrates higher

robustness.

5.7: Further Analysis for Watermark Creation

The watermark creation attack is to transform clean audio into watermarked audio. In Sections 5.5

and 5.6, we provide some analysis related to the watermark creation attack. In this section, we

further analyze the watermark creation for the AudioSeal. In the AudioSeal, they propose a score

to indicate the probability of the audio being watermarked. This score evaluates each audio frame

and determines if the frame is watermarked or not. The final output is the ratio of watermarked

frames to the total number of frames. The score ranges from 0 to 1: 1 indicates that all frames are

watermarked and 0 indicates that all frames are clean. For watermarked audio, a score closer to 1

indicates a higher likelihood of containing a watermark.

Table 5.2 shows the results. We observe that the AudioMarkBench has low scores, which means

that the watermark creation attack is not successful. In our attack method, we enhance the joint

optimization of both this score loss and the message loss. We optimize the message probabilities

to fall within the estimated normal range while increasing the score toward 1. Through the joint

34

Table 5.2: AudioSeal watermark score comparison of watermark creation across different
datasets.

Attack Type

Watermark Creation

Attack Method Librispeech Audiomark Gigaspeech
GT Watermark
AudioMarkbench
Ours
Ours (+opt)

1.0000
0.2042
0.9516
0.9412

0.9998
0.1951
0.9827
0.9670

1.0000
0.2688
0.9780
0.9598

optimization, we improve the score to above 95% in Ours method and above 94% in Ours (+opt)

method.

5.8: Summary

In this chapter, we present a comprehensive evaluation of the proposed AWM across three datasets

and two state-of-the-art watermarking methods. Our experiments demonstrate that AWM achieves

high attack success rates while maintaining low detection success rates, effectively bypassing

existing detection strategies based on statistical message probability distributions. In particular,

AWM achieves detection success rates under 10% for watermark replacement and creation, and

0% for watermark removal, without compromising perceptual audio quality.

Distributional analysis and spectrogram visualizations show that AWM successfully aligns

perturbed message probabilities with benign distributions, making the attack difficult to detect.

Compared with AWM without audio quality optimization, the optimized attack (AWM +opt) further

improves audio quality while preserving attack effectiveness. We further evaluate the robustness of

AWM under various no-box perturbations such as low-pass filtering, amplitude scaling, Gaussian

noise, MP3 compression, and high-pass filtering. The attack remains highly effective and largely

undetectable under these perturbations, which demonstrates the strong robustness of AWM.

35

CHAPTER 6: DISCUSSION

6.1: Limitation

To further evaluate whether the defender can detect our attack, we apply the no-box perturbations

to our proposed attack methods. Figure 6.1 illustrates the results. We observe an increase in the

DSR scores after applying the no-box perturbations. Among the no-box perturbations, the low-pass

and high-pass filters have higher DSR scores, which indicates that our attack is more susceptible

to band-pass filter-based perturbations. For the watermark attack types, the watermark removal

attack can bypass the defense to some extent, the DSR scores show a slight increase. In contrast,

some DSR scores for watermark replacement and creation attacks increase significantly. In the

watermark replacement and creation attacks, although our attacks achieve higher ASR scores, the

defenders can still detect that the audio has been attacked by applying the no-box perturbations.

The results suggest that watermark replacement and creation attacks are more complex and require

greater attention in designing attacks against no-box perturbations.

6.2: Message Probabilities Output Analysis

We demonstrate the effectiveness of the detection mechanism, which uses the message probabilities

to find the outliers. The defenders can use the no-box perturbations on the perturbed audio.

Although attackers can alter the watermark binary message and achieve a high ASR score, some

probabilities may have changed to fall outside of the normal ranges. For example, suppose

the binary messages of watermarked audio are 0000001110101100, and the binary messages of

perturbed audio are 1111111100000000. This occurs because we add an attack perturbation to

perturb the original perturbation and then form a new perturbation. This process may (1) influence

the robustness of the original perturbation, and (2) after applying the no-box perturbations, some

modified message probabilities move more towards the direction of the pre-attack (original) audio.

36

Figure 6.1: DSR of our attack methods against the no-box attacks in the AudioSeal watermark
method.

Below is a watermarked audio example of the AudioSeal, the threshold is 0.5. If the message

probability is larger than 0.5, the binary message is 1; otherwise, the binary message is 0. The

watermarked audio sample is applied to three perturbation attacks. For the binary message at

position 12, the original binary message is 0 and the perturbed binary message is 0, but applying

the AWM +(LP) perturbation, the message probability is 0.3919, which is an outlier (12th value in

each probability vector). We think this occurs because of (1). For the binary message at position

4, the original binary message is 0 and the perturbed binary message is 1, but applying the AWM

+(LP) attack, the message probability is 0.5012, which is an outlier (4th value in each probability

vector). We think this occurs because of (2).

Original watermarked audio message probabilities (pre-attacked):

[0.2350, 0.2333, 0.2050, 0.2165, 0.1962, 0.2471, 0.7921, 0.7392, 0.7768, 0.2211, 0.7336, 0.2676,

0.7756, 0.7263, 0.2677, 0.2164]

Original watermarked audio message probabilities using low-pass filter (attacked):

[0.2321, 0.2365, 0.2110, 0.2250, 0.1999, 0.2386, 0.7952, 0.7387, 0.7612, 0.2289, 0.7253, 0.2661,

0.7827, 0.7363, 0.2637, 0.2199]

Perturbed audio message probabilities (attacked):

[0.7948, 0.7513, 0.7943, 0.7820, 0.7833, 0.7722, 0.7617, 0.7311, 0.2522, 0.2327, 0.2092, blue,

0.2165, 0.2424, 0.2731, 0.2206]

Perturbed audio message probabilities using low-pass filter (attacked):

[0.7805, 0.7546, 0.7882, 0.5012, 0.7702, 0.7505, 0.7748, 0.7214, 0.2474, 0.2467, 0.2139, 0.3919,

0.2304, 0.2280, 0.2989, 0.2552]

37

 5 H S O D F H P H Q W & U H D W L R Q 5 H P R Y D O           ' 6 5     2 X U V 2 X U V    / 3  5 H S O D F H P H Q W & U H D W L R Q 5 H P R Y D O 2 X U V 2 X U V    $ 6  5 H S O D F H P H Q W & U H D W L R Q 5 H P R Y D O 2 X U V 2 X U V    * 1  5 H S O D F H P H Q W & U H D W L R Q 5 H P R Y D O 2 X U V 2 X U V    0 3   5 H S O D F H P H Q W & U H D W L R Q 5 H P R Y D O 2 X U V 2 X U V    + 3 CHAPTER 7: CONCLUSION

In the master’s thesis, we propose a watermark attack detection approach, and a corresponding

adaptive attack framework. By observing the distribution differences between attacked and benign

audio samples, we design a detection strategy that effectively distinguishes existing attack methods.

To counter this defense, we introduce AWM, an adaptive audio watermark attack that dynamically

evades the detection mechanism. The approach uses a two-step optimization process: the first

step enhances attack robustness while ensuring the message probabilities remain within the normal

range; the second step focuses on improving audio quality, which strikes a balance between audio

quality and attack effectiveness. Experimental results demonstrate that the proposed detection

strategy reliably identifies prior attacks, while AWM achieves superior performance with a higher

attack success rate (ASR) and lower detection success rate (DSR).

In the future, we will continue to explore more advanced attack and defense strategies for audio

watermarking. First, to perform an effective audio watermark attack, it is essential to balance attack

strength and audio quality. A too strong attack can degrade audio quality, while prioritizing audio

quality too much may result in a weaker, less robust attack. When optimizing audio quality, the ideal

approach should align with the principles of the human auditory system [37], which enables the

noise to remain imperceptible to human listeners. Second, for the defender’s detection strategy, we

explore a statistical-based method using message probabilities. We observe that the perturbed audio

contains some noticeable noise that affects the audio distribution. Therefore, defenders may be able

to detect such attacks by directly analyzing the audio signal itself. Third, we provide extensible

insights based on the observed distribution of message probabilities. For attackers, we find that

simply changing the binary message bits to a specific target might not lead to a successful attack.

We hope our findings will serve as a foundation for future advancements in audio watermarking

methods, as well as in the development of more robust attack and defense strategies.

38

BIBLIOGRAPHY

[1] Darius Afchar, Gabriel Meseguer-Brocal, and Romain Hennequin. Detecting music deepfakes

is easy but actually hard. arXiv preprint arXiv:2405.04181, 2024.

[2] Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion, and Matthias Hein. Square
In European

attack: a query-efficient black-box adversarial attack via random search.
conference on computer vision, pages 484–501. Springer, 2020.

[3] Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer,
Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. Common voice: A
massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670, 2019.

[4] Paraskevi Bassia, Ioannis Pitas, and Nikos Nikolaidis. Robust audio watermarking in the time

domain. IEEE Transactions on multimedia, 3(2):232–241, 2001.

[5] Ane Blázquez-García, Angel Conde, Usue Mori, and Jose A Lozano. A review on
outlier/anomaly detection in time series data. ACM computing surveys (CSUR), 54(3):1–33,
2021.

[6] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks.

In 2017 ieee symposium on security and privacy (sp), pages 39–57. Ieee, 2017.

[7] Guangyu Chen, Yu Wu, Shujie Liu, Tao Liu, Xiaoyong Du, and Furu Wei. Wavmark:

Watermarking for audio generation. arXiv preprint arXiv:2308.12770, 2023.

[8] Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan
Su, Daniel Povey, Jan Trmal, Junbo Zhang, et al. Gigaspeech: An evolving, multi-domain
asr corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909, 2021.

[9]

Jianbo Chen, Michael I Jordan, and Martin J Wainwright. Hopskipjumpattack: A query-
efficient decision-based attack. In 2020 ieee symposium on security and privacy (sp), pages
1277–1294. IEEE, 2020.

[10] Minhao Cheng, Simranjit Singh, Patrick Chen, Pin-Yu Chen, Sijia Liu, and Cho-Jui Hsieh.
Sign-opt: A query-efficient hard-label adversarial attack. arXiv preprint arXiv:1909.10773,
2019.

[11] Ju-chieh Chou, Cheng-chieh Yeh, and Hung-yi Lee. One-shot voice conversion by
separating speaker and content representations with instance normalization. arXiv preprint
arXiv:1904.05742, 2019.

[12] Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio

compression. arXiv preprint arXiv:2210.13438, 2022.

[13] Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap
learning audio concepts from natural language supervision.
In ICASSP 2023-2023 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–
5. IEEE, 2023.

39

[14] Hanqing Guo, Junfeng Guo, Bocheng Chen, Yuanda Wang, Xun Chen, Heng Huang, Qiben
Yan, and Li Xiao. Audio watermark: Dynamic and harmless watermark for black-box voice
dataset copyright protection.

[15] Anna Yoo Jeong Ha, Josephine Passananti, Ronik Bhaskar, Shawn Shan, Reid Southen,
Haitao Zheng, and Ben Y Zhao. Organic or diffused: Can we distinguish human art from
ai-generated images? In Proceedings of the 2024 on ACM SIGSAC Conference on Computer
and Communications Security, pages 4822–4836, 2024.

[16] Andrew Hines, Jan Skoglund, Anil C Kokaram, and Naomi Harte. Visqol: an objective speech
quality model. EURASIP Journal on Audio, Speech, and Music Processing, 2015:1–18, 2015.

[17] Guang Hua, Jonathan Goh, and Vrizlynn LL Thing.

Time-spread echo-based audio
watermarking with optimized imperceptibility and robustness. IEEE/ACM Transactions on
Audio, Speech, and Language Processing, 23(2):227–239, 2015.

[18] Guang Hua, Jiwu Huang, Yun Q Shi, Jonathan Goh, and Vrizlynn LL Thing. Twenty years
of digital audio watermarking—a comprehensive review. Signal processing, 128:222–242,
2016.

[19] Zhengyuan Jiang, Jinghuai Zhang, and Neil Zhenqiang Gong. Evading watermark based
detection of ai-generated content. In Proceedings of the 2023 ACM SIGSAC Conference on
Computer and Communications Security, pages 1168–1181, 2023.

[20] Xiangui Kang, Rui Yang, and Jiwu Huang. Geometric invariant audio watermarking based

on an lcm feature. IEEE Transactions on Multimedia, 13(2):181–190, 2010.

[21] Wen-Nung Lie and Li-Chun Chang.

watermarking based on low-frequency amplitude modification.
multimedia, 8(1):46–59, 2006.

Robust and high-quality time-domain audio
IEEE transactions on

[22] Yist Y Lin, Chung-Ming Chien, Jheng-Hao Lin, Hung-yi Lee, and Lin-shan Lee. Fragmentvc:
Any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice
fragments with attention. In ICASSP 2021-2021 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 5939–5943. IEEE, 2021.

[23] Chang Liu, Jie Zhang, Han Fang, Zehua Ma, Weiming Zhang, and Nenghai Yu. Dear: A
deep-learning-based audio re-recording resilient watermarking. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 37, pages 13201–13209, 2023.

[24] Chang Liu, Jie Zhang, Tianwei Zhang, Xi Yang, Weiming Zhang, and Nenghai Yu. Detecting
voice cloning attacks via timbre watermarking. arXiv preprint arXiv:2312.03410, 2023.

[25] Hongbin Liu, Moyang Guo, Zhengyuan Jiang, Lun Wang, and Neil Gong. Audiomarkbench:
Advances in Neural Information

Benchmarking robustness of audio watermarking.
Processing Systems, 37:52241–52265, 2024.

40

[26] Nils Lukas, Edward Jiang, Xinda Li, and Florian Kerschbaum. Sok: How robust is image
classification deep neural network watermarking? In 2022 IEEE Symposium on Security and
Privacy (SP), pages 787–804. IEEE, 2022.

[27] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian
arXiv preprint

Vladu. Towards deep learning models resistant to adversarial attacks.
arXiv:1706.06083, 2017.

[28] McAfee. Beware the artificial impostor. a mcafee cybersecurity artificial intelligence report.,

May 2023.

[29] Kevin P Murphy. Conjugate bayesian analysis of the gaussian distribution. def, 1(2σ2):16,

2007.

[30] Minzhou Pan, Zhenting Wang, Xin Dong, Vikash Sehwag, Lingjuan Lyu, and Xue Lin.
Finding needles in a haystack: A black-box approach to invisible watermark detection. In
European Conference on Computer Vision, pages 253–270. Springer, 2024.

[31] Minzhou Pan, Yi Zeng, Lingjuan Lyu, Xue Lin, and Ruoxi Jia. {ASSET}: Robust backdoor
data detection across a multiplicity of deep learning paradigms. In 32nd USENIX Security
Symposium (USENIX Security 23), pages 2725–2742, 2023.

[32] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an
asr corpus based on public domain audio books. In 2015 IEEE international conference on
acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015.

[33] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya
In International

Sutskever. Robust speech recognition via large-scale weak supervision.
conference on machine learning, pages 28492–28518. PMLR, 2023.

[34] Jie Ren, Han Xu, Pengfei He, Yingqian Cui, Shenglai Zeng, Jiankun Zhang, Hongzhi Wen,
Jiayuan Ding, Pei Huang, Lingjuan Lyu, et al. Copyright protection in generative ai: A
technical perspective. arXiv preprint arXiv:2402.02333, 2024.

[35] Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech
2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558, 2020.

[36] Robin San Roman, Pierre Fernandez, Alexandre Défossez, Teddy Furon, Tuan Tran, and Hady
Elsahar. Proactive detection of voice cloning with localized watermarking. arXiv preprint
arXiv:2401.17264, 2024.

[37] Jan Schnupp, Israel Nelken, and Andrew King. Auditory neuroscience: Making sense of

sound. MIT press, 2011.

[38] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang,
Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural tts synthesis by
conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference
on acoustics, speech and signal processing (ICASSP), pages 4779–4783. IEEE, 2018.

41

[39] Xiang-Yang Wang and Hong Zhao. A novel synchronization invariant audio watermarking
scheme based on dwt and dct. IEEE Transactions on signal processing, 54(12):4835–4840,
2006.

[40] Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and
Houqiang Li. Dire for diffusion-generated image detection. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 22445–22455, 2023.

[41] Yizhu Wen, Ashwin Innuganti, Aaron Bien Ramos, Hanqing Guo, and Qiben Yan. Sok: How
robust is audio watermarking in generative ai models? arXiv preprint arXiv:2503.19176,
2025.

[42] Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long.
Timesnet: Temporal 2d-variation modeling for general time series analysis. arXiv preprint
arXiv:2210.02186, 2022.

[43] Yong Xiang, Iynkaran Natgunanathan, Song Guo, Wanlei Zhou, and Saeid Nahavandi.
Patchwork-based audio watermarking method robust
to de-synchronization attacks.
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(9):1413–1423,
2014.

[44] Pei Yang, Hai Ci, Yiren Song, and Mike Zheng Shou. Can simple averaging defeat modern
watermarks? Advances in Neural Information Processing Systems, 37:56644–56673, 2024.

[45] Yongyi Zang, You Zhang, Mojtaba Heydari, and Zhiyao Duan. Singfake: Singing voice
In ICASSP 2024-2024 IEEE International Conference on Acoustics,

deepfake detection.
Speech and Signal Processing (ICASSP), pages 12156–12160. IEEE, 2024.

[46] Juan Zhao, Tianrui Zong, Yong Xiang, Longxiang Gao, Wanlei Zhou, and Gleb Beliakov.
Desynchronization attacks resilient watermarking method based on frequency singular
value coefficient modification. IEEE/ACM Transactions on Audio, Speech, and Language
Processing, 29:2282–2295, 2021.

[47] Junzuo Zhou, Jiangyan Yi, Yong Ren, Jianhua Tao, Tao Wang, and Chu Yuan Zhang.
Wmcodec: End-to-end neural speech codec with deep watermarking for authenticity
verification. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pages 1–5. IEEE, 2025.

42