PROACTIVE SCHEMES: ADVERSARIAL ATTACKS FOR SOCIAL GOOD

By

Vishal Asnani

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Computer Science—Doctor of Philosophy

2025

ABSTRACT

Adversarial attacks in computer vision typically exploit vulnerabilities in deep learning models,

generating deceptive inputs that can lead AI systems to incorrect decisions. However, proactive

schemes approaches designed to embed purposeful signals into visual data can serve as “adversarial

attacks for social good,” harnessing similar principles to enhance the robustness, security, and

interpretability of AI systems. This research explores application of proactive schemes in computer

vision, diverging from conventional passive methods by embedding auxiliary signals known as

"templates" into input data, fundamentally improving model performance, attribution capabilities,

and detection accuracy across diverse tasks. This includes novel techniques for image manipulation

detection and localization, which introduce learned templates to accurately identify and pinpoint

alterations made by multiple, previously unseen Generative Models (GMs). The Manipulation

Localization Proactive scheme (MaLP), for example, not only detects but also localizes specific

pixel changes caused by manipulations, showing resilient performance across a broad range of

GMs. Extending this approach, the Proactive Object Detection (PrObeD) scheme utilizes encoder-

decoder architectures to embed task-specific templates within images, enhancing the efficacy of

object detectors, even under challenging conditions like camouflaged environments. This research

further expands proactive schemes into generative models and video analysis, enabling attribution

and action detection solutions. ProMark, for instance, introduces a novel attribution framework by

embedding imperceptible watermarks within training data, allowing generated images to be traced

back to specific training concepts—such as objects, motifs, or styles—while preserving image

quality. Building on ProMark, CustomMark offers selective and efficient concept attribution,

allowing artists to opt into watermarking specific styles and easily add new styles over time,

without the need to retrain the entire model. Inspired by the proactive structure of PrObeD for 2D

object detection, PiVoT introduces a video-based proactive wrapper that enhance action recognition

and spatio-temporal action detection. By integrating action-specific templates through a template-

enhanced Low-Rank Adaptation (LoRA) framework, PiVoT seamlessly augments various action

detectors, preserving computational efficiency while significantly boosting detection performance.

Lastly, the thesis presents a model parsing framework that estimates "fingerprints” for the generative

models, extracting unique characteristics from generated images to predict the architecture and

loss functions of underlying networks—a particularly valuable tool for deepfake detection and

model attribution. Collectively, these proactive schemes offer significant advancements over

passive methods, establishing robust, accurate, and generalizable solutions for diverse computer

vision challenges. By addressing key issues related to the different vision applications caused by

conventional passive approaches, this research lays the groundwork for a future where proactive

frameworks can improve AI-driven applications.

Copyright by
VISHAL ASNANI
2025

This thesis is dedicated to my Father and Mother. Thank you for always being there for me and
believing in me.

v

ACKNOWLEDGMENTS

This PhD journey has been an incredible and transformative experience, one that would not have

been possible without the support of many individuals. First and foremost, I extend my deepest

gratitude to my advisor, Dr. Xiaoming Liu, for his mentorship, guidance, and patience throughout

my PhD. His support and belief in me, especially during times when I doubted myself, have been

invaluable. He took a chance on me and continuously pushed me to do better at every step, ensuring

I stayed on track even when things got difficult. His insights and encouragement have helped shape

my research and given me the confidence to take on challenging projects. Without his support, I

would not have achieved the progress I have made in my PhD.

Among those who shaped my PhD, Dr. Xi Yin holds a special place. She entered my life when

I was struggling to find direction, becoming not just a brilliant mentor but a guiding force. Beyond

research, she taught me how to navigate the PhD journey itself, offering both intellectual guidance

and emotional support during critical moments. Involved in most of my projects, she pushed me

toward excellence while ensuring I never felt lost. Her patience, kindness, and belief in me have

been invaluable, and I will always be grateful for her support.

I am also immensely grateful to my committee members, Dr. Arun Ross and Dr. Yu Kong, and

to my collaborator Dr. Sijia Liu, whose expertise and thoughtful insights have been instrumental in

shaping my research. Their guidance has gone far beyond formal meetings-they have continuously

challenged me to think critically, refine my methodologies, and push the boundaries of my work.

Their feedback has not only strengthened the technical aspects of my dissertation but has also

encouraged me to explore new perspectives and approaches that I would not have considered on

my own.

I never imagined that a single email with Dr. John Collomosse would shape my PhD journey,

leading to two summer internships under him and Dr. Shruti Agarwal, and ultimately to my

full-time position with the same team. Dr. Collomosse provided the perfect mix of guidance and

independence, helping me bridge research with real-world applications.

If he opened the door,

Dr. Agarwal made sure I thrived. Her technical expertise, hands-on approach, and clear, practical

vi

advice kept our projects on track and made problem-solving seamless. Beyond work, she ensured

our time was filled with memorable experiences, from Indian restaurant outings to movie nights

and fun gatherings, making my internships truly special.

I am deeply grateful to Dr. Tal Hassner for his support, especially during a critical moment

when a last-minute conflict with Meta’s legal team jeopardized a key paper submission. With just a

week left, he went above and beyond to push through approvals, and thanks to his and Dr. Xi Yin’s

efforts, we secured approval two days before the deadline, ensuring the paper’s submission and

eventual publication. Beyond this, both Dr. Hassner and Dr. Yin have been invaluable mentors,

offering continuous guidance, feedback, and encouragement, shaping my growth as a researcher.

I also want to acknowledge my fellow members of CVLab—Andrew, Yiyang, Abhinav, Feng,

Shengjie, Xiao, Zhiyuan, Girish, Jie, Zhizhong, Zhihao, Minchul, Dingqiang, Zhang, Masa, Yaojie,

Garrick, Amin, Luan, and Morteza, whose stimulating discussions, and unwavering support have

made this journey all the more enjoyable and intellectually fulfilling. The lab has been more than

just a workplace; it has been a community where I have grown both as a researcher and as a person.

Beyond academia, I owe everything to my family, who have been my pillars of strength. This

PhD is dedicated to my father (Shyam Lal Asnani) and mother (Jaya Asnani), whose unconditional

love, sacrifices, and encouragement have shaped the person I am today. My mother is not here to

see me achieve this milestone, having lost her to COVID, but I know she would be proud of me. Her

belief in my dreams gave me the resilience to push forward, even in the hardest moments, and her

absence is felt deeply in this achievement. I am equally grateful to my sisters, Neetu and Deepika

Didi, and my nephew and nieces—Mannan, Dimple, Anushka, and Nyysa—for their constant

support, love, and for always reminding me of the joys of life beyond research. Their presence has

been a source of strength, and I carry my mother’s love with me as I reach this milestone.

To my wife, Nikita, thank you for being my anchor through this journey. Your love, patience,

and unwavering faith in me have been my greatest source of motivation. Our story has unfolded

alongside this PhD journey—from the moment I first met you at Lansing airport to the day I proposed

and eventually married you. Through the highs and lows, from late-night research struggles to the

vii

small moments of joy, you have been my greatest companion. I am beyond grateful to have you by

my side, making every challenge easier and every milestone even more meaningful.

Beyond my family, my friends-my second family-have been an integral part of my PhD journey,

bringing laughter and encouragement into my life. Everyone has played a role in some way, shaping

this experience into something far more meaningful than just academic work.

It all began with Ashish and Himanshu, my school friends who have been constants in my

life, keeping me grounded no matter how much time passed. Then came my bachelor’s years,

where Manu, Yalaj, Mayank, Amartya, and Aman made every challenge easier with their support,

whether through late-night conversations, shared struggles, or moments of pure fun. During my

master’s years, I found another incredible circle with Thanish, Ahamad, Saloni, Navya, Abhishek,

and Snehal. Beyond academics, these years were filled with shared courses, endless hangouts, and

game nights—especially during COVID, making some of my best memories despite the challenges.

Then came the introduction of the Desi Boys group, where I met some of the most amazing

people: Bharat, Hitesh, Abubakr, Sai, Ankit, Siddharth, and Abhiroop. Friday hangouts became

a tradition, filled with game nights and what we called "restaurant exploration"- which, in reality,

was just revisiting the one and only chosen restaurant over and over again. These friendships made

my PhD experience feel so much lighter, bringing moments of fun that balanced out the intensity

of research.

Among all the people, meeting Nisha was truly special. She quickly became one of my

best friends, someone I could talk to about anything and everything. Our drives were never just

about getting coffee—they turned into adventures where we’d end up two hours away at a beach,

completely unplanned. She was always there, through the ups and downs, making life in EL (East

Lansing) all the more exciting. Then there was Nidhi, one of the closest friends I found in East

Lansing. Our drives, our shared love for coffee, and our endless conversations made for some of

my best moments. From convincing Nikita about me to our many hangouts, she was always there.

Nothing hit me harder than when she left EL, and her absence was deeply felt.

Through this journey, I built a bond that felt like home with Ishita and Aditya. Whether it was

viii

potluck nights, coffee hunts, or simply being there for each other, we always found time despite

our packed schedules. Exploring countless coffee places together still feels like an achievement

in itself. I was also fortunate to meet Konika and Mudita, whose warmth and friendship made

everyday moments more enjoyable. Mudita’s dad, with his kindness and wisdom, felt like family,

always offering encouragement and support that meant a lot to me. Through Nikita, I got to

know Gauri, Gaurav, Shruti, Raj, and Nihar, who started as her roommates but soon became close

friends. From trips and hangouts to visiting them in Chicago and South Bend, every moment

together strengthened our bond. Whether it was sharing meals, planning getaways, or simply

catching up, their presence made my PhD years even more fulfilling.

Finally, in the later part of my PhD, I found a group of people who became close to my heart

in no time—Nabasmita, Ritam, Devika, Soni, Deepak, and Ritwik. From celebrating birthdays to

planning trips, this group made my last phase of PhD truly special (special shout-out to Nabasmita’s

therapy sessions and Devika’s chai). The friendships I formed with them in such a short time feel

just as deep and meaningful as the ones that have been with me for years. Each of these friendships

has added something irreplaceable to my PhD journey. Beyond the research, papers, and long

nights in the lab, it is these people who made this experience worthwhile.

In meeting all of these wonderful people along my journey, there was one constant that remained

close to my heart-"VLHALA," my first car. She was more than just a vehicle; she was my companion

through every phase of this PhD. She has seen me at my best and my worst, from moments of

pure joy to times when I broke down in frustration. She was there for the late-night drives that

helped clear my mind, for the spontaneous road trips that brought excitement, and for the quiet

moments when I just needed to escape and reflect. I don’t think I would have survived this journey

without her-those drives were not just about getting from one place to another; they were my space

to breathe, to think, and to keep pushing forward.

This research would not have been possible without the generous support of my PhD sponsors:

Adobe, DARPA, Meta, and DEVCOM Army Research Laboratory—whose funding enabled me

to pursue my work with the necessary resources and opportunities. Their support has been

ix

instrumental in allowing me to explore new ideas and contribute meaningfully to my field.

Finally, I must acknowledge one of the most consistent companions throughout this PhD—coffee.

While the exact origins of coffee remain a mystery, its contribution to this dissertation is undeniable.

It has powered countless late nights, early mornings, and moments of deep contemplation, ensuring

that I stayed focused and driven. This journey has been filled with challenges, growth, and countless

memories, and I am forever grateful to everyone who has been a part of it.

x

TABLE OF CONTENTS

CHAPTER 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

1

CHAPTER 2

PROACTIVE IMAGE MANIPULATION DETECTION . . . . . . . . . 10

CHAPTER 3

MALP: MANIPULATION LOCALIZATION USING A PROACTIVE
SCHEME .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

. 29

CHAPTER 4

PROBED: PROACTIVE OBJECT DETECTION WRAPPER . . . . . . 48

CHAPTER 5

PROMARK: PROACTIVE DIFFUSION WATERMARKING FOR
CAUSAL ATTRIBUTION . . . . . . . . . . . . . . . . . . . . . . .

. 65

CHAPTER 6

CUSTOMMARK: CUSTOMIZATION OF DIFFUSION MODELS
FOR PROACTIVE ATTRIBUTION . . . . . . . . . . . . . . . . . . . 84

CHAPTER 7

PIVOT: PROACTIVE VIDEO TEMPLATES FOR ENHANCING
VIDEO TASK PERFORMANCE . . . . . . . . . . . . . . . . . . . . . 103

CHAPTER 8

REVERSE ENGINEERING OF GENERATIVE MODELS:
INFERRING MODEL HYPERPARAMETERS FROM GENERATED
IMAGES .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

.

. 122

BIBLIOGRAPHY .

.

.

.

.

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

APPENDIX A

PUBLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

APPENDIX B

PROACTIVE IMAGE MANIPULATION DETECTION
APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

APPENDIX C

MALP APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

APPENDIX D

PROBED APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

APPENDIX E

PROMARK APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . 220

APPENDIX F

CUSTOMMARK APPENDIX . . . . . . . . . . . . . . . . . . . . . . 229

APPENDIX G

PIVOT APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

APPENDIX H

REVERSE ENGINEERING OF GENERATIVE MODELS
APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

xi

CHAPTER 1

INTRODUCTION

Traditional CV tasks have evolved significantly with the advent of deep learning models like CNNs

and transformers [127, 281, 310]. These advancements have enhanced tasks such as real-time object

detection, advanced image classification, visions and large language models, and facial recognition,

leading to substantial improvements in accuracy and efficiency. All the methods which takes the

image as is for the input are treated as passive schemes [5, 6, 7, 4]. Adversarial attacks have also

become more sophisticated, exploiting deep neural networks’ vulnerabilities to create misleading

inputs that appear normal to humans [34, 109, 206].

Adversarial attacks in computer vision underscore a significant societal problem, highlighting

the vulnerabilities inherent in the deployment of machine learning technologies. The subtle

manipulations used in these attacks can lead to misinterpretations by AI systems, potentially causing

widespread harm in critical applications such as security surveillance, healthcare diagnostics,

and autonomous transportation [138, 78]. Moreover, the exploitation of these vulnerabilities by

malicious actors could undermine public trust in AI technologies, stalling progress and adoption.

The challenge of adversarial attacks extends beyond technical hurdles, posing ethical, legal, and

safety concerns that society must address to ensure the responsible and secure advancement of

computer vision applications.

While adversarial attacks in computer vision are often viewed through the lens of their potential

for harm, there exists a transformative perspective that leverages these techniques for social good [5,

6, 7]. By understanding and harnessing the principles behind adversarial perturbations, researchers

have innovated protective measures which utilizes techniques that enhance various computer vision

applications using imperceptible signals added onto the original media, known as templates [5, 7, 6],

as shown in Fig. 1.1. The methods that encrypt input data using templates, allowing the encrypted

data to enhance the performance for an application, are referred to as proactive schemes. In contrast,

all the methods which takes the input data as is are treated as passive schemes [5, 7, 6, 4].

Proactive schemes have been used for a long time, using different methodologies. Previously,

1

Figure 1.1 Passive vs. Proactive Schemes: Passive schemes take input as is for their method,
while proactive schemes use templates to encrypt the input and then use the encrypted input for the
particular method.

proactive schemes focused on simple enhancements in image processing, with applications like

steganography, encryption, and security surveillance [155, 234]. Proactive schemes also share the

similar idea from approaches using stochastic resonance in signal processing [95] and non-linear

systems [274]. Stochastic resonance occurs when a weak signal that is too faint to be detected by a

system is enhanced by the addition of noise, allowing the system to cross a detection threshold. This

happens because the noise helps to push the weak signal above the threshold intermittently, making

it detectable by the system. The interplay between the noise and the signal can amplify the signal’s

effects at certain points, leading to an overall improvement in the system’s ability to process or

detect the signal. The noise level is tuned to an optimal range—too little noise won’t help the signal,

and too much noise will overwhelm it. However, deep learning has opened up the door for utilizing

stochastic resonance in improving the performance by thresholding neural networks [47], noise-

boosted activation functions [261], non-linear stochastic dynamics [277], Fourier domain [253] etc.

Similarly, many works inject noise in the data or labels as augmentations, to improve the robustness

of the deep learning networks [231, 360, 358, 181]. Although the above methods resemble proactive

schemes, the focus of this thesis is on the usage of these schemes for social good in the current

2

deep learning era for a variety of applications in the realm of computer vision and natural language

processing.

A general framework for proactive schemes is shown in Fig. 1.1. Each method has a specific

encryption process and learning process associated with it, which depends on the application.

Firstly, the encryption process is a critical component in the design of proactive schemes. This

process involves the use of various innovative methods or operations to embed template information

within digital media. The templates used for encryption can take the form of many different types

of signals like bit sequences, 2D noises, texts, visual prompts, predefined tags, audio, etc. The

templates are added onto different types of media, such as images, texts, videos, audios, etc. The

goal of the encryption process is to create a secure framework that can withstand potential attacks

while maintaining the quality of the encrypted media compared to the original. As technology

evolves, so do the techniques used for encryption, making it an ever-growing area of research.

Next, the learning process involves training models to recognize and incorporate these templates,

whether they are bit sequences, 2D templates, text signals, or visual prompts—into various forms

of digital content. This integration is achieved through specialized learning paradigms, eg. encoder

decoder frameworks, learning via objective functions, adversarial learning, specialized architectures

like GANs, transformers, etc., tailored to the unique characteristics of each template type. The

effectiveness of the learning process is constrained, optimized and evaluated using a range of

objective functions and metrics. This encompasses the stage of learning objectives, which govern

the efficacy of the proactive schemes for various applications. The learning objectives are heavily

dependent on the application for which the method is being used.

These schemes are used for a plethora of applications, including encryption, GenAI and LLM

defense, preservation of authorship rights, ownership verification, improving CV applications, and

privacy protection. Based on each application, the researchers have explored various combinations

of respective modules of proactive schemes,

i.e.,type of template, encryption process, and

learning process.

In this thesis we explore various applications for proactive schemes. The

main innovation for proactive schemes comes in choosing the right kind of template to be added

3

Figure 1.2 A general overview of the proactive framework. The method starts by encrypting
the input data with some kind of template. This is known as Encryption process. The framework
passes through some learning process, and is evaluated based on certain decision process. Finally,
every method is associated with some application.

onto the media type. This step is crucial, as it’ll guide the overall path for different blocks of

proposed approach. We propose various work in this thesis which explore different application

domains which are benefiting by the usage of proactive schemes as compared to their passive

counterparts. We show the effectiveness of proactive schemes across image manipulation detection,

image manipulation localization, 2D generic and camouflaged object detection, concept attribution

for media provenance, and action recognition.

Image manipulation detection algorithms are traditionally designed to differentiate between

images altered by specific Generative Models (GMs) and authentic images, but they often struggle

to generalize when encountering images manipulated by previously unseen GMs. Typically, these

detection methods operate in a passive manner [69, 265, 340, 65], simply analyzing the input

image as it is. In contrast, we introduce a proactive approach to image manipulation detection,

which is based on the recovery of the template from encrypted real and manipulated images. The

core innovation of our method lies in the estimation of templates that, when superimposed onto

the original image, enhance the accuracy of detecting manipulations. Specifically, a real image

protected by these templates, along with its manipulated counterpart, can be more effectively

distinguished than a plain real image compared to its altered version. These templates are crafted

based on specific constraints designed to ensure their effectiveness. Unlike prior works, we use

4

unsupervised learning to estimate this template set based on certain constraints. We define different

loss functions to incorporate properties including small magnitude, more high frequency content,

orthogonality and classification ability as constraints to learn the template set. In comparison, our

approach differs from related proactive works [267, 356, 272, 325] in its purpose (detection vs other

tasks), template learning (learnable vs predefined), the number of templates, and the generalization

ability.

As the quality of images generated by various Generative Models (GMs) continues to improve,

there is an increasing need not only to detect whether an image has been manipulated but also to

pinpoint the specific pixels that have been altered. However, existing methods [194, 141, 65], often

described as passive, show limited ability to generalize across unseen GMs and different types of

modifications. To address this challenge, we propose a proactive manipulation localization strategy,

named MaLP. In this approach, real images are encrypted with a specially learned template. If

the image is later manipulated by a GM, this template not only aids in the binary detection of

the manipulation but also assists in identifying the exact pixels that were modified. We design a

two-branch architecture consisting of a shallow CNN network and a transformer to optimize the

template during training. While the former leverages local-level features due to its shallow depth,

the latter focuses on global-level features to better capture the affinity of the far-apart regions. The

joint training of both networks enables the MaLP to learn a better template, having embedded the

information of both levels. During inference, the CNN network alone is sufficient to estimate the

fakeness map with a higher inference efficiency. Our results demonstrate that MaLP outperforms

previous passive methods. We further validate the robustness of MaLP by testing it on 22 different

GMs, establishing a new benchmark for future research in manipulation localization.

Traditional object detection research in 2D images has primarily focused on tasks such as

detecting objects in both generic [260, 254, 43, 32, 117, 128] and camouflaged scenarios [82,

81, 149, 178, 120, 122, 121]. These approaches are typically considered passive, as they process

the input images in their original form. However, since convergence to a global minimum is

not necessarily optimal in neural networks, the resulting trained weights in object detectors may

5

not be ideal. To address this issue, we propose a proactive wrapper scheme called PrObeD,

designed to enhance the performance of existing object detectors by learning an auxiliary signal.

PrObeD utilizes an encoder-decoder architecture where the encoder generates an image-specific

signal, referred to as a template, which is used to encrypt the input images. The decoder is then

responsible for recovering this template from the encrypted images. We posit that by learning an

optimal template, the object detector’s performance can be significantly improved. The template

functions as a mask, emphasizing semantic features that are particularly useful for the object

detector. Fine-tuning the object detector with these encrypted images results in enhanced detection

performance for both generic and camouflaged objects.

Generative AI (GenAI) is revolutionizing creative workflows by enabling the synthesis and

manipulation of images through high-level prompts. However, the current systems fall short in

adequately supporting creatives in receiving recognition or compensation when their content is

used for training GenAI models [11, 269, 328]. To address this gap, we introduce ProMark, a

causal attribution method designed to trace the origin of synthetically generated images back to

specific training data concepts, such as objects, motifs, templates, artists, or styles. ProMark works

by proactively embedding concept information into the input training images through imperceptible

watermarks, which are then retained in the images generated by diffusion models—whether

unconditional or conditional.

Building on top of ProMark, CustomMark is proposed for concept attribution offering greater

flexibility and efficiency in attribution within pre-trained generative AI models. Unlike ProMark,

which requires embedding attribution markers across all training data concepts upfront, CustomMark

enables selective, concept-specific watermarking, allowing artists to opt-in only for specific styles

or concepts without impacting the rest of the model. This approach is more scalable and

computationally efficient, as it avoids the need for retraining the entire model with pre-defined

attribution markers. Furthermore, CustomMark supports sequential learning, allowing the model

to seamlessly add new attributions as additional styles emerge, achieving rapid customization

with only a fraction of the retraining time. This means CustomMark can embed watermarks for

6

new concepts in a streamlined way, maintaining image quality and ensuring resilience against

modifications.

Using principles of proactive learning, we introduce PiVoT, a pioneering video-based proactive

wrapper that enhances the functionality of video action detectors, specifically targeting Action

Recognition (AR) and Spatio-Temporal Action Detection (STAD). AR and STAD are essential

for interpreting dynamic scenes and human activities, and they benefit from advancements in

deep learning architectures such as CNNs and Transformers.

In line with proactive scheme

applications,PiVoT is crafted to seamlessly integrate with existing detector architectures while

minimizing training costs. It adopts a template-enhanced Low-Rank Adaptation (LoRA) strategy,

leveraging a 3D U-Net to produce action-specific templates that effectively elevate detection

capabilities. This targeted adaptation fine-tunes select elements of the detector, like the CNN

backbone or transformer attention modules, preserving the core structure.

Further, we also discuss our work for reverse engineering othe parameters of the generative

models. State-of-the-art (SOTA) Generative Models (GMs) have the capability to produce photo-

realistic images that are nearly indistinguishable from real photographs by the human eye [158, 51,

156, 164, 29, 42, 74]. As these models become increasingly sophisticated, the need to identify

and understand manipulated media is essential to address the societal concerns surrounding the

potential misuse of GMs. In response to this challenge, we introduce a novel approach that involves

reverse engineering GMs to infer their underlying hyperparameters based solely on the images they

generate. We define this new problem as "model parsing," which entails estimating the network

architectures and training loss functions of GMs by analyzing their output images—an exceedingly

difficult task for humans. To address this problem, we propose a two-component framework: a

Fingerprint Estimation Network (FEN), which derives a unique GM fingerprint from a generated

image by training with four specific constraints that guide the fingerprint to exhibit desirable

properties; and a Parsing Network (PN), which uses the estimated fingerprints to predict the GM’s

network architecture and loss functions. Although this work is not directly said to be proactive

in terms of adding a template to any media, the fingerprint estimated for every generative model

7

serves the role of template. This fingerprint is left by every generative model serving as additional

signal onto the images helping the task of model parsing.

1.1 Contributions of the Thesis

The thesis focuses on proactive solutions to problem in different application domains. These

approaches not only outperform their passive counterparts, but they also help better generalization

in the respective fields. We discuss 7 different applications as mentioned below:

⋄ Image Manipulation Detection: This proactive approach estimates templates that, when added

to real images, improve the precision of detecting manipulations by various Generative Models,

offering superior generalization across multiple unseen models.

⋄ MaLP: MaLP encrypts real images with learned templates that not only aid in binary manipulation

detection but also effectively localize altered pixels, demonstrating robust performance across 22

different Generative Models.

⋄ PrObeD: PrObeD introduces a proactive scheme that uses an encoder-decoder architecture to

generate and embed image-specific templates, significantly enhancing object detection performance

in both generic and camouflaged scenarios across various datasets.

⋄ ProMark: ProMark embeds imperceptible watermarks into training images, enabling the causal

attribution of generated images to their original concepts, while maintaining high image quality

and outperforming correlation-based methods.

⋄ CustomMark: CustomMark is a versatile and efficient approach for concept attribution in pre-

trained generative AI models, allowing targeted and incremental watermarking of specific concepts

without requiring full model retraining.

⋄ PiVoT: PiVoT is a proactive framework that boosts the accuracy of video-based action detectors by

embedding action-specific templates through a LoRA-enhanced architecture, delivering consistent

performance gains across multiple detectors and datasets with minimal computational costs.

⋄ Model Parsing: A framework that reverse engineers Generative Models by extracting fingerprints

from generated images to predict the models’ network architectures and loss functions, showing

effectiveness in deepfake detection and image attribution tasks.

8

1.2 Dissertation Organization

We organize the remaining chapters of the dissertation as follows. Chapter 2 introduces

the overall framework of the proposed proactive scheme for image manipulation detection. We

propose to learn a set of templates with desired properties, achieving higher performance than

a single template approach. Chapter 3 describes the proactive methodology, termed MaLP, for

image manipulation localization, applicable to both face and generic images. The framework uses

two-branch architecture capturing both local and global level features to learn a set of templates in

an unsupervised manner. Chapter 4 proposes a novel proactive approach PrObeD for the object

detection task. We mathematically demonstrate that, under certain assumptions, the proactive

method leads to a more effectively converged model compared to the passive detector, thereby

resulting in a superior object detector. Chapter 5 discusses ProMark, which performs causal

attribution of synthetic images to the predefined concepts in the training images that influenced the

generation. Chapter 6 discusses CustomMark which offers flexible, efficient concept attribution

in generative AI models by enabling selective, concept-specific watermarking without full model

retraining. Chapter 7 introduces a proactive wrapper that enhances video action detection by

integrating seamlessly with existing architectures and improving accuracy across variety of video-

based detectors and datasets. Chapter 8 discusses about going beyond model classification by

formulating a novel problem of model parsing for GMs by using a framework with fingerprint

estimation and clustering of GMs to predict the network architecture and loss functions, given a

single generated image.

9

CHAPTER 2

PROACTIVE IMAGE MANIPULATION DETECTION

Image manipulation detection algorithms are often trained to discriminate between images manipulated

with particular Generative Models (GMs) and genuine/real images, yet generalize poorly to images

manipulated with GMs unseen in the training. Conventional detection algorithms receive an input

image passively. By contrast, we propose a proactive scheme to image manipulation detection.

Our key enabling technique is to estimate a set of templates which when added onto the real image

would lead to more accurate manipulation detection. That is, a template protected real image,

and its manipulated version, is better discriminated compared to the original real image vs. its

manipulated one. These templates are estimated using certain constraints based on the desired

properties of templates. For image manipulation detection, our proposed approach outperforms the

prior work by an average precision of 16% for CycleGAN and 32% for GauGAN. Our approach is

generalizable to a variety of GMs showing an improvement over prior work by an average precision

of 10% averaged across 12 GMs1.

2.1

Introduction

It’s common for people to share personal photos on social networks. Recent developments

of image manipulation techniques via Generative Models (GMs) [107] result in serious concerns

over the authenticity of the images. As these techniques are easily accessible [303, 194, 52, 238,

385, 53, 214], the shared images are at a greater risk for misuse after manipulation. Generation

of fake images can be categorized into two types: entire image generation and partial image

manipulation [325, 330]. While the former generates entirely new images by feeding a noise code

to the GM, the latter involves the partial manipulation of a real image. Since the latter alters

the semantics of real images, it is generally considered as a greater risk, and thus partial image

manipulation detection is the focus of this work.

Detecting such manipulation is an important step to alleviate societal concerns on the authenticity

1Vishal Asnani, Xi Yin, Tal Hassner, Sijia Liu, and Xiaoming Liu. "Proactive image manipulation detection." In

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022.

10

Figure 2.1 Passive vs.proactive image manipulation detection Classic passive schemes take an
image as it is to discriminate a real image vs.its manipulated one created by a Generative Model
(GM). In contrast, our proactive scheme performs encryption of the real image so that our detection
module can better discriminate the encrypted real image vs.its manipulated counterpart.

Table 2.1 Comparison of our approach with prior works. Generalizable column means if the
[Keys: Img. man. det.: Image
performance is reported on datasets unseen during training.
manipulation detection, Img. ind.: Image independent].

Method

Year

Cozzolino et al. [60]
Nataraj et al. [222]
Rossler et al. [265]
Zhang et al. [371]
Wang et al. [330]
Wu et al. [340]
Qian et al. [250]
Dang et al. [65]
Masi et al. [211]
Nirkin et al. [230]
Asnani et al. [8]
Segalis et al. [272]
Ruiz et al. [267]
Yeh et al. [356]
Wang et al. [325]
Ours

Purpose

Detection
scheme
Passive
Passive
Passive
Passive
Passive
Passive
Passive
Passive
Passive
Passive
Passive

Img. man. det.
Img. man. det.
Img. man. det.
Img. man. det.
Img. man. det.
Img. man. det.
Img. man. det.
Img. man. det.
Img. man. det.
Img. man. det.
Img. man. det.

2018
2019
2019
2019
2020
2020
2020
2020
2020
2021
2021
2020 Proactive Deepfake disruption
2020 Proactive Deepfake disruption
2020 Proactive Deepfake disruption
2021 Proactive
Proactive

Deepfake tagging
Img. man. det.

-

Manipulation
type
Entire/Partial
Entire/Partial
Entire/Partial
Partial
Entire/Partial
Entire/Partial
Entire/Partial
Partial
Partial
Partial
Entire/Partial
Partial
Partial
Partial
Partial
Partial

Generalizable

✔
✔
✖
✔
✔
✖
✖
✖
✖
✖
✔
✖
✖
✖
✖
✔

Add
perturbation
✖
✖
✖
✖
✖
✖
✖
✖
✖
✖
✖
✔
✔
✔
✔
✔

Recover
perturbation
✖
✖
✖
✖
✖
✖
✖
✖
✖
✖
✖
✖
✖
✖
✔
✔

Template
learning method
-
-
-
-
-
-
-
-
-
-
-
Adversarial attack
Adversarial attack
Adversarial attack
Fixed template
Unsupervised learning

# of
templates
-
-
-
-
-
-
-
-
-
-
-
1
1
1
> 1
> 1

Img. ind.
templates
-
-
-
-
-
-
-
-
-
-
-
✔
✔
✔
✖
✔

of shared images. Prior works have been proposed to combat manipulated media [69]. They leverage

properties that are prone to being manipulated, including mouth movement [265], steganalysis

features [340], attention mechanism [65, 197], etc.. However, these methods are often overfitted

to the image manipulation method and the dataset used in training, and suffer when tested on data

with a different distribution.

All the aforementioned methods adopt a passive scheme since the input image, being real

or manipulated, is accepted as is for detection. Alternatively, there is also a proactive scheme

proposed for a few computer vision tasks, which involves adding signals to the original image. For

example, prior works add a predefined template to real images which either disrupt the output of

11

the GM [267, 356, 272] or tag images to real identities [325]. This template is either a one-hot

encoding [325] or an adversarial perturbation [267, 356, 272].

Motivated by improving the generalization of manipulation detection, as well as the proactive

scheme for other tasks, this paper proposes a proactive scheme for the purpose of image manipulation

detection, which works as follows. When an image is captured, our algorithm adds an imperceptible

signal (termed as template) to it, serving as an encryption. If this encrypted image is shared and

manipulated through a GM, our algorithm accurately distinguishes between the encrypted image

and its manipulated version by recovering the added template.

Ideally, this encryption process

could be incorporated into the camera hardware to protect all images after being captured.

In

comparison, our approach differs from related proactive works [267, 356, 272, 325] in its purpose

(detection vs other tasks), template learning (learnable vs predefined), the number of templates,

and the generalization ability.

Our key enabling technique is to learn a template set, which is a non-trivial task. First, there

is no ground truth template for supervision. Second, recovering the template from manipulated

images is challenging. Third, using one template can be risky as the attackers may reverse engineer

the template. Lastly, image editing operations such as blurring or compression could be applied to

encrypted images, diminishing the efficacy of the added template.

To overcome these challenges, we propose a template estimation framework to learn a set of

orthogonal templates. We perform image manipulation detection based on the recovery of the

template from encrypted real and manipulated images. Unlike prior works, we use unsupervised

learning to estimate this template set based on certain constraints. We define different loss functions

to incorporate properties including small magnitude, more high frequency content, orthogonality

and classification ability as constraints to learn the template set. We show that our framework

achieves superior manipulation detection than State-of-The-Art (SoTA) methods [325, 371, 60,

222]. We propose a novel evaluation protocol with 12 different GMs, where we train on images

manipulated by one GM and test on unseen GMs. In summary, the contributions of this paper

include:

12

Figure 2.2 Our proposed framework includes two stages: 1) selection and addition of templates;
and 2) the recovery of the estimated template from encrypted real images and manipulated images
using an encoder network. The GM is used in the inference mode. Both stages are trained in
an end-to-end manner to output a set of templates. For inferences, the first stage is mandatory to
encrypt the images. The second stage is used only when there is a need of image manipulation
detection.

• We propose a novel proactive scheme for image manipulation detection.

• We propose to learn a set of templates with desired properties, achieving higher performance

than a single template approach.

• Our method substantially outperforms the prior works on image manipulation detection. Our

method is more generalizable to different GMs showing an improvement of 10% average

precision averaged across 12 GMs.

2.2 Related Works

Passive deepfake detection. Most deepfake detection methods are passive. Wang et al. [330]

perform binary detection by exploring frequency domain patterns from images. Zhang et al. [371]

propose to extract the median and high frequencies to detect the upsampling artifacts by GANs.

Asnani et al. [8] propose to estimate fingerprint using certain desired properties for generative

models which produce fake images. Others use autoencoders [60], hand-crafted features [222],

face-context discrepancies [230], mouth and face motion [265], steganalysis features [340], xception-

net [54], frequency domain [211] and attention mechanisms [65]. These aforementioned passive

deepfake detection methods suffer from generalization. We propose a novel proactive scheme for

manipulation detection, aiming to improve the generalization.

13

Proactive schemes. Recently, some proactive methods are proposed by adding an adversarial

noise onto the real image. Ruiz et al. [267] perform deepfake disruption by using adversarial

attack in image translation networks. Yeh et al. [356] disrupt deepfakes to low quality images by

performing adversarial attacks on real images. Segalis et al. [272] disrupt manipulations related to

face-swapping by adding small perturbations. Wang et al. [325] propose a method to tag images

by embedding messages and recovering them after manipulation. Wang et al. [325] use a one-hot

encoding message instead of adversarial perturbations. Compared with these works, our method

focuses on image manipulation detection rather than deepfake disruption or deepfake tagging. Our

method learns a set of templates and recovers the added template for image manipulation detection.

Our method also generalizes better to unseen GMs than prior works. Tab. 2.1 summarizes the

comparison with prior works.

Watermarking and cryptography methods. Digital watermarking methods have been evolving

from using classic image transformation techniques to deep learning techniques. Prior work have

explored different ways to embed watermarks through pixel values [14] and spatial domain [282].

Others [152, 161, 355] use frequency domains including transformation coefficients obtained via

SVD, discrete wavelet transform (DWT), discrete cosine transform (DCT) and discrete fourier

transform (DFT) to embed watermarks. Recently, deep learning techniques proposed by Zhu et

al. [384], Baluja et al. [13] and Tancik et al. [297] use an encoder-decoder architecture to embed

watermarks into an image. All of these methods aim to either hide sensitive information or protect

the ownership of digital images. While our algorithm shares the high-level idea of image encryption,

we develop a novel framework for an entirely different purpose, i.e., proactive image manipulation

detection.

2.3 Proposed Approach

2.3.1 Problem Formulation

We only consider GMs which perform partial image manipulation that takes a real image as

input for manipulation. Let 𝑿𝑎 be a set of real images which when given as input to a GM 𝐺 would

output 𝐺 ( 𝑿𝑎), a set of manipulated images. Conventionally, passive image manipulation detection

14

methods perform binary classification on 𝑿𝑎 vs.𝐺 ( 𝑿𝑎). Denote 𝑿 = {𝑿𝑎, 𝐺 ( 𝑿𝑎)} ∈ R128×128×3

as the set of real and manipulated images, the objective function for passive detection is formulated

as follows:

min
𝜃

(cid:26)

−

∑︁

(cid:16)

𝑗

𝑦 𝑗 .log(H ( 𝑿 𝑗; 𝜃)) − (1 − 𝑦 𝑗).log(1 − H ( 𝑿 𝑗; 𝜃))

(cid:17)(cid:27)

.

(2.1)

where 𝑦 is the class label and H refers to the classification network used with parameters 𝜃.

In contrast, for our proactive detection scheme, we apply a transformation T to a real image

from set 𝑿𝑎 to formulate a set of encrypted real images represented as: T ( 𝑿𝑎). We perform

image encryption by adding a learnable template to the image which acts as a defender’s signature.

Further, the set of encrypted real images T ( 𝑿𝑎) is given as input to the GM, which produces a

set of manipulated images 𝐺 (T ( 𝑿𝑎)). We propose to learn a set of templates rather than a single

one to increase security as it is difficult to reverse engineer all templates. Thus for a real image
𝑗 ∈ 𝑿𝑎, we define T via a set of n orthogonal templates S = {𝑺1, 𝑺2, ...𝑺𝑛} where 𝑺𝑖 ∈ R128×128

𝑿𝑎

as follows:

T ( 𝑿𝑎

𝑗 ) = 𝑿𝑎

𝑗 + 𝑺𝑖, where 𝑖 ∈ {1, 2, ..., 𝑛}.

(2.2)

After applying the transformation T , the objective function defined in Eq. (2.1) can be re-written

as:

(cid:26)

−

∑︁

(cid:16)

𝑗

min
𝜃,𝑺𝑖

𝑦 𝑗 .log(H (T ( 𝑿 𝑗 ); 𝜃, 𝑺𝑖))+

(1 − 𝑦 𝑗 ).log(1 − H (T ( 𝑿 𝑗 ); 𝜃, 𝑺𝑖))

(cid:17) (cid:27)

.

(2.3)

The goal is to find 𝑺𝑖 for which corresponding images in 𝑿𝑎 and T ( 𝑿𝑎) have no significant

visual difference. More importantly, if T ( 𝑿𝑎) is modified by any GM, this would improve the

performance for image manipulation detection.

2.3.2 Proposed Framework

As shown in Fig. 2.2, our framework consists of two stages: image encryption and recovery

of template. The first stage is used for selection and addition of templates, while the second stage

involves the recovery of templates from images in T ( 𝑿𝑎) and 𝐺 (T ( 𝑿𝑎)). Both stages are trained

15

(a)

(b)

(c)

(d)

(e)

(f)

Figure 2.3 Visualization of (a) a template set with the size of 3, (b) real images, (c) encrypted real
images after adding a template, (d) manipulated images output by a GM, (e) recovered template
from (c), and (f) recovered template from (d). Each row corresponds to image manipulation by
different GM (top: StarGAN, middle: CycleGAN, bottom: GauGAN). The template recovered
from encrypted real images is more similar to the template set than the one from manipulated
images. The addition of the template creates no visual difference between real and encrypted
real images. We provide more examples of real images evaluated using our framework in the
supplementary material.

in an end-to-end manner with GM parameters fixed. For inference, each stage is applied separately.

The first stage is a mandatory step to encrypt the real images while the second stage would only be

used when image manipulation detection is needed.

2.3.2.1

Image Encryption

We initialize a set of 𝑛 templates as shown in Fig. 2.2, which is optimized during training using

certain constraints. As formulated in Eq. (2.2), we randomly select and add a template from our

template set to every real image. Our objective is to estimate an optimal template set from which

any template is capable of protecting the real image in 𝑿𝑎.

Although we constrain the magnitude of the templates using 𝐿2 loss, the added template still

degrades the quality of the real image. Therefore, when adding the template to real images, we

16

control the strength of the added template using a hyperparameter m. We re-define T as follows:

T ( 𝑿𝑎

𝑗 ) = 𝑿𝑎

𝑗 + 𝑚 × 𝑺𝑖 where 𝑖 ∈ {1, 2, ..., 𝑛}.

(2.4)

We perform an ablation study of varying m in Sec. 2.4.3, and find that setting m at 30% performs

the best.

2.3.2.2 Recovery of Templates

To perform image manipulation detection as shown in Fig. 2.2, we attempt to recover our added
template from images in T ( 𝑿𝑎) using an encoder E with parameters 𝜃E. For any real image
𝑿𝑎

𝑗 ∈ 𝑿𝑎, we define the recovered template from encrypted real image T ( 𝑿𝑎

𝑗 ) as 𝑺𝑅 = E (T ( 𝑿𝑎

𝑗 ))

and from manipulated image 𝐺 (T ( 𝑿𝑎

𝑗 )) as 𝑺𝐹 = E (𝐺 (T ( 𝑿𝑎

𝑗 ))). As template selection from

the template set is random, the encoder receives more training pairs to learn how to recover any

template from an image, which contributes positively to the robustness of the recovery process. We

visualize our trained template set S, and the recovered templates 𝑺𝑅/𝐹 in Fig. 2.3.

The main intuition of our framework design is that 𝑺𝑅 should be much more similar to the added

template and vice-versa for 𝑺𝐹. Thus, to perform image manipulation detection, we calculate the

cosine similarity between 𝑺𝑅/𝐹 and all learned templates in the set S rather than merely using

a classification objective. For every image, we select the maximum cosine similarity across all

templates as the final score. Therefore, we update logit scores in Eq. (2.3) by cosine similarity

scores as shown below:

(cid:26)

−

∑︁

(cid:16)

𝑗

min
𝜃E ,𝑺𝑖

𝑦 𝑗 .log( max
𝑖=1...𝑛

(Cos(E (T ( 𝑿 𝑗); 𝜃 E), 𝑺𝑖)))+

(1−𝑦 𝑗).log(1− max
𝑖=1...𝑛

(Cos(E (T ( 𝑿 𝑗); 𝜃 E), 𝑺𝑖)))

(cid:17) (cid:27)

.

(2.5)

2.3.2.3 Unsupervised Training of Template Set

Since there is no ground truth for supervision, we define various constraints to guide the learning

process. Let 𝑺 be the template selected from set S to be added onto a real image. We formulate

five loss functions as shown below.

Magnitude loss. The real image and the encrypted image should be as similar as possible visually

as the user does not want the image quality to deteriorate after template addition. Therefore, we

17

propose the first constraint to regularize the magnitude of the template:

𝐽𝑚 = ||𝑺||2
2

.

(2.6)

Recovery loss. We use an encoder network to recover the added template. Ideally, the encoder

output, i.e., the recovered template 𝑺𝑅 of the encrypted real image, should be the same as the

original added template 𝑺. Thus, we propose to maximize the cosine similarity between these two

templates:

𝐽𝑟 = 1 − Cos(𝑺, 𝑺𝑅).

(2.7)

Content independent template loss. Our main aim is to learn a set of universal templates which

can be used for detecting manipulated images from unseen GMs. These templates, despite being

trained on one dataset, can be applied to images from a different domain. Therefore, we encourage

the high frequency information in the template to be data independent. We propose a constraint to

minimize low frequency information:

𝐽𝑐 = ||L (F(𝑺), 𝑘)||2
2

,

(2.8)

where L is the low pass filter selecting the 𝑘 × 𝑘 region in the center of the 2D Fourier spectrum,

while assigning the high frequency region to zero. F is the Fourier transform.

Separation loss. We want the recovered template 𝑺𝐹 from manipulated images 𝐺 (T ( 𝑿)) to be

different than all the templates in set S. Thus, we optimize 𝑺𝐹 to be orthogonal to all the templates

in the set S. Therefore, we take the template for which the cosine similarity between 𝑺𝐹 and the

template is maximum, and minimize its respective cosine similarity:

𝐽𝑠 = max
𝑖=1...𝑛

(Cos(N (𝑺𝑖), N (𝑺𝐹))),

(2.9)

where N (𝑺) is the normalizing function defined as N (𝑺) = (𝑺 − min(𝑺))/(max(𝑺) − min(𝑺)).

Since this loss minimizes the cosine similarity to be 0, we normalize the templates before similarity

calculation.

Pair-wise set distribution loss. A template set would ensure that if the attacker is somehow able

to get access to some of the templates, it would still be difficult to reverse engineer other templates.

18

Table 2.2 Performance comparison with prior works.

Method

Train GM

Test GM Average precision (%)

Set
size CycleGAN StarGAN GauGAN

[222]
[60]
[371]
[330]

CycleGAN
ProGAN
AutoGAN
ProGAN

STGAN

Ours

AutoGAN

STGAN +
AutoGAN

-
-
-
-
3
20
3
20

3

100
77.20
100
84.00
96.12
99.66
97.87
97.05

100

88.20
91.70
100
100
100
100
97.89
97.18

100

56.20
83.30
61.00
67.00
91.62
90.58
86.57
84.24

99.69

Therefore, we propose a constraint to minimize the inter-template cosine similarity to prompt the

diversity of the templates in S:

𝐽𝑝 =

𝑛
∑︁

𝑛
∑︁

𝑖=1

𝑗=𝑖+1

Cos(N (𝑺𝑖), N (𝑺 𝑗 )).

The overall loss function for template estimation is thus:

𝐽 = 𝜆1𝐽𝑚 + 𝜆2𝐽𝑟 + 𝜆3𝐽𝑐 + 𝜆4𝐽𝑠 + 𝜆5𝐽𝑝,

where 𝜆1, 𝜆2, 𝜆3, 𝜆4, 𝜆5 are the loss weights for each term.

2.4 Experiments

2.4.1 Settings

(2.10)

(2.11)

Experimental setup and dataset. We follow the experimental setting of Wang et al. [330],

and compare with four baselines: [330], [371], [60] and [222]. For training, [330] uses 720𝐾

images from which the manipulated images are generated by ProGAN [157]. However, as our

method requires a GM to perform partial manipulation, we choose STGAN [194] in training as

ProGAN synthesizes entire images. We use 24𝐾 images in CelebA-HQ [157] as the real images

and pass them through STGAN to obtain manipulated images for training. For testing, we use

200 real images and pass them through unseen GMs such as StarGAN [52], GauGAN [238] and

CycleGAN [385]. The real images for testing GMs are chosen from the respective dataset they

are trained on, i.e.CelebA-HQ for StarGAN, Facades [385] for CycleGAN, and COCO [30] for

GauGAN.

19

Table 2.3 Performance comparison with Wang et al. [330].

Method Train GM

Test GM TDR (%) at low FAR (0.5%)
CycleGAN StarGAN

[330]
Ours

ProGAN
STGAN

55.98
88.50

93.88
100.00

GauGAN
37.14
43.00

Table 2.4 Average precision of 12 testing GMs when our method is trained on only STGAN. All
the GMs have different architectures and are trained on diverse datasets. The average precision of
almost all GMs are over 90% showing the generalization ability of our method.

GM

UNIT MUNIT StarGAN2 BicycleGAN CONT_Enc. SEAN ALAE Pix2Pix DualGAN CouncilGAN ESRGAN GANimation
[195]
[330] 64.94
100
Ours

[357]
98.91
92.49

[386]
100
99.05

[249]
55.19
58.69

[245]
92.73
93.10

[144]
91.26
92.50

[232]
74.13
89.71

[333]
57.04
87.30

[240]
98.18
98.75

[139]
95.33
100

[387]
67.81
97.63

[53]
100
100

Average

82.97
92.43

Table 2.5 Performance comparison of our proposed method with Ruiz et al. [267]. The performance
for our proposed method is better than [267] when the testing GM is unseen. Both methods use
StarGAN as the training GM.

Method

[267]
Ours

Test GM Average precision (%)

StarGAN CycleGAN GANimation

100
100

51.50
95.26

52.43
60.12

Pix2Pix
49.08
91.85

To further evaluate generalization ability of our approach, we use 12 additional unseen GMs

that have diverse network architectures and loss functions, and are trained on different datasets. We

manipulate each of 200 real images with these 12 GMs which gives 2, 400 manipulated images.

The real images are chosen from the dataset that the respective GM is trained on. The list of GMs

and their training datasets are provided in the supplementary.

Implementation details. Our framework is trained end-to-end for 10 epochs via Adam optimizer

with a learning rate of 10−5 and a batch size of 4. The loss weights are set to ensure similar

magnitudes at the beginning of training: 𝜆1 = 100, 𝜆2 = 30, 𝜆3 = 5, 𝜆4 = 0.003, 𝜆5 = 10. If not

specified, we set the template set size 𝑛 = 3. We set 𝑘 = 50 in the content independent template

loss. All experiments are conducted using one NVIDIA Tesla K80 GPU.

Evaluation metrics. We report average precision as adopted by [330]. To mimic real-world

scenarios, we further report true detection rate (TDR) at a low false alarm rate (FAR) of 0.5%.

20

2.4.2

Image Manipulation Detection Results

As shown in Tab. 2.2, when our training GM is STGAN, we can outperform the baselines by

a large margin on GauGAN-based test data, while the performance on StarGAN-based test data

remains the same at 100%. When training on STGAN, our method achieves lower performance

on CycleGAN. We hypothesis that it is because AutoGAN and CycleGAN share the same model

architecture. To validate this, we change our training GM to AutoGAN and observe improvement

when tested on CycleGAN. However, the performance drops on other two GMs because the amount

of training data is reduced (24𝐾 for STGAN and 1.5𝐾 for AutoGAN). Increasing the number

of templates can improve the performance for when trained on STGAN and test on CycleGAN,

but degrades for others. The degradation is more when train on AutoGAN. It suggests that it is

challenging to find a larger template set on a smaller training set. Finally, using both STGAN and

AutoGAN training data can achieve the best performance.

TDR at low FAR. We also evaluate using TDR at low FAR in Tab. 2.3. This is more indicative of

the performance in the real world application where the number of real images are exponentially

larger than manipulated images. For comparison, we evaluate the pretrained model of [330] on our

test set. Our method performs consistently better for all three GMs, demonstrating the superiority

of our approach.

Generalization ability. To test our generalization ability, we perform extensive evaluations across

a large set of GMs. We compare the performance of our method with [330] by evaluating its

pretrained model on a test set of different GMs. Our framework performs quite well on almost all

the GMs compared to [330] as shown in Tab. 2.4. This further demonstrates the generalization

ability of our framework in the real world where an image can be manipulated by any unknown GM.

Compared to [330], our framework achieves an improvement in the average precision of almost

10% averaged across all 12 GMs.

Comparison with proactive scheme work. We compare our work with previous work in proactive

scheme [267]. As [267] proposes to disrupt the GM’s output, they only provide the distortion

results of the manipulated image. To enable binary classification, we take their adversarial real

21

Table 2.6 Performance comparison of our proposed method with steganography and adversarial
attack methods.

Method

Type

Baluja [13]
PGD [207]
FGSM [109]
Ours

Steganography
Adversarial
attack
-

Test GM Average precision (%)
CycleGAN StarGAN GauGAN
88.06
98.22
98.29
100

85.64
90.28
89.21
99.95

81.26
57.71
63.81
98.23

and disrupted fake images to train a classifier with the similar network architecture as our encoder.

Tab. 2.5 shows that [267] works perfectly when the testing GM is the same as the training GM.

Yet if the testing GM is unseen, the performance drops substantially. Our method performs much

better showing the high generalizability.

Comparison with steganography works. Our method aligns with the high-level idea of digital

steganographhy methods [14, 282, 355, 387, 13] which are used to hide an image onto other

images. We compare our approach to the recent deep learning-based steganography method,

Baluja et al. [13], with its publicly available code. We hide and retrieve the template using the

pre-trained model provided by [13]. Our approach has far better average precision for each test GM

compared to [13] as shown in Tab. 2.6. This validates the effectiveness of template learning and

concludes that the digital steganography methods are less generalizable across unknown GMs than

our approach.

Comparison with benign adversarial attacks.

Adversarial attacks are used to optimize a

perturbation to change the class of the image. The learning of the template using our framework

is similar to a benign usage of adversarial attacks. We conduct an ablation study to compare

our method with common attacks such as benign PGD and FGSM. We remove the losses in

Eq. (2.6), Eq. (2.8), and Eq. (2.10) responsible for learning the template and replace them with

an adversarial noise constraint. Our approach has better average precision for each test GM than

both adversarial attacks as shown in Tab. 2.6. We observe that adversarial noise performed similar

to passive schemes offering poor generalization to unknown GMs. This shows the importance of

using our proposed constraints to learn the universal template set.

22

Table 2.7 Average precision (%) with various augmentation techniques in training and testing for
three GMs. We apply data augmentation to three scenarios: (1) in training only (2) in testing only
and (3) in both training and testing. [Keys: aug.=augmentation, B.=blur, J.=JPEG compression,
Gau. No.=Gaussian Noise].

Augmentation Augmentation
Train

Test

type
No
augmentation

✖

✖

✔

✖

✖

✔

✔

✔

Method

[330]
Ours
[330]
Ours
[330]
Ours
[330]
Ours
[330]
Ours

Ours

Ours

Ours

Test GMs
CycleGAN StarGAN GauGAN
100
100
100
100
91.80
98.30
95.40
100
84.50
100
100
84.92
100
84.87
82.96
82.18
77.41
73.87
69.47
100
97.92
84.92
100
89.22
100

84.00
96.12
90.10
93.55
93.20
98.74
96.80
94.44
93.50
95.79
100
84.45
99.95
95.74
91.91
89.23
93.12
84.04
73.83
92.16
94.00
87.37
99.98
77.63
97.44

67.00
91.62
74.70
92.35
97.50
91.85
98.10
98.16
89.50
95.94
98.97
94.43
99.11
70.74
84.16
75.53
91.45
70.12
66.70
90.15
85.91
74.68
92.73
79.96
82.32

Blur

JPEG

B+J (0.5)

B+J (0.1)

Resizing
Crop
Gau. No.
Blur
JPEG
B+J (0.5)
Resizing
Crop
Gau. No.
Blur
JPEG
B+J (0.5)
Resizing
Cropping
Gau. No.

Data augmentation. We apply various data augmentation schemes to evaluate the robustness of

our method. We adopt some of the image editing techniques from Wang et al. [330], including (1)

Gaussian blurring, (2) JPEG compression, (3) blur + JPEG (0.5), and (4) blur + JPEG (0.1), where

0.5 and 0.1 are the probabilities of applying these image editing operations. In addition, we add

resizing, cropping, and Gaussian noise. The implementation details of these techniques are in the

supplementary. These techniques are applied after addition of our template to the real images.

We evaluate in three scenarios when augmentation is applied in (1) training, (2) testing, (3)

both training and testing. As shown in Tab. 2.7, for the augmentation techniques adopted from

[330], we outperform [330] in almost all techniques. We observe significant improvement when

23

Figure 2.4 Ablation study with varying template set sizes. The performance improves when the set
size increases, while the inter-template cosine similarity also increases.

blurring or JPEG compression is applied jointly but the improvement is less when they are applied

separately.

As for the different scenarios on when data augmentation is applied, scenario 2 performs the

worst because the augmentation applied in testing has not been seen during training. Scenario 3

performs better than scenario 2 in most cases. There is a much larger performance drop when

blurring and JPEG are applied together than separately. Cropping performs the worst for both

Scenario 1 and 3.

2.4.3 Ablation Studies

Template set size. We study the effects of the template set size. As shown in Fig. 2.4, the average

precision increases as the set size is expanding from 1 and saturates around the set size 10. In the

meantime, the average cosine similarity between templates within the set increases consistently,

as it gets harder to find many orthogonal templates. We also test our framework’s run-time for

different set sizes. On a Tesla K80 GPU, for the set size of 1, 3, 10, 20 and 50, the per-image

run-time of our manipulation detection is 26.19, 27.16, 28.44, 34.26, and 43.76 ms respectively.

Thus, despite increasing the set size enhances our accuracy and security, there is a trade-off with

the detection speed which is a important factor too. For comparison, we also test the pretrained

model of [330] which gives a per-image run-time of 54.55 ms. Our framework is much faster even

with a larger set size which is due to the shallow network in our proactive scheme compared to a

deeper network in passive scheme.

Template strength. We use a hyperparameter 𝑚 to control the strength of our added template.

24

Figure 2.5 Ablation with varying template strengths in the encrypted real images. The lower the
template strength, the higher the PSNR is and the harder it is for our encoder to recover it, which
leads to lower detection performance.

Table 2.8 Ablation study to remove losses used in our training. Removing any one loss deteriorates
the performance compared to our proposed method. Fixing the template or performing direct
classification made the results worse. This shows the importance of a variable template and using
an encoder for classification purposes.

Loss removed

Magnitude loss (𝐽𝑚)
Pair-wise set distribution loss (𝐽 𝑝)
Recovery loss (𝐽𝑟 )
Content independent template loss (𝐽𝑐)
Separation loss (𝐽𝑠)
𝐽𝑚, 𝐽 𝑝 and 𝐽𝑐 (fixed template)
𝐽𝑟 and 𝐽𝑠 (removing encoder)
None (ours)

Test GM Average precision (%)
CycleGAN StarGAN GauGAN
100
79.99
94.18
100
100
59.88
98.24
100

94.43
66.60
51.59
92.01
92.24
46.93
50.00
96.12

87.44
74.55
90.61
80.54
64.06
43.64
55.00
91.62

We ablate 𝑚 and show the results in Fig. 2.5. Intuitively, the lower the strength of the template

added, the lower the detection performance since it would be harder for the encoder to recover the

original template. Our results support this intuition. For all three GMs, the precision increases

as we enlarge the template strength, and converges after 50% strength. We also show the PSNR

between the encrypted real image and the original real image. The PSNR decreases as we enlarge

the strength as expected. We choose 𝑚 = 30% for a trade-off between the detection precision and

the visual quality.

Loss functions.

Our training process is guided by an objective function with five losses

( Eq. (2.11)). To demonstrate the necessity of each loss, we ablate by removing each loss and

compare with our full model. As shown in Tab. 2.8, removing any one of the losses results in

performance degradation. Specifically, removing the pair-wise set distribution loss, recovery loss

25

or separation loss causes a larger drop.

To better understand the importance of the data-driven template set, we fix the template

set during training, i.e., removing the three losses directly operating on the template and only

considering recovery and separation losses for training. We observe a significant performance

drop, which shows that the learnable template is indeed crucial for effective image manipulation

detection.

Finally, we remove the encoder from our framework and use a classification network with

similar number of layers. Instead of recovering templates, the classification network is directly

trained to perform binary image manipulation detection via cross-entropy loss. The performance

drops significantly. This observation aligns with the previous works [326, 60, 371] stating that

CNN networks trained on images from one GM show poor generalizability to unseen GMs. The

performance drops for all three GMs but CycleGAN and GauGAN are affected the most, as the

datasets are different. For our proposed approach, when we are recovering the template, the encoder

ignores all the low frequency information of the images which are data dependent. Thus, being

more data (i.e., image content) independent, our encoder is able to achieve a higher generalizability.

Template selection. Given a real image, we randomly select a template from the learnt template

set to add to the image. Thus, every image has an equal chance of selecting any one template from

the set, resulting in many combinations for the entire test set. This raises the question of finding a

worst and best combination of templates for all images in the test set. To answer this, we experiment

with a template set size of 50 as a large size may offer higher variation in performance. For each

image in T ( 𝑿𝑎) and 𝐺 (T ( 𝑿𝑎)), we calculate the cosine similarity between added template 𝑺 and

recovered template 𝑺𝑅/𝐹. For the worst/best case of every image, we select the template with the

minimum/maximum difference between the real and manipulated image cosine similarities. As

shown in Tab. 2.9, GauGAN gives much more variation in the performance compared to CycleGAN

and StarGAN. This shows that the template selection is an important step for image manipulation

detection. This brings up the idea of training a network to select the best template for a specific

image, by using the best case described above as a pseudo ground truth to supervise the network.

26

Table 2.9 Ablation of template selection schemes at set size of 50.

Selection scheme

Test GM Average precision (%)
CycleGAN
StarGAN
99.90 ± 0.02 100 ± 0.00 93.56 ± 0.52
Random selection
Biasing one template 99.05 ± 0.37 100 ± 0.00 91.21 ± 0.97
Network based
Worst case
Best case

90.47
80.55
98.23

95.46
94.85
99.95

100
100
100

GauGAN

We hypothesis template selection could be important, but with experiments, the difference of

performance among different templates is nearly zero and the network’s selection doesn’t help in

the performance compared with selecting the template randomly as shown in Tab. 2.9. Therefore,

we cannot have a pseudo ground truth to train another network for template selection.

Another option for template selection is to select the same template for every test image which

is equivalent to using one template compromising the security of our method. Nevertheless, we test

this option to see the performance variation of biasing one template for all images. The performance

variation is larger than our random selection scheme. This shows that each template has a similar

contribution to image manipulation detection.

2.5 Conclusion

In this paper, we propose a proactive scheme for image manipulation detection. The main

objective is to estimate a set of templates, which when added to the real images improves the

performance for image manipulation detection. This template set is estimated using certain

constraints and any template can be added onto the image right after it is being captured by

any camera. Our framework is able to achieve better image manipulation detection performance

on different unseen GMs, compared to prior works. We also show the results on a diverse set of 12

additional GMs to demonstrate the generalizability of our proposed method.

Limitations.

First, although our work aims to protect real images in a proactive manner and

can detect whether an image has been manipulated or not, it cannot perform general deepfake

detection on entirely synthesized images. Second, we try our best to collect a diverse set of GMs to

validate the generalization of our approach. However, there are many other GMs that do not have

open-sourced codes to be evaluated in our framework. Lastly, how to supervise the training of a

27

network for template selection is still an unanswered question.

Potential societal impact. We propose a proactive scheme which uses encrypted real images and

their manipulated versions to perform manipulation detection. While this offers more generalizable

detection, the encrypted real images might be used for training GMs in the future, which could

make the manipulated images more robust against our framework, and thus warrents more research.

28

CHAPTER 3

MALP: MANIPULATION LOCALIZATION USING A PROACTIVE SCHEME

Advancements in the generation quality of various Generative Models (GMs) has made it necessary

to not only perform binary manipulation detection but also localize the modified pixels in an image.

However, prior works termed as passive for manipulation localization exhibit poor generalization

performance over unseen GMs and attribute modifications. To combat this issue, we propose

a proactive scheme for manipulation localization, termed MaLP. We encrypt the real images by

adding a learned template. If the image is manipulated by any GM, this added protection from the

template not only aids binary detection but also helps in identifying the pixels modified by the GM.

The template is learned by leveraging local and global-level features estimated by a two-branch

architecture. We show that MaLP performs better than prior passive works. We also show the

generalizability of MaLP by testing on 22 different GMs, providing a benchmark for future research

on manipulation localization. Finally, we show that MaLP can be used as a discriminator for

improving the generation quality of GMs1.

3.1

Introduction

We witness numerous Generative Models (GMs) [107, 304, 194, 52, 238, 385, 53, 214, 159,

157, 327, 113, 264, 162] being proposed to generate realistic-looking images. These GMs can not

only generate an entirely new image [159, 157], but also perform partial manipulation of an input

image [53, 194, 53, 385]. The proliferation of these GMs has made it easier to manipulate personal

media for malicious use. Prior methods to combat manipulated media focus on binary detection

[197, 279, 97, 40, 345, 6, 265, 340, 65, 8], using mouth movement, model parsing, hand-crafted

features, etc..

Recent works go one step further than detection, i.e.manipulation localization, which is defined

as follows: given a partially manipulated image by a GM (e.g.STGAN [194] modifying hair

colors of a face image), the goal is to identify which pixels are modified by estimating a fakeness

1Vishal Asnani, Xi Yin, Tal Hassner, and Xiaoming Liu. "Malp: Manipulation localization using a proactive

scheme." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.

29

Figure 3.1 (a) High-level idea of MaLP. We encrypt the image by adding a learnable template,
which helps to estimate the fakeness map. (b) The cosine similarity (CS) between ground-truth
and predicted fakeness maps for 22 unseen GMs. The performance is better for almost all GMs
when using our proactive approach.

map [141]. Identifying modified pixels helps to determine the severity of the fakeness in the image,

and aid media-forensics [141, 65]. Also, manipulation localization provides an understanding of the

attacker’s intent for modification which may further benefit identifying attack toolchains used [80].

Recent methods for manipulation localization [180, 287, 225] focus on estimating the manipulation

mask of face-swapped images. They localize modified facial attributes by leveraging attention

mechanisms [65], patch-based classifier [36], and face-parsing [141]. The main drawback of these

methods is that they do not generalize well to GMs unseen in training. That is when the test images

and training images are modified by different GMs, which will likely happen given the vast number

of existing GMs. Thus, our work aims for a localization method generalizable to unseen GMs.

All aforementioned methods are based on a passive scheme as the method receives an image

as is for estimation. Recently, proactive methods are gaining success for deepfake tasks such as

detection [6], disruption [267, 356], and tagging [325]. These methods are considered proactive

as they add different types of signals known as templates for encrypting the image before it is

manipulated by a GM. This template can be one-hot encoding [325], adversarial perturbation [267],

or a learnable noise [6], and is optimized to improve the performance of the defined tasks.

Motivated by [6], we propose a Proactive scheme for MAnipulation Localization, termed as

MaLP, in order to improve generalization. Specifically, MaLP learns an optimized template which,

30

when added to real images, would improve manipulation localization, should they get manipulated.

This manipulation can be done by an unseen GM trained on either in-domain or out-of-domain

datasets. Furthermore, face manipulation may involve modifying facial attributes unseen in training

(e.g.train on hair color modification yet test on gender modification). MaLP incorporates three

modules that focus on encryption, detection, and localization. The encryption module selects and

adds the template from the template set to the real images. These encrypted images are further

processed by localization and detection modules to perform the respective tasks.

Designing a proactive manipulation localization approach comes with several challenges. First,

it is not straightforward to formulate constraints for learning the template unsupervisedly. Second,

calculating a fakeness map at the same resolution as the input image is computationally expensive

if the decision for each pixel has to be made. Prior works [36, 65] either down-sample the images

or use a patch-wise approach, both of which result in inaccurate low-resolution fakeness maps.

Lastly, the templates should be generalizable to localize modified regions from unseen GMs.

We design a two-branch architecture consisting of a shallow CNN network and a transformer

to optimize the template during training. While the former leverages local-level features due to

its shallow depth, the latter focuses on global-level features to better capture the affinity of the

far-apart regions. The joint training of both networks enables the MaLP to learn a better template,

having embedded the information of both levels. During inference, the CNN network alone is

sufficient to estimate the fakeness map with a higher inference efficiency. Compared to prior

passive works [141, 65], MaLP improves the generalization performance on unseen GMs. We also

demonstrate that MaLP can be used as a discriminator for fine-tuning conventional GMs to improve

the quality of GM-generated images.

In summary, we make the following contributions.

• We are the first to propose a proactive scheme for image manipulation localization, applicable

to both face and generic images.

• Our novel two-branch architecture uses both local and global level features to learn a set

31

Table 3.1 Comparison of our approach with prior works on manipulation localization and proactive
schemes. We show the generalization ability of all works across different facial attribute
modifications, unseen GMs trained on datasets with the same domain (in-domain) and different
domains (out-domain). [Keys: Attr.: Attributes, Imp.: Improving, L.: Localization, D.: Detection].

Work

Scheme

Task

Template

[325]
[272]
[267]
[356]
[6]
[225]
[287]
[180]
[65]
[36]
[141]
MaLP

Tag

Proactive
Proactive Disrupt
Proactive Disrupt
Proactive Disrupt
D.
Proactive
L. + D.
Passive
L. + D.
Passive
L. + D.
Passive
L. + D.
Passive
L. + D.
Passive
L. + D.
Passive
Proactive L. + D.

Fix
Learn
Learn
Learn
Learn
-
-
-
-
-
-
Learn

Attr.
✔
✔
✔
✔
✖
✖
✖
✖
✔
✖
✔
✔

Generalization

Imp.
In-domain Out-domain GM
✖
✖
✖
✖
✖
✖
✖
✖
✖
✖
✖
✔

✔
✖
✔
✖
✔
✖
✖
✔
✔
✔
✔
✔

✖
✖
✖
✖
✔
✖
✖
✖
✖
✖
✖
✔

of templates in an unsupervised manner. The framework is guided by constraints based on

template recovery, fakeness maps classification, and high cosine similarity between predicted

and ground-truth fakeness maps.

• MaLP can be used as a plug-and-play discriminator module to fine-tune the generative model

to improve the quality of the generated images.

• Our method outperforms State-of-The-Art (SoTA) methods in manipulation localization and

detection. Furthermore, our method generalizes well to GMs and modified attributes unseen

in training. To facilitate the research of localization, we develop a benchmark for evaluating

the generalization of manipulation localization, on images where the train and test GMs are

different.

3.2 Related Work

Manipulation Localization. Prior works tackle manipulation localization by adopting a passive

scheme. Some of them focus on forgery attacks like removal, copy-move, and splicing using

multi-task learning [225]. Songsri-in et al. [287] leverage facial landmarks [54] for manipulation

localization. Li et al. [180] estimate the blended boundary for forged face-swap images.

[65]

32

uses an attention mechanism to leverage the relationship between pixels and [36] uses a patch-

based classifier to estimate modified regions. Recently, Huang et al. [141] utilize gray-scale

maps as ground truth for manipulation localization and leverage face parsing with an attention

mechanism for prediction. The passive methods discussed above suffer from the generalization

issue [225, 54, 287, 65, 36, 141] and estimate a low-resolution fakeness map [65] which is less

accurate for the localization purpose. MaLP generalizes better to modified attributes and GMs

unseen in training.

Proactive Scheme. Recently, proactive schemes are developed for various tasks. Wang et al. [325]

leverage the recovery of embedded one-hot encoding messages to perform deepfake tagging. A

small perturbation is added onto the images by Segalis et al. [272] to disrupt the output of a GM.

The same task is performed by Ruiz et al. [267] and Yeh et al. [356], both adding adversarial noise

onto the input images. Asnani et al. [6] propose a framework based on adding a learnable template

to input images for generalized manipulation detection. Unlike prior works, which focus on binary

detection, deepfake disruption, or tagging, our work emphasizes on manipulation localization. We

show the comparison of our approach with prior works in Tab. 3.1.

Manipulation Detection.

The advancement in manipulation detection keeps reaching new

heights. Prior works propose to combat deepfakes by exploiting frequency domain patterns [330],

up-sampling artifacts [371], model parsing [8, 353], hand-crafted features [222], lip motions [265],

unified detector [70] and self-attention [65]. Recent methods use self-blended images [279],

hierarchical localization features [116], real-time deviations [97], and self-supervised learning

with adversarial training [40]. Finally, methods based on contrastive learning [345] and proactive

scheme [6] have explicitly focused on generalized manipulation detection across unknown GMs.

3.3 Proposed Approach

3.3.1 Problem Formulation

Passive Manipulation Localization Let 𝑰𝑅 be a set of real images that are manipulated by a GM
𝐺 to output the set of manipulated images 𝐺 ( 𝑰𝑅). Prior passive works perform manipulation

33

Figure 3.2 The overview of MaLP. It includes three modules: encryption, localization, and
detection. We randomly select a template from the template set and add it to the real image as
encryption. The GM is used in inference mode to manipulate the encrypted image. The detection
module recovers the added template for binary detection. The localization module uses a two-
branch architecture to estimate the fakeness map. Lastly, we apply the classifier to the fakeness
map to better distinguish them from each other. Best viewed in color.

localization by estimating the fakeness map 𝑴 𝑝𝑟𝑒𝑑 with the following objective:

(cid:26) ∑︁

𝑗

(cid:16)(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)

min
𝜃E

E (𝐺 ( 𝑰𝑅

𝑗 ); 𝜃 E) − 𝑴𝐺𝑇

(cid:17)(cid:27)

,

(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)2

(3.1)

where E denotes the passive framework with parameters 𝜃E and 𝑴𝐺𝑇 is the ground-truth fakeness

map.

To represent the fakeness map, some prior methods [287, 65, 180] choose a binary map

by applying a threshold on the difference between the real and manipulated images. This is

undesirable as the threshold selection is highly subjective and sensitive, leading to inaccurate

fakeness maps. Therefore, we adopt the continuous gray-scale map for calculating the ground-truth

fakeness maps [141], formulated as:

𝑴𝐺𝑇 = 𝐺𝑟𝑎𝑦(| 𝑰𝑅 − 𝐺 ( 𝑰𝑅)|)/255,

(3.2)

where 𝐺𝑟𝑎𝑦(.) converts the image to gray-scale.

Proactive Scheme Asnani et al. [6] define adding the template as a transformation T applied to
images 𝑰𝑅, resulting in the encrypted images T ( 𝑰𝑅). The added template acts as a signature of the

34

defender and is learned during the training, aiming to improve the performance of the task at hand,

e.g.detection, disruption, and tagging. Motivated by [6] that uses multiple templates, we have a
𝑗 ∈ 𝑰𝑅,
set of 𝑛 orthogonal templates S = {𝑺1, 𝑺2, ...𝑺𝑛} where 𝑺𝑖 ∈ R128×128, for a real image 𝑰𝑅

transformation T is defined as:

T ( 𝑰𝑅

𝑗 ; 𝑺𝑖) = 𝑰𝑅

𝑗 + 𝑺𝑖, where 𝑖 ∈ {1, 2, ..., 𝑛}.

(3.3)

The templates are optimized such that adding them to the real images wouldn’t result in a noticeable

visual difference, yet helps manipulation localization.

Proactive Manipulation Localization. Unlike the passive schemes [141, 65, 225, 180], we learn
an optimal template set to help manipulation localization. For the encrypted images T ( 𝑰𝑅), we

formulate the estimation of the fakeness map as:

min
,𝑺𝑖
𝜃E𝑃

(cid:26) ∑︁

𝑗

(cid:16)(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)

E𝑃 (𝐺 (T ( 𝑰𝑅

𝑗 ; 𝑺𝑖)); 𝜃 E𝑃 ) − 𝑴𝐺𝑇

(cid:17)(cid:27)

.

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)2
(cid:12)

(3.4)

where E𝑃 is the proactive framework with parameters 𝜃E𝑃 .

However, as the output of the GM has changed from images in set 𝐺 ( 𝑰𝑅) to images in set
𝐺 (T ( 𝑰𝑅)), in our proactive approach, the calculation of the ground-truth fakeness map shall be

changed from Eq. (3.2) to the follows:

𝑴𝐺𝑇 = 𝐺𝑟𝑎𝑦(| 𝑰𝑅 − 𝐺 (T ( 𝑰𝑅))|)/255.

(3.5)

3.3.2 Manipulation Localization

MaLP consists of three modules: encryption, localization, and detection. The encryption

module is used to encrypt the real images. The localization module estimates the fakeness map

using a two-branch architecture. The detection module performs binary detection for the encrypted

and manipulated images by recovering the template and using the classifier in the localization

module. All three modules, as detailed next, are trained in an end-to-end manner.

3.3.2.1 Encryption Module

Following the procedure in [6], we add a randomly selected learnable template from the template

set to a real image. We control the strength of the added template using a hyperparameter 𝑚, which

35

Figure 3.3 Visualization of fakeness maps for faces and generic images showing generalization
across unseen attribute modifications and GMs: (a) real image, (b) encrypted image, (c) manipulated
image, (d) 𝑴𝐺𝑇 , (e) predicted fakeness map for encrypted images, and (f) predicted fakeness map
for manipulated images. The first column shows the manipulation of (seen GM, seen attribute
modification) i.e.(STGAN, bald). Following two columns show the manipulation of (seen GM,
unseen attribute modification) i.e.(STGAN, [bangs, pale skin]. The fourth and fifth columns show
manipulation of unseen GM, GauGAN for non-face images. The last column shows manipulation
by unseen GM, DRIT. We see that the fakeness map of manipulated images is more bright and
similar to 𝑴𝐺𝑇 , while the real fakeness map is more close to zero. We use the cmap as “pink" to
better visualize the fakeness map. All face images come from SiWM-v2 data [115].

prevents the degradation of the image quality. The encryption process is summarised below:

T ( 𝑰𝑅

𝑗 ) = 𝑰𝑅

𝑗 + 𝑚 × 𝑺𝑖 where 𝑖 = 𝑅𝑎𝑛𝑑 (1, 2, ..., 𝑛).

(3.6)

We select the value of 𝑚 as 30% for our framework.

We optimize the template set by focusing on properties like low magnitude, orthogonality, and

36

high-frequency content [6]. The properties are applied as constraints as follows.

𝐽𝑇 = 𝜆1 ×

𝑛
∑︁

𝑖=1

||𝑺𝑖 ||2 + 𝜆2 ×

𝑛
∑︁

𝑖, 𝑗=1
𝑖≠ 𝑗

CS(𝑺𝑖, 𝑺 𝑗) + 𝜆3 × ||L (𝔉(𝑺))||2,

(3.7)

where CS is the cosine similarity, L is the low-pass filter, 𝔉 is the fourier transform, 𝜆1, 𝜆2, 𝜆3 are

weights for losses of low magnitude, orthogonality and high-frequency content, respectively.

3.3.2.2 Localization Module

To design the localization module, we consider two desired properties: a larger receptive field

for fakeness map estimation and high inference efficiency. A network with a large receptive field

will consider far-apart regions in learning features for localization. Yet, large receptive fields

normally come from deeper networks, implying slower inference.

In light of these properties, we design a two-branch architecture consisting of a shallow CNN

network E𝐶 and a ViT transformer [79] E𝑇 (see Fig. 3.2). The intuition is to have one shallow

branch to capture local features, and one deeper branch to capture global features. While training

with both branches helps to learn better templates, in inference we only use the shallow branch

for a higher efficiency. Specifically, the shallow CNN network has 10 layers which is efficient in

inference but can only capture the local features due to small receptive fields. To capture global

information, we adopt the ViT transformer. With the self-attention between the image patches, the

transformer can estimate the fakeness map considering the far-apart regions.

Both the CNN and transformer are trained jointly to estimate a better template set, resembling

the concept of the ensemble of networks. We empirically show that training both networks

simultaneously results in higher performance than training either network separately. As the

shallow CNN network is much faster in inference than the transformer, we use the transformer only

in training to optimize the templates and switch off the transformer branch in inference.

To estimate the fakeness map, we leverage the supervision of the ground-truth fakeness map

in Eq. (3.5). For fake images, we maximize the cosine similarity (𝐶𝑆) and structural similarity index

measure (𝑆𝑆) between the predicted and ground-truth fakeness map. However, the fakeness map

should be a zero image for encrypted images. Therefore, we apply an 𝐿2 loss [141] to minimize the

predicted map to zero for encrypted images. To maximize the difference between the two fakeness

37

maps, we further minimize the cosine similarity between the predicted map from encrypted images

and 𝑴𝐺𝑇 . The localization loss is defined as:

(cid:110)

𝜆4 × ||E𝐶/𝑇 ( 𝑰)||2
2+

𝜆5 × CS(E𝐶/𝑇 ( 𝑰), 𝑴𝐺𝑇 )

(cid:111)

if 𝑰 ∈ T ( 𝑰𝑅)

(cid:110)

𝜆6 × (1 − CS(E𝐶/𝑇 ( 𝑰), 𝑴𝐺𝑇 ))+ if 𝑰 ∈ 𝐺 (T ( 𝑰𝑅))

𝜆7 × (1 − 𝑆𝑆(E𝐶/𝑇 ( 𝑰), 𝑴𝐺𝑇 ))

(cid:111)

.

𝐽𝐿 =






(3.8)

Finally, we have a classifier to make a binary decision of real vs.fake using the fakeness maps. This

classifier is included in the framework to aid the detection module for binary detection of the input

images, which will be discussed in Sec. 3.3.2.3. Another reason to have the classifier is to make

the fakeness maps from encrypted and fake images to be distinguishable. We find that this design

allows our training to converge much faster.

3.3.2.3 Detection Module

To leverage the added template for manipulation detection, we perform template recovery using

encoder E𝐸 . We follow the procedure in [6] to recover the added template from the encrypted

images by maximizing the cosine similarity between 𝑺 and 𝑺𝑅. However, for manipulated images,

we minimize the cosine similarity between the recovered template (𝑺𝑅) and all the templates in the

template set S.

𝐽𝑅 =





𝜆8 × (1 − CS(𝑺, 𝑺𝑅))

if 𝑥 ∈ T ( 𝑰𝑅)

𝜆9 × ((cid:205)𝑛

𝑖=1(CS(𝑺𝑖, 𝑺𝑅)))

if 𝑥 ∈ 𝐺 (T ( 𝑰𝑅)).

(3.9)

Further, we leverage our estimated fakeness map to help manipulation detection. As discussed

in the previous section, we apply a classifier C to perform binary classification of the predicted

fakeness map for the encrypted and fake images. The logits of the classifier are further combined

with the cosine similarity of the recovered template. The averaged logits are back-propagated

using the binary cross-entropy constraint. This not only improves the performance of manipulation

detection but also helps manipulation localization. Therefore, we apply the binary cross entropy

38

Table 3.2 Manipulation localization comparison with prior works.

Method

[65]
[141]
MaLP

CS ↑
0.6230
0.8831
0.9394

Localization

Detection

PSNR ↑ SSIM ↑ Accuracy ↑ EER ↓ AUC ↑
0.9975
0.9975
6.214
0.9998
0.9945
22.890
1.0
0.9991
23.020

0.0050
0.0077
0.0072

0.2178
0.7876
0.7312

loss on the averaged logits as follows:

𝐽𝐶 =𝜆10 × −

(cid:26)

∑︁

𝑗

𝑦 𝑗 .log

(cid:104) C( 𝑿 𝑗) + 𝐶𝑆(𝑺𝑅, 𝑺)
2

(cid:105)

−

(1 − 𝑦 𝑗).log

(cid:104)

1 −

C( 𝑿 𝑗) + 𝐶𝑆(𝑺𝑅, 𝑺)
2

(cid:105) (cid:27)

,

(3.10)

where 𝑦 𝑗 is the class label, 𝑺 and 𝑺𝑅 are the added and recovered template respectively.

Our framework is trained in an end-to-end manner with the overall loss function as follows:

𝐽 = 𝐽𝑇 + 𝐽𝑅 + 𝐽𝐶 + 𝐽𝐿.

(3.11)

3.3.3 MaLP as A Discriminator

One application of MaLP is to leverage our proposed localization module as a discriminator

for improving the quality of the manipulated images. MaLP performs binary classification by

estimating a fakeness map, which can be used as an objective. This results in output images being

resilient to manipulation localization, thereby lowering the performance of our framework.

We use MaLP as a plug-and-play discriminator to improve image generation quality through

fine-tuning pretrained GMs. The generation quality and manipulation localization will compete

head-to-head, resulting in a better quality of the manipulated images. We define the fine-tuning

objective for the GM as follows:

min
𝜃𝐺

max
𝜃𝑀𝑎𝐿 𝑃 ,𝑺𝑖

(cid:26) ∑︁

𝑗

(cid:16)E(cid:2)𝑙𝑜𝑔(E𝑀 𝑎𝐿 𝑃 (T ( 𝑰𝑅

𝑗 )); 𝜃 𝑀 𝑎𝐿 𝑃)(cid:3)+

E(cid:2)1 − 𝑙𝑜𝑔(E𝑀 𝑎𝐿 𝑃 (𝐺 (T ( 𝑰𝑅

𝑗 ; 𝑺𝑖); 𝜃𝐺); 𝜃 𝑀 𝑎𝐿 𝑃))(cid:3) (cid:17)(cid:27)

.

(3.12)

where E𝑀𝑎𝐿𝑃 is our framework with 𝜃 𝑀𝑎𝐿𝑃 parameters.

39

Table 3.3 Comparison of localization performance across unseen GMs and attribute modifications.
We train on STGAN bald/smile attribute modification and test on AttGAN/StyleGAN.
Cosine similarity ↑ (StyleGAN)
Smile
0.6176
0.8159

Cosine similarity ↑(AttGAN)

Gender
0.6470
0.8016

Black Hair Eyeglasses

Age
0.3141
0.8255

Bald
0.8141
0.8201

0.6932
0.7940

0.6950
0.8557

[141]
MaLP

Method

3.4 Experiments

3.4.1 Experimental Setup

Settings Following the settings in [141], we use STGAN [194] to manipulate images from

CelebA [199] dataset and train on bald facial attribute modification.

In order to evaluate the

generalization of image manipulation localization, we construct a new benchmark that consists of

200 real images of 22 different GMs on various data domains. The real images are chosen from

the dataset on which the GM is trained on. The list of GMs, datasets and implementation details

are provided in the supplementary.

Evaluation Metrics We use cosine similarity (CS), peak signal-to-noise ratio (PSNR), and

structural similarity index measure (SSIM) as adopted by [141] to evaluate manipulation localization

since the GT is a continuous map. For binary detection, we use the area under the curve (AUC),

equal error rate (EER), and accuracy score [141].

3.4.2 Comparison with Baselines

We compare our results with [141] and [65] for manipulation localization. The results are

shown in Tab. 3.2. MaLP has higher cosine similarity and similar PSNR for localization compared

to [141]. However, we observe a dip in SSIM. This might be because of the degradation caused

by adding our template to the real images and then performing the manipulation. The learned

template helps localize the manipulated regions better, as demonstrated by cosine similarity, but

the degradation affects SSIM and PSNR. We also compare the performance of real vs.fake binary

detection. As expected, our proposed proactive approach outperforms the passive methods with

a perfect AUC and near-perfect accuracy. We also show visual examples of fakeness maps for

images modified by unseen GMs in Fig. 3.3. MaLP is able to estimate the fakeness map for unseen

40

Table 3.4 Benchmark for manipulation localization across 22 different unseen GMs, showing cosine
similarity between ground-truth and predicted fakeness maps. We compare our proactive vs.passive
baselines [36, 127, 65] approach to highlight the generalization ability of our MaLP. We scale the
images to 1282 for “sc." and keep the resolution as is for “no sc.".

GM
Resolution
ResNet50 [127]
[36]
[65]
MaLP (sc.)
MaLP (no sc.)
GM
Resolution
ResNet50 [127]
[36]
[65]
MaLP (sc.)
MaLP (no sc.)

SEAN [387]
2562
0.8614
0.7514
0.7961
0.9376
0.9258
DRIT [177]
2562
0.7486
0.7871
0.8120
0.8867
0.9084

StarGAN [52] CycleGAN [385] GauGAN [238] Con_Enc. [240]

StarGAN2 [53] ALAE [245]

BiGAN [386] AuGAN [385] GANim [248] DRGAN [304]

1282
0.7513
0.7111
0.7887
0.8718
0.8718

2562
0.6715
0.7981
0.8014
0.9128
0.9245

2562
0.7615
0.8016
0.8256
0.9251
0.9125

1282
0.8639
0.7894
0.8541
0.8546
0.8546

Pix2Pix [144] CounGAN [232] DualGAN [357] ESRGAN [333]

2562
0.6719
0.7769
0.7781
0.8915
0.8714

1282
0.7293
0.8146
0.8559
0.9326
0.9326

2562
0.7365
0.7569
0.7721
0.8872
0.8432

10242
0.8703
0.8168
0.8241
0.8348
0.8743

2562
0.8196
0.7026
0.7034
0.8836
0.8785
UNIT [195]
512 × 931
0.7083
0.8064
0.8086
0.8214
0.8391

2562
0.6766
0.7156
0.7549
0.9192
0.9141

2562
0.6514
0.7217
0.7805
0.9181
0.9229

3402
0.6639
0.7516
0.7232
0.8894
0.9149

1282
0.6871
0.7612
0.8457
0.9625
0.9625

1282
0.8029
0.7115
0.7239
0.7512
0.7512

MUNIT [139] ColGAN [223] GDWCT [49] RePaint [201]

256 × 512
0.6601
0.6788
0.7097
0.7565
0.7860

1282
0.7596
0.7610
0.7874
0.8096
0.8096

1282
0.8350
0.8691
0.8879
0.9384
0.9384

2562
0.6512
0.7516
0.7696
0.8102
0.8290

Average
-
0.7401
0.7645
0.7903
0.8725
0.8773

ILVR [50]
2562
0.7018
0.7851
0.7854
0.8003
0.8359

modifications and GMs across face/generic image datasets.

3.4.3 Generalization

Across Attribute Modifications Following the settings in [141], we evaluate the performance of

MaLP across unseen attribute modifications. Specifically, we train MaLP using STGAN with the

bald/smile attribute modification and test it on unseen attribute modifications with unseen GMs:

AttGAN/StyleGAN. As shown in Tab. 3.3, MaLP is more generalizable to all unseen attribute

modifications. Furthermore, AttGAN shares the high-level architecture with STGAN but not with

StyleGAN. We observe a significant increase in localization performance for StyleGAN compared

to AttGAN. This shows that, unlike our MaLP, passive works perform much worse if the test GM

doesn’t share any similarity with the training GM.

Across GMs Although [141] tries to show generalization across unseen GMs; it is limited by

the GMs within the same domain of the dataset used in training. We propose a benchmark to

evaluate the generalization performance for future manipulation localization works that consists of

22 different GMs in various domains. We select GMs that are publicly released and can perform

partial manipulation.

As no open-source code base is available for [141], we train a passive approach using a

ResNet50 [127] network to estimate the fakeness map as the baseline for comparison. Further, we

compare our approach with [36, 65]. Although [36, 65] estimate a fakeness map, it has at least

5× lower resolution compared to input images due to their patch-based methodology. For a fair

comparison, we rescale their predicted fakeness maps to the resolution of 𝑴𝐺𝑇 . We compare the

41

Table 3.5 FID score comparison for the application of our approach as a discriminator for improving
the generation quality of the GM.

State
Before

Fine-tune
−
𝐺

After 𝐺 + 𝑀𝑎𝐿𝑃

StarGAN FID ↓
60.49
51.91
52.07

cosine similarity in Tab. 3.4. MaLP is able to outperform all the baselines for almost all GMs,

which proves the effectiveness of the proactive scheme.

We also evaluate the performance of E𝐶 for high-resolution images. For encryption, we

upsample the 128 × 128 template to the original resolution of images and evaluate E𝐶 on these

higher resolution encrypted images. We observe similar performance of E𝐶 for higher resolution

images in Tab. 3.4, proving the versatility of E𝐶 to image sizes.

3.4.4

Improving Quality of GMs

We fine-tune the GM into fooling our framework to generate a fakeness map as a zero image.

This process results in better-quality images. Initially, we train MaLP with the pretrained GM so

that it can perform manipulation localization. Next, to fine-tune the GM, we adopt two strategies.

First, we freeze MaLP and fine-tune the GM only. Second, we fine-tune both the GM and the

MaLP but update the MaLP with a lower learning rate. The result for fine-tuning StarGAN is

shown in Tab. 3.5. We observe that for both strategies, MaLP reduces the FID score of StarGAN.

We also show some visual examples in Fig. 3.4. We see that the images are of better quality after

fine-tuning, and many artifacts in the images manipulated by the pretrained model are removed.

3.4.5 Other Comparisons

Binary Detection We compare with prior proactive and passive approaches for binary manipulation

detection [6, 330, 222, 371]. We adopt the evaluation protocol in [6] to test on images manipulated

by CycleGAN, StarGAN, and GauGAN. We are able to perform similar to [6] as shown in Tab. 3.6.

We have better average precision than passive schemes and generalize well to GMs unseen in

training. We also conduct experiments to see whether localization can help binary detection to

improve the performance, as mentioned in Sec. 3.3.2.3. The combined predictions’ results are

better than just using the detection module as shown in Tab. 3.6. This is intuitive as the localization

42

Figure 3.4 Visualization of (a) encrypted images, (b) manipulated images before fine-tuning, and
(c) manipulated images after fine-tuning. The generation quality has improved after we fine-tune
the GM using our framework as a discriminator. The artifacts in the images have been reduced,
and the face skin color is less pale and more realistic. We also specify the cosine similarity of the
predicted fakeness map and 𝑴𝐺𝑇 . The GM is able to decrease the performance of our framework
after fine-tuning. All face images come from SiWM-v2 data [115].

module provides extra information, thereby increasing the performance.

Inference Speed We compare the inference speed of our MaLP against prior work.

[141]

uses Deeplabv3-ResNet101 model from PyTorch [239]. In our generalization benchmark shown

in Sec. 3.4.3, we use the ResNet50 model for training the passive baseline. The inference speed

per image on an NVIDIA K80 GPU for Deeplabv3-ResNet101, ResNet50, and MaLP are 75.61,

52.66, and 29.26 ms, respectively. MaLP takes less than half the inference time compared to [141]

due to our shallow CNN network.

Adversarial Attack Our framework can be considered as an adversarial attack on real images

to aid manipulation localization. Therefore, it is vital to contrast the performance between our

approach and classic adversarial attacks. For this purpose, we perform experiments that make use of

43

Table 3.6 Comparison with prior binary detection works. [Keys: D.M.: Detection module, L.M.:
Localization module].

Method

Nataraj et al. [222]
Zhang et al. [371]
Wang et al. [330]
Asnani et al. [6]
MaLP (D.M.)
MaLP (D.M. + L.M.)

Train GM

CycleGAN
AutoGAN
ProGAN
STGAN
STGAN
STGAN

Test GM Average precision (%)↑

Set
size CycleGAN StarGAN GauGAN

-
-
-
1
1
1

100
100
84.00
94.00
94.10
94.30

88.20
100
100
100
100
100

56.20
61.00
67.00
69.50
69.61
72.16

Table 3.7 Comparison with adversarial attack methods.

Method

Scheme

Huang et al. [141]
PGD [207]
FGSM [109]
CW [34]
MaLP

Passive
Proactive
Proactive
Proactive
Proactive

Bald
0.8141
0.8051
0.8111
0.8014
0.8201

Cosine similarity↑

Black Hair Eyeglasses

0.6932
0.7514
0.7882
0.8344
0.7940

0.6950
0.8358
0.8512
0.8405
0.8557

adversarial attacks, namely PGD [207], CW [34], and FGSM [109] to guide the learning of the added

template. We evaluate on unseen GM AttGAN for unseen attribute modifications. We show the

performance comparison in Tab. 3.7. MaLP has higher cosine similarity across some unseen facial

attribute modifications compared to adversarial attacks. This can be explained as the adversarial

attack methods being over-fitted to training parameters (data, target network etc.). Therefore, if the

testing data is changed with unseen attribute modifications by GMs, the performance of adversarial

attacks degrades. Further, these attacks are analogous to our MaLP as a proactive scheme which,

in general, have better performance than passive works.

Model Robustness Against Degradations It is necessary to test the robustness of our proposed

approach against various types of real-world image editing degradations. We evaluate our method

on degradations applied during testing as adopted by [141], which include JPEG compression,

blurring, adding noise, and low resolution. The results are shown in Fig. 3.5. Our proposed MaLP

is more robust to real-world degradations than passive schemes.

44

Figure 3.5 Comparison of our approach’s robustness against common image editing degradations.

3.4.6 Ablations

Two-branch Architecture As described in Sec. 3.3.2.2, MaLP adopts a two-branch architecture

to predict the fakeness map using the local-level and global-level features, which are estimated

by a shallow CNN and a transformer. We ablate by training each branch separately to show

the effectiveness of combining them. As shown in Tab. 3.8, if the individual network is trained

separately, the performance is lower than the two-branch architecture. Next, to show the efficacy of

the transformer, we use a ResNet50 network in place of the transformer to predict the fakeness map.

We observe that the performance is even worse than using only the transformer. ResNet50 lacks

the added advantage of self-attention in the transformer, which estimates the global-level features

much better than a CNN network.

Constraints MaLP leverages different constraints to estimate the fakeness map using an optimized

template. We perform an ablation by removing each constraint separately, showing the importance

of every constraint. Tab. 3.9 shows the cosine similarity for localization and accuracy for detection.

Removing either the classifier or recovery constraint results in lower detection performance. This

is expected as we leverage logits from both C and E𝐸 , and removing the constraint for one network

will hurt the logits of the other network. Furthermore, removing the template constraint results in

a decrease in performance. Although the gap is small, the template is not properly optimized to

have lower magnitude and high-frequency content.

Removing the localization constraint and just applying a 𝐿2 loss for supervising fakeness maps

45

Table 3.8 Ablation of two-branch architecture. CNN is a shallow network with 10 layers. Training
each branch separately has worse localization results than combining them.
Cosine similarity ↑ Accuracy ↑

Network trained
CNN only
Transformer only
CNN + ResNet50
CNN + Transformer

0.8961
0.8848
0.8647
0.9394

0.9801
0.9856
0.9512
0.9981

Table 3.9 Ablation of constraints used in training our framework.
Cosine similarity ↑ Accuracy ↑

Constraint removed
Classifier constraint 𝐽𝐶
Template constraint 𝐽𝑇
Localization constraint 𝐽𝐿
Recovery constraint𝐽𝑅
Fixed template
Nothing (MaLP)

0.9319
0.9143
0.8814
0.9206
0.8887
0.9394

0.9814
0.9803
0.9539
0.9780
0.9514
0.9991

Figure 3.6 Ablation study on hyperparameters used in our framework: set size and signal strength.

result in a significant performance drop for both localization and detection, showing the necessity

of this constraint. Finally, we show the importance of a learnable template by not optimizing it

during the training of MaLP. This hurts the performance a lot, similar to removing the localization

constraint. Both these observations prove that our localization constraint and learnable template

are important components of MaLP.

Template Set Size We perform an ablation to vary the size of the template set S. Having multiple

templates will improve security if an attacker tries to reverse engineer the template from encrypted

images. The results are shown in Fig. 3.6 (a). The cosine similarity takes a dip when the set size is

increased. We also observe the inter-template cosine similarity, which remains constant at a high

46

value of around 0.74 for all templates. This is against the findings of [6]. Localization is a more

challenging task than binary detection. Therefore, it is less likely to find different templates for our

MaLP in the given feature space compared to [6].

Signal Strength We vary the template strength hyperparameter m to find its impact on the

performance. As shown in Fig. 3.6 (b), the cosine similarity increases as we increase the strength

of the added template. However, this comes with the lower visual quality of the encrypted images

if the template strength is increased. The performance doesn’t vary much after 𝑚 = 30%, which

we use for MaLP.

3.5 Conclusion

This paper focuses on manipulation localization using a proactive scheme (MaLP). We propose

to improve the generalization of manipulation localization across unseen GM and facial attribute

modifications. We add an optimal template onto the real images and estimate the fakeness map via a

two-branch architecture using local and global-level features. MaLP outperforms prior works with

much stronger generalization capabilities, as demonstrated by our proposed evaluation benchmark

with 22 different GMs in various domains. We show an application of MaLP in fine-tuning GMs

to improve generation quality.

Limitations First, the number of publicly available GMs is limited. More thorough testing on

many different GMs might give more insights into the problem of generalizable manipulation

localization. Second, we show that our MaLP can be used to fine-tune the GMs to improve image

generation quality. However, it is based on the pretrained GM. Using our method to train a GM

from scratch can be an interesting direction to explore in the future.

47

CHAPTER 4

PROBED: PROACTIVE OBJECT DETECTION WRAPPER

Previous research in 2𝐷 object detection focuses on various tasks, including detecting objects in

generic and camouflaged images. These works are regarded as passive works for object detection

as they take the input image as is. However, convergence to global minima is not guaranteed to be

optimal in neural networks; therefore, we argue that the trained weights in the object detector are

not optimal. To rectify this problem, we propose a wrapper based on proactive schemes, PrObeD,

which enhances the performance of these object detectors by learning a signal. PrObeD consists of

an encoder-decoder architecture, where the encoder network generates an image-dependent signal

termed templates to encrypt the input images, and the decoder recovers this template from the

encrypted images. We propose that learning the optimum template results in an object detector with

an improved detection performance. The template acts as a mask to the input images to highlight

semantics useful for the object detector. Finetuning the object detector with these encrypted images

enhances the detection performance for both generic and camouflaged. Our experiments on MS-

COCO, CAMO, COD10K, and NC4K datasets show improvement over different detectors after

applying PrObeD1.

4.1

Introduction

Generic 2𝐷 object detection (GOD) has improved from earlier traditional detectors [312, 313,

64, 87] to the deep-learning-based object detectors [260, 254, 43, 32, 117, 128]. Advancements

in deep-learning-based methods underwent many architectural change over recent years, including

one-stage [254, 256, 22, 255, 196, 189], two-stage [103, 102, 260], CNN-based [102, 254, 256, 22,

73, 91, 98, 63], transformer-based [32, 388], and diffusion-based [43] methods. All these methods

aim to predict the 2𝐷 bounding box of the objects in the images and their category class.

Another emerging area related to generic object detection is camouflaged object detection [82,

81, 149, 178, 120, 122, 121] (COD). COD aims to detect and segment objects blended with the

1Vishal Asnani, Abhinav Kumar, Suya You, and Xiaoming Liu. "PrObeD: proactive object detection wrapper."

Advances in Neural Information Processing Systems 36, 2024.

48

Figure 4.1 (a) Passive vs. Proactive object detection. A learnable template encrypts the input
images, which are further used to train the object detector. (b) PrObeD serves as a wrapper on
both generic and camouflaged object detectors, enhancing the detection performance. (c) For the
linear regression model under additive noise and other assumptions, the converged weights of the
proactive detector are closer to the optimal weights as compared to the converged weights of the
passive detector. See Sec. 4.3.2 for details and proof.

background [82, 81] via object-level mask supervision. Applications of COD include medical [83,

193], surveillance [46] and autonomous driving [346]. Early COD detectors exploit hand-

crafted features [275, 236] and optical flow [135], while current methods are deep-learning-

based. These methods utilize attention [292, 38], joint learning [178], image gradient [149],

and transformers [208, 348].

All these methods take input images as is for the detection task and hence are called passive

methods. However, there is a line of research on proactive methods for a wide range of vision

tasks such as disruption [267, 272], tagging [325], manipulation detection [6], and localization [7].

Proactive methods use signals, called templates, to encrypt the input images and pass the encrypted

images as the input to the network. These are trained in an end-to-end manner by using either a

fixed [325] or learnable template [267, 272, 7, 6] to improve the performance. A major advantage of

proactive schemes is that such methods generalize better on unseen data/models [6, 7]. Motivated

by this, we propose a plug-and-play Proactive Object Detection wrapper, PrObeD, to improve GOD

and COD detectors.

Designing PrObeD as a proactive scheme involves several challenges and key factors. First, the

proactive wrapper needs to be a plug-and-play module that can be applied to both GOD and COD

49

detectors. Secondly, the encryption process should be intuitive to benefit the object detection task.

e.g., an ideal template for detection should highlight the foreground objects in the input image.

Lastly, the choice of supervision to estimate the template for encryption is hard to formulate.

Previous proactive methods [6, 7] use learnable but image-independent templates for manipulation

and localization tasks. However, the object detection task is scene-specific; therefore, the ideal

template should be image-dependent. Based on this key insight, we propose a novel plug-and-

play proactive wrapper in which we apply object detectors to enhance detection performance.

The PrObeD wrapper utilizes an encoder network to learn an image-dependent template. The

learned template encrypts the input images by applying a transformation, defined as an element-

wise multiplication between the template and the input image. The decoder network recovers the

templates from the encrypted images. We utilize regression losses for supervision and leverage the

ground-truth object map to guide the learning process, thereby imparting valuable object semantics

to be integrated into the template. We then fine-tune the proactive wrapper with the GOD and

COD detectors to improve their detection performance. Extensive experiments on MS-COCO,

CAMO, COD10K, and NC4K datasets show that PrObeD improves the detection performance for

both GOD and COD detectors.

In summary, the contributions of this work include:

• We propose a novel proactive approach PrObeD for the object detection task. To the best of

our knowledge, this is the first work to develop a proactive approach to 2𝐷 object detection.

• We mathematically prove that the proactive method results in a better-converged model than

the passive detector under assumptions and, consequently, a better object detector.

• PrObeD wraps around both GOD and COD detectors and improves detection performance

on MS-COCO, CAMO, COD10K, and NC4K datasets

4.2 Related works

Proactive Schemes.

Earlier works adopt to add signals like perturbation [272], adversarial

noise [267], and one-hot encoding [325] messages while focusing on tasks like disruption [272, 267]

50

and deepfake tagging [325]. Asnani et al. [6] propose to learn an optimized template for binary

detection by unseen generative models. Recently, MaLP [7] adds the learnable template to perform

generalized manipulation localization for unknown generative models. Unlike these works, PrObeD

uses image-dependent templates and is a plug-and-play wrapper for a different task of object

detection.

Generic Object Detection Detection of generic objects, instead of specific object categories

such as pedestrians [25], apples [55], and others [10, 170, 169], has been a long-standing

objective of computer vision. RCNN [103, 104] employs the extraction of object proposals.

He et al. [125] propose a spatial pooling layer to extract a fixed-length representation of all the

objects. Modifications of RCNN [102, 260, 185, 367] increase the inference speed. Feature pyramid

network [188] detects objects with a wide variety of scales. The above methods are mostly two-stage,

so inference is an issue. Single-stage detectors like YOLO [254, 256, 22, 255, 316], SSD [196],

HRNet [321] and RetinaNet [189] increase the speed and simplicity of the framework compared to

the two-stage detector. Recently, transformer-based methods [32, 388] use a global-scale receptive

field. Chen et al. [43] use diffusion models to denoise noisy boxes at every forward step. PrObeD

functions as a wrapper around the pre-existing object detector, facilitating its transformation into an

enhanced object detector. The comparison of PrObeD with prior works is summarized in Tab. 4.1.

Camouflaged Object Detection Early COD works rely on hand-crafted features like co-occurrence

matrices [275], 3𝐷 convexity [236], optical flow [135], covariance matrix [150], and multivariate

calibration components [259]. Later on, [292, 38] incorporate an attention-based cross-level

fusion of multi-scale features to recover contextual information. Mei et al. [215] take motivation

by predators to identify camouflaged objects using a position and focus ideology. SINet [82]

uses a search and identification module to perform localization. SINET-v2[81] uses group-reversal

attention to extract the camouflaged maps. [154] explores uncertainty maps and [389] utilizes cube-

like architecture to integrate multi-layer features. ANet [176], LSR [204], and JCSOD [178] employ

joint learning with different tasks to improve COD. Lately, [208, 348, 48] apply a transformer-based

architecture for difficult-aware learning, uncertainty modeling, and temporal consistency. Zhai et

51

Table 4.1 Comparison of PrObeD with prior works.

Method

Faster R-CNN [260]
YOLO [254]
DeTR [32]
DGNet [149]
SINet-v2 [81]
JCSOD [178]
OGAN [272]
Ruiz et al. [267]
Yeh et al. [356]
FakeTagger [325]
Asnani et al. [6]
MaLP [7]
PrObeD (Ours)

Template

Proactive

Task

Object Detection
Object Detection
Object Detection
Object Detection
Object Detection
Object Detection
Disrupt
Disrupt
Disrupt
Tagging
Manipulation Detection

Number
✕
-
✕
-
✕
-
✕
-
✕
-
✕
-
✓
1
✓
1
✓
1
✓
≥ 1
✓
≥ 1 Learnable set, Image-independent
✓ Manipulation Localization ≥ 1 Learnable set, Image-independent
✓
≥ 1

Type
-
-
-
-
-
-
Learnable
Learnable
Learnable
Fixed, Id-dependent

Learnable, Image-dependent

Object Detection

COD GOD Plug-Play

✕
✕
✕
✓
✓
✓
-
-
-
-
-
-
✓

✓
✓
✓
✕
✕
✕
-
-
-
-
-
-
✓

✕
✕
✕
✕
✕
✕
✕
✕
✕
✕
✓
✓
✓

al. [368] use a graph learning model to disentangle input into different features for localization.

DGNet [149] uses image gradients to exploit intensity changes in the camouflaged object from

the background. Unlike these methods, PrObeD uses proactive methods to improve camouflaged

object detection.

4.3 Proposed Approach

Our method originates from understanding what makes proactive schemes effective. We first

overview the two detection problems: GOD and COD in Sec. 4.3.1. We next derive Lemma 1, where

we show that the proactive schemes with the multiplicative transformation of images are better than

passive schemes by comparing the deviation of trained network weights from the optimal. Based on

this result, we derive that Average Precision (AP) from the proactive model is better than AP from

the passive model in Theorem 1. At last, we present our proactive scheme-based wrapper, PrObeD,

in Sec. 4.3.3, which builds upon the Theorem 1 to improve generic 2D objects and camouflaged

detection.

4.3.1 Background

4.3.1.1 Passive Object Detection

Although generic 2𝐷 object detection and camouflage detection are similar problems, they have

different objective functions. Therefore, we treat them as two different problems and define their

objectives separately.

52

Generic 2D Object Detection. Let 𝑰 𝑗 be the set of input images given to the generic 2D object

detector O with trainable parameters 𝜃. Most of these detectors output two sets of predictions per
image: (1) bounding box coordinates, O ( 𝑰 𝑗 )1 = ˆ𝑇 ∈ R4, (2) class logits, O ( 𝑰 𝑗 )2 = ˆ𝐶 ∈ R𝐶, where
𝑁 is the number of foreground object categories. If the ground-truth bounding box coordinates are

𝑇𝑗 , and the ground-truth category label is 𝐶, the objective function of such detector is:

min
𝜃

(cid:26) ∑︁

(cid:16)

𝑗

||O ( 𝑰 𝑗 ; 𝜃)1 − 𝑇𝑗 ||2

𝑁
∑︁

(cid:16)

∑︁

(cid:17)

−

𝑗

𝑖=1

𝐶𝑖

𝑗 · log(O ( 𝑰 𝑗 ; 𝜃)2))

(cid:17) (cid:27)

.

(4.1)

Camouflaged Object Detection. Let 𝑰 𝑗 be the input image set given to the camouflaged object

detector O with trainable parameters 𝜃, and 𝑮 𝑗 be the ground-truth segmentation map. Prior

passive works predict a segmentation map with the following objective:

min
𝜃

(cid:26) ∑︁

𝑗

(cid:16)(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)

O ( 𝑰 𝑗 ; 𝜃) − 𝑮 𝑗

(cid:17) (cid:27)

.

(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)2

(4.2)

4.3.1.2 Proactive Object Detection

Proactive schemes [7, 6] encrypt the input images with the template to aid manipulation

detection/localization. Such schemes take an input image 𝑰 𝑗 ∈ R𝐻×𝑊×3 and learns a template

𝑺 𝑗 ∈ R𝐻×𝑊 . PrObeD uses image-dependent templates to improve object detection. Given an input

image 𝑰 𝑗 ∈ R𝐻×𝑊×3, PrObeD learns to output a template 𝑺 𝑗 ∈ R𝐻×𝑊 , which can be used by a

transformation T resulting in encrypted images T ( 𝑰 𝑗 ). PrObeD uses element-wise multiplication

as the transformation T , which is defined as:

T ( 𝑰 𝑗 ) = T ( 𝑰 𝑗 ; 𝑺 𝑗 ) = 𝑰 𝑗 ⊙ 𝑺 𝑗 .

(4.3)

4.3.2 Mathematical Analysis of Passive and Proactive Detectors

PrObeD optimizes the template to improve the performance of the object detector. We argue

that this template helps arrive at a better global minima representing the optimal parameters 𝜃. We

now define the following lemma to support our argument:

Lemma 1 Converged weights of proactive and passive detectors. Consider a linear regression

model that regresses an input image 𝑰 𝑗 under an additive noise setup to obtain the 2D coordinates.

Assume the noise under consideration 𝑒 is a normal random variable N (0, 𝜎2). Let 𝒘 and 𝒘∗

53

denote the trained weights of the pretrained linear regression model and the optimal weights of the

linear regression model. Also, assume SGD optimizes the model parameters with decreasing step

size 𝑠 such that the steps are square summable i.e., S = lim
𝑡→∞

𝑡
(cid:205)
𝑘=1

𝑠2
𝑘 exist, and the noise is independent

of the image. Then, there exists a template 𝑺 𝑗 ∈ [0, 1] for the image 𝑰 𝑗 such that the multiplicative

transformation of images as the input results in a trained weight 𝒘′ closer to the optimal weight

than the originally trained weight 𝒘. In other words,

E(||𝒘′ − 𝒘∗||2) < E(||𝒘 − 𝒘∗||2).

(4.4)

The proof of Lemma 1 is in supplementary. We use the variance of the gradient of the encrypted

images to arrive at this lemma. We next use Lemma 1 to derive the following theorem:

Theorem 1 AP comparison of proactive and passive detectors. Consider a linear regression

model that regresses an input image 𝑰 𝑗 under an additive noise setup to obtain the 2D coordinates.

Assume the noise under consideration 𝑒 is a normal random variable N (0, 𝜎2). Let 𝒘 and 𝒘∗

denote the trained weights of the pretrained linear regression model and the optimal weights of the

linear regression model. Also, assume SGD optimizes the model parameters with decreasing step

size 𝑠 such that the steps are square summable i.e., S = lim
𝑡→∞

𝑡
(cid:205)
𝑘=1

𝑠2
𝑘 exist, and the noise is independent

of the image. Then, the AP of the proactive detector is better than the AP of the passive detector.

The proof of Theorem 1 is in the supplementary. We use the Lemma 1 and the non-decreasing

nature of AP w.r.t. IoU to arrive at this theorem. Next, we adapt the objectives of Eqs. (4.1) and (4.2)

to incorporate the proactive methods as follows:

(cid:26) ∑︁

(cid:16)

𝑗

min
𝜃,𝑺 𝑗

||O (T ( 𝑰 𝑗 ; 𝑺 𝑗 ); 𝜃)1 − 𝑇𝑗 ||2

𝑁
∑︁

(cid:16)

∑︁

(cid:17)

−

𝑗

𝑖=1

𝐶𝑖

𝑗 · log(O (T ( 𝑰 𝑗 ; 𝑺 𝑗 ); 𝜃)2)

(cid:17) (cid:27)

,

(cid:26) ∑︁

𝑗

(cid:16)(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)

min
𝜃,𝑺 𝑗

O (T ( 𝑰 𝑗 ; 𝑺 𝑗 ); 𝜃) − 𝑮 𝑗

(cid:17) (cid:27)

.

(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)2

(4.5)

(4.6)

4.3.3 PrObeD

Our proposed approach comprises of three stages: template generation, template recovery, and

detector fine-tuning. First, we use an encoder network to generate an image-dependent template for

54

Figure 4.2 Overview of PrObeD. PrObeD consists of three stages: (1) template generation, (2)
template recovery, and (3) detector fine-tuning. The templates are generated by encoder network
E to encrypt the input images. The decoder network D is used to recover the template from the
encrypted images. Finally, the encrypted images are used to fine-tune the object detector to perform
detection. We train all the stages in an end-to-end manner. However, for inference, we only use
stages 1 and 3. Best viewed in color.

image encryption. This encrypted image is further used to recover the template through a decoder

network. Finally, the object detector is fine-tuned using the encrypted images. All three stages are

trained in an end-to-end fashion. While all the stages are used for training PrObeD, we specifically

use only stages 1 and 3 for inference. We will now describe each stage in detail.

4.3.3.1 Proactive Wrapper

Our proposed approach consists of three stages, as shown in Fig. 4.2. However, only the first

two stages are part of our proposed proactive wrapper, which can be applied to object detector to

improve its performance.

Stage 1: Template Generation.

Prior works learn a set of templates [7, 6] in their proactive

schemes. This set of templates is enough to perform the respective downstream tasks as the

generative model manipulates the template, which is easy to capture with a set of learnable

templates. However, for object detection tasks, every image has unique object characteristics such

as size, appearance, and color that can vary significantly. This variability present in the images may

exceed the descriptive capacity of a finite set of templates, thereby necessitating the use of image-

55

specific templates to accurately represent the range of object features present in each image. In

other words, a fixed set of templates may not be sufficiently flexible to capture the diversity of visual

features across the given set of input images, thus demanding more adaptable, image-dependent

templates.

Motivated by the above argument, we propose to generate the template 𝑺 𝑗 for every image

using an encoder network. We hypothesize that highlighting the area of the key foreground objects

would be beneficial for object detection. Therefore, for GOD, we use the ground-truth bounding

boxes 𝑇 𝐺 to generate the pseudo ground-truth segmentation map. Specifically, for any image

𝑰 𝑗 , if the bounding box coordinates are 𝑇 𝐺

𝑗 = {𝑥1, 𝑥2, 𝑦1, 𝑦2}, we define the pseudo ground-truth

segmentation map as:

∀𝑚 ∈ [0, 𝐻], 𝑛 ∈ [0, 𝑊], we have

𝑮 𝑗 (𝑚, 𝑛) = 1 if 𝑥1 ≤ 𝑚 ≤ 𝑥2 and 𝑦1 ≤ 𝑛 ≤ 𝑦2, otherwise 0

However, for COD, the dataset already has the ground-truth segmentation map 𝑮 𝑗 , which we

use as the supervision for the encoder to output the templates with semantic information of the

image to be restricted only in the region of interest for the detector. For both GOD and COD,

we minimize the cosine similarity (Cos) between 𝑺 𝑗 and 𝑮 𝑗 as the supervision for the encoder

network. The encoder loss 𝐽𝐸 is as follows:

𝐽𝐸 = 1 − Cos(𝑺 𝑗 , 𝑮 𝑗 ) = 1 − Cos(E ( 𝑰 𝑗 ), 𝑮 𝑗 ).

(4.7)

This generated template acts as a mask for the input image to highlight the object region of

interest for the detector. We use this template with the transformation T to encrypt the input image

as T ( 𝑰 𝑗 ; 𝑺 𝑗 ) = 𝑰 𝑗 ⊙ 𝑺 𝑗 . As we start from the pretrained model of object detector O, we initialize

the bias of the last layer of the encoder as 0 so that for the first few iterations, 𝑺 𝑗 ≈ 1. This is to

ensure that the distribution of 𝑰 𝑗 and T ( 𝑰 𝑗 ; 𝑺 𝑗 ) remains similar for the first few iterations, and O

doesn’t encounter a sudden change in its input distribution.

Stage 2: Template Recovery. So far, we have discussed the generation of template 𝑺 𝑗 using E,

which will be used as a mask to encrypt the input image. The encrypted images are used for two

56

purposes: (1) recovery of templates and (2) fine-tuning of the object detector. The main intuition

of recovering the templates is from the prior works on image steganalysis [258, 257] and proactive

schemes [7, 6]. Motivated by these works, we draw the following insight: “To properly learn the

optimal template and embed it onto the input images, it is beneficial to recover the template from

encrypted images."

To perform recovery, we exploit an encoder-decoder approach. Using this approach leverages

the strengths of the encoder network E for feature extraction, capturing the most useful salient

details, and the decoder network D for information recovery, allowing for efficient and effective

encryption and decryption of the template. We also empirically show that not using the decoder to

recover the templates harms the object detection performance.

To supervise D in recovering 𝑺 𝑗 from T ( 𝑰 𝑗 ; 𝑺 𝑗 ), we propose to maximize the cosine similarity

between the recovered template, 𝑺

′
𝑗 and 𝑺 𝑗 . The decoder loss is as follows:

𝐽𝐷 = 1 − Cos(𝑺

′

𝑗 , 𝑺 𝑗 ) = 1 − Cos(D (T ( 𝑰 𝑗 ; 𝑺 𝑗 )), 𝑺 𝑗 ).

(4.8)

Stage 3: Detector Fine-tuning. Due to our encryption, the distribution of the images input to the

pretrained O changes. Thus, we fine-tune O on the encrypted images T ( 𝑰 𝑗 ; 𝑺). As proposed in

Theorem 1, given the encrypted images T ( 𝑰 𝑗 ; 𝑺), we use the pretrained detector O with parameters

𝜃 to arrive at a better local minima. Therefore, the general objective of GOD and COD in Eq. (4.5)

and Eq. (4.6) change to as follows:

min
𝜃 , 𝜃E , 𝜃D

(cid:26) ∑︁

(cid:16)

𝑗

||O (T ( 𝑰 𝑗 ; E ( 𝑰 𝑗 ; 𝜃 E)); 𝜃, 𝜃 D)1 − 𝑇𝑗 ||2 −

(cid:0)𝐶𝑖

𝑗 .log(O (T ( 𝑰 𝑗 ; E ( 𝑰 𝑗 ; 𝜃 E)); 𝜃, 𝜃 D)2)(cid:1) (cid:17) (cid:27)

,

(4.9)

𝑁
∑︁

𝑖=1

min
𝜃 , 𝜃E , 𝜃D

(cid:26) ∑︁

𝑗

(cid:16)(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

O (T ( 𝑰 𝑗 ; E ( 𝑰 𝑗 ; 𝜃 E)); 𝜃, 𝜃 D) − 𝑮 𝑗

(cid:17) (cid:27)

.

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)2
(cid:12)

(4.10)

We use the detector-specific loss function 𝐽𝑂𝐵𝐽 of O along with the encoder and decoder loss

in Eq. (4.7) and Eq. (4.8) to train all the three stages. The overall loss function 𝐽 to train PrObeD

is as follows:

𝐽 = 𝜆𝑂𝐵𝐽 𝐽𝑂𝐵𝐽 + 𝜆𝐸 𝐽𝐸 + 𝜆𝐷 𝐽𝐷 .

(4.11)

57

Table 4.2 GOD results on MS-COCO val split. PrObeD improves the performance of all GOD at
all thresholds and across all categories.

AP ↑ AP50 ↑ AP75 ↑ AP𝑆 ↑ AP𝑀 ↑ AP𝐿 ↑
Method
39.3
19.3
Faster R-CNN [260]
51.1
31.7
Faster R-CNN [260]+PrObeD
48.4
37.3
Faster R-CNN + FPN [188]
51.2
Faster R-CNN + FPN [188] + Seg. Mask [124] 38.2
49.8
38.5
Faster R-CNN + FPN [188] + PrObeD
52.9
37.6
Sparse R-CNN [291]
53.6
39.2
Sparse R-CNN [291]+ PrObeD
62.3
48.9
YOLOv5 [254]
62.6
49.4
YOLOv5 [254]+ PrObeD
61.0
41.9
DeTR [32]
61.3
42.1
DeTR [32]+ PrObeD

17.9
35.5
41.0
43.2
43.4
39.6
40.1
54.4
55.1
45.8
46.0

42.5
52.6
58.0
60.3
60.4
55.6
57.5
67.6
67.9
62.3
62.6

1.8
11.0
21.4
22.1
22.5
20.5
21.7
31.8
32.0
20.3
20.4

16.9
33.3
40.6
41.7
41.9
40.2
41.5
53.1
53.5
44.1
44.4

Table 4.3 COD results on CAMO, COD10K and NC4K datasets. PrObeD outperforms DGNet on
all datasets and metrics.

CAMO

Method

COD10K
E𝑚 ↑ S𝑚 ↑ wF𝛽 ↑ MAE↓ E𝑚 ↑ S𝑚 ↑ wF𝛽 ↑ MAE↓ E𝑚 ↑ S𝑚 ↑ wF𝛽 ↑ MAE↓
DGNet[149] 0.859 0.791 0.681 0.079 0.833 0.776 0.603 0.046 0.876 0.815 0.710 0.059
+ PrObeD 0.871 0.797 0.702 0.071 0.869 0.803 0.661 0.037 0.900 0.838 0.755 0.049

NC4K

4.4 Experiments

We apply PrObeD for two categories of object detectors: GOD and COD.

GOD Baselines. For GOD, we apply PrObeD on four detectors with varied architectures: two-

stage, one-stage, and transformer-based detectors, namely, Faster R-CNN [260], YOLO [254],

Sparse R-CNN, and DeTR [32]. We use these works as baselines for three reasons: (1) varied

architecture types, (2) their increased prevalence in the community, and (3) varied timelines (from

earlier to recent detectors). We use the PyTorch [239] code of the respective detectors for our GOD

experiments and use the corresponding GODs as our baseline. For YOLOv5 and DeTR, we use the

official repositories released by the authors; for Faster R-CNN, we use the public repository "Faster

R-CNN.pytorch". For other GOD detectors, we use Detectron2 library as the pre-trained detector.

We use the ResNet101 backbone for Faster R-CNN, Sparse R-CNN and DeTR, and CSPDarknet53

for YOLOv5.

COD Baselines. For COD, we apply PrObeD on the current SoTA camouflage detector DGNet [149]

and use DGNet as our baseline. For all object detectors, we use the pretrained model released by

58

the authors and fine-tune them with PrObeD. Please see the supplementary for more details.

Datasets.

Our experiments use the MS-COCO 2017 [190] dataset for GOD, while we use

CAMO [176], COD10K [81], and NC4K [204] datasets for COD. We use the following splits of

these datasets:

• MS-COCO 2017 Val Split [190]: It includes 118,287 images for training and 5𝐾 for testing.

• COD10K Val Split [81]: It includes 4,046 camouflaged images for training and 2,026 for

testing.

• CAMO Val Split [176]: It includes 1𝐾 camouflaged images for training and 250 for testing.

• NC4K Val [204]: It includes 4,121 NC4K images. We use it for generalization testing as

in [149].

Evaluation Metrics. We use mean average precision average at multiple thresholds in [0.5, 0.95]

(AP) for GOD as in [190]. We also report results at threshold of 0.5 (AP50), threshold of 0.75

(AP75) and at different object sizes: small (AP𝑆), medium (AP𝑀), and large (AP𝐿). For COD, we

use E-measure 𝐸𝑚, S-measure 𝑆𝑚, weighted F1 score 𝑤𝐹𝛽 and mean absolute error 𝑀 𝐴𝐸 as [149].

4.4.1 GOD Results

Quantitative Results. Tab. 4.2 shows the results of applying PrObeD on GOD networks. PrObeD

improves the average precision of all three detectors. The performance gain is significant for Faster

R-CNN. As Faster R-CNN is an older detector, it was at a worse minima to start with. PrObeD

improves the convergence weight of Faster R-CNN by a significant margin, thereby improving

the performance. We further experiment with two variations of Faster R-CNN, namely, Faster

R-CNN + FPN and Sparse-RCNN. We observe an increase in the performance of both detectors.

PrObeD also improves newer detectors like YOLOv5 and DeTR, although the gains are smaller

compared to Faster R-CNN. We believe this happens because the newer detectors leave little room

for improvement due to which PrObeD improves the performance slightly. We next compare

PrObeD with a work that leverage segmentation map as a mask for object detection. We compare

59

Table 4.4 Performance comparison with proactive works. MaLP [7] has a significantly deteriorated
performance than PrObeD.

CAMO

Method

COD10K
E𝑚 ↑ S𝑚 ↑ wF𝛽 ↑ MAE↓ E𝑚 ↑ S𝑚 ↑ wF𝛽 ↑ MAE↓ E𝑚 ↑ S𝑚 ↑ wF𝛽 ↑ MAE↓
MaLP [7] 0.474 0.514 0.218 0.254 0.491 0.520 0.150 0.202 0.503 0.548 0.228 0.222
PrObeD 0.871 0.797 0.702 0.071 0.869 0.803 0.661 0.037 0.900 0.838 0.755 0.049

NC4K

Table 4.5 Ablation studies of PrObeD using Faster R-CNN GOD on MS-COCO 2017 dataset.
Removing the encoder/decoder network or adding the template results in degraded performance.

Changed

Template

From−⊲To
AP ↑ AP50 ↑ AP75 ↑ AP𝑆 ↑ AP𝑀 ↑ AP𝐿 ↑
39.5
Image Dependent−⊲Fixed
17.6
39.4
Image Dependent−⊲Universal 19.4
24.1
25.2
Yes−⊲No
39.1
19.2
51.1
31.7

15.1
17.1
26.2
20.1
33.3

15.4
18.0
26.6
17.9
35.5

1.3
1.9
5.3
1.7
11.0

37.9
42.6
46.1
42.3
52.6

Decoder
Transformation Multiply−⊲Add
-
PrObeD

our performance with Mask R-CNN [124], which uses an image segmentation branch to help

with object detection. Tab. 4.2 shows that the gains using Mask R-CNN are lower than using our

proactive wrapper.

Qualitative Results. Fig. 4.3 shows qualitative results for the MS-COCO 2017 dataset. PrObeD

clearly improves the performance of pretrained Faster R-CNN for three types of errors: Missed

predictions, false negatives, and localization errors. PrObeD has a lower number of missed

predictions, fewer false positives, and better bounding box localization. We also visualize the

generated and recovered templates. We see that the template has object semantics of the input

images. When the template is multiplied with the input image, it highlights the foreground objects,

thereby making the task of object detector easier.

Error Analysis. We show the error analysis [23] for GOD section 4 of the supplementary. We

observe that all GOD detectors make mistakes mainly due to five types of errors: classification,

localization, duplicate detection, background detection, and missed detection. The main reason for

the degraded performance is the errors in which the foreground-background boundary is missed.

These errors include localization, background detection, and missed detection. Our proactive

wrapper significantly corrects these errors, as the template has object semantics, which, when

multiplied with the input image, highlights the foreground objects, consequently simplifying the

task of object detection.

60

Figure 4.3 Qualitative GOD Results on MS-COCO 2017 dataset. (a) ground-truth annotations,
(b) Faster R-CNN [260] predictions, (c) Faster R-CNN [260]+ PrObeD predictions, (d) generated
template, and (e) recovered template. We highlight the objects responsible for improvement in (c)
as compared to (b). The yellow box represents better localization, the blue box represents false
positives, and the red box represents missed predictions. PrObeD improves on all these errors made
by (b).

Figure 4.4 Qualitative COD Results on CAMO, COD10K, and NC4K datasets from top to bottom,
after applying PrObeD. (a) input images, (b) ground-truth camouflaged map, (c) DGNet[149]
predictions, (d) DGNet[149]+ PrObeD predictions, (e) generated PrObeD template, and (f)
recovered PrObeD template. PrObeD template has the semantics of the camouflaged object,
which aids DGNet in detection.

4.4.2 COD Results

Quantitative Results. Tab. 4.3 shows the result of applying PrObeD to DGNet [149] on three

different datasets. PrObeD, when applied on top of DGNet, outperforms DGNet on all four metrics

for all datasets. The biggest gain appears in COD10K and NC4K datasets. This is impressive

as these datasets have more diverse testing images than CAMO. As NC4K is only a testing

set, the higher performance of PrObeD demonstrates its superior generalizability as compared to

DGNet [149]. This result agrees with the observation in [6, 7], where proactive-based approaches

61

Table 4.6 Ablation of training iterations on Faster R-CNN. YOLOv5, and DeTR for more iterations
similar to after applying PrObeD. We also report the inference time for all the detectors before
and after applying PrObeD. Training object detectors proactively with PrObeD results in more
performance gain compared to training passively for more iterations. PrObeD adds an overhead
cost on top of the inference cost of detectors.

Method

Faster R-CNN [260]
Faster R-CNN [260]
Faster R-CNN [260] + PrObeD
YOLOv5 [254]
YOLOv5 [254]
YOLOv5 [254] + PrObeD
DeTR [32]
DeTR [32]
DeTR [32] + PrObeD

Iterations AP ↑ AP50 ↑ AP75 ↑ AP𝑆 ↑ AP𝑀 ↑ AP𝐿 ↑
39.3
16.9
41.2
21.5
51.1
33.3
62.3
53.1
62.4
53.0
62.6
53.5
61.0
44.1
61.1
44.0
61.3
44.4

17.9
20.3
35.5
54.4
54.7
55.1
45.8
45.9
46.0

1.8
3.3
11.0
31.8
31.8
32.0
20.3
20.1
20.4

42.5
46.6
52.6
67.6
67.7
67.9
62.3
62.4
62.6

19.3
20.1
31.7
48.9
48.8
49.4
41.9
41.9
42.1

1×
2×
2×
1×
2×
2×
1×
2×
2×

Time (𝑚𝑠)

161.1

175.3 (↑ 8.7%)

48.5

62.7 (↑ 29.1%)

194.2

208.4 (↑ 7.2%)

exhibit improved generalization on manipulation detection and localization tasks.

Qualitative Results.

Fig. 4.4 visualizes the predicted camouflaged map for DGNet before

and after applying PrObeD on testing samples of all three datasets. PrObeD improves the

predicted camouflaged map, with less blurriness along the boundaries and better localization

of the camouflaged object. As observed before for GOD, the generated and recovered template

has the semantics of the camouflaged objects, which after multiplication intensifies the foreground

object, resulting in better segmentation by DGNet.

4.4.3 Ablation Study

Comparison with Proactive Works. The prior proactive works perform a different task of image

manipulation detection and localization. Therefore, these works are not directly comparable to

our proposed proactive wrapper, which performs a different task of object detection as described

in Tab. 4.1. However, manipulation localization and COD both involve a prediction of a localization

map, segmentation, and fakeness map, respectively. This inspires us to experiment with MaLP [7]

for the task of COD. We train the localization module of MaLP supervised with the COD datasets.

The results are shown in Tab. 4.4. We see that MaLP is not able to perform well for all three

datasets. MaLP is designed for estimating universal templates rather than templates tailored to

specific images. It shows the significance of image-specific templates in object detection. While

MaLP’s design with image-independent templates is effective for localizing image manipulation,

62

applying it to object detection has a negative impact on performance.

Framework Design.

PrObeD consists of blocks to improve the object detector.

Tab. 4.5

ablates different versions of PrObeD to highlight the importance of each block in our design.

PrObeD utilizes an encoder network E to learn image-dependent templates aiding the detector. We

remove the encoder E from our network, replacing it with a fixed template. We observe that the

performance deteriorates by a large margin. Next, we make this template learnable as proposed

in PrObeD, but only a single template would be used for all the input images. This choice also

results in worse performance, highlighting that image-dependent templates are necessary for object

detection. Finally, we remove the decoder network D, which is used to recover the template from

the encrypted images. Although this results in a better performance than the pretrained Faster

R-CNN, we observe a drop as compared to PrObeD. Therefore, as discussed in Sec. 4.3.3, the

recovery of templates is indeed a necessary and beneficial step for boosting the performance of the

proactive schemes.

Encryption Process.

PrObeD includes an encryption process as described in Eq. (4.3), which

involves multiplying the template with the input image. This process makes the template act as

a mask, highlighting the foreground for better detection. However, prior proactive works [7, 6]

consider adding templates to achieve better results. Thus, we ablate by changing the encryption

process to template addition. Tab. 4.5 shows that template addition degrades performance by a

significant margin w.r.t. our multiplication scheme. This shows that encryption is a key step in

formulating proactive schemes, and the same encryption process may not work for all tasks.

More Training Time. We perform an ablation to show that the performance gain of the detector is

due to our proactive wrapper instead of training for more iterations of the pretrained object detector.

Results in Tab. 4.6 show that although more training iterations for the detector has a performance

gain, it’s not enough to get the significant margin in performance as achieved by PrObeD. This

shows that extra training can help, but only up to a certain extent.

Inference Time. We evaluate the overhead computational cost after applying PrObeD on different

object detectors are shown in Tab. 4.6, averaged across 1, 000 images, on a NVIDIA 𝑉100 GPU.

63

Our encoder network has 17 layers, which adds extra cost for inference. For detectors with bulky

architectures like Faster R-CNN (ResNet101) and DeTR (transformer), the overhead computational

cost is quite small, 8.7% and 7.2%, respectively. This additional cost is minor compared to the

performance gain of detectors, especially Faster R-CNN. For a lighter detector like YOLOv5, our

overhead computational cost increases to 29.1%. So, there is a trade-off of applying PrObeD to

different detectors with varied architectures. PrObeD is more beneficial to bulky detectors like

two-staged/transformer-based as compared to one-stage detectors.

4.5 Conclusion

We mathematically prove that the proactive method results in a better-converged model than

the passive detector under assumptions and, consequently, a better 2D object detector. Based on

this finding, we propose a proactive scheme wrapper, PrObeD, which enhances the performance

of camouflaged and generic object detectors. The wrapper outputs an image-dependent template

using an encoder network, which encrypts the input images. These encrypted images are then

used to fine-tune the object detector. Extensive experiments on MS-COCO, CAMO, COD10K,

and NC4K datasets show that PrObeD improves the overall object detection performance for both

GOD and COD detectors.

Limitations. Our proposed scheme has the following limitations. First, PrObeD does not provide

a significant gain for recent object detectors such as YOLO and DeTR. Second, the proactive

wrapper should be thoroughly tested on other object detectors to show the generalizability of

PrObeD. Finally, we only experiment with simple multiplication and addition as the encryption

scheme. A more sophisticated encryption process might further improve the object detectors’

performance. We leave these for our future avenues.

64

CHAPTER 5

PROMARK: PROACTIVE DIFFUSION WATERMARKING FOR CAUSAL
ATTRIBUTION

Generative AI (GenAI) is transforming creative workflows through the capability to synthesize

and manipulate images via high-level prompts. Yet creatives are not well supported to receive

recognition or reward for the use of their content in GenAI training. To this end, we propose

ProMark, a causal attribution technique to attribute a synthetically generated image to its training

data concepts like objects, motifs, templates, artists, or styles. The concept information is

proactively embedded into the input training images using imperceptible watermarks, and the

diffusion models (unconditional or conditional) are trained to retain the corresponding watermarks

in generated images. We show that we can embed as many as 216 unique watermarks into the training

data, and each training image can contain more than one watermark. ProMark can maintain image

quality whilst outperforming correlation-based attribution. Finally, several qualitative examples

are presented, providing the confidence that the presence of the watermark conveys a causative

relationship between training data and synthetic images1.

5.1

Introduction

GenAI is able to create high-fidelity synthetic images spanning diverse concepts, largely due

to advances in diffusion models, e.g. DDPM [132], DDIM [216], LDM [264]. GenAI models,

particularly diffusion models, have been shown to closely adopt and sometimes directly memorize

the style and the content of different training images – defined as “concepts” in the training

data [33, 172]. This leads to concerns from creatives whose work has been used to train GenAI.

Concerns focus upon the lack of a means for attribution, e.g.recognition or citation, of synthetic

images to the training data used to create them and extend even to calls for a compensation

mechanism (financial, reputational, or otherwise) for GenAI’s derivative use of concepts in training

images contributed by creatives.

1Vishal Asnani, John Collomosse, Tu Bui, Xiaoming Liu, and Shruti Agarwal. "ProMark: Proactive Diffusion
Watermarking for Causal Attribution." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2024

65

Figure 5.1 Causative vs.correlation-based matching for concept attribution. ProMark identifies
the training data most responsible for a synthetic image (‘attribution’). Correlation-based matching
doesn’t always perform the data attribution properly. We propose ProMark, which is a proactive
approach involving adding watermarks to training data and recovering them from the synthetic
image to perform attribution in a causative way.

We refer to this problem as concept attribution – the ability to attribute generated images to the

training concept/s which have most directly influenced their creation. Several passive techniques

have recently been proposed to solve the attribution problem [11, 269, 328]. These approaches use

visual correlation between the generated image and the training images for attribution. Whilst they

vary in their method and rationale for learning the similarity embedding – all use some forms of

contrastive training to learn a metric space for visual correlation.

We argue that although correlation can provide visually intuitive results, a measure of similarity

is not a causative answer to whether certain training data is responsible for the generation of an

image or not. Further, correlation-based techniques can identify close matches with images that

were not even present in the training data.

Keeping this in mind, we explore an intriguing field of research which is developing around

proactive watermarking methodologies [356, 267, 325, 7], that employ signals, termed templates

66

to encrypt input images before feeding them into the network. These works have integrated and

subsequently retrieved templates to bolster the performance of the problem at hand. Inspired by

these works, we introduce ProMark, a proactive watermarking-based approach for GenAI models

to perform concept attribution in a causative way. The technical contributions of ProMark are

three-fold:

1. Causal vs. Correlation-based Attribution. ProMark performs causal attribution of synthetic

images to the predefined concepts in the training images that influenced the generation. Unlike

prior works that visually correlate synthetic images with training data, we make no assumption that

visual similarity approximates causation. ProMark ties watermarks to training images and scans for

the watermarks in the generated images, enabling us to demonstrate rather than approximate/imply

causation. This provides confidence in grounding downstream decisions such as legal attribution

or payments to creators.

2. Multiple Orthogonal Attributions. We propose to use orthogonal invisible watermarks to

proactively embed attribution information into the input training data and add a BCE loss during the

training of diffusion models to retain the corresponding watermarks in the generated images. We

show that ProMark causatively attributes as many as 216 unique training-data concepts like objects,

scenes, templates, motifs, and style, where the generated images can simultaneously express one

or two orthogonal concepts.

3. Flexible Attributions. ProMark can be used for training conditional or unconditional diffusion

models and even finetuning a pre-trained model for only a few iterations. We show that ProMark’s

causative approach achieves higher accuracy than correlation-based attribution over five diverse

datasets (Sec. 5.4.1): Adobe Stock, ImageNet, LSUN, Wikiart, and BAM while preserving synthetic

image quality due to the imperceptibility of the watermarks.

Fig. 5.1 presents our scenario, where synthetic image(s) are attributed back to the most influential

GenAI training images. Correlation-based techniques [11, 328] try to match the high-level image

structure or style. Here, the green-lizard synthetic image is matched to a generic green image without

a lizard [11]. With ProMark’s causative approach, the presence of the green-lizard watermark in

67

Table 5.1 Comparison of ProMark with prior works. Uniquely, we perform causative attribution
using proactive watermarking to attribute multiple concepts. [Keys: emb.: embedding, obj.: object,
own.: ownership, sem.: semantic, sty.: style, wat.: watermark].

Method Scheme

Task Match # Class Multiple Attribution

type
type
attribution emb.
passive
attribution emb.
passive
attribution emb.
passive
wat.
passive
wat.
passive
wat.
passive
wat.
proactive
proactive
wat.
proactive localization wat.
proactive obj. detect

detect
detect
detect
detect
detect

-

[269]
[11]
[328]
[90]
[198]
[62]
[325]
[6]
[7]
[5]

ProMark proactive attribution wat.

-
-
693
2
2
2
2
2
2
90

216

attribution
✕
✕
✕
✕
✕
✕
-
-
-
-

✓

type
sty.
obj.
sty., obj.
-
-
-
-
-
-
-
sty., obj.
own., sem.

the synthetic image will correctly indicate the influence of the similarly watermarked concept group

of lizard training images.

5.2 Related Works

Passive Concept Attribution.

Concept attribution differs from model [28] or camera [37]

attribution in that the task is to determine the responsible training data for a given generation.

Existing concept attribution techniques are passive – they do not actively modify the GenAI model

or training data but instead, measure the visual similarity (under some definition) of synthetic

images and training data to quantify attribution for each training image. EKILA [11] proposes

patch-based perceptual hashing (visual fingerprinting [224, 19]) to match the style of the query

patches to the training data for attribution. Wang et al. [328] finetune semantic embeddings like

CLIP, DINO, etc.for the attribution task. Both [11] and [328] explore ALADIN [269] for style

attribution. ALADIN is a feature representation for fine-grained style similarity learned using a

weakly supervised approach.

All these works are regarded as passive approaches as they take the image as an attribute by

correlating between generated and training image styles.

Instead, our approach is a proactive

scheme that adds a watermark to training images and performs attribution in a causal manner

(Tab. 5.1).

68

Figure 5.2 Overview of ProMark. We show the training and inference procedure for our proposed
method. Our training pipeline involves two stages, image encryption and generative model training.
We convert the bit-sequences to spatial watermarks (𝑾), which are then added to the corresponding
concept images (𝑿) to make them encrypted (𝑿𝑊 ). The generative model is then trained with the
encrypted images using the LDM supervision. During training, we recover the added watermark
using the secret decoder (D𝑆) and apply the BCE supervision to perform attribution. To sample
newly generated images, we use a Gaussian noise and recover the bit-sequences using the secret
decoder to attribute them to different concepts. Best viewed in color.

Proactive Schemes.

Proactive schemes involve adding a signal/perturbation onto the input

images to benefit different tasks like deepfake tagging [325], deepfake detection [6], manipulation

localization [7], object detection [5], etc.. Some works [356, 267] disrupt the output of the

generative models by adding perturbations to the training data. Alexandre et al. [270] tackles the

problem of training dataset attribution by using fixed signals for every data type. These prior works

successfully demonstrate the use of watermarks to classify the content of the AI-generated images

proactively. We extend the idea of proactive watermarking to perform the task of causal attribution

of AI-generated images to influential training data concepts. Watermarking has not been used to

trace attribution in GenAI before.

Watermarking of GenAI Models.

It is an active research to watermark AI-generated images for

the purpose of privacy protection. Fernandez1 et al. [90] fine-tune the LDM’s decoder to condition

on a bit sequence, embedding it in images for AI-generated image detection. Kirchenbauer et

al. [165] propose a watermarking method for language models by pre-selecting random tokens

69

and subtly influencing their use during word generation. Zhao et al. [380] use a watermarking

scheme for text-to-image diffusion models, while Liu et al. [198] verify watermarks by pre-defined

prompts. [62, 241] add a watermark for detecting copyright infringement. Asnani et al. [8] reverse

engineer a fingerprint left by various GenAI models to further use it for recovering the network and

training parameters of these models [354, 8]. Finally, Cao et al. [31] adds an invisible watermark for

protecting diffusion models which are used to generate audio modality. Most of these works have

used watermarking for protecting diffusion models, which enables them to add just one watermark

onto the data. In contrast, we propose to add multiple watermarks to the training data and to a

single image, which is a more challenging task than embedding a universal watermark.

5.3 Method

5.3.1 Background

Diffusion Models. Diffusion models learn a data distribution 𝑝( 𝑿), where 𝑿 ∈ Rℎ×𝑤×3 is the

input image. They do this by iteratively reducing the noise in a variable that initially follows a

normal distribution. This can be viewed as learning the reverse steps of a fixed Markov Chain with

a length of 𝑇. Recently, LDM [264] is proposed to convert images to their latent representation

for faster training in a lower dimensional space than the pixel space. The image is converted to

and from the latent space by a pretrained autoencoder consisting of an encoder 𝒛 = E𝐿 ( 𝑿) and a

decoder 𝑿 𝑅 = D𝐿 (𝒛), where 𝒛 is the latent code and 𝑿 𝑅 is the reconstructed image. The trainable

denoising module of the LDM is 𝜖𝜃 (𝒛𝑡, 𝑡); 𝑡 = 1...𝑇, where 𝜖𝜃 is trained to predict the denoised

latent code ˆ𝒛 from its noised version 𝒛𝑡. This objective function can be defined as follows:

𝐿 𝐿𝐷 𝑀 = EE𝐿 ( 𝑿),𝜖∼N (0,1),𝑡

(cid:2)||𝜖 − 𝜖𝜃 (𝒛𝑡, 𝑡)||2
2

(cid:3),

(5.1)

where 𝜖 is the noise added at step 𝑡.

Image Encryption. Proactive works [6, 7, 5] have shown performance gain on various tasks by

proactively transforming the input training images 𝑿 with a watermark, resulting in an encrypted

image. This watermark is either fixed or learned, depending on the task at hand. Similar to prior

70

proactive works, our image encryption is of the form:

𝑿𝑊 = T ( 𝑿; 𝑾) = 𝑿 + 𝑚 × 𝑅(𝑾, ℎ, 𝑤),

(5.2)

where T is the transformation, 𝑾 is the spatial watermark, 𝑿𝑊 is the encrypted image, 𝑚 is the

watermark strength, and 𝑅(.) resizes 𝑾 to the input resolution (ℎ, 𝑤).

We use the state-of-the-art watermarking technique RoSteALS [27] to compute the spatial

watermarks for encryption due to its robustness to image transformation and generalization (the

watermark is independent of content of the input image). RoSteALS is designed to embed a secret

of length 𝑏-bits into an image using robust and imperceptible watermarking. It comprises of a

secret encoder E𝑆 (𝒔), which converts the bit-secret 𝒔 ∈ {0, 1}𝑏 into a latent code offset 𝒛𝑜. It is then

added to the latent code of an autoencoder 𝒛𝑤 = 𝒛 + 𝒛𝑜. This modified latent code 𝒛𝑤 is then used

to reconstruct a watermarked image via autoencoder decoder. Finally, a secret decoder, denoted by

D𝑆 (𝑋𝑊 ), takes the watermarked images as input and predicts the bit-sequence ˆ𝒔.

5.3.2 Problem Definition

Let C = {𝑐1, 𝑐2, . . . , 𝑐𝑁 } be a set of 𝑁 distinct concepts within a dataset that is used for training

a GenAI model for image synthesis. The problem of concept attribution can be formulated as

follows:

Given a synthetic image 𝑿 𝑆 generated by a GenAI model, the objective of concept attribution is

to accurately associate 𝑿 𝑆 to a concept 𝑐𝑖 ∈ C that significantly influenced the generation of 𝑿 𝑆.

We aim to find a mapping 𝑓 : 𝑿 𝑆 → 𝑐𝑖 such that

𝑐∗
𝑖 = arg max
𝑐𝑖 ∈C

𝑓 ( 𝑿 𝑆, 𝑐𝑖),

(5.3)

where 𝑐∗

𝑖 represents the concept most strongly attributed to image 𝑿 𝑆.

5.3.3 Overview

The pipeline of ProMark is shown in Fig. 5.2. The principle is simple: if a specific watermark

unique to a training concept can be detected from a generated image, it indicates that the generative

model relies on that concept in the generation process. Thus, ProMark involves two steps: training

data encryption via watermarks and generative model training with watermarked images.

71

To watermark the training data, the dataset is first divided into 𝑁 groups, where each group

corresponds to a unique concept that needs attribution. These concepts can be semantic (e.g.objects,

scenes, motifs or stock image templates) or abstract (e.g.stylistic, ownership info). Each training

image in a group is encoded with a unique watermark without significantly altering the image’s

perceptibility. Once the training images are watermarked, they are used to train the generative

model. As the model trains, it learns to generate images based on the encrypted training images.

Ideally, the generated images would have traces of watermarks corresponding to concepts they’re

derived from.

During inference, ProMark conforms to whether a generated image is derived from a particular

training concept by identifying the unique watermark of that concept within the image. Through

the careful use of unique watermarks, we can trace back and causally attribute generated images to

their origin in the training dataset.

5.3.4 Training

During training, our algorithm is composed of two stages: image encryption and generative

model training. We now describe each of these stages in detail.

Image Encryption. The training data is first divided into 𝑁 concepts, and images in each partition

are encrypted using a fixed spatial watermark 𝑾 𝑗 ∈ Rℎ×𝑤 ( 𝑗 ∈ 0, 1, 2, ..., 𝑁). Each noise 𝑾 𝑗 is a

𝑏-dim bit-sequence (secret) 𝒔 𝑗 = {𝑝 𝑗1, 𝑝 𝑗2, ..., 𝑝 𝑗 𝑏} where 𝑝 𝑗𝑖 ∈ {0, 1}.

In order to compute the watermark 𝑾 𝑗 from the bit-sequence 𝒔 𝑗 , we encrypt 100 random images

with 𝒔 𝑗 using pretrained RoSteALS secret encoder E𝑆 (.) which takes 𝑏 = 160 length secret as input.

From these encrypted images, we obtain 100 noise residuals by subtracting the encrypted images

from the originals, which are averaged to compute the watermark 𝑾 𝑗 as:

𝑾 𝑗 =

1
100

100
∑︁

𝑖=1

( 𝑿𝑖 − E𝑆 ( 𝑿𝑖, 𝒔 𝑗 )).

(5.4)

The above process is defined as spatial noise conversion in Fig. 5.2. The averaging of noise residuals

across different images reduces the image content in the watermark and makes the watermark

independent of any specific image. Additionally, the generated watermarks are orthogonal due to

72

different bits for all 𝒔 𝑗 , ensuring distinguishability from each other. With the generated watermarks,

each training image is encrypted using Eq. (5.2) with one of the 𝑁 watermarks that correspond to

the concept the image belongs to.

Generative Model Training. Using the encrypted data, we train the LDM’s denoising module

𝜖𝜃 (.) using the objective function (Eq. (5.1)), where 𝒛𝑡 is the noised version of:

𝒛 = E𝐿 ( 𝑿𝑊 𝑗 ) = E𝐿 (T ( 𝑿; 𝑾 𝑗 )),

(5.5)

i.e., the input latent codes 𝒛 are generated using the encrypted images 𝑿𝑊 𝑗 for 𝑗 ∈ {0, 1, 2...., 𝑁 }.

However, we found that only using LDM loss is insufficient to successfully learn the connection

between the conceptual content and its associated watermark. This gap in learning presents a

significant hurdle, as the primary aim is to trace back generated images to their respective training

concepts via the watermark. To tackle this, an auxiliary supervision is introduced to LDM’s

training,

𝐿 𝐵𝐶𝐸 (𝒔 𝑗 , ˆ𝒔) = −

1
𝑏

𝑏
∑︁

𝑖=1

[ 𝑝 𝑗𝑖 log( ˆ𝑝𝑖) + (1 − 𝑝 𝑗𝑖) log(1 − ˆ𝑝𝑖)],

(5.6)

where 𝐿 𝐵𝐶𝐸 (.) is the binary cross-entropy (BCE) between the actual bit-sequence 𝒔 𝑗 associated

with watermark 𝑾 𝑗 and the predicted bit-sequence ˆ𝒔. The denoised latent code ˆ𝒛 is then decoded

using the autoencoder D𝐿 (.), and the embedded secret ˆ𝒔 is predicted by the secret decoder D𝑆 (.)

as:

ˆ𝒔 = D𝑆 (D𝐿 ( ˆ𝒛)).

(5.7)

By employing BCE, the model is guided to minimize the difference between the predicted

watermark and the embedded watermark, hence improving the model’s ability to recognize and

associate watermarks with their respective concepts. Finally, our objective is to minimize the loss

function 𝐿𝑎𝑡𝑡𝑟 = 𝐿 𝐿𝐷 𝑀 + 𝛼𝐿 𝐵𝐶𝐸 during training, where 𝛼 is set to 2 for our experiments.

5.3.5

Inference

After the LDM learns to associate the watermarks with concepts, we use random Gaussian

noise to sample the newly generated images from the model. While the diffusion model creates

these new images, it also embeds a watermark within them. Each watermark maps to a distinctive

73

orthogonal bit-sequence associated with a specific training concept, serving as a covert signature

for attribution.

To attribute the generated images and ascertain which training concept influenced them, we

predict the secret embedded by the LDM in the generated images (see Eq. (5.7)). Given a predicted

binary bit-sequence, ˆ𝒔 = { ˆ𝑝1, ˆ𝑝2, ..., ˆ𝑝𝑏} and all the input bit-sequences 𝒔 𝑗 for 𝑗 ∈ 0, 1, 2..., 𝑁, we

define the attribution function, 𝑓 , in Eq. (5.3) as:
𝑏
∑︁

𝑓 ( ˆ𝒔, 𝒔 𝑗 ) =

[ ˆ𝑝𝑘 = 𝑝 𝑗 𝑘 ],

(5.8)

𝑘=1
where [ ˆ𝑝𝑘 = 𝑝𝑖𝑘 ] acts as an indicator function, returning 1 if the condition is true, i.e., the bits are

identical, and 0 otherwise. Consequently, we assign the predicted bit sequence to the concept whose

bit sequence it most closely mirrors — that is, the concept 𝑗 ∗ for which 𝑓 ( ˆ𝒔, 𝒔 𝑗 ∗) is maximized:

𝑗 ∗ = arg max

𝑗 ∈{1,2,...,𝑁 }

𝑓 ( ˆ𝒔, 𝒔 𝑗 ).

(5.9)

In other words, the concept whose watermark is most closely aligned with the generated image’s

watermark is deemed to be the influencing source behind the generated image.

5.3.6 Multiple Watermarks

In prior image attribution works, an image is usually attributed to a single concept (e.g.image

content or image style). However, in real-world scenarios, an image may encapsulate multiple

concepts. This observation brings forth a pertinent question: “Is it possible to use multiple

watermarks for multi-concept attribution within a single image?"

In this paper, we propose a novel approach to perform multi-concept attribution by embedding

multiple watermarks into a single image. In our preliminary experiments, we restrict our focus to

the addition of two watermarks. To achieve this, we divide the image into two halves and resize

each watermark to fit the respective halves. This ensures that each half of the image carries distinct

watermark information pertaining to a specific concept.

For the input RGB image 𝑿, {𝑾𝑖, 𝑾 𝑗 } are the watermarks for two secrets {𝒔𝑖, 𝒔 𝑗 }, we formulate

74

the new transformation T as:

T ( 𝑿; 𝑾𝑖, 𝑾 𝑗 ) =

=

(cid:110)

(cid:110)

(cid:111)

𝑿𝑙𝑒 𝑓 𝑡, 𝑿𝑟𝑖𝑔ℎ𝑡
𝑤

( 𝑿 (:, 0 :
𝑤

2

, :) + 𝑅(𝑾𝑖, ℎ,

𝑤

2
𝑤

)),

(cid:111)

,

)

( 𝑿 (:,

: 𝑤, :) + 𝑅(𝑾 𝑗 , ℎ,

2
where {.} is the horizontal concatenation. The loss function uses the two predicted secrets (ˆ𝒔1 and

2

ˆ𝒔2) from the two halves of the generated image, defined as:

𝐿𝑎𝑡𝑡𝑟 = 𝐿 𝐿𝐷 𝑀 + 𝛼(𝐿 𝐵𝐶𝐸 (𝒔𝑖, ˆ𝒔1) + 𝐿 𝐵𝐶𝐸 (𝒔 𝑗 , ˆ𝒔2)).

5.4 Experiments

5.4.1 Unconditional Diffusion Model

In this section, we train multiple versions of unconditional diffusion models [264] to demonstrate

that ProMark can be used to attribute a variety of concepts in the training data. In each case, the

model is trained starting from random initialization of LDM weights. Described next are details of

the datasets and evaluation protocols.

Datasets We use 5 datasets spanning attribution categories like image templates, scenes, objects,

styles, and artists. For each dataset, we consider the dataset classes as our attribution categories.

For each class in a dataset, we use 90% images for training, and 10% for evaluation, unless specified

otherwise.

1. Stock: We collect images from Adobe Stock, comprising of near-duplicate image clusters

like templates, symbols, icons, etc.. An example image from some clusters is shown in the

supplementary. We use 100 such clusters, each with 2𝐾 images.

2. LSUN: The LSUN dataset [363] comprises 10 scene categories, such as bedrooms and

kitchens. It’s commonly used for scene classification, training generative models like GANs,

and anomaly detection. Same as the Stock dataset, we use 2𝐾 images per class.

3. Wiki-S: The WikiArt dataset [296] is a collection of fine art images spanning various styles

and artists. We use the 28 style classes with 580 average images per class.

75

Table 5.2 Comparison with prior works for unconditional diffusion model on various datasets.
[Keys: str.: strength].

Method

ALADIN [269]
CLIP [251]
F-CLIP [328]
SSCD [246]
EKILA [11]

ProMark

Str.
(%)
-
-
-
-
-
30
100

Attribution Accuracy (%) ↑

Stock LSUN Wiki-A Wiki-S
33.25
99.86
60.84
75.67
60.43
78.49
50.37
99.63
37.06
99.37
98.12
100
100
100

48.95
77.58
77.23
69.51
51.23
97.45
100

46.27
87.13
87.39
73.26
70.60
95.12
100

ImageNet
9.25
60.12
62.83
37.32
38.00
83.06
91.07

4. Wiki-A: From the WikiArt dataset [296] we also use the 23 artist classes with 2, 112 average

images per class.

5. ImageNet: We use the ImageNet dataset [71] which comprises of 1 million images across

1𝐾 classes. For this dataset, we use the standard validation set with 50𝐾 for evaluation and

the remaining images for training.

Evaluation Protocol For all datasets, the concept attribution performance is tested on the held-out

data as follows. For a held-out image, we first encrypt it with the concept’s watermark. Then using

the latent code of the encrypted image, we noise it till a randomly assigned timestamp and apply

our trained diffusion model to reverse back to the initial timestamp with the estimated noise. The

denoised latent code is then decoded using the autoencoder D𝐿 (.), and the embedded secret is

predicted using the secret decoder D𝑆 (.). Using Eq. (5.9), we compute the predicted concept and

calculate the accuracy using the ground-truth concept.

Results Shown in Tab. 5.2 is the attribution accuracy of ProMark at two watermark strengths

i.e. 100% and 30% which is set by variable 𝑚 in Eq. (5.2). ProMark outperforms prior works,

achieving near-perfect accuracy on all the datasets when the watermark strength is 100%. However,

the watermark introduces visual artifacts [27] if the watermark strength is full. Therefore, we

decrease the watermark strength to 30% before adding it to the training data (see Sec. 5.4.5 for

ablation on watermark strength). Even though our performance drops at a lower watermark strength,

we still outperform the prior works. This shows that our causal approach can be used to attribute a

76

Figure 5.3 Example training and newly sampled images of different datasets for the
corresponding classes. We observe a similar content in the inference image compared with
the training image of the predicted class.

variety of concepts in the training data with an accuracy higher than the prior passive approaches.

Fig. 5.3 (rows 1-5) shows the qualitative examples of the newly sampled images from each of

the trained models. For each model, we sample the images using random Gaussian noise until we

have images for every concept. The concept for each image is predicted using the secret embedded

in the generated images. Shown in each row of Fig. 5.3 are three training images (columns 1-3) and

three sampled images from the corresponding concepts (columns 4-6). This shows that ProMark

makes the diffusion model embed the corresponding watermark for the class of the generated image,

thereby demonstrating the usefulness of our approach.

Shown in Fig. 5.4 are the nearest images retrieved using the embedding-based methods (row

(2)-(6)) for the query images from the ImageNet (row (1)). For each image retrieval, we highlight

77

Figure 5.4 Visual results of prior embedding-based works. We show the image of the closest
matched embedding for each method on ImageNet. We highlight images green for correct
attribution, otherwise red. Embedding-based works do not always attribute to the correct concept.

the correct/incorrect attribution using a green/red box. As we can see, the correlation-based prior

techniques rely on visual similarity between the query and the retrieved images, ignoring the

concept. However, for each query image, ProMark predicts the correct concept corresponding to

the query image (Fig. 5.3).

78

Table 5.3 Multi-concept attribution comparison with baselines.

Method

ALADIN [269]
CLIP [251]
F-CLIP [328]
SSCD [246]
EKILA [11]
ProMark (single)

ProMark (multi)

Strength
(%)
-
-
-
-
-
30
30
100

Attribution Accuracy (%) ↑
Media Content Combined
42.16
46.71
52.12
47.06
43.72
-
91.33
95.61

41.25
45.12
51.56
46.09
43.58
-
89.21
93.31

34.97
42.36
46.23
40.61
37.09
97.73
84.66
90.12

5.4.2 Multiple Watermarks

We evaluate the effectiveness of ProMark for multi-concept attribution. As before, an unconditional

diffusion model is trained starting from random initialization, and each image in the training data

is encrypted with two watermarks as outlined in Sec. 5.3.6.

Dataset For this experiment, we use the BAM dataset [337], comprising contemporary artwork

sourced from Behance, a platform hosting millions of portfolios by professionals and artists. This

dataset uniquely categorizes each image into two label types: media and content. It encompasses 7

distinct labels for media and 9 for content, culminating in a diverse set of 63 label pairs, with 4, 593

average images in these label pairs. For each class pair, we use 90% data for training and 10% for

held-out evaluation.

Results The same evaluation is performed as described in Sec. 5.4.1, except the accuracy is now

computed for two concepts instead of one. Shown in Tab. 5.3 is the attribution accuracy for the two

concepts individually and simultaneously. To benchmark the effectiveness of ProMark, we also

compare against baselines, where ProMark outperforms baselines, achieving a combined attribution

accuracy of 90.12% as compared to 46.61% for F-CLIP [328]. We believe our findings substantiate

that ProMark can be extended to a scenario where the generated images are composed of several

unique concepts from the training images. For ablation, we train ProMark with 7 × 8 classes,

with each pair of media and content as an individual concept. ProMark is able to achieve 97.73%

attribution accuracy for single-concept, higher than the performance achieved for multi-concept

case i.e.90.12%. However, single concept approach is not scalable when the number of concepts

79

Table 5.4 Comparison with different baselines for the conditional model trained on ImageNet
dataset.

Strength Attribution Accuracy (%) ↑
Held-out data New images

Method

ALADIN [269]
CLIP [251]
F-CLIP [328]
SSCD [246]
EKILA [11]

ProMark

(%)
-
-
-
-
-
30
100

9.25
60.12
62.83
37.32
38.00
91.24
95.60

0.18
41.01
50.19
30.10
29.06
87.30
90.13

in an image increases, as the number of watermarks would grow exponentially (7 × 8 vs.7 + 8).

Therefore, transitioning to a multi-concept scenario is more appropriate for real-world scenarios,

where scalability and practicality are crucial.

In the final row of Fig. 5.3, we present qualitative examples of newly sampled images from the

model trained on the BAM dataset. Observations indicate that these sampled images successfully

adopt both media and content corresponding to training images of the same concept. This provides

empirical evidence of ProMark’s effectiveness in facilitating multi-concept attribution.

5.4.3 Number of Concepts

AI models leverage large-scale image datasets [264, 20, 132, 216], encompassing a broad

spectrum of concepts. This diversity necessitates concept attribution methods that can maintain

high performance across numerous concepts. In this context, we test ProMark with an exponentially

increasing number of concepts. Our dataset comprises Adobe Stock images with near duplicate

image templates (used as concepts). As we escalate the number of concepts, we concurrently

reduce the per-concept image count, only 24 images per concept for 216 concepts, see the red curve

of Fig. 5.5 (a) for image count. This is done to obtain balanced image distribution and also to

challenge ProMark’s robustness.

The outcomes, depicted in Fig. 5.5(a) red curve, indicate an anticipated decline in ProMark’s

efficacy in line with the increase in the number of concepts, reducing from 100% attribution

accuracy for 10 concepts (chance accuracy 10%) to 82% for 216 concepts (chance accuracy

1.5𝑒-3%). This reduction in attribution accuracy is correlated with the reduction in bit-secret

80

Figure 5.5 Ablation experiments: We show the results for ablating multiple parameters of ProMark.
(a) Number of concepts, (b) watermark strength, and (c) number of images per concept.

accuracy (green curve) for every predicted secret, indicating poor watermark recovery due to the

increased confusion between the watermarks. Notwithstanding the increased difficulty, ProMark

demonstrates commendable performance, underscoring its potential in real-world applications.

5.4.4 Conditional Diffusion Model

As the diffusion models are usually trained with conditions to guide generation, we also evaluate

using the conditional LDM model [264]. For this, we fine-tune a model pretrained of the ImageNet

dataset (see Sec. 5.4.1), where the 1000 ImageNet classes are used as model conditions and also as

the 1000 concepts.

Evaluation Protocol In addition to the evaluation on the held-out data (see Sec. 5.4.1), we also

perform the quantitative evaluation on the newly sampled images as follows. We use the labels of

the ImageNet dataset as conditions to sample 10𝐾 images (10 images per label). Using these labels

as the ground-truth concept for a newly sampled image, we compute the accuracy of the concept

predicted by the embedded watermark in the generated images.

Results The accuracies for held-out and newly sampled images are shown in Tab. 5.4. The

performance on the held-out dataset for the conditional model improves compared to the unconditional

models as the label conditions provide improved supervision for correct watermarks. ProMark also

outperforms prior embedding-based works by a large margin on both held-out and newly sampled

images. The attribution accuracy on the new images, however, is less than the held-out data. We

hypothesize that it is because newly sampled images may contain more than one concept and can

be more confusing to attribute. The high accuracy, even for newly sampled images, suggests that

ProMark exhibits higher generalizability to unseen synthetic images.

81

5.4.5 Ablation Study

For the ablation experiment, we use Stock dataset with a varying number of concepts, and we

train unconditional LDM models from random initialization.

Strength of Watermark.

The hyperparameter 𝑚 in Eq. (5.2) modulates the intensity of the

watermark applied to the training images, ensuring encrypted images retain high quality. We

systematically alter 𝑚 to examine its impact on the LDM’s performance and the Peak Signal-

to-Noise Ratio (PSNR) of the output images with reference to the held-out encrypted images.

Fig. 5.5(b) shows that attribution accuracy improves with increased 𝑚, plateauing beyond a threshold

of 0.5. The discernible compromise in image quality, as evidenced by the inverse relationship

between intensity and PSNR, can be attributed to the use of fixed watermarks obtained using

RoSteALS [27], which is originally optimized for robustness. In light of this, we select an optimal

watermark strength of 0.3, which balances between performance and PSNR. We measured the

FID between original and newly sampled images from a pretrained ImageNet conditional model

(trained without watermark) and ProMark model (trained with watermark), which is 13.28 and

17.63 respectively. This small increment shows negligible quality loss in the generated images due

to ProMark.

Number of Images Per Concept. To ascertain the optimal number of images required per concept

for effective watermark learning, we ablate by fixing the number of concepts to 500 and varying the

number of images used to train the LDM. Fig. 5.5(c) reveals that performance drops by 2.5% when

image count per concept is reduced from 700 to 10. Remarkably, the general efficacy of ProMark

remains consistently high, suggesting a low sensitivity to the image count per concept. These

results demonstrate that ProMark can successfully learn watermarks with as few as 10 images per

concept, highlighting its efficiency and potential for applications with limited data availability.

Framework Design.

ProMark employs BCE loss to instruct the LDM model in the accurate

embedding of bit-sequence watermarks within generated images. The attribution performance

degrades to 2% when BCE loss is not used as compared to 100% in Tab. 5.2. This shows that

removing BCE loss significantly impairs the LDM’s performance, underscoring the necessity of

82

this supervision in helping LDM embed watermarks effectively.

Also, ProMark incorporates a secret decoder to retrieve secret bit-sequence from synthesized

images, rendering the process contingent upon the pretrained secret decoder.

In contrast, prior

works [6, 7, 5] recover watermarks by training a dedicated decoder with the main model in an

end-to-end fashion. To ablate this alternative approach, we train a standard decoder along with

LDM by optimizing for the cosine similarity between the embedded and extracted watermarks.

We see a degradation in performance from 100% to 80.56%, indicating that the pretrained secret

decoder is a better choice for our approach. This is due to the increased complexity of predicting

watermarks of resolution 2562 as compared to 160-bit sequence from the encrypted images.

5.5 Conclusion

We introduce a novel proactive watermarking-based approach, ProMark, for causal attribution.

We use predefined training concepts like styles, scenes, objects, motifs, etc..

to attribute the

influence of training data on generated images. We show ProMark’s is effective across various

datasets and model types, maintaining image quality while providing more accurate attribution

on a large number of concepts. Our approach can also be extended to multi-concept attribution

by embedding multiple watermarks onto the image. Finally, for each experiment, our approach

achieves a higher attribution accuracy than the prior passive approaches. Such attribution offers

opportunities to recognize and reward creative contributions to generative AI, underpinning new

models for value creation in the future creative economy [57].

Limitations.

In evaluating ProMark, we note a trade-off between image quality and attribution

accuracy, which may need us to learn watermarks for attribution task. Our model is currently

trained with predefined concepts and further research is needed on training paradigm when new

concepts are introduced. While we use orthogonal watermarks for varied concepts like motifs and

styles, this may not accurately reflect the interrelated nature of some concepts, suggesting another

opportunity for future research. Finally, our results are specific to the LDM, and extending this

approach to other GenAI models could provide a better understanding of ProMark’s effectiveness.

83

CHAPTER 6

CUSTOMMARK: CUSTOMIZATION OF DIFFUSION MODELS FOR PROACTIVE
ATTRIBUTION

Generative AI (GenAI) presents challenges in attributing synthesized content to its original training

data, particularly for artists whose styles are replicated by these models. We introduce CustomMark,

a novel technique for customizing pre-trained text-to-image GenAI models to enable attribution.

With CustomMark, text prompts can be modified to embed a watermark in generated images,

linking them to training concepts such as an artist’s style, specific objects, or the GenAI model

itself. Our approach supports sequential customization, allowing new concepts to be attributed

efficiently and scalably without retraining from scratch. We demonstrate that CustomMark can

robustly watermark hundreds of individual concepts and support multiple attributions within a

single image while preserving high visual quality of the generation1.

6.1

Introduction

Given GenAI’s potential to democratize creativity, ethical concerns have emerged among artists

regarding the unauthorized use of their works. Many seek recognition or compensation for the

derivative use of their styles in generated images [263]. In the past, such creative recognition has

relied on collaborations between technology, legal frameworks, and artistic practices [24]. GenAI

currently lacks such mechanisms, leading to artist discontent and prompting adversarial strategies

like “Glaze” [276], “Anti-DreamBooth” [309], and others [381, 96, 88] to protect their works.

To address this discontent, it is needed that GenAI models provide attribution when generated

images are derived from artists’ works in training data. Such attribution could potentially unlock

new revenue streams in the creator economy, rewarding creative opt-in to GenAI training [57]. A

decentralized framework to compensate creators based on visual similarities between generated

and training images was proposed in [11]. Several similarity embeddings have been explored [269,

11, 328] to determine the subset of training images that influenced the generation. While intuitive,

these visual correlation-based attribution methods [269, 11, 328] often fail to provide definitive

1Vishal Asnani, John Collomosse, Xiaoming Liu, and Shruti Agarwal. "CustomMark: Customization of Diffusion

Models for Proactive Attribution." In review, 2025.

84

Figure 6.1 Overview of concept attribution by GenAI models. (a) A user generates images of
various artists’ style using artists’ tokens in the prompt (w/o attribution). (b) Artists request to the
companies to provide attribution for their work. Using CustomMark, companies customize their
models to enable attribution only for the artists that have requested the same. (c) A user generates
the images using the improved GenAI model with artists’ specific watermark for attribution to the
artists.

explanations and can also incorrectly attribute works not present in the training set.

Alternative approaches attempt to establish direct causal relationships using techniques like

proactive watermarking [4] or influence estimation via data removal [329]. However, these methods

require modifications to training data or inference paradigms, making them computationally heavy.

In response, we propose CustomMark, an efficient technique for attribution in pre-trained GenAI

models. Similar to [4], we use concept-specific watermarking but without requiring predefined

concepts before training. CustomMark enables selective attribution of specific concepts in a pre-

85

trained model, supporting sequential learning for newly emerging seen or unseen concepts. This

approach avoids exhaustive retraining and allows attribution only for relevant concepts.

As shown in Fig. 6.1, we focus on attribution in text-to-image Latent Diffusion Models (LDMs),

where attributable concepts appear in prompts, such as “A painting in the style of V*” or “An image

of V*.” If the owner of concept V* requests attribution, CustomMark embeds a concept-specific

watermark into generated images while preserving visual quality. Unlike [4], which attributes to

a subset of training images, CustomMark directly attributes the concept itself. The watermark

remains robust against non-editorial modifications, ensuring traceability to the original concept

and the GenAI model as the image circulates online. Since CustomMark embeds watermarks in a

concept-specific manner without requiring exhaustive retraining, it effectively functions as a form

of model customization.

Current customization methods [148, 171, 366, 219, 375, 268, 89, 351, 75, 278] struggle to scale

across many distinct concepts, often compromising generation quality. To address this, we propose

a novel architecture that customizes pretrained LDMs for large-scale watermarking. Building

on [88], we use a concept encoder to map a bit-secret to token-embedding perturbations but find

it insufficient for scalability. Thus, we introduce a mapper network that perturbs input Gaussian

noise, we fine-tune the LDM’s attention layers, and leverage CSD [286] loss for faster training and

improved image quality. CustomMark enables fine-tuned LDMs to generate watermarked images

aligned with text prompts while embedding corresponding watermarks.

Its sequential learning

capability allows new attributions with just 10% additional finetuning, preserving visual quality

while protecting artist styles. Our contributions are:

1. An efficient, scalable technique to customize LDMs for imperceptibly watermarking single

or multiple seen/unseen concepts in a generated image, enabling robust concept attribution

in pre-trained text-to-image LDMs.

2. Sequential attribution capability, allowing fine-tuning for new concepts dynamically without

retraining the model, ensuring selective attribution of relevant seen and unseen concepts.

86

Figure 6.2 Overview of CustomMark.
Illustrating the training workflow for CustomMark. A
concept token 𝒑𝑐𝑖 is encoded through the Concept Encoder E𝐶 to generate a modified prompt ˆ𝒑𝑐𝑖
with embedded watermark information. The Secret Mapper M𝐶 maps a bit secret 𝒔𝑖 to perturb
the concept token, producing 𝛿, which is added to the Gaussian noise 𝜖. The LDM using the
prompt tokens and pertubed Gaussian noise, producing watermarked images ˆ𝑿 that carry the
bit secret in visual form. During inference, the Secret Decoder D𝐶 extracts the bit secret from
watermarked image ˆ𝑿 and the clean image 𝑿 to extract the bit secret. CustomMark is guided by
various constraints, namely regularization loss 𝐽𝑅𝑒𝑔 to make the artist token embedding similar,
style loss 𝐽𝑆𝑡𝑦 to maintain style consistency between clean and watermarked images, and the bit
secret loss 𝐽𝐵𝐶𝐸 to predict the added bit secret. Best viewed in color.

.

3. Demonstration that diffusion models can attribute 100s of artists’ styles and 1000 ImageNet

classes while maintaining high visual quality of watermarked concepts.

6.2 Related Works

Proactive Schemes. Proactive methods enhance various tasks by embedding perturbations into

input images, providing benefits to deepfake tagging [325], detection of manipulated content [6],

localization of manipulations [7], object detection [5, 88], and concept attribution [4]. Some

approaches focus on altering the training data to disrupt the output of generative models [356, 267].

Meanwhile, Alexandre et al. [270] introduce a fixed signal method to enable attribution of training

datasets. Recently, a survey by Asnani et al. [9] discusses various proactive approaches, encryption

schemes, learning process, and their applications, such as vision model defense [298, 342], LLM

defense [221, 341, 378], privacy protection [299, 343, 383, 252], improving GenAI models [285,

163, 205, 237, 173, 373], 3D domain [134, 151, 374, 305, 359, 146], etc.. In CustomMark, we use

87

proactive techniques to do concept attribution in an efficient and scalable manner, with a focus on

practical application to the real-world scenarios.

IP Protection and Concept Attribution.

For IP protection of AI-generated models and

content, watermarking techniques embed signals into outputs via model fine-tuning [90], prompt

verification [380, 198], and token-level adjustments [165]. Tools like DiffusionShield [62] and

detection watermarking [241] prevent misuse, while on the other hand latent fingerprinting [8] and

audio watermarking [31] extend protection across media. Additional model security is provided by

DeepSigns [67], DeepMarks [39], and network embedding [320], as well as deep spatial encryption

[369], backdoor triggers [1], and dynamic defenses like DAWN [293].

Concept attribution identifies which training data influenced a generated output, distinct from

model [28] or camera attribution [37]. Traditional methods passively assess visual similarities

between generated and training images using predefined criteria. For instance, Wang et al. [328]

propose Attribution by Customization (AbC), modifying embeddings like CLIP and DINO with

customized diffusion models. Style-specific attribution methods such as ALADIN [269] and

EKILA [11] employ perceptual hashing for patch-based matching. MONTRAGE [26] monitors

weight updates to attribute pre-trained concepts, while Asnani et al. [4] embed concept-specific

watermarks in training images for direct attribution.

In contrast, we introduce a proactive

watermarking technique that requires no training data modifications and enables selective, sequential

attribution after training.

GenAI Customization.

Advances in GenAI customization leverage techniques like Video

Motion Customization [148], Custom Diffusion [171], and CustomNet [366] to adapt models to

specific concepts and motions, while approaches like Modular Customization [247] and CIDM [77]

enhance scalability and prevent catastrophic forgetting. Efficiency-focused methods [75] and LoRA-

Composer [351] optimize customization with minimal parameter adjustments, while AquaLoRA [89]

provides watermarking for unauthorized use protection, and textual inversion [219, 375, 268]

enables precise text-based editing. Privacy-oriented anti-customization [317] offers additional

security by adapting adversarial strategies. We propose a proactive concept attribution technique

88

using model customization, which hasn’t been explored before.

6.3 Method

6.3.1 Background

Prompts and Cross-Attention Mechanism in Diffusion Model.

In text-to-image LDMs [264],

prompts and cross-attention mechanisms work together to guide image generation. A prompt is

processed by a text encoder, and converted into a text embedding. This embedding conditions the

sampling process by capturing the prompt’s meaning. Instead of merely producing random images,

the cross-attention mechanism allows the model to “attend” to specific parts of the text embedding,

guiding the diffusion process to align the output with the input prompt. For key 𝑲, query 𝑸 and

value 𝑽, the scaled dot-product attention is given by:

Attention(𝑸, 𝑲, 𝑽) = softmax

(cid:19)

(cid:18) 𝑸𝑲𝑇
√
𝑑𝑘

𝑽.

(6.1)

Further, multi-head cross attention with respective weight matrices 𝑾∗

𝑖 s, is utilized to improve

generation quality by processing the prompt with multiple attention heads:

MultiHead(𝑸, 𝑲, 𝑽) = Concat(H1, . . . , Hℎ)𝑾,

H𝑖 = Attention(𝑸𝑾

𝑄

𝑖 , 𝑲𝑾 𝐾

𝑖 , 𝑽𝑾𝑉

𝑖 ).

(6.2)

(6.3)

As the multi-head cross-attention in Eq. (6.3) is the main component to establish a relationship

between prompts and the generated image, in CustomMark, we only fine-tune 𝑾∗

𝑖 s. This significantly

reduces training time while enhancing critical associations between the concept and its watermarked

image.

Concept Attribution. ProMark [4] defines the concept attribution as finding the closest concept

in the training dataset for a given generated image. For this purpose, ProMark divides the entire

dataset into different concepts and trains with each concept being watermarked. However, this is

impractical for the real world as it difficult to retrain the GenAI models on the entire watermarked

data. Therefore, we re-define the problem of Concept Attribution as follows.

Let C represent a set of 𝑁 distinct concepts within the training dataset of a GenAI model. Out

of the 𝑁 concepts, let ˆC = {𝑐1, 𝑐2, . . . , 𝑐𝑀 } be the 𝑀 concepts that need attribution, whose token

89

embeddings are represented as 𝑷𝑐 = { 𝒑𝑐1

, 𝒑𝑐2

, . . . , 𝒑𝑐 𝑀 }. Given a synthetic image 𝑿 generated

by a GenAI model using 𝒑𝑐𝑖 ∈ 𝑷𝑐, along with other prompt token embeddings, forming an input

prompt 𝑷 = { 𝒑1
corresponding concept 𝑐𝑖. Specifically, we find a mapping function 𝑓 such that 𝑐𝑖 = 𝑓 ( 𝑿).

, . . . , 𝒑𝑛}, the objective of concept attribution is to map 𝑿 to its

, . . . , 𝒑𝑐𝑖

, 𝒑2

6.3.2 CustomMark

Overview. To add attribution capabilities to a pre-trained LDM, CustomMark perturbs the inputs

to the LDM and fine-tunes its attention weights. The input token embedding 𝒑𝑐𝑖 and the input

Gaussian noise 𝝐 are perturbed by the concept encoder E𝐶 and the secret mapper M𝐶 networks,

that encode a concept specific bit-secret into the respective inputs. This results in the perturbed

embedding ˆ𝒑𝑐𝑖 and the perturbed Gaussian noise ˆ𝝐, which are fed into the LDM to sample new

images. The synthesized images are then fed to the secret decoder D𝐶 that outputs the corresponding

bit-secret. During training, only the attention weights in Eq. (6.2), and Eq. (6.3) of the LDM are

fine-tuned. The framework is guided by several constraints which allows for the generation of

images with embedded secrets and also maintain the original artistic style. We will now present

our method in details.

Embedding Encryption.
encoder E𝐶. For 𝑖𝑡ℎ concept, the concept token embedding 𝒑𝑐𝑖 is encrypted using E𝐶 as:

In CustomMark we perturb all the concepts in 𝑷𝑐 using a single concept

ˆ𝒑𝑐𝑖 = E𝐶 ( 𝒑𝑐𝑖

, 𝒔𝑖),

(6.4)

where 𝒔𝑖 is the concept specific bit-secret of length 𝑙, i.e. 𝒔𝑖 = {𝑏𝑖1, 𝑏𝑖2, ..., 𝑏𝑖𝑙 } where 𝑏𝑖 𝑗 ∈ {0, 1}.

After encryption, the original embedding is replaced by the encrypted text embedding, resulting

in encrypted prompt token embeddings ˆ𝑷 = { 𝒑1
, 𝒑2
image, ˆ𝑷 is fed to the LDM in place of the original token embeddings 𝑷. Following the architecture

, . . . , 𝒑𝑛}. To obtain the watermarked

, . . . , ˆ𝒑𝑐𝑖

of [88], we apply a regularization mean squared error (MSE) loss between 𝑷 and ˆ𝑷 at initial

iterations, so that the encoder E𝐶 has a good starting point to preserve the style, and support secret

90

learning. The regularization loss is:

𝐽𝑅𝑒𝑔 = || ˆ𝑷 − 𝑷||2
2

.

(6.5)

Secret Learning. We will now discuss the learning of LDM to generate watermarked images given

the encrypted token embeddings ˆ𝑷. In addition to E𝐶, we use a mapper network M𝐶 to further

accelerate the secret learning. Using 𝑖𝑡ℎ bit-secret 𝒔𝑖, we estimate a perturbation 𝜹 = M𝐶 (𝒔𝑖) which

is added to the initially sampled Gaussian noise 𝜖 for image generation. Therefore, the perturbed 𝜖

is given by:

ˆ𝝐 = 𝝐 + 𝛼 × M𝐶 (𝒔𝑖),

(6.6)

where 𝛼 controls the magnitude of 𝜹. The perturbed Gaussian noise ˆ𝝐 along with ˆ𝑷 is given as

input to the LDM to sample an image. Finally, to avoid the complexity of LDM training, we only

finetune the attention layers of the LDM, while fixing other layers.

During training, we create both clean and watermarked images, 𝑿 and ˆ𝑿, using the inputs

(𝑷, 𝝐) and ( ˆ𝑷, ˆ𝝐). The style descriptors 𝒅 and ˆ𝒅 from images 𝑿 and ˆ𝑿 are extracted using the

pretrained Contrastive Style Descriptors (CSD) [286] model. CSD contain concise and effective

style information, while being invariant to semantic content and capable of disentangling multiple

styles. We maximize the cosine similarity between two descriptors, which ensures that the

watermarked images matches the style of original concept. To further support style matching,

we apply a MSE loss between the two images, in addition to the CSD loss. Therefore, our style

loss is given by:

𝐽𝑆𝑡𝑦 = 1 − cos( ˆ𝒅, 𝒅) + || 𝑿 − ˆ𝑿 ||2
2

.

(6.7)

𝑿 and ˆ𝑿 are further fed to a secret decoder D𝐶, which estimates the bit secret in given images.

The decoder shall output a zeros secret for 𝑿, and the secret 𝒔𝑖 for ˆ𝑿. To train D𝐶, we use a binary

cross-entropy (BCE) loss between the ground truth bit-sequence 𝒔𝑖 and the predicted one ˆ𝒔𝑖:

𝐽𝐵𝐶𝐸 (𝒔𝑖, ˆ𝒔𝑖) = −

1
𝑙

𝑙
∑︁

𝑗=1

[𝑏 𝑗 log( ˆ𝑏 𝑗) + (1 − 𝑏 𝑗) log(1 − ˆ𝑏 𝑗)].

(6.8)

Therefore, CustomMark is trained in an end-to-end manner to minimize the objective 𝐿𝑎𝑡𝑡𝑟 =

𝐿𝑆𝑡𝑦 + 𝐿 𝐵𝐶𝐸 + 𝛽𝐿 𝑅𝑒𝑔 during training, where 𝛽 = 10 for our experiments.

91

During inference, if the random Gaussian noise and the input prompt are perturbed, the diffusion

model embeds a watermark within the generated image. This watermark can be decoded using D𝐶

to the concept specific bit-secret, functioning as hidden signatures for attribution.

Concept Attribution in Inference. To attribute the generated images, we extract the bit secret

embedded by the LDM using D𝐶. Using this predicted bit-secret ˆ𝒔 = D𝐶 ( ˆ𝑿) and the bit-secret 𝒔𝑖

corresponding to the concept 𝑐𝑖, we define the attribution mapping function 𝑓 as:

where,

𝑓 ( ˆ𝑿) = argmax
𝑖∈[1,𝑀]

𝑔(D𝐶 ( ˆ𝑿), 𝒔𝑖),

𝑔(D𝐶 ( ˆ𝑿), 𝒔𝑖) = 𝑔( ˆ𝒔, 𝒔𝑖) =

𝑙
∑︁

𝑘=1

[ ˆ𝑏𝑘 = 𝑏𝑖𝑘 ],

(6.9)

(6.10)

and [ ˆ𝑏𝑘 = 𝑏 𝑗 𝑘 ] is an indicator function that returns 1 if the bits match, and 0 otherwise. Thus,

using the predicted bit-sequence we assign the generated images to the concept whose bit-sequence

matches the best, i.e., the 𝑖𝑡ℎ concept that maximizes 𝑔( ˆ𝒔, 𝒔𝑖).

6.3.3 Sequential Learning

In real-world scenarios, the number of concepts requiring attribution is not always fixed. The

set of concepts can change frequently, making it impractical to retrain the attribution model from

scratch each time new concepts are introduced. To address this challenge, we propose the idea of

sequential learning with CustomMark.

For example, if CustomMark is initially trained on 𝑀 concepts, denoted as ˆC = {𝑐1, 𝑐2, . . . , 𝑐𝑀 },

and a new concept 𝑐𝑀+1 needs to be attributed, the model can be fine-tuned on the expanded set

ˆC ∪ 𝑐𝑀+1, starting from the model pretrained on ˆC. This approach allows the model to adapt to new

concepts without requiring a predefined set during initial training. Our experiments demonstrate

that learning new concepts in this manner requires only about 10% additional iterations, making it

significantly more efficient than retraining CustomMark from scratch.

6.3.4 Multi-Concept Learning

In real-world text-to-image generation, multiple concepts are often combined within a single

prompt, such as "a painting of a dog in the style of Van Gogh." To enable concept attribution in such

92

Figure 6.3 Comparison with ProMark [4] on ImageNet. ProMark produces low-quality images
with bubble-like artifacts from its encryption, whereas CustomMark enables LDMs to generate
high-quality images that closely match the original training concepts.

cases, CustomMark extends its attribution mechanism to handle multiple concepts simultaneously.

Given two concepts, 𝑐𝑖 and 𝑐 𝑗 , from the attributed set ˆC, their respective token embeddings 𝒑𝑐𝑖

and 𝒑𝑐 𝑗 are perturbed using the concept encoder E𝐶. This results in the perturbed embeddings:

ˆ𝒑𝑐𝑖 = E𝐶 ( 𝒑𝑐𝑖

, 𝒔𝑖),

ˆ𝒑𝑐 𝑗 = E𝐶 ( 𝒑𝑐 𝑗

, 𝒔 𝑗 ).

(6.11)

The perturbed prompt embeddings ˆ𝑷 = { 𝒑1

, . . . , ˆ𝒑𝑐𝑖

, . . . , ˆ𝒑𝑐 𝑗

, . . . , 𝒑𝑛} are then used in the

LDM to generate a watermarked image ˆ𝑿. During decoding, the secret decoder D𝐶 is designed to

recover the concatenated secret associated with both concepts:

ˆ𝒔 = D𝐶 ( ˆ𝑿) = [𝒔𝑖; 𝒔 𝑗 ].

(6.12)

The concatenation ensures that both concept-specific secrets are extracted from the generated

image, thereby enabling attribution for multiple concepts simultaneously. The attribution function

93

𝑓 is then applied independently for each concept:

𝑓 ( ˆ𝑿) = argmax
𝑖, 𝑗 ∈[1,𝑀]

𝑔(D𝐶 ( ˆ𝑿), [𝒔𝑖; 𝒔 𝑗 ]),

(6.13)

This approach ensures that CustomMark can reliably attribute both concepts in a multi-concept

image, allowing for effective auditing of GenAI models even when multiple stylistic or semantic

elements are present in the generated content.

6.4 Experiments

Implementation Details For training CustomMark, a predefined list of prompts are used per

concept (see supp.). For concepts, we use 1, 000 ImageNet [71] classes, 23 WikiArt [296] artists,

and a custom 200 list of artists (see supp.). For text-to-image LDM, Stable Diffusion 1.5 is used.

Unless stated, we use a bit-sequence of size 16. We evaluate CustomMark using four metrics: bit

accuracy, attribution accuracy, CSD [286] score, and CLIP [328] score as described. For attribution

assessment, bit accuracy: is the maximum percentage of bits matched between the predicted bit-

secret and any of the concept-specific secret and attribution accuracy: is the percentage of times

the predicted bit-secret matches with the correct concept-specific secret. For quality assessment,

CSD score:

is the cosine similarity between CSD descriptors, which assesses the style match

between two images and CLIP score: is the cosine similarity between CLIP image embeddings.

For all evaluation, we report average results on 100 generated and/or 100 clean images. For 10

concepts, CustomMark is trained for 20𝐾 iterations. All experiments are conducted on 8 A100

NVIDIA GPUs with a batch size of 8 per GPU.

6.4.1 Results

Comparison with Attribution Methods We evaluate various passive and proactive attribution

methods on images generated by LDMs that are trained on the ImageNet and WikiArt datasets,

containing 1000 and 23 classes, respectively. Here, each class is treated as a unique concept. For

fair comparison, we generate 100 images per class for both ProMark [4] and CustomMark. Since

ProMark and CustomMark embed different watermark, their accuracy is reported only on their

respective 100 watermarked images. Whereas for passive methods, including ALADIN [269],

94

Table 6.1 Comparison with passive and proactive methods on images generated by conditional
model trained on ImageNet and Wikiart dataset. CustomMark outperforms the passive methods on
both datasets significantly. Both proactive methods have similar performance on ImageNet, but for
Wikiart, CustomMark performs better than ProMark.

Method

Type

ALADIN [269]
CLIP [251]
AbC [328]
SSCD [246]
EKILA [11]
ProMark [4]
CustomMark

Passive
Passive
Passive
Passive
Passive
Proactive
Proactive

Attribution Accuracy (%) ↑
ImageNet
5.55
42.61
53.51
25.50
30.98
87.30
87.12

Wikiart
18.58
52.60
56.03
45.34
43.03
87.19
89.25

CLIP [251], AbC [328], SSCD [246], and EKILA [11], that rely on embeddings, the evaluation is

done on images generated by both proactive models i.e. an average over a total of 200 generated

images per concept. As shown in Tab. 6.1, the passive methods exhibit relatively low attribution

accuracy.

In contrast, the proactive methods, ProMark and CustomMark, significantly outperform the

passive methods, with much higher accuracy in both datasets. Although ProMark trains on an

entirely watermarked dataset with all LDM parameters learnable in training, its performance is

still comparable to CustomMark. Further, ProMark adversely impacts image quality, as shown

in Fig. 6.3, where the generated ImageNet samples of ProMark are of lower quality and display

visible artifacts. To quantify the quality, we calculate the FID score [131, 273] between the original

ImageNet images (from a pretrained model without watermarks) and the watermarked images from

each proactive model. The pretrained model achieves an FID score of 13.28. ProMark yields an

FID score of 17.63, while CustomMark achieves an FID score of 14.73, indicating substantially

better image quality. Thus, CustomMark not only maintains robust attribution performance but

also generates higher-quality images than ProMark, making it a more effective solution for practical

applications.

Comparison with Customization-Based Watermarking Methods We compare our method with

[88], that also leverages textual token perturbations to guard personalized concepts. However, in

[88], authors train new concept encoder-decoder pair for each personalization – an impractical

95

Figure 6.4 Attribution results of three concept artists: VanGogh, Monet and Picasso, sampled
from LDM before and after applying the attribution capability of customization-based method [88]
and CustomMark. [88] makes the LDM sample images far apart from the original style of artists,
while CustomMark-watermarked images are much closer to the original style.

Table 6.2 Comparison with customization-based method by Feng et al. [88].
Acc.=Accuracy].

[KEYS:

Method

Bit

Attribution
Acc. (%)↑ Acc. (%) ↑

Feng et al. [88]
CustomMark

90.87
99.29

74.14
94.29

CLIP
Score ↑
0.57
0.81

CSD
Score ↑
0.51
0.77

Figure 6.5 Sequential learning of new concepts. CustomMark starts with three initial concepts
and incrementally learns new attributions without retraining from scratch. Each column displays
clean and watermarked images, demonstrating CustomMark’s efficiency in adapting to new styles
with only about 10% extra training iterations per concept while maintaining high stylistic fidelity.
We only show the concept used to create the image. A list of all the prompts used is given in supp.

solution in real world. For a fair comparison, we adapt [88] by training a single encoder-

decoder pair for 3 artists’ styles as concepts, namely VanGogh, Monet and Picasso. As shown

in Tab. 6.2, CustomMark surpasses this baseline in all metrics, achieving higher watermark detection

96

accuracy (99.29), attribution accuracy (94.29), and generation quality (CSD score 0.81 and CLIP

score 0.77). These results demonstrate the effectiveness of CustomMark for concept watermarking

in GenAI.

Shown in Fig. 6.4 are some qualitative results for comparison. Unlike [88], which struggles

to preserve individual artistic styles like brushstrokes and color palettes, CustomMark accurately

captures each artist’s unique nuances. For example, for Picasso (second row, last col),

[88]

generates Van Gogh style brushstrokes.

Sequential Learning In Fig. 6.5, we showcase CustomMark’s sequential learning capability,

where the model begins attribution with three concepts and subsequently integrates additional

concepts one at a time. This setup reflects a dynamic, real-world setting where the need for concept

attribution evolves over time as new styles are added. Instead of retraining the model from scratch

for each new concept, CustomMark employs sequential learning to incrementally learn attributions

for new concepts without erasing previously learned styles.

Starting with three initial concepts, CustomMark fine-tunes the model as new concepts are

introduced, updating attribution while preserving distinct stylistic features. This is evident in the

similarity between clean and watermarked images in each column, where CustomMark maintains

high fidelity to the original style. With sequential learning, it attributes new concepts with only 10%

additional iterations per concept, avoiding full retraining. These results demonstrate CustomMark’s

scalability and efficiency in preserving style-consistent, high-quality outputs for GenAI models.

Unseen Artists Watermarking We demonstrate CustomMark’s ability to attribute both seen and

unseen concepts using textual inversion. As shown in Fig. 6.4, known concepts are watermarked

by perturbing their token embeddings. However, in real-world scenarios, generative models often

encounter novel concepts outside the initial training set, requiring adaptability beyond predefined

attributions.

To address this, we leverage textual inversion to derive token embeddings for unseen concepts.

Once obtained, we apply watermark perturbations, enabling attribution without significant model

retraining. Fig. 6.6 illustrates this by showing stylistic consistency between clean and watermarked

97

Figure 6.6 Attribution of Unseen Concepts with CustomMark. Shown is the CustomMark’s
ability to handle attribution for unseen concepts. The consistent style between clean and
watermarked images across new styles demonstrates CustomMark’s robustness in preserving artistic
fidelity while achieving scalable attribution. We only show the concept used to create the image.
A list of all the prompts used is given in supp.

images, preserving unique attributes of each new style. This demonstrates CustomMark’s adaptability,

allowing it to generalize to new styles while maintaining fidelity and stylistic integrity.

Multi-Concept Watermarking For this scenario, we take 20 concepts into consideration (10

objects, and 10 artists). Each concept is associated with an 8-bit secret. The decoder extracts a 16-

bit secret for the generated image. The qualitative results in Fig. 6.7 demonstrate that CustomMark

successfully embeds attribution signatures for both object (e.g., "dog," "tree") and style (e.g., "Van

Gogh," "Picasso") concepts within a single image while preserving visual quality. Quantitatively,

the attribution and bit accuracy evaluated on 100 clean and generated images is 89.14% and 95.47%,

respectively.

6.4.2 Ablations

For all the ablation experiments, unless stated, we use a model trained for 10 concepts (see

supp.).

Nearby Concepts and Clean Images.

CustomMark provides the flexibility to easily switch from the watermarked image generation

to non-watermarked version, which we define as clean image generation. To do this, we use the

non-perturbed original text tokens, while keeping the mapper network M𝐶 and the fine-tuned

98

Figure 6.7 Attribution for multiple Concepts present in a single prompt with CustomMark.

Table 6.3 Ablation study for various style losses. [KEYS: Acc. Accuracy, Att. Attribution, Atte.:
Attention].

Method

Bit

Attribution
Acc. (%)↑ Acc. (%) ↑

CSD
CSD + L2 (latent)
CSD + L2 (image)
CSD + L2 + LDM atte.

98.6
99.12
99.17
99.29

88.15
90.94
92.35
94.29

CLIP
Score ↑
0.65
0.70
0.73
0.81

CSD
Score ↑
0.73
0.67
0.74
0.77

attention weights of the model. An all-zero bit secret is used as an input to M𝐶 and the secret

decoder is expected to output the same for these clean images. We evaluate CustomMark’s ability

to generate clean images for 1) attributable concepts: that are fine-tuned with CustomMark and 2)

nearby concepts: that are related to attributable concepts but not exactly the same. For example,

if CustomMark can attribute paintings of Van Gogh, then paintings from other artists are considered

nearby concepts. For this evaluation, we use three attributable artists (first three cols of Fig. 6.5)

and seven random nearby artists (see supp.).

For attributable concepts, the model achieves high bit accuracy (96.13%) and attribution

accuracy (85.45%) with an all-zeros bit secret, indicating effective attribution of clean concepts.

For nearby concepts, it maintains strong bit accuracy (92.36%) and attribution accuracy (81.90%),

showcasing the adaptability of CustomMark for practical applications, allowing selective watermarking

for certain styles while not watermarking concepts which don’t specifically request it. The

99

Figure 6.8 Ablation study for varying different parameters of CustomMark. We show the
performance variation by varying bit secret length, number of concepts, and scaling factor.

Figure 6.9 Robustness evaluation of decoder by applying distortion to generated images.

generation quality with CustomMark is comparable to the pretrained LDM, with an FID score

of 14.51 between original and clean images.

Style Loss.

Tab. 6.3 presents an ablation study on different style loss combinations and their

impact on bit accuracy, attribution accuracy, and qualitative metrics. The baseline using only CSD

performs well, but adding L2 loss in LDM’s latent space improves accuracy, with a slight drop in

the CSD score. Further applying L2 loss in image space enhances overall performance, boosting

attribution accuracy, CLIP, and CSD scores. The best results are achieved by CustomMark, which

combines CSD, L2 loss, and attention layer training, yielding the highest gains across all metrics

and validating our design choice.

Robustness.

Fig. 6.9 demonstrates CustomMark’s robustness against various post-processing

100

distortions, including JPEG compression, rotation, cropping, resizing, Gaussian blur, noise, color

jitter, and sharpness (see supp.). CustomMark maintains high attribution and bit accuracy, with

minimal impact from common distortions like JPEG compression and rotation, while stronger

distortions (e.g., Gaussian blur, noise) cause slight accuracy drops. Against adversarial attacks [379],

it retains 82.21% attribution accuracy, only slightly lower than the original 91.11%. These results

highlight CustomMark’s resilience in real-world scenarios.

Bit Secret Length Fig. 6.8(a) analyzes the effect of bit secret length on bit accuracy, attribution

accuracy, and CLIP score. As the secret length increases, both accuracy metrics decline, suggesting

that longer secrets are harder for the decoder to recover, impacting attribution performance.

Additionally, the CLIP score drops, indicating stylistic deviations. This trade-off suggests that

a moderate bit length, such as 16, balances attribution accuracy and stylistic fidelity.

Number of Concepts.

Fig. 6.8(b) examines how the number of unique artist concepts affects

attribution and stylistic fidelity. As concepts increase, bit and attribution accuracy decline, likely

due to the growing challenge of distinguishing among them. Similarly, the CLIP score drops,

suggesting that maintaining stylistic consistency becomes harder with a broader range of styles in

watermarked images.

Scaling Factor.

Fig. 6.8(c) shows the impact of the scaling factor in Eq. (6.6) on attribution

and stylistic similarity.

Increasing the scaling factor sharply reduces both bit and attribution

accuracy, likely due to overpowering the sampled Gaussian noise. Conversely, decreasing it too

much causes the LDM to diverge, generating noise images, as reflected in the declining CLIP score.

This underscores the need for a low scaling factor to balance attribution accuracy and stylistic

preservation, leading us to select 0.01 for our experiments.

6.5 Conclusion

We propose CustomMark, an efficient and flexible technique for enabling concept attribution

in pre-trained text-to-image LDMs. Addressing the growing demand for ethical content generation

in GenAI models, CustomMark provides a customization-based approach to embed concept-

specific watermarks, allowing artists to request attribution for their work. Unlike previous

101

methods, CustomMark allows selective attribution without requiring all concepts to be predefined

before training, and entire watermarking of the training data. It supports sequential learning to add

new concepts in a online-way. We demonstrate that CustomMark can handle hundreds of artist

styles and diverse ImageNet classes while maintaining image quality and ensuring robust attribution.

By fine-tuning the model for new concepts with minimal computational overhead, CustomMark

streamlines the attribution process, helping bridge the gap between GenAI developers and the

creative community, and so promoting responsible use of GenAI in content creation.

102

CHAPTER 7

PIVOT: PROACTIVE VIDEO TEMPLATES FOR ENHANCING VIDEO TASK
PERFORMANCE

In this paper, we introduce PiVoT, a video-based proactive wrapper that enhances Action Recognition

(AR) and Spatio-Temporal Action Detection (STAD) systems. By leveraging a proactive template-

enhanced Low-Rank Adaptation (LoRA) paradigm, PiVoT integrates seamlessly with detectors

while maintaining an efficient training approach. A 3D U-Net generates action-specific templates,

capturing temporal dynamics through shadow-like artifacts that help detectors better identify

motion cues and refine frame distribution. Fine-tuning only the LoRA layers within the CNN

backbone or transformer attention layers ensures minimal computational overhead while improving

detection accuracy. Applied to TSN, TSM, MViTv2, and SlowFast across datasets like Kinetics-400,

Something-Something-v2, and AVA2.1, PiVoT consistently boosts performance, demonstrating its

adaptability and scalability for video-based detection tasks. Models and code will be released upon

publication1.

7.1

Introduction

Video-based tasks in computer vision, such as Action Recognition (AR) and Spatio-Temporal

Action Detection (STAD), play a crucial role in enabling machines to understand dynamic scenes

and human behavior. AR focuses on recognizing the action occurring in a video by analyzing

temporal sequences and identifying the action category. AR methods have evolved from traditional

hand-crafted features in RGB videos [21, 175, 318, 319] to sophisticated deep learning techniques,

such as two-stream Convolutional Neural Networks (CNNs) [280, 324, 323, 322, 334, 243, 86],

Recurrent Neural Networks (RNNs) [290, 76, 307, 123, 101, 186, 12, 290, 217], 3D CNNs [86,

35, 17, 119, 302, 301], and Transformer-based models [3, 349, 347, 310, 79], which capture spatial

and temporal features. STAD, on the other hand, aims to both localize and recognize actions

within videos by assigning bounding boxes to each instance and classifying the type of action.

STAD include two-stage methods, separating bounding box detection and the classification of

1Vishal Asnani, Xiaoming Liu, and Shruti Agarwal. "PiVoT: Proactive Video Templates for Enhancing Video Task

Performance." In review, 2025.

103

Figure 7.1 PiVoT as a wrapper.
(a) We propose PiVoT which wraps around different video-
based baseline detectors to improve the performance.
(b) We show the effectiveness of
incorporating PiVoT on various detectors for two different video-based tasks, across three different
datasets. PiVoT is able to improve the performance for each detector. The plotted points represent
performance improvements after applying PiVoT, with all points in the green region indicating
enhanced performance compared to the baseline. (c) PiVoT uses 3D templates unlike prior proactive
works which use 1D/2D templates.

actions [271, 112, 242], and query-based one-stage methods [167, 376], which integrate these steps

into a unified process.

Many of these techniques use various type of modules and augmentation strategies [110, 352,

59, 344, 339, 142] resulting in performance gains by increasing diversity and model robustness.

104

Further for performance gain, a growing segment of deep learning involves the use of proactive

learning schemes [9] which have demonstrated performance improvements across various computer

vision tasks, including vision model and LLM defense [298, 342], privacy solutions [343, 299],

attribution [4], manipulation detection [6], localization [7], and 2D object detection [5], among

others. While these methods share similarities with traditional augmentation techniques, their

distinct feature lies in learning an additional signal known as templates. Asnani et al. [5] propose

a proactive wrapper for 2D object detectors which generate templates, when applied under certain

conditions, can significantly enhance the performance of networks.

Building on augmentation and proactive learning insights, we introduce PiVoT, a video-based

proactive wrapper that enhances video detectors (Fig. 7.1(a)). Developing such a wrapper poses

key challenges:

it must be plug-and-play, parameter-efficient, and compatible across diverse

datasets while preserving temporal dynamics. Additionally, it should integrate seamlessly with

detectors without extensive modifications or performance trade-offs. Overcoming these challenges

makes PiVoT a scalable solution for improving video-based detection systems.

To address these challenges, PiVoT employs a proactive template-enhanced Low-Rank Adaptation

(LoRA) training paradigm, using a 3D U-Net to generate action-specific 3D templates (Fig. 7.1(c)).

These templates are applied to video frames before detector processing. The detector, initialized

with pretrained weights, integrates LoRA layers into either the CNN backbone [45] or transformer

attention layers [136], enabling efficient task adaptation. We show that the estimated templates

capture temporal information, evident in shadow-like artifacts that provide motion cues, enhancing

AR and STAD detectors’ ability to interpret dynamic movements. By incorporating this temporal

context, the templates refine feature extraction, improving model performance. Training is limited

to the 3D U-Net and LoRA layers, keeping the rest of the detector fixed to minimize computational

overhead. We demonstrate PiVoT’s effectiveness on AR and STAD tasks, showing consistent

performance gains across various detectors and datasets (Fig. 7.1(b)). Our key contributions are

summarized below:

• Proactive Wrapper for Enhanced Video-Based Detection: We introduce PiVoT, a novel

105

video-based proactive wrapper designed to enhance the performance of multiple video-

based detectors, including those for AR and STAD. PiVoT functions as a wrapper, effectively

integrating with existing detectors and augmenting their capabilities without requiring significant

architectural changes.

• Template-Enhanced LoRA Training Paradigm: PiVoT incorporates a 3D U-Net for generating

action-specific templates, combined with a LoRA framework. This approach allows for

targeted fine-tuning of specific components within the detector, such as the LoRA layers in

the CNN backbone or transformer attention layers, resulting in improved performance with

minimal training.

• A plug and play architecture module: Experiments demonstrate that PiVoT can be used

as a plug-and-play architecture module resulting in consistent gains in performance across

various datasets (e.g., Kinetics-400, Something-Something-v2, and AVA2.1) and detectors

(TSN, TSM, MViTv2 and SlowFast), validating the efficacy of our approach.

7.2 Related Works

Proactive Schemes Proactive methods enhance various tasks by embedding signals or perturbations

into input images, aiding in deepfake tagging [325], detection of manipulated content [6], and

localization [7]. Asnani et al. [5] improve object detection with such techniques while approaches

by Yeh et al. [356] and Ruiz et al. [267] modify training data to disrupt generative model outputs

while [270] introduce fixed signals for dataset attribution. A survey by Asnani et al. [9] discusses

perturbation methods, applications in vision model and LLM defense [298, 342, 221], and privacy

solutions [343, 299, 88]. Proactive methods also boost generative AI [285, 163] and address

challenges in 3D domains [134, 151]. Different from the above strategies, we propose to apply the

proactive paradigm in improving the performance of video-based tasks.

Action Recognition In recent years, significant advancements have been made in AR through the

integration of various modalities and deep learning techniques. Early works focus on hand-crafted

features using RGB data, such as the Temporal Template method by Bobick et al. [21] and STIP by

106

Laptev et al. [175]. The advent of deep learning sees the rise of two-stream CNNs like Simonyan et

al. [280] and RNNs such as LRCN by Donahue et al. [76], which improves spatiotemporal feature

extraction. Temporal Segment Networks (TSN) by Wang et al. [324] and Temporal Shift Module

(TSM) by Lin et al. [187] further enhance temporal modeling capabilities. More recently, 3D

CNNs and Transformer-based methods, including I3D by Carreira et al. [35] and ViViT by Arnab et

al. [3], have achieved state-of-the-art (SoTA) results. Multimodal fusion techniques, such as the

three-stream CNN for RGB and audio by Wang et al. [315] and the combination of depth and

inertial sensors by Dawar et al. [68], are also explored to improve accuracy. MViTv2 [182] is a

unified architecture for image and video classification, as well as object detection. In egocentric

action recognition (EAR), deep learning frameworks like Ego-ConvNet by Singh et al. [283] and the

Mutual Context Network by Huang et al. [140] address the unique challenges of first-person video

data. We propose a wrapper which plugs with a pre-existing AR detector, resulting in performance

enhancement.

Spatio-temporal Action Detection (STAD) STAD has seen significant advancements in recent

years, driven by the rapid development of deep learning techniques. Early works in STAD, such

as those by Gkioxari et al. [105] and Weinzaepfel et al. [336], lay the foundation by introducing

methods to link frame-level detections into action tubes. The introduction of region proposal

networks (RPN) by Saha et al. [271] and the use of two-stream architectures to incorporate both

appearance and motion cues [242] further improve detection accuracy. Recent approaches have

leveraged 3D convolutional neural networks (3D CNNs) to capture motion information across

multiple frames, as demonstrated by Gu et al. [112] and Girdhar et al. [99]. SlowFast [85]

introduces two pathways to capture spatial and temporal dynamics. The integration of visual relation

modeling, as seen in works by Sun et al. [289] and Girdhar et al. [100], enhances the understanding of

interactions between actors and objects, leading to more accurate action detection. Additionally, the

development of efficient and real-time models, such as YOWO [167] and WOO [44], has addressed

the computational challenges associated with STAD. The use of transformer-based frameworks, like

TubeR [376] and HIT [84], also shows promising results, highlighting the potential of transformers

107

Figure 7.2 Overview of PiVoT. This figure illustrates the PiVoT framework for video-based tasks.
(a) Video frames are processed through a 3D U-Net model to generate templates, which are then
perturbed by adding them to the original frames. The perturbed frames are passed to a detector
enhanced with LoRA layers, producing final predictions.
(b) Detailed visualization of LoRA
(c) LoRA applied to CNN layers by injecting low-rank
integration to the pretrained weights.
adaptation matrices 𝑨 and 𝑩 into pretrained weights 𝑾, and LoRA applied to attention layers,
modifying the query/key/value weights 𝑾𝑄/𝐾/𝑉 with additional low-rank matrices 𝑨𝑄/𝐾/𝑉 and
𝑩𝑄/𝐾/𝑉 . The framework is trained in an end-to-end manner using the respective baseline losses.
Best viewed in color.

in this domain. Unlike these works, PiVoT uses proactive learning to improve STAD detector

performance.

7.3 Method

7.3.1 Preliminary

In this paper we study two video-based tasks: Action Recognition (AR), and Spatio-Temporal

Action Detection (STAD).

AR aims to identify human actions from a video sequence. Let V = {𝑭1, 𝑭2, . . . , 𝑭𝑇 } represent
a video sequence, where each frame 𝑭𝑡 ∈ R𝐻×𝑊×3 corresponds to an image of height 𝐻, width 𝑊,

and three color channels (RGB). The sequence consists of 𝑇 frames. The goal is to classify the

video V into one of the possible action classes 𝑦 ∈ A, where A = {𝑎1, 𝑎2, . . . , 𝑎𝐾 } represents the

set of 𝐾 possible action labels.

The task can be formulated as learning a function 𝑓 , parameterized by 𝜃, that maps the video

108

sequence V to the predicted action label ˆ𝑦:

Thus, the predicted action label ˆ𝑦 is given by:

𝑓𝜃 : V → ˆ𝑦.

ˆ𝑦 = argmax

𝑦∈A

𝑝(𝑦|V; 𝜃),

(7.1)

(7.2)

where 𝑝(𝑦|V; 𝜃) is the probability that the action label 𝑦 corresponds to the video sequence V,

given the model parameters 𝜃.

STAD aims to identify action types in a video and localize them across frames. For the video

V, the task is to detect the action label 𝑦 ∈ A = {𝑎1, 𝑎2, . . . , 𝑎𝐾 } and bounding box 𝑩𝑡 for each

frame.

Our goal is to learn a function 𝑓 with parameters 𝜃 that maps the video sequence V to detected

actions and spatial locations:

𝑓𝜃 : V → {(𝑦, 𝑩𝑡) | 𝑭𝑡, 𝑡 = 1, . . . , 𝑇 },

(7.3)

where 𝑦 is the action label and 𝑩𝑡 ∈ R4 represents bounding box coordinates for frame 𝑭𝑡. Thus,

the predicted action detection across frames is given by:

(𝑦∗, {𝑩∗

𝑡 }𝑇

𝑡=1) = argmax

𝑝(𝑦, 𝑩𝑡 |V; 𝜃),

𝑇
(cid:214)

(7.4)

𝑦∈A
where 𝑝(𝑦, 𝑩𝑡 |V; 𝜃) denotes the likelihood that 𝑦 and 𝑩𝑡 describe the observed action in each

𝑡=1

frame given 𝜃.

7.3.2 PiVoT

7.3.2.1 Overview

For video-based tasks, capturing spatial and temporal dynamics is crucial for accurately

identifying actions in video sequences. As shown in Fig. 7.2(a), we propose a template-enhanced

Low-Rank Adaptation (LoRA) approach, which utilizes a 3D U-Net to generate action-specific

templates that are added to video frames before they are processed by a detector. The detector,

initialized with pretrained weights, is modified by adding LoRA layers to specific components—either

in the CNN backbone or transformer attention layers—allowing for efficient adaptation to the task.

109

This approach enables us to train only the 3D U-Net and the LoRA layers, while the rest of the

detector remains fixed. Next we’ll discuss each component in more detail.

7.3.2.2 Template Generation

Inspired by previous works [7, 6], proactive schemes are applied to enhance the performance of

computer vision tasks by perturbing input images using a template. In the context of video-based

tasks, we propose to use proactive schemes to enhance the performance of the baseline methods by

perturbing each frame of a video sequence using a template.

Proactive approaches offer the flexibility of using fixed or learnable templates. The template

can either be universal [7, 6] across the entire dataset, or data-dependent template [325, 5]. In

our scenario, video-based tasks poses unique challenges due to the substantial variability in video-

content across different contexts, including differences in motion patterns, temporal dynamics,

and viewpoint changes. These fluctuations introduce a level of complexity that may surpass the

representational capacity of a fixed template set, potentially limiting the performance.

To address this limitation, we propose an approach that dynamically generates a unique video

template for each video sequence. This is achieved using an 3D U-Net based encoder network E

designed to produce tailored templates based on the specific action and contextual cues within each

sequence. In doing so, our method adapts more fluidly to the diverse and often nuanced visual

patterns present across video datasets, ensuring enhanced representation and recognition of actions

that vary significantly from one instance to another. Let each frame of the video be represented as

𝑭𝑡 ∈ R𝐻×𝑊×3. For each frame, the model learns a frame-specific template 𝑺𝑡 ∈ R𝐻×𝑊×3. In PiVoT,

each frame 𝑭𝑡 is perturbed using a transformation T that applies the template 𝑺𝑡 to the frame. This

is achieved through element-wise addition, resulting in the perturbed frame T (𝑭𝑡):

T (𝑭𝑡) = T (𝑭𝑡; 𝑺𝑡) = 𝑭𝑡 + 𝑺𝑡 = 𝑭𝑡 + E (V)𝑡 .

(7.5)

Hence, the video V = {𝑭1, 𝑭2, . . . , 𝑭𝑇 } is transformed to a perturbed video T (V), which is

then used for video-based tasks. Therefore, our proactive wrapper changes the formulation defined

110

in Eq. (7.2) and Eq. (7.4) as follows:

ˆ𝑦 = argmax

𝑦∈A

𝑝(𝑦|T (V); 𝜃),

(𝑦∗, {𝑩∗

𝑡 }𝑇

𝑡=1) = argmax

𝑦∈A

𝑇
(cid:214)

𝑡=1

𝑝(𝑦, 𝑩𝑡 |T (V); 𝜃).

(7.6)

(7.7)

7.3.2.3 Detector with LoRA

PiVoT enhances a detector using LoRA layers (Fig. 7.2(b)), which enable efficient fine-tuning

by introducing adaptable components while keeping the core pretrained weights fixed. LoRA

is applied to either the CNN backbone or the transformer attention layers, offering flexibility in

adapting to action-specific features encoded by templates generated from E. This section details

the integration of LoRA into both architectures, illustrating its roles in improving detectors.

In models with CNN backbones, convolutional layers are key to extracting spatial features from

video frames. Fine-tuning these layers requires updating a large number of parameters, which can

be computationally expensive. LoRA addresses this challenge by introducing low-rank matrices

into the convolutional layers [45]. As shown in Fig. 7.2(c), each convolutional layer contains a

weight matrix 𝑾 representing the filter kernels used to process input frames. LoRA modifies this

weight matrix by decomposing it into two smaller, trainable matrices 𝑨 and 𝑩:

𝑾LoRA = 𝑾 + 𝛼 𝑨𝑩.

(7.8)

Here, 𝑾 represents the fixed pretrained weights, 𝑨 ∈ R𝑑×𝑟 and 𝑩 ∈ R𝑟×𝑑 are the low-rank matrices,

with 𝑟 ≪ 𝑑, and 𝛼 is a scaling factor that controls the influence of the adaptation. During training,

only the low-rank matrices 𝑨 and 𝑩 are updated, significantly reducing the number of trainable

parameters. This adaptation allows the convolutional layers to better respond to template-enhanced

inputs from E. The modified CNN backbone thus retains its pretrained strengths while adjusting its

spatial feature extraction to the new patterns encoded by the templates, enabling efficient adaptation

without the need for extensive retraining of the detector.

In contrast, transformer-based detectors leverage self-attention mechanisms to capture temporal

dependencies across frames, making them well-suited for modeling complex action sequences. The

111

challenge, however, lies in fine-tuning these attention layers, which typically involves adjusting large

query, key, and value matrices used in each attention head. LoRA offers a solution by introducing

trainable low-rank matrices into these components (Fig. 7.2(c)), allowing for efficient adaptation

while fixing the main parameters [136]. Specifically, LoRA modifies the query matrix 𝑾𝑄 as:

𝑾𝑄,LoRA = 𝑾𝑄 + 𝛼 𝑨𝑄 𝑩𝑄,

(7.9)

where 𝑾𝑄 ∈ R𝑑𝑄×𝑑𝑄 is the original query weight matrix, 𝑨𝑄 ∈ R𝑑𝑄×𝑟 and 𝑩𝑄 ∈ R𝑟×𝑑𝑄 are the

low-rank matrices specific to the query transformation. Similar adjustments are made to the key

and value matrices, allowing the attention mechanism to adapt to action-specific dynamics encoded

in the template-enhanced frames. By training only the low-rank matrices 𝑨 and 𝑩, the model

efficiently adapts to the template-enhanced inputs.

By focusing only on training the 3D U-Net and the LoRA layers, our approach allows the

video detector to quickly adapt to the new information provided by template-enhanced inputs. This

selective fine-tuning strategy results in a model that is both computationally efficient and capable

of achieving high accuracy in recognizing actions across diverse video datasets.

7.4 Experiments

7.4.1

Implementation Details

We consider two video-based tasks for demonstrating the effectiveness of PiVoT, namely, AR

and STAD.

Datasets For AR task, we conduct experiments on two datasets: Something-Something-v2 [111],

and Kinetics-400 [160] dataset. For STAD task, we use AVA2.1 [112] dataset. Below are the

details for each dataset.

1. Something-Something v2: A large-scale video dataset focused on fine-grained human-object

interactions, containing over 220, 000 labeled video clips across 174 action classes.

2. Kinetics 400: A widely used video dataset with 400 action classes, featuring around 300, 000

high-quality clips sourced from YouTube to represent diverse human activities.

112

Table 7.1 Performance comparison on Something-Something-v2 dataset for TSN, TSM, and
MViTv2 models, with and without PiVoT wrapper.

Method

Reported
Reproduced
PiVoT

TSN [324](%)↑ TSM [187](%)↑ MViTv2 [182](%)↑
Top-1 Top-5 Top-1
62.72
67.09
35.51
59.19
67.39
35.69
63.41
78.71
51.37

Top-1
68.11
64.29
68.81

Top-5
87.70
85.14
88.11

Top-5
91.02
89.21
91.63

Table 7.2 Performance comparison on Kinetics-400 dataset for TSN, TSM, and MViTv2 models,
with and without PiVoT wrapper.

Method

Reported
Reproduced
PiVoT

TSN [324](%)↑ TSM [187](%)↑ MViTv2 [182](%)↑
Top-1 Top-5 Top-1
73.18
90.65
72.83
69.14
86.21
67.19
71.31
87.13
69.61

Top-5
90.56
87.03
89.56

Top-1
81.11
79.12
81.51

Top-5
94.73
94.21
94.91

Figure 7.3 Template visualization for TSN detector on Something-Something-v2. We show
the (a) input frames, (b) estimated template, and the (c) perturbed frames. The estimated template
captures temporal information, as indicated by the shadow-like artifacts, which aid the action
recognition (AR) detector in identifying motion cues more effectively. The template after being
added changes the distribution of the input frames spatially to improve the performance accordingly.

3. AVA 2.1: An action detection dataset providing spatio-temporal annotations for actions in

430, 15-minute movie clips, designed for detailed analysis of person-centered activities in

complex scenes.

Detectors and Evaluation For the AR task, we incorporate PiVoT into multiple detectors:

TSN [324], TSM [187], and MViTv2 [182], and evaluate with and without the PiVoT wrapper.

We report each detector’s Top-1 and Top-5 accuracy as metrics. For the STAD task, we use

113

SlowFast [85] detector reporting mean Average Precision (mAP) (%) as the metric.

The performance for all the detectors is reported across three versions: the original reported

values, the reproduced values (our baseline), and the results after incorporating the PiVoT wrapper.

Starting with the pretrained models from our reproduced baselines, we observe that for some

models, our reproduced performance differs slightly from the originally reported numbers. This

variation could be due to differences in training setups or minor architectural adjustments in the

models that were not disclosed in the original setup. Therefore, for a fair comparison, we apply our

proposed wrapper on the reproduced pre-trained weights. We use the MMACTION2 toolbox [58]

for each detector codebase and pretrained models. We select hyperparameter values as follows:

LoRA rank 𝑟 = 4 and scaling factor 𝛼 = 0.01. All the experiments are done on 8 A100 NVIDIA

GPUs. We use the default parameters for each detector as reported in the respective papers by the

authors (see details in supplementary).

7.4.2 Results

AR Task As shown in Tab. 7.1, PiVoT significantly improves both Top-1 and Top-5 accuracies

across all detectors, particularly for TSN and TSM, where gains are more pronounced. This

improvement is attributed to the addition of object-specific templates, which aid the models in

capturing temporal consistency and semantic continuity in actions. LoRA layers enable PiVoT to

adapt these templates without requiring substantial parameter updates, resulting in a parameter-

efficient performance boost. We further show the input frames, templates, and the perturbed frames

in Fig. 7.3. Perturbed frames are dominated by the templates, with their distribution modified

to enhance AR accuracy. The estimated template aggregates temporal information, as evident

from the shadow-like artifacts, which help the action recognition (AR) detector capture motion

cues more effectively. This adaptation alters the distribution of input frames, enhancing model

performance. The performance gains do not stem from merely increasing the number of trainable

parameters but rather from the specially designed PiVoT wrapper, which effectively integrates

proactive templates with LoRA-based adaptation. As demonstrated in our ablation studies, using

only the 3D U-Net for template generation or solely incorporating LoRA layers does not consistently

114

Figure 7.4 Template visualization for a video in something-something-v2 dataset across various
detectors (a) TSN, (b) TSM, (c) MViTv2, and (d) SlowFast.

Table 7.3 Performance comparison on AVA2.1 dataset for SlowFast [85] detector, with and
without PiVoT wrapper.

Metric
mAP (%)↑

Reported Reproduced PiVoT
26.36

24.11

24.32

guarantee performance improvements. Instead, PiVoT’s unique combination of template-enhanced

input transformations and parameter-efficient fine-tuning enables robust performance gains across

different video-based tasks while maintaining computational efficiency.

Tab. 7.2 presents a similar comparison on the Kinetics-400 dataset for the three detectors. PiVoT

115

Table 7.4 Training iterations ablation. Analysis of training iterations on TSN and MViTv2 on
Something-Something-v2, with extended iterations comparable to those after incorporating PiVoT.
Inference times are shown before and after PiVoT application. Proactive training with PiVoT
provides greater performance gains than merely increasing training iterations, with a slight increase
in inference time. [KEYS: itr.: number of training iterations].

TSN [324]

MViTv2 [182]

Method

Reproduced
Reproduced
PiVoT

(ms)↓

Top-1 Top-5 Time Top-1 Top-5 Time
(ms)↓
(%)↑
35.69
35.71
51.37

(%)↑
64.29
64.40
68.81

(%)↑
89.21
90.13
91.63

(%)↑
67.39
67.45
78.71

19.23

13.21

20.68

15.28

Itr.

1𝑋
2𝑋
2𝑋

demonstrates clear improvements in both Top-1 and Top-5 accuracy. The gains, while slightly less

than those observed on Something-Something-v2, indicate that PiVoT ’s object-aware templates

and LoRA-based adaptation are effective for enhancing action recognition on Kinetics-400. The

Kinetics-400 dataset encompasses a wide variety of action types, featuring complex and dynamic

object interactions that can be challenging for models to interpret consistently. This diversity makes

performance enhancements particularly difficult, leading to less gains in performance.

Fig. 7.4 shows template variations across different video action detectors for a single video.

Each row corresponds to a specific detector: (a) TSN, (b) TSM, (c) MViTv2, and (d) SlowFast.

The distinct visual patterns reflect each detector’s unique approach to capturing motion and spatial

details to estimate the template. TSN emphasizes color variations, TSM appears more muted,

MViTv2 focuses on high-contrast areas, and SlowFast highlights spatial outlines, demonstrating

how each detector prioritizes different aspects of the video frames for video-based tasks.

STAD Task As shown in Tab. 7.3, the reported mAP for the SlowFast model is 24.32%, while

the reproduced implementation achieves a similar mAP of 24.11%. Notably, integrating the PiVoT

wrapper into the detector significantly improves performance, boosting the mAP to 26.36%. This

improvement underscores the efficacy of PiVoT in enhancing model performance by incorporating

proactive learning techniques. The gain in mAP reflects the ability of PiVoT to better capture

temporal and spatial features within video frames, demonstrating its benefit in STAD task as well.

116

7.4.3 Ablations

The ablation study presented provides comprehensive insights into the impact of various

components of PiVoT on the performance of video-based action detectors. This analysis evaluates

the changes in performance when specific aspects of the PiVoT framework are modified.

Computational Overhead, Inference Time, and Additional Detector Training.

Tab. 7.4

shows that PiVoT introduces a minor increase in inference time, or extra time delta, on both TSN

and MViTv2 detectors. Specifically, for TSN, the inference time rises from 19.23𝑚𝑠 to 20.68𝑚𝑠,

representing a small delta of 1.45𝑚𝑠. For MViTv2, the inference time increases from 13.21𝑚𝑠

to 15.28𝑚𝑠, a delta of 2.07𝑚𝑠. Despite this slight overhead, the performance gains in Top-1 and

Top-5 accuracy are substantial, making the extra time a worthwhile trade-off. PiVoT enables more

efficient training with greater performance improvements compared to merely increasing training

iterations for the baseline detector. PiVoT’s ability to deliver significant accuracy improvements

with minimal increases in inference time underscores its efficacy as a proactive wrapper, enhancing

detector performance without compromising computational efficiency.

Perturbation Process. First, we study the perturbation process by replacing the 3D-UNeT with

a 3D-CNN and then switching the transformation operation from addition to multiplication. As

shown in Tab. 7.5, replacing the 3D-UNeT with 3D-CNN results in decreased Top-1 and Top-5

accuracies for both TSN and MViTv2 detectors. For instance, TSN’s Top-1 accuracy drops from

51.37% to 47.14% and MViTv2’s Top-1 accuracy declines from 68.81% to 63.23%. This drop

is attributed to the nature of the templates generated by these models. Fig. 7.5(c) shows that

templates generated using a 3D-UNeT retain frame-dependent semantic content that aligns well

with the underlying video frames, enhancing the detector’s ability to capture temporal dynamics.

Conversely, the templates generated by the 3D-CNN, shown in Fig. 7.5(b), lack this semantic

coherence, which undermines the detector’s ability to leverage temporal relationships, leading to

less performance.

When the transformation operation is altered from addition to multiplication, the results

in Tab. 7.5 indicate an even more pronounced drop in performance. For TSN, the Top-1 accuracy

117

Figure 7.5 Template visualization for the (a) video frames estimated using (b) 3D-CNN, and (c)
3D-UNeT for MViTv2 detector. The templates estimated using 3D-CNN doesn’t have semantic
content, unlike estimation via 3D-UNeT which has frame-dependent semantic content useful for
boosting the performance.

falls to 42.23%, and for MViTv2, it decreases to 60.12%. This suggests that addition, as a

transformation, maintains the semantic integrity of the template and preserves crucial visual

information, while multiplication may introduce extreme distortions to the input image disrupting

the detector’s learning process. This proves that the performance improvements achieved by PiVoT

are not a result of simply increasing the number of trainable parameters but rather stem from its

specialized design, which effectively combines proactive templates with LoRA-based adaptation. PiVoT

’s unique combination of template-driven input transformations and parameter-efficient fine-tuning

ensures significant performance gains across diverse video-based tasks while maintaining computational

efficiency.

Detector Training. We ablate various ways of utilizing detector during our training. One

118

Table 7.5 Ablation study of various components of PiVoT wrapper.

Changed

PiVoT

Perturbation Process

LoRA

3D-UNeT

Detector

Frame selection

From→To

-
3D-UNeT→3D-CNN
Add→Multiplication
Yes→No
LoRA→ST-Adapter [235]
Yes→No
Frozen→Finetune
Pretrain→Scratch
No→Yes

TSN [324]

MViTv2 [182]
Top-1 Top-5 Top-1 Top-5
91.63
51.37
88.38
47.14
85.18
42.23
72.68
31.96
85.19
30.76
90.91
35.42
91.67
51.31
85.09
47.72
84.22
32.60

78.71
75.56
70.92
60.40
58.39
67.10
78.77
76.10
61.65

68.81
63.23
60.12
41.82
59.14
68.02
68.80
60.14
56.80

significant factor explored is the state of the detector during training. The results in Tab. 7.5 show

that fine-tuning the detector while incorporating PiVoT yields similar performance as when using

a frozen detector. For example, fine-tuning TSN results in a Top-1 accuracy of 51.31%, whereas

using a frozen state has similar Top-1 accuracy i.e.51.37%. This suggests that using a frozen

detector might be more suited for practical applications with just finetuning the LoRA layers with

the templates while keeping the detector frozen, resulting in a fast and efficient training paradigm.

Furthermore, training the detector from scratch instead of using pretrained weights significantly

impacts performance. The Top-1 accuracy for TSN decreases from 51.31% to 47.72%, and

for MViTv2, it drops from 68.80% to 60.14%. This underscores the importance of leveraging

pretrained weights to provide a strong starting point, allowing PiVoT to effectively enhance detection

performance through incremental learning.

Frame Selection. We explored a Frame Selection strategy to enhance TSN and MViTv2 by

prioritizing frames with high template norms, aiming to emphasize semantically rich content.

Inspired by prior work [377, 133, 382] on selective frame sampling, we sampled four times

more high-norm frames. However, this led to a performance drop, with TSN and MViTv2 Top-

1 accuracy declining to 32.60% and 56.80%, respectively. This suggests that template norm-

based selection overemphasizes specific segments, disrupts temporal continuity, and reduces frame

diversity, ultimately hindering action recognition.

LoRA Ablation. We analyze the role of LoRA in PiVoT by first examining the impact of

119

Figure 7.6 Ablation for (a) LoRA rank, and (b) scaling factor.

its removal. Eliminating LoRA significantly reduces performance, with TSN’s Top-1 accuracy

dropping from 51.37% to 31.96% and MViTv2’s from 68.81% to 41.82%. This highlights LoRA’s

critical role in fine-tuning specific model components to enhance proactive template utility while

avoiding extensive parameter updates.

Next, prior works have proposed parameter efficient modules to be integrated with the base

network for transfer learning purposes. We compare LoRA with ST-Adapter [235] that trains

an adapter at the network’s end while keeping the base frozen. Replacing LoRA with ST-

Adapter and fine-tuning TSN and MViTv2 (while keeping base detectors frozen) leads to inferior

AR performance (Tab. 7.5). Unlike adapters, LoRA’s lightweight design integrates directly into

self-attention and convolution blocks, enabling better adaptation, lower compute cost, and faster

convergence across both convolutional and transformer architectures.

120

In Fig. 7.6(a), we examine the effect of LoRA rank on TSN detector performance. Increasing

the rank from 2 to 4 improves performance, indicating that additional capacity helps capture richer

action recognition features. Beyond rank 4, performance stabilizes with minimal gains up to rank

128, suggesting diminishing returns. We select rank 4 for its near-optimal performance and lower

parameter costs.

In Fig. 7.6(b), we analyze the impact of the scaling factor (𝛼) from Eq. (7.9). At 𝛼 = 0.001,

performance is low, indicating insufficient adaptation. Increasing to 𝛼 = 0.01 significantly improves

performance, as LoRA effectively integrates template information while preserving pre-trained

features. However, beyond 𝛼 = 0.01, performance degrades due to excessive modification of

pre-trained weights. Thus, we use 𝛼 = 0.01, balancing adaptation and stability to optimize TSN

detector performance.

7.5 Conclusion

We propose PiVoT, a proactive video-based wrapper that enhances action recognition (AR) and

spatio-temporal action detection (STAD) systems. Leveraging proactive learning and augmentation, PiVoT

integrates as a plug-and-play module with various detectors, improving performance. It employs

a 3D U-Net to generate action-specific templates, which are added to input frames, and a LoRA-

based training paradigm for efficient fine-tuning while preserving detector stability. The estimated

templates capture temporal information, evident in shadow-like artifacts that aid AR detectors

in identifying motion cues, refining frame distribution, and boosting performance. Experiments

across multiple datasets and detectors validate PiVoT ’s adaptability and scalability, demonstrating

consistent improvements in video-based detection tasks.

121

CHAPTER 8

REVERSE ENGINEERING OF GENERATIVE MODELS:
INFERRING MODEL HYPERPARAMETERS FROM GENERATED IMAGES

State-of-the-art (SOTA) Generative Models (GMs) can synthesize photo-realistic images that are

hard for humans to distinguish from genuine photos. Identifying and understanding manipulated

media are crucial to mitigate the social concerns on the potential misuse of GMs. We propose to

perform reverse engineering of GMs to infer model hyperparameters from the images generated by

these models. We define a novel problem, “model parsing", as estimating GM network architectures

and training loss functions by examining their generated images – a task seemingly impossible for

human beings. To tackle this problem, we propose a framework with two components: a Fingerprint

Estimation Network (FEN), which estimates a GM fingerprint from a generated image by training

with four constraints to encourage the fingerprint to have desired properties, and a Parsing Network

(PN), which predicts network architecture and loss functions from the estimated fingerprints.

To evaluate our approach, we collect a fake image dataset with 100K images generated by 116

different GMs. Extensive experiments show encouraging results in parsing the hyperparameters of

the unseen models. Finally, our fingerprint estimation can be leveraged for deepfake detection and

image attribution, as we show by reporting SOTA results on both the deepfake detection (Celeb-DF)

and image attribution benchmarks1.

8.1

Introduction

Image generation techniques have improved significantly in recent years, especially after the

breakthrough of Generative Adversarial Networks (GANs) [108]. Many Generative Models (GMs),

including both GAN and Variational Autoencoder (VAE) [158, 51, 156, 164, 29, 42, 74], can

generate photo-realistic images that are hard for humans to distinguish from genuine photos. This

photo-realism, however, raises increasing concerns for the potential misuse of these models, e.g.,

by launching coordinated misinformation attack [314, 130]. As a result, deepfake detection [266,

1Vishal Asnani, Xi Yin, Tal Hassner, and Xiaoming Liu. "Reverse engineering of generative models: Inferring
model hyperparameters from generated images." IEEE Transactions on Pattern Analysis and Machine Intelligence,
2023.

122

Figure 8.1 Top: Three increasingly difficult tasks: (a) deepfake detection classifies an image as
genuine or fake; (b) image attribution predicts which of a closed set of GMs generated a fake
image; and (c) model parsing, proposed here, infers hyperparameters of the GM used to generate
an image, for those models unseen during training. Bottom: We present a framework for model
parsing, which can also be applied to simpler tasks of deepfake detection and image attribution.

213, 114, 210, 66, 229] has recently attracted growing attention. Going beyond the binary genuine

vs.fake classification as in deepfake detection, Yu et al. [365] proposed source model classification

given a generated image. This image attribution problem assumes a closed set of GMs, used in

both training and testing.

It is desirable to generalize image attribution to open-set recognition, i.e., classify an image

generated by GMs which were not seen during training. However, one may wonder what else we

can do beyond recognizing a GM as an unseen or new model. Can we know more about how

this new GM was designed? How its architecture differs from known GMs in the training set?

Answering these questions is valuable when we, as defenders, strive to understand the source of

images generated by malicious attackers or identify coordinated misinformation attacks which use

the same GM. We view this as the grand challenge of reverse engineering of GMs.

While image attribution of GMs is both exciting and challenging, our work aims to take one

step further with the following observation. When different GMs are designed, they mainly differ in

123

Table 8.1 Comparison of our approach with prior works on reverse engineering of models,
fingerprint estimation and deepfake detection. We compare on the basis of input and output
of methods, whether the testing is done on multiple unseen GMs and whether the testing is done
on multiple datasets. [KEYS: R.E.: reverse engineering, I.A.: image attribution, D.D.: deepfake
detection, Fing.
est.: fingerprint estimation, mul.: multiple, un.: unknown, N.A.: network
architecture, L.F.: Loss function, para.: parameters, sup.: supervised, unsup.: unsupervised].
Output
Method (Year)
Training data
[300] (2016)
N.A. para.
[233] (2018)
Model weights
[137] (2018)
N.A. para.
[15] (2018)
✖
[209] (2019)
✖
[365] (2019)
✖
[331] (2020)
✖
[372] (2019)
✖
[266] (2019)
✖
[114] (2020)
✖
[213] (2019)
✖
[210] (2019)
✖
[66] (2020)
✖
[229] (2020)
✖
[211] (2020)
✖
[192] (2021)
N.A. & L.F. para.
Ours (2022)

Input
Attack on models
Input-output images
Memory access patterns
Electromagnetic emanations
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image

Purpose
R.E.
R.E.
R.E.
R.E.
I.A.
I.A.
I.A.
I.A.
D.D.
D.D.
D.D.
D.D.
D.D.
D.D.
D.D.
D.D.
R.E., I.A.,D.D.

✖
✖
✖
✖
Sup.
Sup.
Sup.
Sup.
✖
✖
✖
✖
✖
✖
✖
✖
Unsup.

Fing. est. Test on mul. GMs Test on un. GMs Test on mul. data

✖
✖
✖
✖
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔

✖
✖
✖
✖
✔
✔
✔
✔
✖
✖
✖
✖
✖
✖
✖
✖
✔

✖
✖
✖
✖
✖
✔
✖
✖
✖
✖
✖
✖
✖
✖
✖
✖
✔

their model hyperparameters, including the network architectures (e.g., the number of layers/blocks,

the type of normalization) and training loss functions. If we could map the generated images to

the embedding space of the model hyperparameters used to generate them, there is a potential to

tackle a new problem we termed as model parsing, i.e., estimating hyperparameters of an unseen

GM from only its generated image ( Fig. 8.1). Reverse engineering machine learning models has

been done before by relying on a model’s input and output [300, 233], or accessing the hardware

usage during inference [137, 15]. To the best of our knowledge, however, reverse engineering has

not been explored for GMs, especially with only generated images as input.

There are many publicly available GMs that generate images of diverse contents, including

faces, digits, and generic scenes. To improve the generalization of model parsing, we collect a

large-scale fake image dataset with various contents so that our framework is not specific to a

particular content. It consists of images generated from 116 CNN-based GMs, including 81 GANs,

13 VAEs, 6 Adversarial Attack models (AAs), 11 Auto-Regressive models (ARs) and 5 Normalizing

Flow models (NFs). While GANs or VAEs generate an image by feeding a genuine image or latent

code to the network, AAs modify a genuine image based on its objectives via back-propagation.

ARs generate each pixel of a fake image sequentially, and NFs generate images via a flow-based

124

function. Despite such differences, we call all these models as GMs for simplicity. For each GM,

our dataset includes 1, 000 generated images. We use each model’s hyperparameters, including

network architecture parameters and training loss types, as the ground-truth for model parsing

training. We propose a framework to peek inside the black boxes of these GMs by estimating their

hyperparameters from the generated images. Unlike the closed-set setting in [365], we venture into

quantifying the generalization ability of our method in parsing unseen GMs.

Our framework consists of two components ( Fig. 8.1, bottom). A Fingerprint Estimation

Network (FEN) infers the subtle yet unique patterns left by GMs on their generated images. Image

fingerprint was first applied to images captured by camera sensors [203, 106, 174, 92, 308, 202, 41]

and then extended to GMs [209, 365]. We estimate fingerprints using different constraints which

are based on the general properties of fingerprint, including the fingerprint magnitude, repetitive

nature, frequency range and symmetrical frequency response. Different loss functions are defined

to apply these constraints so that the estimated fingerprints manifest these desired properties. These

constraints enable us to estimate fingerprints of GMs without ground truth.

The estimated fingerprints are discriminative and can serve as the cornerstone for subsequent

tasks. The second part of our framework is a Parsing Network (PN), which takes the fingerprint

as input and predicts the model hyperparameters. We consider parameters representing network

architectures and loss function types. For the former, we form 15 parameters and categorize

them into discrete and continuous types. For the latter, we form a 10-dimensional vector where

each parameter represents the usage of a particular loss function type. Classification is used for

estimating discrete parameters such as the normalization type, and regression is used for continuous

parameters such as the number of layers. To leverage the similarity between different GMs, we

group the GMs into several clusters based on their ground-truth hyperparameters. The mean and

deviation are calculated for each GM. We use two different parsers: cluster parser and instance

parser to predict the mean and deviation of these parameters, which are then combined as the final

predictions.

Among the 116 GMs in our collected dataset, there are 47 models for face generation and 69

125

for non-face image generation. We partition all GMs into two categories: face vs.non-face. We

carefully curate four evaluation sets for face and non-face categories respectively, where every set

well represents the GM population. Cross-validation is used in our experiments. In addition to

model parsing, our FEN can be used for deepfake detection and image attribution. For both tasks, we

add a shallow network that inputs the estimated fingerprint and performs binary (deepfake detection)

or multi-class classification (image attribution). Although our FEN is not tailored for these tasks,

we still achieve state-of-the-art (SOTA) performance, indicating the superior generalization ability

of our fingerprint estimation. Finally, in coordinated misinformation attack, attackers may use the

same GM to generate multiple fake images. To detect such attacks, we also define a new task

to evaluate how well our model parsing results can be used to determine if two fake images are

generated from the same GM.

In summary, this paper makes the following contributions.

• We are the first to go beyond model classification by formulating a novel problem of model

parsing for GMs.

• We propose a novel framework with fingerprint estimation and clustering of GMs to predict

the network architecture and loss functions, given a single generated image.

• We assemble a dataset of generated images from 116 GMs, including ground-truth labels on

the network architectures and loss function types.

• We show promising results for model parsing and our fingerprint estimation generalizes well

to deepfake detection on the Celeb-DF benchmark [184] and image attribution [365], in both

cases reporting results comparable or better than existing SOTA [66, 365]. The parsed model

parameters can also be used in detecting coordinated misinformation attacks.

8.2 Related work

Reverse engineering of models There is a growing area of interest in reverse engineering the

hyperparameters of machine learning models, with two types of approaches. First, some methods

126

treat a model as a black box API by examining its input and output pairs. For example, Tramer et

al. [300] developed an avatar method to estimate training data and model architectures, while Oh et

al. [233] trained a set of while-box models to estimate model hyperparameters. The second type of

approach assumes that the intermediate hardware information is available during model inference.

Hua et al. [137] estimated both the structure and the weights of a CNN model running on a hardware

accelerator, by using information leaks of memory access patterns. Batina et al. [15] estimated

the network architecture by using side-channel information such as timing and electromagnetic

emanations.

Unlike prior methods which require access to the models or their inputs, our approach can

reverse engineer GMs by examining only the images generated by these models, making it more

suitable for real-world applications. We summarize our approach with previous works in Tab. 8.1.

Fingerprint estimation Every acquisition device leaves a subtle but unique pattern on its captured

image, due to manufacturing imperfections. Such patterns are referred to as device fingerprints.

Device fingerprint estimation [203, 61] was extended to fingerprint estimation of GMs by Marra et

al. [209], who showed that hand-crafted fingerprints are unique to each GM and can be used

to identify an image’s source. Ning et al. [365] extended this idea to learning-based fingerprint

estimation. Both methods rely on the noise signals in the image. Others explored frequency

domain information. For example, Wang et al. [331] showed that CNN generated images have

unique patterns in their frequency domain, regarded as model fingerprints. Zhang et al. [372]

showed that features extracted from the middle and high frequencies of the spectrum domain were

useful in detecting upsampling artifacts produced by GANs.

Unlike prior methods which derive fingerprints directly from noise signals or the frequency

domain, we propose several novel loss functions to learn GM fingerprints in an unsupervised

manner ( Tab. 8.1). We further show that our fingerprint estimation can generalize well to other

related tasks.

Deepfake detection Deepfake detection is a new and active field with many recent developments.

Rossler et al. [266] evaluated different methods for detecting face and mouth replacement manipulation.

127

Figure 8.2 Example images generated by all 116 GMs in our collected dataset (one image per
model).

Others proposed SVM classifiers on colour difference features [213]. Guarnera et al. [114]

used Expectation Maximization [220] algorithm to extract features and convolution traces for

classification. Marra et al. [210] proposed a multi-task incremental learning to classify new GAN

generated images. Chai et al. [36] introduced a patch-based classifier to exaggerate regions that

are more easily detectable. An attention mechanism [311] was proposed by Hao et al. [66] to

improve the performance of deepfake detection. Masi et al. [211] amplifies the artifacts produced

by deepfake methods to perform the detection. Nirkin et al. [229] seek discrepancies between face

regions and their context [228] as telltale signs of manipulation. Finally, Liu [192] uses the spatial

information as an additional channel for the classifier. In our work, the estimated fingerprint is fed

into a classifier for genuine vs. fake classification.

128

Figure 8.3 t-SNE visualization for ground-truth vectors for (a) network architecture, (b) loss function
and (c) network architecture and loss function combined. The ground-truth vectors are fairly
distributed across the embedding space regardless of the face/non-face data.

Figure 8.4 Our framework includes two components: 1) the FEN is trained with four objectives for
fingerprint estimation; and 2) the PN consists of a shared network, two parsers to estimate mean
and deviation for each parameter, an encoder to estimate fusion parameter, fully connected layers
(FCs) for continuous type parameters and separate classifiers (CLs) for discrete type parameters in
network architecture and loss function prediction. Blue boxes denote trainable components; green
boxes denote feature vectors; orange boxes denote loss functions; red boxes denote other tasks our
framework can handle; black arrows denote data flow; orange arrows denote loss supervisions.
Best viewed in color.

8.3 Proposed approach

In this section, we first introduce our collected dataset in Sec. 8.3.1. We then present the

fingerprint estimation method in Sec. 8.3.2 and model parsing in Sec. 8.3.3. Finally, we apply

our estimated fingerprints to deepfake detection, image attribution, and detecting coordinated

misinformation attacks, as described in Sec. 8.3.4.

129

Table 8.2 Hyper-parameters representing the network architectures of GMs. (KEYS: cont.
continuous integer).
Parameter
cont. int. [5, 95]
# layers
# convolutional layers
cont. int. [0, 92]
# fully connected layers cont. int. [0, 40]
cont. int.
# pooling layers
[0, 4]
cont. int. [0, 57]
# normalization layers

Parameter
Range
non-linearity type in blocks multi-class 0, 1, 2, 3
non-linearity type in last layer multi-class 0, 1, 2, 3
up-sampling type
skip connection
down-sampling

Type
cont. int.
# filter
cont. int.
# parameters
cont. int.
# blocks
# layers per block
cont. int.
normalization type multi-class

Range
[0, 8365]
[0.36𝑀, 267𝑀]
[0, 16]
[0, 9]
0, 1, 2, 3

binary
binary
binary

Range Parameter

0, 1
0, 1
0, 1

int.:

Type

Type

Table 8.3 Loss function types used by all GMs. We group the 10 loss functions into three categories.
We use the binary representation to indicate presence of each loss type in training the respective
GM.

Category

Pixel-level

Discriminator

Classification

Loss function
𝐿1
𝐿2
Mean squared error (MSE)
Maximum mean discrepancy (MMD)
Least squares (LS)
Wasserstein loss for GAN (WGAN)
Kullback–Leibler (KL) divergence
Adversarial
Hinge
Cross-entropy (CE)

8.3.1 Data collection

We make the first attempt to study the model parsing problem. Since data drives research, it is

essential to collect a dataset for our new research problem. Given the large number of GMs published

in recent years [335, 145], we consider a few factors while deciding which GMs to be included in

our dataset. First of all, since it is desirable to study if model parsing is content-dependent, we

hope to collect GMs with as diverse content as possible, such as the face, digits, and generic scenes.

Secondly, we give preference to GMs where either the authors have publicly released pre-trained

models, generated images, or the training script. Third, the network architecture of the GM should

be clearly described in the respective paper.

To this end, we assemble a list of 116 publicly available GMs, including ProGan [156],

StyleGAN [158], and others. A complete list is provided in the supplementary material. For

each GM, we collect 1, 000 generated images. Therefore, our dataset D comprises of 116, 000

images. We show example images in Fig. 8.2. These GMs were trained on datasets with various

contents, such as CelebA [200], MNIST [72], CIFAR10 [168], ImageNet [71], facades [385],

edges2shoes [385], and apple2oranges [385]. The dataset is available here.

We further document the model hyperparameters for each GM as reported in their papers.

130

Specifically, we investigate two aspects: network architecture and training loss functions. We form

10 different loss function types. We obtain a large-scale fake image dataset D = {X𝑖, y𝑛

a super-set of 15 network architecture parameters (e.g., number of layers, normalization type) and
𝑖 }𝑁
𝑖=1
𝑖 ∈ R10 represent the ground-truth network architecture

where X𝑖 is a fake image, y𝑛

𝑖 ∈ R15 and y𝑙

𝑖 , y𝑙

and loss functions, respectively. We also show the t-SNE distribution for both network architecture

and loss functions in Fig. 8.3 for different types of models and datasets. We observe that the

ground-truth vectors for both network architecture and loss function are evenly distributed across

the space for both types of data: face and non-face.

8.3.2 Fingerprint estimation

We adopt a network structure similar to the DnCNN model used in [370]. As shown in Fig. 8.4,

the input to FEN is a generated image X, and the output is a fingerprint image F of the same size.

Motivated by prior works on physical fingerprint estimation [153, 372, 331, 365, 209], we define

the following four constraints to guide our estimated fingerprints to have the desirable properties.

Magnitude loss Fingerprints can be considered as image noise patterns with small magnitudes.

Similar assumptions were made by others when estimating spoof noise for spoofed face images [153]

and sensor noise for genuine images [203]. The first constraint is thus proposed to regularize the

fingerprint image to have a low magnitude with an 𝐿2 loss:

𝐽𝑚 = ||F||2
2

.

(8.1)

Spectrum loss Previous work observed that fingerprints primarily lie in the middle and high-

frequency bands of an image [372]. We thus propose to minimize the low-frequency content in a

fingerprint image by applying a low pass filter to its frequency domain:

𝐽𝑠 = ||L (F (F), 𝑓 )||2
2

,

(8.2)

where F is the Fourier transform, L is the low pass filter selecting the 𝑓 × 𝑓 region in the center

of the 2D Fourier spectrum and making everything else zero.

Repetitive loss Amin et al. [153] noted that the noise characteristics of an image are repetitive and

exist everywhere in its spatial domain. Such repetitive patterns will result in a large magnitude in

131

the high-frequency band of the fingerprint. Therefore, we propose to maximize the high-frequency

information to encourage this repetitive pattern:

𝐽𝑟 = −max{H (F (F), 𝑓 )},

(8.3)

where H is a high pass filter assigning the 𝑓 × 𝑓 region in the center of the 2D Fourier spectrum

to zero.

Energy loss. Wang et al. [331] showed that unique patterns exist in the Fourier spectrum of

the image generated by CNN networks. These patterns have similar energy in the vertical and

horizontal directions of the Fourier spectrum. Our final constraint is proposed to incorporate this

observation:

𝐽𝑒 = ||F (F) − F (F)𝑇 ||2
2

,

(8.4)

where F (F)𝑇 is the transpose of F (F).

These constraints guide the training of our fingerprint estimation. As shown in Fig. 8.4, the

fingerprint constraint is given by:

𝐽 𝑓 = 𝜆1𝐽𝑚 + 𝜆2𝐽𝑠 + 𝜆3𝐽𝑟 + 𝜆4𝐽𝑒,

(8.5)

where 𝜆1, 𝜆2, 𝜆3, 𝜆4 are the loss weights for each term.

8.3.3 Model parsing

The estimated fingerprint is expected to capture unique patterns generated from a GM. Prior

works adopted fingerprints for deepfake detection [213, 114] and image attribution [365]. However,

we go beyond those efforts by parsing the hyperparameters of GMs. As shown in Fig. 8.4, we

perform prediction using two parsers, namely, cluster parser and instance parser. We combine both

outputs for network architecture and loss function prediction. We will now discuss the ground truth

calculation and our framework in detail.

8.3.3.1 Ground truth hyperparamters

Network architecture In this work, we do not aim to recover the network parameters. The reason

is that a typical deep network has millions of network parameters, which reside in a very high

dimensional space and is thus hard to predict. Instead, we propose to infer the hyperparameters that

132

define the network architecture, which are much fewer than the network parameters. Motivated by

prior works in neural architecture search [294, 244, 191], we form a set of 15 network architecture

parameters covering various aspects of architectures. As shown in Tab. 8.2, these parameters

fall into different data types and have different ranges. We further split the network architecture

parameters y𝑛 into two parts: y𝑛𝑐 ∈ R9 for continuous data type and y𝑛𝑑 ∈ R6 for discrete data type.

Loss function In addition to the network architectures, the learned network parameters of trained

GM can also impact the fingerprints left on the generated images. These network parameters are

determined mainly by the training data and the loss functions used to train these models. We,

therefore, explore the possibility of also predicting the training loss functions from the estimated

fingerprints. The 116 GMs were trained with 10 types of loss functions as shown in Tab. 8.3. For

each model, we compose a ground-truth vector y𝑙 ∈ R10, where each element is a binary value

indicating whether the corresponding loss is used or not in training this model.

Our framework parses two types of hyperparameters: continuous and discrete. The former

includes the continuous network architecture parameters. The latter includes discrete network

architecture parameters and loss function parameters. For clarity, we group these parameters

into continuous and discrete types in the remaining of this section to describe the model parsing

objectives. We use y𝑐 and y𝑑 to denote continuous and discrete parameters respectively.

8.3.3.2 Cluster parser prediction

We have observed that directly estimating the hyperparameters independently for each GM

yields inferior results. In fact, some of the GMs in our dataset have similar network architectures

and/or loss functions. It is intuitive to leverage the similarities among different GMs for better

hyperparameter estimation. To do this, we perform k-means clustering to group all GMs into

different clusters, as shown in Fig. 8.5. Then we propose to perform cluster-level coarse prediction

and GM-level fine prediction, which are subsequently combined to obtain the final prediction

results.

As we aim to estimate the parameters for network architecture and loss function, it is intuitive

to combine them to perform grouping. Thus, we concatenate the ground truth network architecture

133

Figure 8.5 The idea of grouping various GMs into different clusters. For the test GM, we estimate
its cluster mean and the deviation from that mean to predict network architecture and loss function
type.

parameters y𝑛 and loss function parameters y𝑙, denoted as y𝑛𝑙. We use these ground truth vectors to

perform k-means clustering to find the optimal k-clusters in the dataset D = {𝑪1, 𝑪2, ...𝑪 𝑘 }. Our

clustering objective can be written as:

argmin
D

𝑘
∑︁

∑︁

𝑖=1

y𝑛𝑙
𝑗 ∈𝑪𝑖

||y𝑛𝑙

𝑗 − 𝜇𝑖 ||2,

(8.6)

where 𝜇𝑖 is the mean of the ground truth of the GMs in 𝑪𝑖.

Our dataset comprises different kinds of GMs, namely GANs, VAEs, AAs, ARs, and NFs. We

perform clustering after separating the training data into different kinds of GMs. This is done to

ensure that each cluster would belong to one particular kind of GM. Next, we select the value of

k i.e., the number of clusters, using the elbow method adopted by previous works [18, 166]. After

determining the clusters comprising of similar GMs, we estimate the ground truth y𝑢 to represent

the respective cluster. We estimate this cluster ground truth using different ways for continuous and

discrete parameters. For the former, we take the average of each parameter using the ground truth

for all GMs in the respective cluster. For the latter, we perform majority voting for every parameter

to find the most common class across all GMs in the cluster.

We use different loss functions to perform cluster-level prediction. For continuous parameters,

134

we perform regression for parameter estimation. As these parameters are in different ranges, we

further perform a min-max normalization to bring all parameters to the range of [0, 1]. An 𝐿2 loss

is used to estimate the prediction error:

𝑢 = || ˆy𝑐
𝐽𝑐

𝑢 − y𝑐

𝑢 ||2
2

,

(8.7)

where ˆy𝑐

𝑢 is the cluster mean prediction and y𝑐

𝑢 is the normalized ground-truth cluster mean.

For discrete parameters, the prediction is made via individual classifiers. Specifically, we train

𝑀 = 16 classifiers (6 for network architecture and 10 for loss function parameters), one for each

discrete parameter. The loss term for discrete parameters cluster-prediction is defined as:

𝐽 𝑑
𝑢 = −

𝑀
∑︁

𝑚=1

sum(y𝑑

𝑢𝑚 ⊙ log(S( ˆy𝑑

𝑢𝑚))),

(8.8)

where y𝑑

𝑢𝑚 is the ground-truth one-hot vector for the respective class in the 𝑚-th discrete type

parameter, ˆy𝑑

𝑢𝑚 are the class logits, S is the Softmax function that maps the class logits into the

range of [0, 1], ⊙ is the element-wise multiplication, and sum() computes the summation of a

vector’s elements.

As shown in Fig. 8.4, the clustering constraint is given by:

𝐽𝑢 = 𝛾1𝐽𝑐

𝑢 + 𝛾2𝐽 𝑑
𝑢 ,

(8.9)

where 𝛾1 and 𝛾2 are the loss weights for each term.

8.3.3.3

Instance parser prediction

The cluster parser performs coarse-level prediction. To obtain a more fine-level prediction, we

use an instance parser to estimate a GM-level prediction, which ignores any similarity among GMs.

This parser aims to predict the deviation of every parameter from the coarse-level prediction. The

ground truth deviation vector y𝑣 can be estimated in different ways for two types of parameters.

For continuous type parameters, the deviation can be the difference between the ground truth of

the GM and the ground truth of the cluster the GM was assigned. However, in the case of discrete

parameters, the actual ground truth class for the parameters can act as the deviation from the

most common class estimated in cluster ground truth. We use different loss functions to perform

deviation-level prediction. Specifically, we use an 𝐿2 loss to estimate the prediction error for

135

continuous parameters:

𝑣 = || ˆy𝑐
𝐽𝑐

𝑣 − y𝑐

𝑣 ||2
2

,

(8.10)

where ˆy𝑐

𝑣 is the deviation prediction and y𝑐

𝑣 is the deviation ground-truth of continuous data type.

We have noticed the class distribution for some discrete parameters is imbalanced. Therefore,

we apply the weighted cross-entropy loss for every parameter to handle this challenge. We train

𝑀 = 16 classifiers, one for each of the discrete parameters. For the 𝑚-th classifier with 𝑁𝑚 classes

(𝑁𝑚 = 2 or 4 in our case), we calculate a loss weight for each class as 𝑤𝑖

𝑚 = 𝑁
𝑁 𝑖
𝑚

where 𝑁𝑖

𝑚 is the

number of training examples for the 𝑖th class of 𝑚-th classifier, and 𝑁 is the number of total training

examples. As a result, the class with more examples is down-weighted, and the class with fewer

examples is up-weighted to overcome the imbalance issue, which will be empirically demonstrated

in Fig. 8.9. The loss term for discrete parameters deviation-prediction is defined as:

𝐽 𝑑
𝑣 = −

𝑀
∑︁

𝑚=1

sum(w𝑚 ⊙ y𝑑

𝑣𝑚 ⊙ log(S( ˆy𝑑

𝑣𝑚))),

(8.11)

where y𝑑

𝑣𝑚 is the ground-truth one-hot deviation vector for the 𝑚-th classifier, w𝑚 is a weight vector

for all classes in the 𝑚-th classifier and ˆy𝑑

𝑣𝑚 are the class logits.

As shown in Fig. 8.4, the deviation constraint is given by:

𝐽𝑣 = 𝛾3𝐽𝑐

𝑣 + 𝛾4𝐽 𝑑
𝑣 .

(8.12)

where 𝛾3 and 𝛾4 are the loss weights for each term.

8.3.3.4 Combining predictions

We use a cluster parser to perform a coarse-level mean prediction and an instance parser to

predict a deviation prediction for each GM. The final prediction of our framework, i.e., the prediction

at the fine-level is the combination of the outputs of these two parsers. For continuous parameters,

we perform the element-wise addition of the coarse-level mean and deviation prediction:

ˆy𝑐 = ˆy𝑐

𝑢 + ˆy𝑐
𝑣,

(8.13)

For discrete parameters, we have observed that element-wise addition of the logits for every

classifier in both parsers didn’t perform well. Therefore, to integrate the outputs, we train an encoder

136

network to predict a fusion parameter ˆ𝑝𝑑 ∈ [0, 1] for each classifier. For any parameter, the value

of the fusion parameter is 1 if the cluster class is the same as the GM class, encouraging the parsing

network to give importance to the cluster parser output. The value of the fusion parameter is 0 if

the GM class is different from the cluster class. Therefore, for 𝑚-th classifier, the training of the

model is supervised by the ground truth 𝑝𝑑

𝑚 as defined below:

𝑝𝑑
𝑚 =

1,

0,

𝑢𝑚 = y𝑑
y𝑑
𝑣𝑚

𝑢𝑚 ≠ y𝑑
y𝑑
𝑣𝑚

.

(8.14)





To train our encoder, we use the ground truth fusion parameter p𝑑 which is the concatenation

for all parameters. The training is done via cross-entropy loss as shown below:

𝐽𝑝 = −

𝑀
∑︁

𝑚=1

( 𝑝𝑑

𝑚log(G( ˆ𝑝𝑑

𝑚)) + (1 − 𝑝𝑑

𝑚)log(1 − G( ˆ𝑝𝑑

𝑚))).

where G is the Sigmoid function that maps the class logits into the range of [0, 1].

As shown in Fig. 8.4 for discrete parameters, the final prediction is given by:

ˆy𝑑 = ˆp𝑑 ⊙ ˆy𝑑

𝑢 + (1 − ˆp𝑑) ⊙ ˆy𝑑
𝑣 .

The overall loss function for model parsing is given by:

𝐽 = 𝐽 𝑓 + 𝐽𝑢 + 𝐽𝑣 + 𝛾5𝐽𝑝.

(8.15)

(8.16)

(8.17)

where 𝛾5 is the loss weight for fusion constraint. Our framework is trained end-to-end with

fingerprint estimation ( Eq. (8.5)) and model parsing ( Eq. (8.17)).

8.3.4 Other applications

In addition to model parsing, our fingerprint estimation can be easily leveraged for other

applications such as detecting coordinated misinformation attacks, deepfake detection and image

attribution.

Coordinated misinformation attack In coordinated misinformation attacks, the attackers often

use the same model to generate multiple fake images. One way to detect such attacks is to classify

whether two fake images are generated from the same GM, despite that this GM might be unseen

to the classifier. This task is not straightforward to perform by prior works. However, given the

137

ability of our model parsing, this is the ideal task that we can contribute. To perform this binary

classification task, we use the parsed network architecture and loss function parameters to calculate

the similarity score between two test images. We calculate the cosine similarity for continuous

type parameters and fraction of the number of parameters having same class for discrete type. Both

cosine similarity and fraction of parameters are averaged to get the similarity score. Comparing

the cosine similarity with a threshold will lead to the binary classification decision of whether two

images come from the same GM or not.

Deepfake detection We consider the binary classification of an image as either genuine or fake.

We add a shallow network on the generated fingerprint to predict the probabilities of being genuine

or fake. The shallow network consists of five convolution layers and two fully connected layers.

Both genuine and fake face images are used for training. Both FEN and the shallow network are

trained end-to-end with the proposed fingerprint constraints ( Eq. (8.5)) and a cross-entropy loss

for genuine vs.fake classification. Note that the fingerprint constraints ( Eq. (8.5)) are not applied

to the genuine input face images.

Image attribution We aim to learn a mapping from a given image to the model that generated it

if it is fake or classified as genuine otherwise. All models are known during training. We solve

image attribution as a closed-set classification problem. Similar to deepfake detection, we add a

shallow network on the generated fingerprint for model classification with the cross-entropy loss.

The shallow network consists of two convolutional layers and two fully connected layers.

8.4 Experiments

8.4.1 Settings

Dataset As described in Sec. 8.3.1, we have collected a fake image dataset consisting of 116𝐾

images from 116 GMs (1𝐾 images per model) for model parsing experiments. These models can

be split into two parts: 47 face models and 69 non-face models. Instead of performing one split

of training and testing sets, we carefully construct four different splits with a focus on curating

well-represented test sets. Specifically, each testing set includes six GANs, two VAEs, two ARs,

one AA and one NF model. We perform cross-validation to train on 104 models and evaluate on

138

the remaining 12 models in testing sets. The performance is averaged across four testing sets.

For deepfake detection experiments, we conduct experiments on the recently released Celeb-DF

dataset [184], consisting of 590 real and 5, 639 fake videos. For image attribution experiments,

a source database with genuine images needs to be selected, from which the fake images can be

generated by various GAN models. We select two source datasets: CelebA [184] and LSUN [364],

for two experiments. From each source dataset, we construct a training set of 100𝐾 genuine and

100𝐾 fake face images produced by each of the same four GAN models used in Yu et al. [365], and

a testing set with 10𝐾 genuine and 10𝐾 fake images per model.

Implementation details Our framework is trained end-to-end with the loss functions of Eq. (8.5)

and Eq. (8.17). The loss weights are set to make the magnitudes of all loss terms comparable:

𝜆1 = 0.05, 𝜆2 = 0.001, 𝜆3 = 0.1, 𝜆4 = 1, 𝛾1 = 5, 𝛾2 = 5, 𝛾3 = 5, 𝛾4 = 5, 𝛾5 = 5, 𝛾6 = 5, 𝛾7 = 1,

𝛾8 = 1. The value of 𝑓 for spectrum loss and repetitive loss in the fingerprint estimation is set to

50. For each of the four test sets, we calculate the number of clusters k using the elbow method. We

divide the data into different GM types and perform k-means clustering separately for each type.

According to the sets defined in the supplementary, we obtain the value of k as 11, 11, 15, and 13.

We use Adam optimizer with a learning rate of 0.0001. Our framework is trained with a batch size

of 32 for 10 epochs. All the experiments are conducted using NVIDIA Tesla K80 GPUs.

Evaluation metrics For continuous type parameters, we report the 𝐿1 error for the regression

estimation of continuous type parameters. We also report the p-value of t-test, correlation

coefficient, coefficient of determination [288] and slope of the RANSAC regression line [93]

to show the effectiveness of regression in our approach. For discrete type parameters, as there is

imbalance in the dataset for different parameters, we compute the F1 score [94, 147] for classification

performance. We also report classification accuracy for discrete-type parameters. For all cross-

validation experiments, we report the averaged results across all images and all GMs.

8.4.2 Model parsing results

As we are the first to attempt GM parsing, there are no prior works for comparison. To provide

a baseline, we, therefore, draw an analogy with the image attribution task, where each model is

139

represented as a one-hot vector and different models have equal inter-model distances in the high-

dimensional space defined by these one-hot vectors. In model parsing, we represent each model as

a 25-D vector consisting of network architectures (15-D) and training loss functions (10-D). Thus,

these models are not of equal distance in the 25-D space.

Based on the aforementioned observation, we define a baseline, referred to here as random

ground-truth. Specifically, for each parameter, we shuffle the values/classes across all 116 GMs to

ensure that the assigned ground-truth is different from the actual ground-truth but also preserves

the actual distribution of each parameter, which means that the random ground-truth baseline

is not based on random chance. These random ground-truth vectors have the same properties

as our ground-truth vectors in terms of non-equal distances. But the shuffled ground truths are

meaningless and are not corresponding to their true model hyperparameters. We train and test

our proposed approach on this randomly shuffled ground-truth. Due to the random nature of this

baseline, we perform three random shuffling and then report the average performance. We also

evaluate a baseline of always predicting the mean for continuous hyperparameters, and always

predicting the mode for discrete hyperparameters across the four sets. These mean/mode values of

the hyperparameters are both measures of central tendency to represent the data, and they might

result in a good enough performance for model parsing.

To validate the effects of our proposed fingerprint estimation constraints, we conduct an ablation

study and train our framework end-to-end with only the model parsing objective in Eq. (8.17). This

results in the no fingerprint baseline. Finally, to show the importance of our clustering and

deviation parser, we estimate the network architecture and loss functions using just one parser,

which estimates the parameters directly instead of a mean and deviation. We refer to this as using

one parser baseline.

Network architecture prediction We report the results of network architecture prediction in Tab. 8.4

for the 4 testing sets, as defined in Sec. 8.4.1. Our method achieves a much lower 𝐿1 error compared

to the random ground-truth baseline for continuous type parameters and higher classification

accuracy and F1 score for discrete type parameters. This result indicates that there is indeed a

140

Figure 8.6 𝐿1 error and F1 score for continuous and discrete parameters respectively of network
architecture averaged across all images of all models in the 4 test sets. Not only we have better
average performance, but also our standard deviations are smaller.

much stronger and generalized correlation between generated images and the embedding space of

meaningful architecture hyper-parameters and loss function types, compared to a random vector

of the same length and distribution. This correlation is the foundation of why model parsing of

GMs can be a valid and feasible task. Our approach also outperforms the mean/mode baseline,

proving that always predicting the mean of the data for continuous parameters is not good enough.

Removing fingerprint estimation objectives leads to worse results showing the importance of the

fingerprint estimation in model parsing. We demonstrate the effectiveness of estimating mean and

deviation by evaluating the performance of using just one parser. Our method clearly outperforms

the approach of using one parser.

141

Table 8.4 Performance of network architecture prediction. We use 𝐿1 error, p-value, correlation
coefficient, coefficient of determination and slope of RANSAC regression line for continuous type
parameters. For discrete parameters, we use F1 score and classification accuracy. We also show
the standard deviation over all the test samples for 𝐿1 error. The first value is the standard deviation
across sets, while the second one is across the samples. The p-value would be estimated for every
ours-baseline pair. Our method performs better for both types of variables compared to the three
baselines. [KEYS: corr.: correlation, coef.: coefficient, det.: determination].

Method

Random ground-truth
Mean/mode
No fingerprint
Using one parser
Ours

𝐿1 error ↓
0.184 ± 0.019/0.036
0.164 ± 0.011/0.016
0.170 ± 0.035/0.012
0.161 ± 0.028/0.035
0.149 ± 0.019/0.014

Continuous type

Discrete type

P-value ↓
0.006 ± 0.001
0.035 ± 0.005
0.017 ± 0.004
0.032 ± 0.002
-

Slope ↑
Corr. coef. ↑ Coef. of det. ↑
0.592 ± 0.041
0.315 ± 0.095
0.261 ± 0.181
0.632 ± 0.024
0.467 ± 0.015
0.326 ± 0.112
0.605 ± 0.152
0.738 ± 0.014
0.892 ± 0.021
0.512 ± 0.116 −0.529 ± 0.075
0.226 ± 0.030
0.921 ± 0.021
0.612 ± 0.161
0.744 ± 0.098

F1 score ↑
0.529 ± 0.078
0.612 ± 0.048
0.700 ± 0.032
0.607 ± 0.034
0.718 ± 0.036

Accuracy ↑
0.575 ± 0.097
0.604 ± 0.046
0.663 ± 0.104
0.593 ± 0.104
0.706 ± 0.040

Figure 8.7 F1 score for each loss function type at coarse and fine levels averaged across all images
of all models in the 4 test sets. We also show the standard deviation of performance across different
sets.

Table 8.5 F1 score and classification accuracy for loss type prediction. Our method performs better
than all the three baselines.

Method

Random ground-truth
Mean/mode
No fingerprint
Using one parser
Ours

Loss function prediction

F1 score ↑
0.636 ± 0.017
0.751 ± 0.027
0.800 ± 0.116
0.687 ± 0.036
0.813 ± 0.019

Classification accuracy ↑
0.716 ± 0.028
0.736 ± 0.056
0.763 ± 0.079
0.633 ± 0.052
0.792 ± 0.021

Fig. 8.6 shows the detailed 𝐿1 error and F1 score for all network architecture parameters. We

observe that our method performs substantially better than the random ground-truth baseline for

almost all parameters. As for the no fingerprint and using one parser baselines, our method is still

better in most cases with a few parameters showing similar results. We also show the standard

142

Figure 8.8 Performance of all GMs in our 4 testing sets. Similar performance trends are observed
for network architecture and loss functions, i.e., if the 𝐿1 error is small for continuous type
parameters in network architecture, the high F1 score is also observed for discrete type parameters
in network architecture and loss function.
In other words, the abilities to reverse engineer the
network architecture and loss function types for one GM are reasonably consistent.

Table 8.6 Performance comparison by varying the training and testing data for face and non-face
GMs. Testing performance on non-face GMs is better compared to face GMs. Training and testing
on the same content produces better results than on the different contents.We also show the standard
deviation over all the test samples for 𝐿1 error. The first value is the standard deviation across sets,
while the second one is across the samples.

Test GMs (# GMs) Train GMs (# GMs)

Face (6)

Non-face (6)

Face (41)
Non-face (69)
Full (110)
Non-face (63)
Face (47)
Full (110)

Random guess

Network architecture

Loss function

Continuous type
𝐿1 error ↓
0.139 ± 0.042/0.015
0.213 ± 0.066/0.136
0.118 ± 0.046/0.040
0.118 ± 0.021/0.049
0.125 ± 0.031/0.028
0.082 ± 0.045/0.049
0.393

Discrete type
F1 score ↑
0.729 ± 0.106
0.688 ± 0.125
0.712 ± 0.129
0.794 ± 0.110
0.667 ± 0.099
0.832 ± 0.046
0.500

F1 score ↑

0.788 ± 0.146
0.759 ± 0.100
0.833 ± 0.136
0.864 ± 0.094
0.858 ± 0.115
0.886 ± 0.061
0.500

deviation of every estimated parameter for all the methods. Our proposed approach in general has

smaller standard deviations than the two baselines. For continuous type parameters, we further

show the effectiveness of regression prediction by evaluating three metrics namely, correlation

coefficient, coefficient of determination and slope of RANSAC regression line. These metrics are

evaluated between prediction and ground-truth. Further, we also estimate a p-value of a t-test,

where the null hypothesis is as follows: the sequence of sample-wise 𝐿1 error differences between

our method and the baseline method is sampled from zero-mean Gaussian. This p-value would be

estimated for every ours-baseline pair. We report the mean and the standard deviation across all

143

Figure 8.9 Confusion matrix in the estimation of four parameters in the network architecture and
loss function. (a)-(d): Standard cross-entropy and (e)-(f): Weighted cross entropy. Weighted cross
entropy handles imbalance data much better than the standard cross entropy which usually predicts
one class.

four sets. The p-value of our approach when compared to all the three baselines is less than 0.05,

thereby rejecting the null hypothesis and proving our improvement is statistically significant. For

other three metrics, the values closer to 1 shows effective regression. For our method, we have

slope of 0.921, correlation coefficient of 0.744 and coefficient of determination as 0.612 which

shows the effectiveness of our approach. Further, our approach outperforms all the baselines for all

three metrics.

Loss function prediction We calculate the F1 score and classification accuracy for loss function

parameters. The performance are shown in Tab. 8.5. For the random ground-truth baseline, the

performance is close to a random guess. Our approach performs much better than all the baselines.

Fig. 8.7 shows the detailed F1 score for all loss function parameters. Apparently our method works

better than all the baselines for almost all parameters. We also show the standard deviation of

every estimated parameter for all the methods. Similar behaviour of standard deviation for different

144

Figure 8.10 Estimated fingerprints (left) and corresponding frequency spectrum (right) from one
generated image of each of 116 GMs. Many frequency spectrums show distinct high-frequency
signals, while some appear to be similar to each other.

Table 8.7 Ablation study of the 4 loss terms in fingerprint estimation. Removing any one loss for
fingerprint estimation deteriorates the performance with the worst results in the case of removing
all losses. [KEYS: fing.: fingerprint]. We also show the standard deviation over all the test samples
for 𝐿1 error. The first value is the standard deviation across sets, while the second one is across the
samples.

Network architecture

Loss function

Loss removed

Magnitude loss
Spectrum loss
Repetitive loss
Energy loss
All (no fing.)
Nothing (ours)

Continuous type
𝐿1 error ↓
0.156 ± 0.007/0.009
0.149 ± 0.022/0.016
0.150 ± 0.018/0.026
0.162 ± 0.032/0.038
0.170 ± 0.035/0.037
0.149 ± 0.019/0.014

Discrete type
F1 score ↑
0.674 ± 0.012
0.676 ± 0.034
0.708 ± 0.031
0.703 ± 0.045
0.700 ± 0.032
0.718 ± 0.036

F1 score ↑
0.755 ± 0.046
0.786 ± 0.042
0.794 ± 0.031
0.785 ± 0.028
0.800 ± 0.016
0.813 ± 0.019

methods was observed as in the network architecture. Fig. 8.8 provides another perspective of

model parsing by showing the performance in terms of 48 unique GMs across our 4 testing sets.

Practical Usage of Model Parsing. As our work is the first one to propose the task of model

parsing, it’s beneficial to ask the question: what is the performance desired for practical usage

of model parsing in the real world? To answer this question, we can expect that an error less

than 10% can be considered useful for the practical application of model parsing. The rationale

is the following. We consider two of the most similar generative models, RSGAN_HALF and

RSGAN_QUAR, in our dataset. Upon further analysis, we observe that these models differ in only

2 out of 15 parameters. Therefore, we argue that an error rate below 10% is reasonable for practical

145

Figure 8.11 Cosine similarity matrix for pairs of 116 GM’s fingerprints. Each element of this matrix
is the average Cosine similarities of 50 pairs of fingerprints from two GMs. We see the higher
intra-GM and lower inter-GM similarities. We can also see GMs with similar network architecture
or loss function are clustered together, as shown in the red boxes on the left.

purposes as this error is less than the difference between the two most similar generative models.

Therefore, for the task of model parsing, we expect 𝐿1 error of less than 0.1 and an 𝐹1 score of

over 90% for practical usage. Our proposed approach achieves an 𝐿1 error slightly above 10%

(0.14) and an 𝐹1 score of 80%, both of which have reasonable margins toward the above mentioned

thresholds.

8.4.3 Ablation study

Face vs.non-face GMs Our dataset consists of 47 GMs trained on face datasets and 69 GMs trained

on non-face datasets. Let’s denote these GMs as face GMs and non-face GMs, respectively. All

aforementioned experiments are conducted by training on 104 GMs and evaluating on 12 GMs.

146

Table 8.8 Network architecture estimation and loss function prediction when given multiple images
of one GM. Performance increases when enlarging the number of images for evaluation from 1 to
10. Performance becomes stable for more than 10 images. We also show the standard deviation
over all the test samples for 𝐿1 error. The first value is the standard deviation across sets, while the
second one is across the samples.

# images

1
10
100
500

Network architecture

Loss function

Continuous type
𝐿1 error ↓
0.215 ± 0.054/0.067
0.151 ± 0.033/0.039
0.145 ± 0.032/0.036
0.146 ± 0.033/0.031

Discrete type
F1 score ↑
0.696 ± 0.089
0.726 ± 0.075
0.721 ± 0.073
0.720 ± 0.070

F1 score ↑
0.798 ± 0.010
0.793 ± 0.070
0.789 ± 0.071
0.808 ± 0.007

Here we conduct an ablation study to train and evaluate on different types of GMs. We study

the performance on face and non-face testing GMs when training on three different training sets,

including only face GMs, only non-face GMs and all GMs. Note that all testing GMs are excluded

during training each time. We also add a baseline where both regression and classification make a

random guess on their estimation.

The results are shown in Tab. 8.6. We have three observations. First, model parsing for non-face

GMs are easier than face GMs. This might be partially due to the generally lower-quality images

generated by non-face GMs compared to those by face GMs, thus more traces are remained for

model parsing. Second, training and testing on the same content can generate better results than

on different contents. Third, training on the full datasets improves some parameter estimation but

may hurt other parameters slightly.

Weighted cross-entropy loss As mentioned before, the ground truth of many network hyperparameters

have biased distributions. For example, the “normalization type" parameter in Tab. 8.2 has uneven

distribution among its 4 possible types. With this biased distribution, our classifier might make a

constant prediction to the type with the highest probability in the ground truth, as this could minimize

the loss especially for severe biasness. This degenerate classifier clearly has no value to model

parsing. To address this issue, we propose to use the weighted cross-entropy loss with different loss

weights for each class. These weights are calculated using the ground-truth distribution of every

parameter in the full dataset. To validate if the above approach is able to remedy this issue, we

147

compare it with the standard cross-entropy loss.

Fig. 8.9 shows the confusion matrix for discrete type parameters in network architecture

prediction and coarse/fine level parameters in loss function prediction. The rows in the confusion

matrix are represented by predicted classes and columns are represented by the ground-truth classes.

We clearly see that the classifier is mostly biased towards more frequent classes in all 4 examples,

when the standard cross-entropy loss is used. However, this problem is remedied when using the

weighted cross-entropy loss, and the classifiers make meaningful predictions.

Fingerprint losses We proposed four loss terms in Sec. 8.3.2 to guide the training of the fingerprint

estimation including magnitude loss, spectrum loss, repetitive loss and energy loss. We conduct

an ablation study to demonstrate the importance of these four losses in our proposed method. This

includes four experiments, each removing one of the loss terms and comparing the performance

with our proposed method (remove nothing) and no fingerprint baseline (remove all). As shown

in Tab. 8.7, removing any loss for fingerprint estimation hurts the performance. Our “no fingerprint"

baseline, for which we remove all losses, performs worst of all. Therefore, each loss clearly has a

positive effect on the fingerprint estimation and model parsing.

Model parsing with multiple images We evaluate model parsing when varying the number of test

images. For each GM, we randomly select 1, 10, 100, and 500 images per GM from different face

GMs sets for evaluation. With multiple images per GM, we average the prediction for continuous

type parameters and take majority voting for discrete type parameters and loss function parameters.

We compute the 𝐿1 error and F1 score for the continuous and discrete type parameters respectively

and average the result across different sets. We repeat the above experiment multiple times, each

time randomly selecting the number of images. We compare the 𝐿1 error and F1 score for respective

parameters. Tab. 8.8 shows noticeable gains with 10 images and minor gains with 100 images. There

is not much performance difference when evaluating on 100 or 500 images, which suggests that our

framework is robust in generating consistent results when tested on different numbers of generated

images by the same GM.

Content-independent fingerprint Ideally our estimated fingerprint should be independent of the

148

Table 8.9 Evaluation on diffusion models. We also show the standard deviation over all the test
samples for 𝐿1 error. The first value is the standard deviation across sets, while the second one is
across the samples.

Method

Random ground-truth
No fingerprint
Using one parser
Ours

Network architecture

Loss function

Continuous type
𝐿1 error ↓
0.240 ± 0.065/0.069
0.211 ± 0.080/0.078
0.201 ± 0.045/0.041
0.189 ± 0.051/0.049

Discrete type
F1 score ↑
0.664 ± 0.105
0.764 ± 0.112
0.564 ± 0.101
0.787 ± 0.099

F1 score ↑
0.619 ± 0.083
0.711 ± 0.085
0.654 ± 0.054
0.724 ± 0.076

content of the image. That is, the fingerprint only includes the trace left by the GM while not

indicating the content in any way. To validate this, we partition all GMs into four classes based on

their contents: FACES (47 GMs), MNIST (25), CIFAR10 (31), and OTHER (13). Every class has

images generated by the GMs belong to this class. We feed these images to a pre-trained FEN and

obtain their fingerprints. Then we train a shallow network consisting of five convolutional layers

and two fully connected layers for a 4-way classification. However, we observe the training cannot

converge. This means that our estimated fingerprint from FEN doesn’t have any content-specific

properties for content classification. As a result, the model parsing of the hyperparameters doesn’t

leverage the content information across different GMs, which is a desirable property.

Evaluation on diffusion models Due to the recent advancement of diffusion models for fake

media generation, we evaluate our approach for these generative models. Specifically, we collect

7 diffusion models with 1𝐾 images each. We create 4 different test set splits, each set containing

3 diffusion models selected randomly. The remaining diffusion models, along with the full dataset

is used for training. The result for our approach along with all the baselines is shown in Tab. 8.9.

Our method clearly outperforms all the baselines, indicating the effectiveness of our approach for

unseen models proposed in future. We also show the standard deviation over all the test samples

for 𝐿1 error. The first value is the standard deviation across sets, while the second one is across the

samples.

149

Table 8.10 Binary classification performance for coordinated misinformation attack.

Method
FEN
FEN + PN

AUC (%) Classification accuracy (%)

83.5
87.3

76.85
80.6

8.4.4 Visualization

Fig. 8.10 shows an estimated fingerprint image and its frequency spectrum averaged over

25 randomly selected images per GM. We observe that estimated fingerprints have the desired

properties defined by our loss terms, including low magnitude and highlights in middle and high

frequencies.

We also find that the fingerprints estimated from different generated images of the same GM are

similar. To quantify this, we compute a Cosine similarity matrix C ∈ R116×116 where C(𝑖, 𝑗) is the

averaged Cosine similarity of 25 randomly sampled fingerprint pairs from GM 𝑖 and 𝑗. The matrix

C in Fig. 8.11 clearly illustrates the higher intra-GM ad lower inter-GM fingerprint similarities.

8.4.5 Applications

Coordinated misinformation attack Our model parsing framework can be leveraged to estimate

whether there exists a coordinated misinformation attack. That is, given two fake images, we hope

to classify whether they are generated from the same GM or not. We do so by computing the

Cosine similarity between the hyperparameters parsed from the given two images. First, we train

our framework on 101 GMs, and test on 15 seen GMs and 15 unseen GMs. The list of GMs are

mentioned in the supplementary. To evaluate this task, we report the Area Under Curve (AUC) and

the classification accuracy at the optimum threshold. The results are shown in Tab. 8.10 comparing

two methods, just using FEN network and using both FEN and PN. We conclude that our framework

using FEN and PN can identify whether two images came from the same source with around 80%

accuracy. Using only FEN network to compare the similarities of the fingerprint performs worse.

This justifies the benefit of using parsed parameters for coordinated misinformation attack.

In fact, due to the nature of our test set, each pair of test samples can come from five different

categories, namely, 1. Same seen GM, 2. Same unseen GM, 3. Different seen GMs, 4. Different

unseen GMs, and 5. One seen and one unseen GM. We show an analysis of the wrongly classified

150

Table 8.11 AUC for deepfake detection on the Celeb-DF dataset [184].

Method

Training Data

AUC (%)

Methods training with pixel-level supervision

Xception+Reg [66]
Xception+Reg [66]

DFFD
DFFD, UADFV

FF

Private

Methods training with image-level supervision
Two-stream [118]
Meso4 [2]
VA-LogReg [212]
DSP-FWA [126]
Multi-task [226]
Capsule [227]
Xception-c40 [266]
Two-branch [211]
SPSL [192]
SPSL [192] (reproduced)
Ours (fingerprint)
Ours (image+fingerprint)
Ours (image+fingerprint+phase)
Ours (model parsing)
HeadPose [350]
FWA [183]
Xception [66]
Xception+Reg [66]
Ours
Xception [66]
Ours
Xception [66]
Ours

DFFD, UADFV

UADFV

DFFD

FF++

64.4
71.2

53.8
54.8
55.1
64.6
54.3
57.5
65.5
73.4
76.8
73.2
69.6
71.1
74.6
64.3
54.6
56.9
52.2
57.1
64.7
63.9
65.3
67.6
70.2

Table 8.12 Classification rates of image attribution. The baseline results are cited from [365].

Method
kNN
Eigenface [284]
PRNU [209]
Yu et al. [365]
Ours

CelebA LSUN
36.30
28.00
53.28
-
67.84
86.61
98.58
99.43
99.84
99.66

samples in Fig. 8.12 with respect to total number of samples and total number of samples in each

category. Around 70% of the wrongly classified samples belong to the category of images coming

from categories having atleast one GM unseen in training which is expected. However, if one of

the test GM was seen in training, the number of wrongly classified samples decreased. This can be

advantageous in detecting a manipulated image from an unknown GM.

Deepfake detection Our FEN can be adopted for deepfake detection by adding a shallow network for

151

Figure 8.12 Percentage of wrongly classified samples for five different categories of test sample
pair. A larger number of sample pairs are wrongly classified if the pair of images come from same
unseen GMs.

binary classification. We evaluate our method on the recently introduced Celeb-DF dataset [184].

We experiment with three training sets, UADFV, DFFD, and FF++, in order to compare with

previous results. We follow the same training protocols used in [66] for UADFV and DFFD

and [192] for FF++.

We report the AUC in Tab. 8.11. Compared with methods trained on UADFV, our approach

achieves a significantly better result, despite the more advanced backbones used by others. Our

results when trained on DFFD and UADFV fall only slightly behind the best performance reported

by Xception+Reg [66]. Importantly, however, they trained with pixel-level supervision which is

typically unavailable. These results are provided for completeness, but are not directly comparable

to all other methods trained with only image-level supervision for binary classification. Compared

to all other methods, our method achieves the highest deepfake detection AUC.

Finally, we compare the performance of our method when trained on FF++ dataset.

[192]

performs the best by using the phase information as an additional channel to the Xception classifier.

However, as the pre-trained models were not released for [192], we reproduce their method and

report the performance shown in Tab. 8.11. We observe a performance gap between the reproduced

and reported performance which should be further investigated in the future. Following [192], we

concatenate the fingerprint information with the RGB image and phase channels which are passed

152

through a Xception classifier. Our method outperforms the reproduced performance of [192]

showing the additional benefit of our fingerprint. Finally, we also perform the classification

based on the pre-trained model parsing network and fine-tune it using the classification loss.

The performance deteriorated compared to using the fingerprint. This shows that although the

model parsing network have some deepfake detection abilities, they are less informative to perform

deepfake detection well.

Image attribution Similar to deepfake detection, we use a shallow network for image attribution.

The only difference is that image attribution is a multi-class task and depends on the number of GMs

during training. Following [365], we train our model on 100𝐾 genuine and 100𝐾 fake face images

each from four GMs: SNGAN [218], MMDGAN [179], CRAMERGAN [16] and ProGAN [156],

for five-class classification. Tab. 8.12 reports the performance. Our result on CelebA [184] and

LSUN [364] outperform the performance in [365]. This again validates the generalization ability

of the proposed fingerprint estimation.

8.5 Conclusion

In this paper, we define the model parsing problem as inferring the network architectures and

training loss functions of a GM from the generative images. We make the first attempt to tackle

this challenging problem. The main idea is to estimate the fingerprint for each image and use it for

model parsing. Four constraints are developed for fingerprint estimation. We propose hierarchical

learning to parse the hyperparameters in coarse-level and fine-level that can leverage the similarities

between different GMs. Our fingerprint estimation framework can not only perform model parsing,

but also extend to detecting coordinated misinformation attack, deepfake detection and image

attribution. We have collected a large-scale fake image dataset from 116 different GMs. Various

experiments have validated the effects of different components in our approach.

153

BIBLIOGRAPHY

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Yossi Adi, Carsten Baum, Moustapha Cisse, Benny Pinkas, and Joseph Keshet. Turning
your weakness into a strength: Watermarking deep neural networks by backdooring.
In
USENIX-S, 2018.

Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. MesoNet: a compact
facial video forgery detection network. In WIFS, 2018.

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia
Schmid. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international
conference on computer vision, pages 6836–6846, 2021.

Vishal Asnani, John Collomosse, Tu Bui, Xiaoming Liu, and Shruti Agarwal. Promark:
Proactive diffusion watermarking for causal attribution. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 10802–10811, 2024.

Vishal Asnani, Abhinav Kumar, Suya You, and Xiaoming Liu. PrObeD: Proactive object
detection wrapper. In NeurIPS, 2023.

Vishal Asnani, Xi Yin, Tal Hassner, Sijia Liu, and Xiaoming Liu. Proactive image
manipulation detection. In CVPR, 2022.

Vishal Asnani, Xi Yin, Tal Hassner, and Xiaoming Liu. MaLP: Manipulation localization
using a proactive scheme. In CVPR, 2023.

Vishal Asnani, Xi Yin, Tal Hassner, and Xiaoming Liu. Reverse engineering of generative
models: Inferring model hyperparameters from generated images. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 45(12):15477–15493, 2023.

Vishal Asnani, Xi Yin, and Xiaoming Liu. Proactive schemes: A survey of adversarial
attacks for social good. arXiv preprint arXiv:2409.16491, 2024.

[10] Yousef Atoum, Joseph Roth, Michael Bliss, Wende Zhang, and Xiaoming Liu. Monocular
In

video-based trailer coupler detection using multiplexer convolutional neural network.
ICCV, 2017.

[11] Kar Balan, Shruti Agarwal, Simon Jenni, Andy Parsons, Andrew Gilbert, and John
In

Collomosse. EKILA: Synthetic media provenance and attribution for generative art.
CVPR, 2023.

[12] Nicolas Ballas, Li Yao, Chris Pal, and Aaron Courville. Delving deeper into convolutional
networks for learning video representations. arXiv preprint arXiv:1511.06432, 2015.

[13] Shumeet Baluja. Hiding images in plain sight: Deep steganography. 2017.

154

[14] Abdullah Bamatraf, Rosziati Ibrahim, and Mohd Najib B Mohd Salleh. Digital watermarking

algorithm using LSB. In ICCAIE, 2010.

[15] Lejla Batina, Shivam Bhasin, Dirmanto Jap, and Stjepan Picek. CSI NN: Reverse engineering
of neural network architectures through electromagnetic side channel. In USENIXSS, 2019.

[16] Marc G. Bellemare,

Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji
Lakshminarayanan, Stephan Hoyer, and Rémi Munos. The cramer distance as a solution to
biased wasserstein gradients. arXiv preprint arXiv:1705.10743, 2017.

[17] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need

for video understanding? In ICML, volume 2, page 4, 2021.

[18] Purnima Bholowalia and Arvind Kumar. Ebk-means: A clustering technique based on elbow
method and k-means in wsn. International Journal of Computer Applications, 105(9), 2014.

[19] Alex Black, Tu Bui, Hailin Jin, Vishy Swaminathan, and John Collomosse. Deep image

comparator: Learning to visualize editorial change. In CVPR WMF, 2021.

[20] Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas Müller, and Björn Ommer.

Retrieval-augmented diffusion models. NeurIPS, 2022.

[21] Aaron F. Bobick and James W. Davis. The recognition of human movement using temporal
templates. IEEE Transactions on pattern analysis and machine intelligence, 23(3):257–267,
2001.

[22] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. YOLOv4: Optimal

speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.

[23] Daniel Bolya, Sean Foley, James Hays, and Judy Hoffman. TIDE: A general toolbox for

identifying object detection errors. In ECCV, 2020.

[24] Oliver Bown. AI doesn’t like to credit its sources. for artists, that’s a problem. Tatlor, 2024.

[25] Garrick Brazil and Xiaoming Liu. Pedestrian detection with autoregressive network phases.

In CVPR, 2019.

[26]

Jonathan Brokman, Omer Hofman, Roman Vainshtein, Amit Giloni, Toshiya Shimizu,
Inderjeet Singh, Oren Rachmil, Alon Zolfi, Asaf Shabtai, Yuki Unno, and Hisashi Kojima.
In Proc.
Montrage: Monitoring training for attribution of generative diffusion models.
ECCV, 2024.

[27] Tu Bui, Shruti Agarwal, Ning Yu, and John Collomosse. RoSteALS: Robust steganography

using autoencoder latent space. In CVPR, 2023.

155

[28] Tu Bui, Ning Yu, and John Collomosse. RepMix: Representation mixing for robust

attribution of synthesized images. In ECCV, 2022.

[29] Christopher P. Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume
Desjardins, and Alexander Lerchner. Understanding disentangling in 𝛽-VAE. In NeurIPS,
2017.

[30] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in

context. In CVPR, 2018.

[31] Xirong Cao, Xiang Li, Divyesh Jadav, Yanzhao Wu, Zhehui Chen, Chen Zeng, and Wenqi
Wei. Invisible watermarking for audio generation diffusion models. In TPS-ISA, 2023.

[32] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov,

and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.

[33] Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. The secret
sharer: Evaluating and testing unintended memorization in neural networks. In USENIX,
2019.

[34] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks.

In SSP, 2017.

[35]

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the
kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 6299–6308, 2017.

[36] Lucy Chai, David Bau, Ser-Nam Lim, and Phillip Isola. What makes fake images detectable?

understanding properties that generalize. In ECCV, 2020.

[37] Chang Chen, Zhiwei Xiong, Xiaoming Liu, and Feng Wu. Camera trace erasing. In CVPR,

2020.

[38] Geng Chen, Si-Jie Liu, Yu-Jia Sun, Ge-Peng Ji, Ya-Feng Wu, and Tao Zhou. Camouflaged
object detection via context-aware cross-level fusion. IEEE Transactions on Circuits and
Systems for Video Technology, 32(10), 2022.

[39] Huili Chen, Bita Darvish Rouhani, Cheng Fu, Jishen Zhao, and Farinaz Koushanfar.
Deepmarks: A secure fingerprinting framework for digital rights management of deep
learning models. In ICMR, 2019.

[40] Liang Chen, Yong Zhang, Yibing Song, Lingqiao Liu, and Jue Wang. Self-supervised
learning of adversarial example: Towards good generalizations for deepfake detection. In
CVPR, 2022.

156

[41] Mo Chen, Jessica Fridrich, Miroslav Goljan, and Jan Lukás. Determining image origin and
IEEE Transactions on Information Forensics and Security,

integrity using sensor noise.
3(1):74–90, 2008.

[42] Ricky T. Q. Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of

disentanglement in variational autoencoders. In NeurIPS, 2018.

[43] Shoufa Chen, Peize Sun, Yibing Song, and Ping Luo. DiffusionDet: Diffusion model for

object detection. In CVPR, 2023.

[44] Shoufa Chen, Peize Sun, Enze Xie, Chongjian Ge, Jiannan Wu, Lan Ma, Jiajun Shen, and
Ping Luo. Watch only once: An end-to-end video action detection framework. In Proceedings
of the IEEE/CVF International Conference on Computer Vision, pages 8178–8187, 2021.

[45] Wei Chen, Zichen Miao, and Qiang Qiu. Parameter-efficient tuning of large convolutional

models. arXiv preprint arXiv:2403.00269, 2024.

[46] Wei-Chen Chen, Xin-Yi Yu, and Lin-Lin Ou. Pedestrian attribute recognition in video
surveillance scenarios based on view-attribute attention localization. Machine Intelligence
Research, 2022.

[47] Zejia Chen, Fabing Duan, Francois Chapeau-Blondeau, and Derek Abbott. Training
threshold neural networks by extreme learning machine and adaptive stochastic resonance.
Physics Letters A, 2022.

[48] Xuelian Cheng, Huan Xiong, Deng-Ping Fan, Yiran Zhong, Mehrtash Harandi, Tom
Implicit motion handling for video camouflaged object

Drummond, and Zongyuan Ge.
detection. In CVPR, 2022.

[49] Wonwoong Cho, Sungha Choi, David Keetae Park, Inkyu Shin, and Jaegul Choo. Image-
to-image translation via group-wise deep whitening-and-coloring transformation. In CVPR,
2019.

[50]

Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon.
ILVR: Conditioning method for denoising diffusion probabilistic models. In ICCV, 2021.

[51] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul
Choo. StarGAN: Unified generative adversarial networks for multi-domain image-to-image
translation. In CVPR, 2018.

[52] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul
Choo. StarGAN: Unified generative adversarial networks for multi-domain image-to-image
translation. In CVPR, 2018.

[53] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. StarGAN v2: Diverse image

157

synthesis for multiple domains. In CVPR, 2020.

[54] François Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR,

2017.

[55] Pengyu Chu, Zhaojian Li, Kyle Lammers, Renfu Lu, and Xiaoming Liu. Deepapple: Deep

learning-based apple detection using a suppression mask R-CNN. PRL, 2021.

[56] Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger.
In Medical
3d u-net:
learning dense volumetric segmentation from sparse annotation.
Image Computing and Computer-Assisted Intervention–MICCAI 2016: 19th International
Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II 19, pages 424–432.
Springer, 2016.

[57]

John Collomosse and Andy Parsons. To Authenticity, and Beyond! Building safe and
IEEE Computer Graphics and
fair generative AI upon the three pillars of provenance.
Applications, May 2024.

[58] MMAction2 Contributors. Openmmlab’s next generation video understanding toolbox and

benchmark, 2020.

[59] Mickael Cormier, Yannik Schmid, and Jürgen Beyerer. Enhancing skeleton-based action
In Proceedings
recognition in real-world scenarios through realistic data augmentation.
of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 290–299,
2024.

[60] Davide Cozzolino, Justus Thies, Andreas Rössler, Christian Riess, Matthias Nießner,
and Luisa Verdoliva. Forensictransfer: Weakly-supervised domain adaptation for forgery
detection. arXiv preprint arXiv:1812.02510, 2018.

[61] Davide Cozzolino and Luisa Verdoliva. Noiseprint: a CNN-based camera model fingerprint.

IEEE Transactions on Information Forensics and Security, 15:144–159, 2019.

[62] Yingqian Cui, Jie Ren, Han Xu, Pengfei He, Hui Liu, Lichao Sun, and Jiliang Tang.
DiffusionShield: A watermark for copyright protection against generative diffusion models.
arXiv preprint arXiv:2306.04642, 2023.

[63]

Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-FCN: Object detection via region-based
fully convolutional networks. NeurIPS, 2016.

[64] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In

CVPR, 2005.

[65] Hao Dang, Feng Liu, Joel Stehouwer, Xiaoming Liu, and Anil K Jain. On the detection of

digital face manipulation. In CVPR, 2020.

158

[66] Hao Dang, Feng Liu, Joel Stehouwer, Xiaoming Liu, and Anil K Jain. On the detection of

digital face manipulation. In CVPR, 2020.

[67] Bita Darvish Rouhani, Huili Chen, and Farinaz Koushanfar. Deepsigns: An end-to-end
watermarking framework for ownership protection of deep neural networks. In Proceedings
of the Twenty-Fourth International Conference on Architectural Support for Programming
Languages and Operating Systems, pages 485–497, 2019.

[68] Neha Dawar and Nasser Kehtarnavaz. Action detection and recognition in continuous action
streams by deep learning-based sensing fusion. IEEE Sensors Journal, 18(23):9660–9668,
2018.

[69] Debayan Deb, Xiaoming Liu, and Anil Jain. Unified detection of digital and physical face

attacks. In arXiv preprint arXiv:2104.02156, 2021.

[70] Debayan Deb, Xiaoming Liu, and Anil K Jain. Unified detection of digital and physical
face attacks. In 2023 IEEE 17th International Conference on Automatic Face and Gesture
Recognition (FG), pages 1–8. IEEE, 2023.

[71]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.
large-scale hierarchical image database. In CVPR, 2009.

Imagenet: A

[72] Li Deng. The MNIST database of handwritten digit images for machine learning research

[best of the web]. Signal Processing Magazine, 29(6):141–142, 2012.

[73] Mohammad Derakhshani, Saeed Masoudnia, Amir Shaker, Omid Mersa, Mohammad
Sadeghi, Mohammad Rastegari, and Babak Araabi. Assisted excitation of activations: A
learning technique to improve object detectors. In CVPR, 2019.

[74] Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat GANs on image
In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors,

synthesis.
Advances in Neural Information Processing Systems, 2021.

[75] Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu,
Yulin Chen, Chi-Min Chan, Weize Chen, et al. Parameter-efficient fine-tuning of large-scale
pre-trained language models. Nature Machine Intelligence, 5(3):220–235, 2023.

[76]

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini
Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks
for visual recognition and description. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 2625–2634, 2015.

[77]

Jiahua Dong, Wenqi Liang, Hongliu Li, Duzhen Zhang, Meng Cao, Henghui Ding, Salman
Khan, and Fahad Shahbaz Khan. How to continually adapt text-to-image diffusion models
for flexible customization? arXiv preprint arXiv:2410.17594, 2024.

159

[78] Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo
Li. Boosting adversarial attacks with momentum. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 9185–9193, 2018.

[79] Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition

at scale. arXiv preprint arXiv:2010.11929, 2020.

[80] Bruce Draper. Reverse engineering of deceptions (red). https://www.darpa.mil/

program/reverse-engineering-of-deceptions.

[81] Deng-Ping Fan, Ge-Peng Ji, Ming-Ming Cheng, and Ling Shao. Concealed object detection.

TPAMI, 2021.

[82] Deng-Ping Fan, Ge-Peng Ji, Guolei Sun, Ming-Ming Cheng, Jianbing Shen, and Ling Shao.

Camouflaged object detection. In CVPR, 2020.

[83] Deng-Ping Fan, Ge-Peng Ji, Tao Zhou, Geng Chen, Huazhu Fu, Jianbing Shen, and Ling
Shao. Pranet: Parallel reverse attention network for polyp segmentation. In MICCAI, 2020.

[84] Gueter Josmy Faure, Min-Hung Chen, and Shang-Hong Lai. Holistic interaction transformer
In Proceedings of the IEEE/CVF Winter Conference on

network for action detection.
Applications of Computer Vision, pages 3340–3350, 2023.

[85] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for
video recognition. In Proceedings of the IEEE/CVF International Conference on Computer
Vision (ICCV), October 2019.

[86] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream
In Proceedings of the IEEE conference on

network fusion for video action recognition.
computer vision and pattern recognition, pages 1933–1941, 2016.

[87] Pedro Felzenszwalb, David McAllester, and Deva Ramanan. A discriminatively trained,

multiscale, deformable part model. In CVPR, 2008.

[88] Weitao Feng, Jiyan He, Jie Zhang, Tianwei Zhang, Wenbo Zhou, Weiming Zhang, and
Nenghai Yu. Catch you everything everywhere: Guarding textual inversion via concept
watermarking. arXiv preprint arXiv:2309.05940, 2023.

[89] Weitao Feng, Wenbo Zhou, Jiyan He, Jie Zhang, Tianyi Wei, Guanlin Li, Tianwei Zhang,
Weiming Zhang, and Nenghai Yu. Aqualora: Toward white-box protection for customized
stable diffusion models via watermark lora. arXiv preprint arXiv:2405.11135, 2024.

[90] Pierre Fernandez, Guillaume Couairon, Hervé Jégou, Matthijs Douze, and Teddy Furon.
The stable signature: Rooting watermarks in latent diffusion models. In ICCV, 2023.

160

[91] Sanja Fidler, Roozbeh Mottaghi, Alan Yuille, and Raquel Urtasun. Bottom-up segmentation

for top-down detection. In CVPR, 2013.

[92] Tomás Filler, Jessica Fridrich, and Miroslav Goljan. Using sensor pattern noise for camera

model identification. In ICIP, 2008.

[93] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model
fitting with applications to image analysis and automated cartography. Communications of
the ACM, 24(6):381–395, 1981.

[94] George Forman and Martin Scholz. Apples-to-apples in cross-validation studies: pitfalls
in classifier performance measurement. Association for Computing Machinery SIGKDD
Explorations Newsletter, 12(1):49–57, 2010.

[95] Luca Gammaitoni, Peter Hänggi, Peter Jung, and Fabio Marchesoni. Stochastic resonance.

Reviews of modern physics, 1998.

[96] Rohit Gandikota, Joanna Materzynska, Jaden Fiotto-Kaufman, and David Bau. Erasing
concepts from diffusion models. In Proceedings of the IEEE/CVF International Conference
on Computer Vision, pages 2426–2436, 2023.

[97] Candice R Gerstner and Hany Farid. Detecting real-time deep-fake videos using active

illumination. In CVPR, 2022.

[98] Spyros Gidaris and Nikos Komodakis. Object detection via a multi-region and semantic

segmentation-aware cnn model. In ICCV, 2015.

[99] Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman. A better baseline for

ava. arXiv preprint arXiv:1807.10066, 2018.

[100] Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman. Video action
In Proceedings of the IEEE/CVF conference on computer vision

transformer network.
and pattern recognition, pages 244–253, 2019.

[101] Rohit Girdhar and Deva Ramanan. Attentional pooling for action recognition. Advances in

neural information processing systems, 30, 2017.

[102] Ross Girshick. Fast R-CNN. In ICCV, 2015.

[103] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies

for accurate object detection and semantic segmentation. In CVPR, 2014.

[104] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Region-based convolutional

networks for accurate object detection and segmentation. TPAMI, 2015.

161

[105] Georgia Gkioxari and Jitendra Malik. Finding action tubes. In Proceedings of the IEEE

conference on computer vision and pattern recognition, pages 759–768, 2015.

[106] Miroslav Goljan, Jessica Fridrich, and Tomáš Filler. Large scale test of sensor fingerprint

camera identification. Media forensics and security, 7254:72540I, 2009.

[107] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014.

[108] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014.

[109] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing

adversarial examples. arXiv preprint arXiv:1412.6572, 2014.

[110] Shreyank N Gowda, Marcus Rohrbach, Frank Keller,

and Laura Sevilla-Lara.
Learn2augment: learning to composite videos for data augmentation in action recognition.
In European conference on computer vision, pages 242–259. Springer, 2022.

[111] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne
Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-
Freitag, et al. The" something something" video database for learning and evaluating visual
common sense. In Proceedings of the IEEE international conference on computer vision,
pages 5842–5850, 2017.

[112] Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li,
Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al.
Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of
the IEEE conference on computer vision and pattern recognition, pages 6047–6056, 2018.

[113] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan,
and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. In CVPR,
2022.

[114] Luca Guarnera, Oliver Giudice, and Sebastiano Battiato. Deepfake detection by analyzing

convolutional traces. In CVPRW, 2020.

[115] Xiao Guo, Yaojie Liu, Anil Jain, and Xiaoming Liu. Multi-domain learning for updating

face anti-spoofing models. In ECCV, 2022.

[116] Xiao Guo, Iacopo Masi, Xiaohong Liu, Zhiyuan Ren, Steven Grosz, and Xiaoming Liu.
Hierarchical fine-grained image forgery detection and localization. In CVPR, 2023.

[117] Agrim Gupta, Piotr Dollár, and Ross Girshick. LVIS: A dataset for large vocabulary instance

segmentation. In CVPR, 2019.

162

[118] Xintong Han, Vlad Morariu, Peng IS Larry Davis, et al. Two-stream neural networks for

tampered face detection. In CVPRW, 2017.

[119] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace
the history of 2d cnns and imagenet? In Proceedings of the IEEE conference on Computer
Vision and Pattern Recognition, pages 6546–6555, 2018.

[120] Chunming He, Kai Li, Yachao Zhang, Longxiang Tang, Yulun Zhang, Zhenhua Guo, and
Xiu Li. Camouflaged object detection with feature decomposition and edge reconstruction.
In CVPR, 2023.

[121] Chunming He, Kai Li, Yachao Zhang, Guoxia Xu, Longxiang Tang, Yulun Zhang, Zhenhua
Guo, and Xiu Li. Weakly-supervised concealed object segmentation with SAM-based pseudo
labeling and multi-scale feature grouping. arXiv preprint arXiv:2305.11003, 2023.

[122] Chunming He, Kai Li, Yachao Zhang, Yulun Zhang, Zhenhua Guo, Xiu Li, Martin Danelljan,
and Fisher Yu. Strategic preys make acute predators: Enhancing camouflaged object detectors
by generating camouflaged objects. arXiv preprint arXiv:2308.03166, 2023.

[123] Jun-Yan He, Xiao Wu, Zhi-Qi Cheng, Zhaoquan Yuan, and Yu-Gang Jiang. Db-lstm:
Densely-connected bi-directional lstm for human action recognition. Neurocomputing,
444:319–331, 2021.

[124] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. In ICCV,

2017.

[125] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep

convolutional networks for visual recognition. TPAMI, 2015.

[126] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep
convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 37(9):1904–1916, 2015.

[127] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for
image recognition. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016.

[128] Yihui He, Chenchen Zhu, Jianren Wang, Marios Savvides, and Xiangyu Zhang. Bounding

box regression with uncertainty for accurate object detection. In CVPR, 2019.

[129] Zhenliang He, Wangmeng Zuo, Meina Kan, Shiguang Shan, and Xilin Chen. Attgan: Facial
attribute editing by only changing what you want. IEEE transactions on image processing,
28:5464–5478, 2019.

[130] Victoria Heath. From a sleazy Reddit post to a national security threat: A closer look at the

163

deepfake discourse. In Disinformation and Digital Democracies in the 21st Century. The
NATO Association of Canada, 2019.

[131] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp
Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.
Advances in neural information processing systems, 30, 2017.

[132] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.

NeurIPS, 2020.

[133] Younggi Hong, Min Ju Kim, Isack Lee, and Seok Bong Yoo. Fluxformer: Flow-guided
IEEE
duplex attention transformer via spatio-temporal clustering for action recognition.
Robotics and Automation Letters, 2023.

[134] Gangyang Hou, Bo Ou, Min Long, and Fei Peng. Separable reversible data hiding for
encrypted 3d mesh models based on octree subdivision and multi-msb prediction. IEEE
Transactions on Multimedia, 2023.

[135] Jianqin Yin Yanbin Han Wendi Hou and Jinping Li. Detection of the mobile object with
camouflage color under dynamic background based on optical flow. Procedia Engineering,
2011.

[136] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang,
Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv
preprint arXiv:2106.09685, 2021.

[137] Weizhe Hua, Zhiru Zhang, and G Edward Suh. Reverse engineering convolutional neural

networks through side-channel information leaks. In DAC, 2018.

[138] Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial

attacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.

[139] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-

to-image translation. In ECCV, 2018.

[140] Yifei Huang, Minjie Cai, Zhenqiang Li, Feng Lu, and Yoichi Sato. Mutual context network
for jointly estimating egocentric gaze and action. IEEE Transactions on Image Processing,
29:7795–7806, 2020.

[141] Yihao Huang, Felix Juefei-Xu, Qing Guo, Yang Liu, and Geguang Pu. FakeLocator: Robust
localization of gan-based face manipulations. IEEE Transactions on Information Forensics
and Security, 17:2657–2672, 2022.

[142] Thien Huynh-The, Cam-Hao Hua, and Dong-Seong Kim. Encoding pose features to
images with data augmentation for 3-d action recognition. IEEE Transactions on Industrial

164

Informatics, 16(5):3100–3111, 2019.

[143] Mobarakol Islam, VS Vibashan, V Jeya Maria Jose, Navodini Wijethilake, Uppal Utkarsh,
and Hongliang Ren. Brain tumor segmentation and survival prediction using 3d attention
unet. In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 5th
International Workshop, BrainLes 2019, Held in Conjunction with MICCAI 2019, Shenzhen,
China, October 17, 2019, Revised Selected Papers, Part I 5, pages 262–272. Springer, 2020.

[144] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation

with conditional adversarial networks. In CVPR, 2017.

[145] Abdul Jabbar, Xi Li, and Bourahla Omar. A survey on generative adversarial networks:

Variants, applications, and training. arXiv preprint arXiv:2006.05132, 2020.

[146] Youngdong Jang, Dong In Lee, MinHyuk Jang, Jong Wook Kim, Feng Yang, and Sangpil
Kim. Waterf: Robust watermarks in radiance fields for protection of copyrights. In CVPR,
2024.

[147] László A Jeni, Jeffrey F Cohn, and Fernando De La Torre. Facing imbalanced data–

recommendations for the use of performance metrics. In ACII, 2013.

[148] Hyeonho Jeong, Geon Yeong Park, and Jong Chul Ye. Vmc: Video motion customization
using temporal attention adaption for text-to-video diffusion models. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9212–9221,
2024.

[149] Ge-Peng Ji, Deng-Ping Fan, Yu-Cheng Chou, Dengxin Dai, Alexander Liniger, and Luc
Van Gool. Deep gradient learning for efficient camouflaged object detection. Machine
Intelligence Research, 2023.

[150] Ge-Peng Ji, Lei Zhu, Mingchen Zhuge, and Keren Fu. Fast camouflaged object detection

via edge-based reversible re-calibration network. Pattern Recognition, 123, 2022.

[151] Ruiqi Jiang, Hang Zhou, Weiming Zhang, and Nenghai Yu. Reversible data hiding in
encrypted three-dimensional mesh models. IEEE Transactions on Multimedia, 2017.

[152] Mei Jiansheng, Li Sukang, and Tan Xiaomei. A digital watermarking algorithm based on

DCT and DWT. In WISA, 2009.

[153] Amin Jourabloo, Yaojie Liu, and Xiaoming Liu. Face de-spoofing: Anti-spoofing via noise

modeling. In ECCV, 2018.

[154] Nobukatsu Kajiura, Hong Liu, and Shin’ichi Satoh. Improving camouflaged object detection

with the uncertainty of pseudo-edge labels. In ACM Multimedia Asia, 2021.

165

[155] Satoshi Kanai, Hiroaki Date, Takeshi Kishinami, et al. Digital watermarking for 3d polygons
using multiresolution wavelet decomposition. In Proc. Sixth IFIP WG, volume 5, pages 296–
307, 1998.

[156] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs

for improved quality, stability, and variation. In ICLR, 2018.

[157] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs

for improved quality, stability, and variation. In ICLR, 2018.

[158] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative

adversarial networks. In CVPR, 2019.

[159] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative

adversarial networks. In CVPR, 2019.

[160] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra
Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics
human action video dataset. arXiv preprint arXiv:1705.06950, 2017.

[161] Mohammad Ibrahim Khan, Md Maklachur Rahman, and Md Iqbal Hasan Sarker.
Digital watermarking for image authentication based on combined DCT, DWT and SVD
transformation. International Journal of Computer Science Issues, 10:223, 2013.

[162] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. DiffusionCLIP: Text-guided diffusion

models for robust image manipulation. In CVPR, 2022.

[163] Youngeun Kim, Yuhang Li, Abhishek Moitra, Ruokai Yin, and Priyadarshini Panda. Do we

really need a large number of visual prompts? Neural Networks, 2024.

[164] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.

[165] John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom

Goldstein. A watermark for large language models. In ICML, 2023.

[166] Trupti M Kodinariya and Prashant R Makwana. Review on determining number of cluster

in k-means clustering. International Journal, 1(6):90–95, 2013.

[167] Okan Köpüklü, Xiangyu Wei, and Gerhard Rigoll.

You only watch once: A
unified cnn architecture for real-time spatiotemporal action localization. arXiv preprint
arXiv:1911.06644, 2019.

[168] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny

images. 2009.

166

[169] Abhinav Kumar, Garrick Brazil, Enrique Corona, Armin Parchami, and Xiaoming Liu.
DEVIANT: Depth EquiVarIAnt NeTwork for Monocular 3D Object Detection. In ECCV,
2022.

[170] Abhinav Kumar, Garrick Brazil, and Xiaoming Liu.

GrooMeD-NMS: Grouped

mathematically differentiable nms for monocular 3D object detection. In CVPR, 2021.

[171] Nupur Kumari, Binazeiang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu.
Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023.

[172] Nupur Kumari, Bingliang Zhang, Sheng-Yu Wang, Eli Shechtman, Richard Zhang, and
Jun-Yan Zhu. Ablating concepts in text-to-image diffusion models. In ICCV, 2023.

[173] Nilakshan Kunananthaseelan, Jing Zhang, and Mehrtash Harandi. Lavip: Language-

grounded visual prompting. In AAAI, 2024.

[174] Kenji Kurosawa, Kenro Kuroki, and Naoki Saitoh. CCD fingerprint method-identification

of a video camera from videotaped images. In ICIP, 1999.

[175] Ivan Laptev. On space-time interest points.

International journal of computer vision,

64:107–123, 2005.

[176] Trung-Nghia Le, Tam Nguyen, Zhongliang Nie, Minh-Triet Tran, and Akihiro Sugimoto.

Anabranch network for camouflaged object segmentation. CVIU, 2019.

[177] Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Kumar Singh, and Ming-Hsuan
Yang. Diverse image-to-image translation via disentangled representations. In ECCV, 2018.

[178] Aixuan Li, Jing Zhang, Yunqiu Lv, Bowen Liu, Tong Zhang, and Yuchao Dai. Uncertainty-

aware joint salient object and camouflaged object detection. In CVPR, 2021.

[179] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos. MMD
GAN: Towards deeper understanding of moment matching network. In NeurIPS, 2017.

[180] Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, and Baining Guo.

Face x-ray for more general face forgery detection. In CVPR, 2020.

[181] Pan Li, Da Li, Wei Li, Shaogang Gong, Yanwei Fu, and Timothy M Hospedales. A simple

feature augmentation for domain generalization. In ICCV, 2021.

[182] Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra
Malik, and Christoph Feichtenhofer. Mvitv2:
Improved multiscale vision transformers
for classification and detection. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 4804–4814, 2022.

167

[183] Yuezun Li and Siwei Lyu. Exposing DeepFake videos by detecting face warping artifacts.

In CVPRW, 2019.

[184] Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-DF: A large-scale

challenging dataset for deepfake forensics. In CVPR, 2020.

[185] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, and Jian Sun. Light-head

R-CNN: In defense of two-stage object detector. arXiv preprint arXiv:1711.07264, 2017.

[186] Zhenyang Li, Kirill Gavrilyuk, Efstratios Gavves, Mihir Jain, and Cees GM Snoek.
Videolstm convolves, attends and flows for action recognition. Computer Vision and Image
Understanding, 166:41–50, 2018.

[187] Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video
In Proceedings of the IEEE/CVF international conference on computer

understanding.
vision, pages 7083–7093, 2019.

[188] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge

Belongie. Feature pyramid networks for object detection. In CVPR, 2017.

[189] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for

dense object detection. In ICCV, 2017.

[190] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,
In

Piotr Dollár, and Lawrence Zitnick. Microsoft COCO: Common objects in context.
ECCV, 2014.

[191] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei,
Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In
ECCV, 2018.

[192] Honggu Liu, Xiaodan Li, Wenbo Zhou, Yuefeng Chen, Yuan He, Hui Xue, Weiming Zhang,
and Nenghai Yu. Spatial-phase shallow learning:
rethinking face forgery detection in
frequency domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 772–781, 2021.

[193] Jiannan Liu, Bo Dong, Shuai Wang, Hui Cui, Deng-Ping Fan, Jiquan Ma, and Geng Chen.
Covid-19 lung infection segmentation with a novel two-stage cross-domain transfer learning
framework. Medical image analysis, 2021.

[194] Ming Liu, Yukang Ding, Min Xia, Xiao Liu, Errui Ding, Wangmeng Zuo, and Shilei Wen.
STGAN: A unified selective transfer network for arbitrary image attribute editing. In CVPR,
2019.

[195] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation

168

networks. In NeurIPS, 2017.

[196] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang

Fu, and Alexander Berg. SSD: Single shot multibox detector. In ECCV, 2016.

[197] Xiaohong Liu, Yaojie Liu, Jun Chen, and Xiaoming Liu. PSCC-Net: Progressive spatio-
channel correlation network for image manipulation detection and localization. In arXiv
preprint arXiv:2103.10596, 2021.

[198] Yugeng Liu, Zheng Li, Michael Backes, Yun Shen, and Yang Zhang. Watermarking diffusion

model. arXiv preprint arXiv:2305.12502, 2023.

[199] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in

the wild. In ICCV, 2015.

[200] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in

the wild. In ICCV, 2015.

[201] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc
Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR,
2022.

[202] Jan Lukáš, Jessica Fridrich, and Miroslav Goljan. Detecting digital image forgeries using
sensor pattern noise. Security, Steganography, and Watermarking of Multimedia Contents
VIII, 6072:60720Y, 2006.

[203] Jan Lukas, Jessica Fridrich, and Miroslav Goljan. Digital camera identification from sensor
IEEE Transactions on Information Forensics and Security, 1(2):205–214,

pattern noise.
2006.

[204] Yunqiu Lv, Jing Zhang, Yuchao Dai, Aixuan Li, Bowen Liu, Nick Barnes, and Deng-Ping
Fan. Simultaneously localize, segment and rank the camouflaged objects. In CVPR, 2021.

[205] Kang Ma, Ying Fu, Chunshui Cao, Saihui Hou, Yongzhen Huang, and Dezhi Zheng.

Learning visual prompt for gait recognition. In CVPR, 2024.

[206] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian
Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint
arXiv:1706.06083, 2017.

[207] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian
Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018.

[208] Yuxin Mao, Jing Zhang, Zhexiong Wan, Yuchao Dai, Aixuan Li, Yunqiu Lv, Xinyu Tian,
Deng-Ping Fan, and Nick Barnes. Transformer transforms salient object detection and

169

camouflaged object detection. arXiv preprint arXiv:2104.10127, 2021.

[209] Francesco Marra, Diego Gragnaniello, Luisa Verdoliva, and Giovanni Poggi. Do GANs

leave artificial fingerprints? In MIPR, 2019.

[210] Francesco Marra, Cristiano Saltori, Giulia Boato, and Luisa Verdoliva. Incremental learning

for the detection and classification of GAN-generated images. In WIFS, 2019.

[211] Iacopo Masi, Aditya Killekar, Royston Marian Mascarenhas, Shenoy Pratik Gurudatt, and
Wael AbdAlmageed. Two-branch recurrent network for isolating deepfakes in videos. In
ECCV. Springer, 2020.

[212] Falko Matern, Christian Riess, and Marc Stamminger. Exploiting visual artifacts to expose

deepfakes and face manipulations. In WACVW, 2019.

[213] Scott McCloskey and Michael Albright. Detecting GAN-generated imagery using saturation

cues. In ICIP, 2019.

[214] Safa C. Medin, Bernhard Egger, Anoop Cherian, Ye Wang, Joshua B. Tenenbaum, Xiaoming
Liu, and Tim K. Marks. MOST-GAN: 3D morphable StyleGAN for disentangled face image
manipulation. In AAAI, 2022.

[215] Haiyang Mei, Ge-Peng Ji, Ziqi Wei, Xin Yang, Xiaopeng Wei, and Deng-Ping Fan.

Camouflaged object segmentation with distraction mining. In CVPR, 2021.

[216] Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan
Ho, and Tim Salimans. On distillation of guided diffusion models. In CVPR, 2023.

[217] Lili Meng, Bo Zhao, Bo Chang, Gao Huang, Wei Sun, Frederick Tung, and Leonid Sigal.
Interpretable spatio-temporal attention for video action recognition. In Proceedings of the
IEEE/CVF international conference on computer vision workshops, pages 0–0, 2019.

[218] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida.

Spectral

normalization for generative adversarial networks. In ICLR, 2018.

[219] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text
In Proceedings of the
inversion for editing real images using guided diffusion models.
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047,
2023.

[220] Todd K Moon. The expectation-maximization algorithm. Signal processing magazine,

13(6):47–60, 1996.

[221] Travis Munyer and Xin Zhong. Deeptextmark: Deep learning based text watermarking for

detection of large language model generated text. arXiv preprint, 2023.

170

[222] Lakshmanan Nataraj, Tajuddin Manhar Mohammed, BS Manjunath, Shivkumar
Chandrasekaran, Arjuna Flenner, Jawadul H Bappy, and Amit K Roy-Chowdhury. Detecting
GAN generated fake images using co-occurrence matrices. Electronic Imaging, 2019:532–1,
2019.

[223] Kamyar Nazeri, Eric Ng, and Mehran Ebrahimi.

Image colorization using generative

adversarial networks. In AMDO, 2018.

[224] Eric Nguyen, Tu Bui, Vishy Swaminathan, and John Collomosse. OSCAR-Net: Object-

centric scene graph attention for image attribution. In ICCV, 2021.

[225] Huy H Nguyen, Fuming Fang, Junichi Yamagishi, and Isao Echizen. Multi-task learning for

detecting and segmenting manipulated facial images and videos. In BTAS, 2019.

[226] Huy H Nguyen, Fuming Fang, Junichi Yamagishi, and Isao Echizen. Multi-task learning for

detecting and segmenting manipulated facial images and videos. In BTAS, 2019.

[227] Huy H Nguyen, Junichi Yamagishi, and Isao Echizen. Capsule-forensics: Using capsule

networks to detect forged images and videos. In ICASSP, 2019.

[228] Yuval Nirkin, Iacopo Masi, Anh Tran Tuan, Tal Hassner, and Gerard Medioni. On face

segmentation, face swapping, and face perception. In FGR, pages 98–105. IEEE, 2018.

[229] Yuval Nirkin, Lior Wolf, Yosi Keller, and Tal Hassner. Deepfake detection based on the
discrepancy between the face and its context. arXiv preprint arXiv:2008.12262, 2020.

[230] Yuval Nirkin, Lior Wolf, Yosi Keller, and Tal Hassner. Deepfake detection based on
IEEE Transactions on Pattern Analysis

discrepancies between faces and their context.
and Machine Intelligence, PP:1–1, 2021.

[231] Kento Nishi, Yi Ding, Alex Rich, and Tobias Hollerer. Augmentation strategies for learning

with noisy labels. In CVPR, 2021.

[232] Ori Nizan and Ayellet Tal. Breaking the cycle - colleagues are all you need. In CVPR.

[233] Seong Joon Oh, Max Augustin, Mario Fritz, and Bernt Schiele. Towards reverse-engineering

black-box neural networks. In ICLR, 2018.

[234] Ryutarou Ohbuchi, Hiroshi Masuda, and Masaki Aono. Watermarking three-dimensional
IEEE Journal on

polygonal models through geometric and topological modifications.
selected areas in communications, 16(4):551–560, 1998.

[235] Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, and Hongsheng Li. St-adapter: Parameter-
efficient image-to-video transfer learning. Advances in Neural Information Processing
Systems, 35:26462–26477, 2022.

171

[236] Yuxin Pan, Yiwang Chen, Qiang Fu, Ping Zhang, and Xin Xu. Study on the camouflaged

target detection method based on 3D convexity. Modern Applied Science, 2011.

[237] Sungho Park and Hyeran Byun. Fair-vpt: Fair visual prompt tuning for image classification.

In CVPR, 2024.

[238] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. GauGAN: semantic image

synthesis with spatially adaptive normalization. In ACM, 2019.

[239] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas
Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy,
Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style,
high-performance deep learning library. In NeurIPS, 2019.

[240] Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, and Alexei Efros. Context

encoders: Feature learning by inpainting. In CVPR, 2016.

[241] Sen Peng, Yufei Chen, Cong Wang, and Xiaohua Jia. Protecting the intellectual property
of diffusion models by the watermark diffusion process. arXiv preprint arXiv:2306.03436,
2023.

[242] Xiaojiang Peng and Cordelia Schmid. Multi-region two-stream r-cnn for action detection. In
Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands,
October 11–14, 2016, Proceedings, Part IV 14, pages 744–759. Springer, 2016.

[243] Yuxin Peng, Yunzhen Zhao, and Junchao Zhang. Two-stream collaborative learning with
spatial-temporal attention for video classification. IEEE Transactions on Circuits and Systems
for Video Technology, 29(3):773–786, 2018.

[244] Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. Efficient neural architecture

search via parameters sharing. In ICML, 2018.

[245] Stanislav Pidhorskyi, Donald A Adjeroh, and Gianfranco Doretto. Adversarial latent

autoencoders. In CVPR, 2020.

[246] Ed Pizzi, Sreya Dutta Roy, Sugosh Nagavara Ravindra, Priya Goyal, and Matthijs Douze. A

self-supervised descriptor for image copy detection. In CVPR, 2022.

[247] Ryan Po, Guandao Yang, Kfir Aberman, and Gordon Wetzstein. Orthogonal adaptation for
modular customization of diffusion models. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 7964–7973, 2024.

[248] Albert Pumarola, Antonio Agudo, Aleix M Martinez, Alberto Sanfeliu, and Francesc
Moreno-Noguer. Ganimation: Anatomically-aware facial animation from a single image. In

172

ECCV, 2018.

[249] Albert Pumarola, Antonio Agudo, Aleix M Martinez, Alberto Sanfeliu, and Francesc
Moreno-Noguer. GANimation: One-shot anatomically consistent facial animation.
International Journal of Computer Vision, 128:698–713, 2020.

[250] Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency:

Face forgery detection by mining frequency-aware clues. In ECCV, 2020.

[251] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini
Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
transferable visual models from natural language supervision. In ICML, 2021.

[252] Arezoo Rajabi, Rakesh B Bobba, Mike Rosulek, Charles Wright, and Wu-chi Feng. On the

(im) practicality of adversarial perturbation for image privacy. PETS, 2021.

[253] VP Subramanyam Rallabandi and Prasun Kumar Roy. Magnetic resonance image
enhancement using stochastic resonance in fourier domain. Magnetic resonance imaging,
2010.

[254] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once:

Unified, real-time object detection. In CVPR, 2016.

[255] Joseph Redmon and Ali Farhadi. YOLO9000: better, faster, stronger. In CVPR, 2017.

[256] Joseph Redmon and Ali Farhadi. YOLOv3: An incremental improvement. arXiv preprint

arXiv:1804.02767, 2018.

[257] Atique Rehman, Rafia Rahim, Shahroz Nadeem, and Sibt Hussain. End-to-end trained CNN

encoder-decoder networks for image steganography. In ECCVW, 2018.

[258] Atique-ur Rehman, Rafia Rahim, Shahroz Nadeem, and Sibt-ul Hussain. End-to-end trained

CNN encoder-decoder networks for image steganography. In ECCVW, 2019.

[259] Jingjing Ren, Xiaowei Hu, Lei Zhu, Xuemiao Xu, Yangyang Xu, Weiming Wang, Zijun
Deng, and Pheng-Ann Heng. Deep texture-aware features for camouflaged object detection.
IEEE Transactions on Circuits and Systems for Video Technology, 2023.

[260] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time

object detection with region proposal networks. NeurIPS, 2015.

[261] Yuhao Ren, Fabing Duan, François Chapeau-Blondeau, and Derek Abbott. Self-gating
stochastic-resonance-based autoencoder for unsupervised learning. Physical Review E, 2024.

[262] Stephan R. Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for data:

173

Ground truth from computer games. In ECCV, 2016.

[263] Anna Rogers. The attribution problem with generative ai. Hacking Semantics, 2022.

[264] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.

High-resolution image synthesis with latent diffusion models. In CVPR, 2022.

[265] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and
Matthias Nießner. Faceforensics++: Learning to detect manipulated facial images.
In
CVPR, 2019.

[266] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and
In

Matthias Nießner. FaceForensics++: Learning to detect manipulated facial images.
ICCV, 2019.

[267] Nataniel Ruiz, Sarah Adel Bargal, and Stan Sclaroff. Disrupting deepfakes: Adversarial
attacks against conditional image translation networks and facial manipulation systems. In
ECCV, 2020.

[268] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir
Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven
generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, pages 22500–22510, 2023.

[269] Dan Ruta, Saeid Motiian, Baldo Faieta, Zhe Lin, Hailin Jin, Alex Filipkowski, Andrew
Gilbert, and John Collomosse. ALADIN: All layer adaptive instance normalization for
fine-grained style similarity. In ICCV, 2021.

[270] Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, and Hervé Jégou. Radioactive

data: tracing through training. In ICML, 2020.

[271] Suman Saha, Gurkirt Singh, Michael Sapienza, Philip HS Torr, and Fabio Cuzzolin.
Deep learning for detecting multiple space-time action tubes in videos. arXiv preprint
arXiv:1608.01529, 2016.

[272] Eran Segalis and Eran Galili. OGAN: Disrupting deepfakes with an adversarial attack that

survives training. arXiv preprint arXiv:2006.12247, 2020.

[273] Maximilian Seitzer. pytorch-fid: FID Score for PyTorch, August 2020. Version 0.3.0.

[274] Vladimir V Semenov and Anna Zakharova. Multiplexing-based control of stochastic

resonance. Chaos: An Interdisciplinary Journal of Nonlinear Science, 2022.

[275] P Sengottuvelan, Amitabh Wahi, and A Shanmugam. Performance of decamouflaging

through exploratory image analysis. In ICETET, 2008.

174

[276] Shawn Shan, Jenna Cryan, Emily Wenger, Haitao Zheng, Rana Hanocka, and Ben Y Zhao.
Glaze: Protecting artists from style mimicry by {Text-to-Image} models. In 32nd USENIX
Security Symposium (USENIX Security 23), pages 2187–2204, 2023.

[277] Mengen Shen, Jianhua Yang, Wenbo Jiang, Miguel AF Sanjuan, and Yuqiao Zheng.
Stochastic resonance in image denoising as an alternative to traditional methods and deep
learning. Nonlinear Dynamics, 2022.

[278] Jing Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. Instantbooth: Personalized text-to-
image generation without test-time finetuning. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 8543–8552, 2024.

[279] Kaede Shiohara and Toshihiko Yamasaki. Detecting deepfakes with self-blended images. In

CVPR, 2022.

[280] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action
recognition in videos. Advances in neural information processing systems, 27, 2014.

[281] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale

image recognition. arXiv preprint arXiv:1409.1556, 2014.

[282] Amit Kumar Singh, Nomit Sharma, Mayank Dave, and Anand Mohan. A novel technique

for digital image watermarking in spatial domain. In PDGC, 2012.

[283] Suriya Singh, Chetan Arora, and CV Jawahar. First person action recognition using deep
learned descriptors. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 2620–2628, 2016.

[284] Lawrence Sirovich and Michael Kirby. Low-dimensional procedure for the characterization

of human faces. Journal of the Optical Society of America, 4(3):519–524, 1987.

[285] Kihyuk Sohn, Huiwen Chang, José Lezama, Luisa Polania, Han Zhang, Yuan Hao, Irfan
Essa, and Lu Jiang. Visual prompt tuning for generative transfer learning. In CVPR, 2023.

[286] Gowthami Somepalli, Anubhav Gupta, Kamal Gupta, Shramay Palta, Micah Goldblum,
Jonas Geiping, Abhinav Shrivastava, and Tom Goldstein. Measuring style similarity in
diffusion models. arXiv preprint arXiv:2404.01292, 2024.

[287] Kritaphat Songsri-in and Stefanos Zafeiriou. Complement face forensic detection and

localization with facial landmarks. arXiv preprint arXiv:1910.05455, 2019.

[288] Anil K Srivastava, Virendra K Srivastava, and Aman Ullah. The coefficient of determination
and its adjusted version in linear regression models. Econometric reviews, 14(2):229–240,
1995.

175

[289] Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, and
Cordelia Schmid. Actor-centric relation network. In Proceedings of the European Conference
on Computer Vision (ECCV), pages 318–334, 2018.

[290] Lin Sun, Kui Jia, Kevin Chen, Dit-Yan Yeung, Bertram E Shi, and Silvio Savarese.
Lattice long short-term memory for human action recognition. In Proceedings of the IEEE
international conference on computer vision, pages 2147–2156, 2017.

[291] Peize Sun, Rufeng Zhang, Yi Jiang, Tao Kong, Chenfeng Xu, Wei Zhan, Masayoshi
Tomizuka, Lei Li, Zehuan Yuan, Changhu Wang, and Ping Luo. Sparse R-CNN: End-
to-end object detection with learnable proposals. In CVPR, 2021.

[292] Yujia Sun, Geng Chen, Tao Zhou, Yi Zhang, and Nian Liu. Context-aware cross-level fusion
network for camouflaged object detection. arXiv preprint arXiv:2105.12555, 2021.

[293] Sebastian Szyller, Buse Gul Atli, Samuel Marchal, and N Asokan. Dawn: Dynamic

adversarial watermarking of neural networks. In ACM-MM, 2021.

[294] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard,
and Quoc V Le. MnasNet: Platform-aware neural architecture search for mobile. In CVPR,
2019.

[295] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural
networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.

[296] Wei Ren Tan, Chee Seng Chan, Hernan Aguirre, and Kiyoshi Tanaka. Improved ArtGAN for
conditional synthesis of natural image and artwork. IEEE Transactions on Image Processing,
28(1):394–409, 2019.

[297] Matthew Tancik, Ben Mildenhall, and Ren Ng. StegaStamp: Invisible hyperlinks in physical

photographs. In CVPR, 2020.

[298] Li Tang, Qingqing Ye, Haibo Hu, Qiao Xue, Yaxin Xiao, and Jin Li. Deepmark: A scalable
and robust framework for deepfake video detection. ACM Transactions on Privacy and
Security, 2024.

[299] Long Tang, Dengpan Ye, Yunna Lv, Chuanxi Chen, and Yunming Zhang. Once and for
all: Universal transferable adversarial perturbation against deep hashing-based facial image
retrieval. In AAAI, 2024.

[300] Florian Tramèr, Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. Stealing

machine learning models via prediction APIs. In USENIXSS, 2016.

[301] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning
In Proceedings of the IEEE

spatiotemporal features with 3d convolutional networks.

176

international conference on computer vision, pages 4489–4497, 2015.

[302] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri.
A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the
IEEE conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018.

[303] Luan Tran, Xi Yin, and Xiaoming Liu. Disentangled representation learning GAN for

pose-invariant face recognition. In CVPR, 2017.

[304] Luan Tran, Xi Yin, and Xiaoming Liu. Disentangled representation learning gan for pose-

invariant face recognition. In CVPR, 2017.

[305] Yuan-Yu Tsai and Hong-Lin Liu. Integrating coordinate transformation and random sampling
into high-capacity reversible data hiding in encrypted polygonal models. IEEE Transactions
on Dependable and Secure Computing, 2022.

[306] Radim Tyleček and Radim Šára. Spatial pattern templates for recognition of objects with

regular structure. In GCPR, 2013.

[307] Amin Ullah, Jamil Ahmad, Khan Muhammad, Muhammad Sajjad, and Sung Wook Baik.
Action recognition in video sequences using deep bi-directional lstm with cnn features. IEEE
access, 6:1155–1166, 2017.

[308] Diego Valsesia, Giulio Coluccia, Tiziano Bianchi, and Enrico Magli. Compressed
fingerprint matching and camera identification via random projections. IEEE Transactions
on Information Forensics and Security, 10(7):1472–1485, 2015.

[309] Thanh Van Le, Hao Phung, Thuan Hoang Nguyen, Quan Dao, Ngoc N Tran, and Anh
In
Tran. Anti-dreambooth: Protecting users from personalized text-to-image synthesis.
Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2116–
2127, 2023.

[310] Ashish Vaswani. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.

[311] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,

Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.

[312] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple

features. In CVPR, 2001.

[313] Paul Viola and Michael Jones. Robust real-time face detection. IJCV, 2004.

[314] Christoffer Waldemarsson. Disinformation, Deepfakes & Democracy; The European
response to election interference in the digital age. The Alliance of Democracies Foundation,
2020.

177

[315] Cheng Wang, Haojin Yang, and Christoph Meinel.

Exploring multimodal video
In 2016 International Joint Conference on Neural

representation for action recognition.
Networks (IJCNN), pages 1924–1931. IEEE, 2016.

[316] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Liao. YOLOv7: Trainable bag-of-

freebies sets new state-of-the-art for real-time object detectors. In CVPR, 2023.

[317] Feifei Wang, Zhentao Tan, Tianyi Wei, Yue Wu, and Qidong Huang. Simac: A simple
anti-customization method for protecting face privacy against text-to-image synthesis of
diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 12047–12056, 2024.

[318] Heng Wang and A Kl. aser, c. schmid, and c.-l. liu,“action recognition by dense trajectories,”.

In Proc. IEEE Conf. Comput. Vis. Pattern Recognit, pages 3169–3176, 2011.

[319] Heng Wang and Cordelia Schmid. Action recognition with improved trajectories.

In
Proceedings of the IEEE international conference on computer vision, pages 3551–3558,
2013.

[320] Jiangfeng Wang, Hanzhou Wu, Xinpeng Zhang, and Yuwei Yao. Watermarking in deep

neural networks via error back-propagation. Electronic Imaging, 2020.

[321] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong
Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deep high-
resolution representation learning for visual recognition. TPAMI, 2020.

[322] Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deep-
convolutional descriptors. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 4305–4314, 2015.

[323] Limin Wang, Yuanjun Xiong, Zhe Wang, and Yu Qiao. Towards good practices for very

deep two-stream convnets. arXiv preprint arXiv:1507.02159, 2015.

[324] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc
Van Gool. Temporal segment networks: Towards good practices for deep action recognition.
In European conference on computer vision, pages 20–36. Springer, 2016.

[325] Run Wang, Felix Juefei-Xu, Meng Luo, Yang Liu, and Lina Wang. FakeTagger: Robust
safeguards against deepfake dissemination via provenance tracking. In ACMM, 2021.

[326] Run Wang, Felix Juefei-Xu, Lei Ma, Xiaofei Xie, Yihao Huang, Jian Wang, and Yang Liu.
FakeSpotter: A simple yet robust baseline for spotting ai-synthesized fake faces. In IJCAI,
2020.

[327] Sheng-Yu Wang, David Bau, and Jun-Yan Zhu. Sketch your own gan. In ICCV, 2021.

178

[328] Sheng-Yu Wang, Alexei A Efros, Jun-Yan Zhu, and Richard Zhang. Evaluating data

attribution for text-to-image models. In ICCV, 2023.

[329] Sheng-Yu Wang, Aaron Hertzmann, Alexei A Efros, Jun-Yan Zhu, and Richard Zhang.
Data attribution for text-to-image models by unlearning synthesized images. arXiv preprint
arXiv:2406.09408, 2024.

[330] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. CNN-

generated images are surprisingly easy to spot... for now. In CVPR, 2020.

[331] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. CNN-

generated images are surprisingly easy to spot... for now. In CVPR, 2020.

[332] Xiaogang Wang and Xiaoou Tang. Face photo-sketch synthesis and recognition.

IEEE

transactions on pattern analysis and machine intelligence, 31:1955–1967, 2008.

[333] Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-ESRGAN: Training real-

world blind super-resolution with pure synthetic data. In CVPR, 2021.

[334] Yunbo Wang, Mingsheng Long, Jianmin Wang, and Philip S Yu. Spatiotemporal pyramid
network for video action recognition. In Proceedings of the IEEE conference on Computer
Vision and Pattern Recognition, pages 1529–1538, 2017.

[335] Zhengwei Wang, Qi She, and Tomás E. Ward. Generative adversarial networks in computer

vision: A survey and taxonomy. ACM Computing Surveys, 54(2), 2021.

[336] Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia Schmid. Learning to track for spatio-
In Proceedings of the IEEE international conference on

temporal action localization.
computer vision, pages 3164–3172, 2015.

[337] Michael J Wilber, Chen Fang, Hailin Jin, Aaron Hertzmann, John Collomosse, and Serge
Belongie. BAM! the behance artistic media dataset for recognition beyond photography. In
ICCV, 2017.

[338] Adrian Wolny, Lorenzo Cerrone, Athul Vijayan, Rachele Tofanelli, Amaya Vilches
Barro, Marion Louveaux, Christian Wenzl, Sören Strauss, David Wilson-Sánchez, Rena
Lymbouridou, Susanne S Steigleder, Constantin Pape, Alberto Bailoni, Salva Duran-
Nebreda, George W Bassel, Jan U Lohmann, Miltos Tsiantis, Fred A Hamprecht, Kay
Schneitz, Alexis Maizel, and Anna Kreshuk. Accurate and versatile 3d segmentation of
plant tissues at cellular resolution. eLife, 9:e57613, jul 2020.

[339] Di Wu, Junjun Chen, Nabin Sharma, Shirui Pan, Guodong Long, and Michael Blumenstein.
In 2019

Adversarial action data augmentation for similar gesture action recognition.
International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2019.

179

[340] Xi Wu, Zhen Xie, YuTao Gao, and Yu Xiao. SSTNET: Detecting manipulated faces through

spatial, steganalysis and temporal features. In ICASSP, 2020.

[341] Xiaoshuai Wu, Xin Liao, and Bo Ou. Sepmark: Deep separable watermarking for unified

source tracing and deepfake detection. arXiv preprint, 2023.

[342] Xiaoshuai Wu, Xin Liao, Bo Ou, Yuling Liu, and Zheng Qin. Are watermarks bugs for

deepfake detectors? rethinking proactive forensics. arXiv preprint, 2024.

[343] Zihao Xiao, Xianfeng Gao, Chilin Fu, Yinpeng Dong, Wei Gao, Xiaolu Zhang, Jun Zhou,
Improving transferability of adversarial patches on face recognition with

and Jun Zhu.
generative models. In CVPR, 2021.

[344] Chu Xin, Seokhwan Kim, and Kyoung Shin Park. A comparison of machine learning
models with data augmentation techniques for skeleton-based human action recognition. In
Proceedings of the 14th ACM International Conference on Bioinformatics, Computational
Biology, and Health Informatics, pages 1–6, 2023.

[345] Ying Xu, Kiran Raja, and Marius Pedersen. Supervised contrastive learning for generalizable

and explainable deepfakes detection. In WACV, 2022.

[346] Jian-Ru Xue, Jian-Wu Fang, and Pu Zhang. A survey of scene understanding by event
International Journal of Automation and Computing,

reasoning in autonomous driving.
2018.

[347] Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi Zhang, Chen Sun, and Cordelia
Schmid. Multiview transformers for video recognition. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pages 3333–3343, 2022.

[348] Fan Yang, Qiang Zhai, Xin Li, Rui Huang, Ao Luo, Hong Cheng, and Deng-Ping Fan.
Uncertainty-guided transformer reasoning for camouflaged object detection. In ICCV, 2021.

[349] Jiewen Yang, Xingbo Dong, Liujun Liu, Chao Zhang, Jiajun Shen, and Dahai Yu. Recurring
the transformer for video action recognition. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 14063–14073, 2022.

[350] Xin Yang, Yuezun Li, and Siwei Lyu. Exposing deep fakes using inconsistent head poses.

In ICASSP, 2019.

[351] Yang Yang, Wen Wang, Liang Peng, Chaotian Song, Yao Chen, Hengjia Li, Xiaolong
Yang, Qinglin Lu, Deng Cai, Boxi Wu, et al. Lora-composer: Leveraging low-rank
adaptation for multi-concept customization in training-free diffusion models. arXiv preprint
arXiv:2403.11627, 2024.

[352] Leiyue Yao, Wei Yang, and Wei Huang. A data augmentation method for human action

180

recognition using dense joint motion images. Applied Soft Computing, 97:106713, 2020.

[353] Yuguang Yao, Yifan Gong, Yize Li, Yimeng Zhang, Xue Lin, and Sijia Liu. Reverse

engineering of imperceptible adversarial image perturbations. In ICLR, 2022.

[354] Yuguang Yao, Xiao Guo, Vishal Asnani, Yifan Gong, Jiancheng Liu, Xue Lin, Xiaoming
Liu, and Sijia Liu. Reverse engineering of deceptions on machine- and human-centric
attacks. Foundations and Trends in Privacy and Security, 2024.

[355] Erkan Yavuz and Ziya Telatar.

Improved SVD-DWT based digital image watermarking

against watermark ambiguity. In SAC, 2007.

[356] Chin-Yuan Yeh, Hsi-Wen Chen, Shang-Lun Tsai, and Sheng-De Wang. Disrupting image-

translation-based deepfake algorithms with adversarial attacks. In WACVW, 2020.

[357] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. DualGAN: Unsupervised dual learning

for image-to-image translation. In CVPR, 2017.

[358] Dong Yin, Raphael Gontijo Lopes, Jon Shlens, Ekin Dogus Cubuk, and Justin Gilmer. A

fourier perspective on model robustness in computer vision. NeurIPS, 2019.

[359] Innfarn Yoo, Huiwen Chang, Xiyang Luo, Ondrej Stava, Ce Liu, Peyman Milanfar, and Feng
Yang. Deep 3d-to-2d watermarking: Embedding messages in 3d meshes and extracting them
from 2d renderings. In CVPR, 2022.

[360] Masakazu Yoshimura, Junji Otsuka, Atsushi Irie, and Takeshi Ohashi. Rawgment: Noise-
In

accounted raw augmentation enables recognition in a wide variety of environments.
CVPR, 2023.

[361] A. Yu and K. Grauman. Fine-grained visual comparisons with local learning. In CVPR,

2014.

[362] A. Yu and K. Grauman. Semantic jitter: Dense supervision for visual comparisons via

synthetic images. In ICCV, 2017.

[363] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao.
LSUN: Construction of a large-scale image dataset using deep learning with humans in the
loop. arXiv preprint arXiv:1506.03365, 2015.

[364] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. LSUN: Construction
of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint
arXiv:1506.03365, 2015.

[365] Ning Yu, Larry S Davis, and Mario Fritz. Attributing fake images to GANs: Learning and

analyzing GAN fingerprints. In ICCV, 2019.

181

[366] Ziyang Yuan, Mingdeng Cao, Xintao Wang, Zhongang Qi, Chun Yuan, and Ying Shan.
Customnet: Zero-shot object customization with variable-viewpoints in text-to-image
diffusion models. arXiv preprint arXiv:2310.19784, 2023.

[367] Matthew Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In

ECCV, 2014.

[368] Qiang Zhai, Xin Li, Fan Yang, Chenglizhao Chen, Hong Cheng, and Deng-Ping Fan. Mutual

graph learning for camouflaged object detection. In CVPR, 2021.

[369] Jie Zhang, Dongdong Chen, Jing Liao, Weiming Zhang, Huamin Feng, Gang Hua, and
Nenghai Yu. Deep model intellectual property protection via deep watermarking. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 2021.

[370] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian
denoiser: Residual learning of deep CNN for image denoising. IEEE Transactions on Image
Processing, 26(7):3142–3155, 2017.

[371] Xu Zhang, Svebor Karaman, and Shih-Fu Chang. Detecting and simulating artifacts in GAN

fake images. In WIFS, 2019.

[372] Xu Zhang, Svebor Karaman, and Shih-Fu Chang. Detecting and simulating artifacts in GAN

fake images. In WIFS, 2019.

[373] Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, and Ke Ding. Text-visual prompting for

efficient 2d temporal video grounding. In CVPR, 2023.

[374] Yushu Zhang, Jiahao Zhu, Mingfu Xue, Xinpeng Zhang, and Xiaochun Cao. Adaptive
IEEE Transactions on

3d mesh steganography based on feature-preserving distortion.
Visualization and Computer Graphics, 2023.

[375] Yuxin Zhang, Nisha Huang, Fan Tang, Haibin Huang, Chongyang Ma, Weiming Dong, and
Changsheng Xu. Inversion-based style transfer with diffusion models. In Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition, pages 10146–10156,
2023.

[376] Jiaojiao Zhao, Yanyi Zhang, Xinyu Li, Hao Chen, Bing Shuai, Mingze Xu, Chunhui Liu,
Kaustav Kundu, Yuanjun Xiong, Davide Modolo, et al. Tuber: Tubelet transformer for video
In Proceedings of the IEEE/CVF Conference on Computer Vision and
action detection.
Pattern Recognition, pages 13598–13607, 2022.

[377] Mingjun Zhao, Yakun Yu, Xiaoli Wang, Lei Yang, and Di Niu. Search-map-search: a frame
selection paradigm for action recognition. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 10627–10636, 2023.

182

[378] Xuandong Zhao, Yu-Xiang Wang, and Lei Li. Protecting language generation models via

invisible watermarking. arXiv preprint, 2023.

[379] Xuandong Zhao, Kexun Zhang, Zihao Su, Saastha Vasan, Ilya Grishchenko, Christopher
Kruegel, Giovanni Vigna, Yu-Xiang Wang, and Lei Li.
Invisible image watermarks are
provably removable using generative AI. In The Thirty-eighth Annual Conference on Neural
Information Processing Systems, 2024.

[380] Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Ngai-Man Cheung, and Min Lin. A
recipe for watermarking diffusion models. arXiv preprint arXiv:2303.10137, 2023.

[381] Zhengyue Zhao, Jinhao Duan, Kaidi Xu, Chenan Wang, Rui Zhang, Zidong Du, Qi Guo,
and Xing Hu. Can protective perturbation safeguard personal data from being exploited by
In Proceedings of the IEEE/CVF Conference on Computer Vision and
stable diffusion?
Pattern Recognition, pages 24398–24407, 2024.

[382] Yuan Zhi, Zhan Tong, Limin Wang, and Gangshan Wu. Mgsampler: An explainable
the IEEE/CVF

sampling strategy for video action recognition.
International conference on Computer Vision, pages 1513–1522, 2021.

In Proceedings of

[383] Yaoyao Zhong and Weihong Deng. Towards transferable adversarial attack against deep face

recognition. IEEE Transactions on Information Forensics and Security, 2020.

[384] Jiren Zhu, Russell Kaplan, Justin Johnson, and Li Fei-Fei. HiDDeN: Hiding data with deep

networks. In ECCV, 2018.

[385] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image

translation using cycle-consistent adversarial networks. In ICCV, 2017.

[386] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang,
and Eli Shechtman. Toward multimodal image-to-image translation. In NeurIPS, 2017.

[387] Peihao Zhu, Rameen Abdal, Yipeng Qin, and Peter Wonka. SEAN: Image synthesis with

semantic region-adaptive normalization. In CVPR, 2020.

[388] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable

DETR: Deformable transformers for end-to-end object detection. In ICLR, 2020.

[389] Mingchen Zhuge, Xiankai Lu, Yiyou Guo, Zhihua Cai, and Shuhan Chen. CubeNet: X-shape

connection for camouflaged object detection. Pattern Recognition, 2022.

183

APPENDIX A

PUBLICATIONS

A list of all peer-reviewed publications during the MSU Ph.D. program listed chronologically.

• Vishal Asnani, Xi Yin, Tal Hassner, Sijia Liu, and Xiaoming Liu.

"Proactive image

manipulation detection." In Proceedings of the IEEE/CVF conference on computer vision

and pattern recognition, 2022.

• Vishal Asnani, Xi Yin, Tal Hassner, and Xiaoming Liu. "Malp: Manipulation localization

using a proactive scheme." In Proceedings of the IEEE/CVF Conference on Computer Vision

and Pattern Recognition, 2023.

• Vishal Asnani, Xi Yin, Tal Hassner, and Xiaoming Liu. "Reverse engineering of generative

models: Inferring model hyperparameters from generated images." IEEE Transactions on

Pattern Analysis and Machine Intelligence, 2023.

• Vishal Asnani, Abhinav Kumar, Suya You, and Xiaoming Liu. "PrObeD: proactive object

detection wrapper." Advances in Neural Information Processing Systems 36, 2024.

• Vishal Asnani, John Collomosse, Tu Bui, Xiaoming Liu, and Shruti Agarwal. "ProMark:

Proactive Diffusion Watermarking for Causal Attribution." In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recognition, 2024.

• Vishal Asnani, John Collomosse, Xiaoming Liu, and Shruti Agarwal.

"CustomMark:

Customization of Diffusion Models for Proactive Attribution." In Review, 2025.

• Vishal Asnani, Xiaoming Liu, and Shruti Agarwal. "PiVoT: Proactive Video Templates for

Enhancing Video Task Performance." In Review, 2025.

184

APPENDIX B

PROACTIVE IMAGE MANIPULATION DETECTION
APPENDIX

B.1 Cross Encoder-Template Set Evaluation

Our framework encrypts a real image using a template from the template set. This encryption

would aid in the image manipulation detection if the image is corrupted by any unseen GM. The

framework is divided in two stages namely, image encryption and recovery of template where each

stage works independently in inference. We therefore provide an ablation to study the performance

using different encoder and template set, i.e., we evaluate recovering ability of an encoder using

a template set trained with different initialization seeds. The results are shown in Tab. B.1. We

observe that even though the template set and the encoder are initialized with different seeds, the

performance of our framework doesn’t vary much. This shows the stability of our framework even

though the initialization seeds of both stages during training are different.

B.2 Template Strength

We provide the ablation for hyperparameter m used to control the strength of the added template

in Sec. 4.3. We observe that the performance is better if we increase the template strength. However,

this comes at a trade-off with PSNR which declines if the template strength increases. This is also

justified in Fig. B.1 which shows the images with different strength of added template. The images

become noisier as the template strength is increased. This is not desirable as there shouldn’t be much

distortion in the encrypted real image due to our added template. Therefore for our experiments,

we select 30% as the strength for the added template.

B.3

Implementation Details

Image editing techniques We use various image editing techniques in Sec. 4.2. All the techniques

are applied after addition of our template. We provide the implementation details for all these

techniques below:

1. Blur: We apply Gaussian blur to the image with 50% probability using 𝜎 sampled from

[0, 3],

185

Table B.1 Cross encoder-template set evaluation with different initialization seeds.

Initialization seed

Test GM Average precision (%)

1

Encoder Template set StarGAN CycleGAN GauGAN
96.12
94.65
94.83
95.48
95.54
95.84
95.56
95.62
96.14

91.62
91.15
91.46
91.56
90.85
91.06
91.32
91.42
90.41

100
100
100
100
100
100
100
100
100

1
2
3
1
2
3
1
2
3

3

2

2. JPEG: We JPEG-compress the image with 50% probability images using Imaging Library

(PIL), with quality sampled from Uniform{30, 31, ..., 100}.

3. Blur + JPEG (p): The image is possibly blurred and JPEG-compressed, each with probability

p.

4. Resizing: We perform the training using 50% of the images with 256 × 256 × 3 resolution

and rest with 128 × 128 × 3 resolution images in CelebA-HQ dataset.

5. Crop: We randomly crop the images with 50% probability on each side with pixels sampled

from [0, 30]. The images are resized to 128 × 128 × 3 resolution.

6. Gaussian noise: We add Gaussian noise with zero mean and unit variance to the images with

50% probability.

186

Figure B.1 Visualization of input images with different template strength. As the template strength
is increased, the images become noisier.

Figure B.2 Network architecture for our (a) encoder (b) classifier network for image manipulation
detection.

187

.

y
t
i
l
i
b
a
n
o
i
t
a
z
i
l
a
r
e
n
e
g

s
’
k
r
o
w
e
m
a
r
f

r
u
o

g
n
i
t
a
u
l
a
v
e

r
o
f

d
e
s
u

n
o
i
t
u
l
o
s
e
r

e
g
a
m

i

t
u
p
n
i

d
n
a

s
t
e
s
a
t
a
d

r
i
e
h
t

h
t
i

w
s

M
G

f
o

t
s
i
L
2
.
B
e
l
b
a
T

]
6
8
3
[

N
A
G
e
l
c
y
c
i
B

]
3
5
[

2
N
A
G
r
a
t
S

]
9
3
1
[
T
I
N
U
M

]
5
9
1
[
T
I
N
U

]
8
3
2
[

N
A
G
u
a
G

]
5
8
3
[

N
A
G
e
l
c
y
C

]
2
5
[

N
A
G
r
a
t
S

]
4
9
1
[

N
A
G
T
S

]
6
0
3
[

s
e
d
a
c
a
F

3
×
6
5
2
×
6
5
2

]
7
5
1
[

-

Q
H
A
b
e
l
e
C

]
2
6
3

,
1
6
3
[

s
e
o
h
S
2
s
e
g
d
E

]
2
6
2
[
y
t
i

C
2
A
T
G

3
×
6
5
2
×
6
5
2

3
×
2
1
5
×
6
5
2

3
×
1
3
9
×
2
1
5

]
9
4
2
[
n
o
i
t
a
m
N
A
G

i

]
3
3
3
[
N
A
G
R
S
E

]
2
3
2
[
N
A
G

l
i
c
n
u
o
C

]
7
5
3
[
N
A
G
l
a
u
D

]
0
0
2
[

A
b
e
l
e
C

3
×
8
2
1
×
8
2
1

]
0
0
2
[

A
b
e
l
e
C

3
×
8
2
1
×
8
2
1

]
0
0
2
[

A
b
e
l
e
C

3
×
6
5
2
×
6
5
2

]
2
3
3
[

o
t
o
h
P
-
h
c
t
e
k
S

3
×
6
5
2
×
6
5
2

]
0
3
[

O
C
O
C

3
×
6
5
2
×
6
5
2

]
4
4
1
[
x
i
P
2
x
i
P

]
6
0
3
[

s
e
d
a
c
a
F

3
×
6
5
2
×
6
5
2

]
6
0
3
[

s
e
d
a
c
a
F

3
×
6
5
2
×
6
5
2

]
7
5
1
[

-

Q
H
A
b
e
l
e
C

]
7
5
1
[

-

Q
H
A
b
e
l
e
C

3
×
6
5
2
×
6
5
2

3
×
8
2
1
×
8
2
1

n
o
i
t
u
l
o
s
e
R

]
5
4
2
[
E
A
L
A

]
7
8
3
[

N
A
E
S

]
0
4
2
[

r
e
d
o
c
n
E
_
T
N
O
C

M
G

]
7
5
1
[

-

Q
H
A
b
e
l
e
C

]
7
5
1
[

-

Q
H
A
b
e
l
e
C

]
2
3
2
[

w
e
i
V

-
t
e
e
r
t
S
s
i
r
a
P

t
e
s
a
t
a
D

3
×
6
5
2
×
6
5
2

3
×
6
5
2
×
6
5
2

3
×
4
6
×
4
6

n
o
i
t
u
l
o
s
e
R

M
G

t
e
s
a
t
a
D

188

Network architecture

Fig. B.2 shows the network architecture used in different experiments

for our framework’s evaluation. For our framework, our encoder has 2 stem convolution layers

and 10 convolution blocks to recover the added template from encrypted real images. Each block

comprises of convolution, batch normalization and ReLU activation.

In ablation experiments for Table 8, we use a classification network with the similar number of

layers as our encoder. This is done to show the importance of recovering templates using encoder.

This classification networks has 8 convolution blocks followed by three fully connected layers with

ReLU activation in between the layers. The network outputs 2 dimension logits used for image

manipulation detection.

B.4 List of GMs

We use a variety of GMs to test the generalization ability of our framework. These GMs have

varied network architectures and many of them are trained on different datasets. We summarize

all the GMs in Tab. B.2. We also provide visualization for different real image samples used in

evaluating the performance for all these GMs in Fig. B.3 - Fig. B.18. We show the added template

and the recovered templates in “gist_rainbow" cmap for better visualization and indicate the cosine

similarity of the recovered template with the added template. As shown in Fig. B.3 for training with

STGAN, the encrypted real images have higher cosine similarity compared to their manipulated

counterparts. However, during testing, the difference between the two cosine similarities decreases

as shown in Fig. B.4 - Fig. B.18 for different GMs.

B.5 Dataset License Information

We use diverse datasets for our experiments which include face and non-face datasets. For

face datasets, we use existing datasets including CelebA [200] and CelebA-HQ [157]. The CelebA

dataset contains images entirely from the internet and has no associated IRB approval. The authors

mention that the dataset is available for non-commercial research purposes only, which we strictly

adhere to. We only use the database internally for our work and primarily for evaluation. CelebA-

HQ consists images collected from the internet. Although there is no associated IRB approval,

the authors assert in the dataset agreement that the dataset is only to be used for non-commercial

189

research purposes, which we strictly adhere to.

We use some non-face datasets too for our experiments. The Facades [306] dataset was

collected at the Center for Machine Perception and is provided under Attribution-ShareAlike

license. Edges2Shoes [361, 362] is a large shoe dataset consisting of images collected from https:

//www.zappos.com. The authors mention that this dataset is for academic, non-commercial use

only. GTA2City [262] dataset consists of a large number of densely labelled frames extracted from

computer games. The authors mention that the data is for research and educational use only. The

sketch-photo [332] datset refers to the CUHK face sketch FERET database. The authors assert

in the dataset agreement that the dataset is only to be used for noncommercial research purposes,

which we strictly adhere to. Paris street-view [232] dataset contains images collected using google

street view and is to be used for noncommercial research purposes.

190

Figure B.3 Visualization of samples used for GM STGAN; (a) added template, (b) real images,
(c) encrypted real images after adding a template, (d) manipulated images output by a GM, (e)
recovered template from (c), and (f) recovered template from (d). Top left corner in last two
columns shows the cosine similarity of the recovered template with the added template.

191

Figure B.4 Visualization of samples used for GM StarGAN; (a) added template, (b) real images,
(c) encrypted real images after adding a template, (d) manipulated images output by a GM, (e)
recovered template from (c), and (f) recovered template from (d). Top left corner in last two
columns shows the cosine similarity of the recovered template with the added template.

192

Figure B.5 Visualization of samples used for GM CycleGAN; (a) added template, (b) real images,
(c) encrypted real images after adding a template, (d) manipulated images output by a GM, (e)
recovered template from (c), and (f) recovered template from (d). Top left corner in last two
columns shows the cosine similarity of the recovered template with the added template.

193

Figure B.6 Visualization of samples used for GM GauGAN; (a) added template, (b) real images,
(c) encrypted real images after adding a template, (d) manipulated images output by a GM, (e)
recovered template from (c), and (f) recovered template from (d). Top left corner in last two
columns shows the cosine similarity of the recovered template with the added template.

194

Figure B.7 Visualization of samples used for GM UNIT; (a) added template, (b) real images,
(c) encrypted real images after adding a template, (d) manipulated images output by a GM, (e)
recovered template from (c), and (f) recovered template from (d). Top left corner in last two
columns shows the cosine similarity of the recovered template with the added template.

Figure B.8 Visualization of samples used for GM MUNIT; (a) added template, (b) real images,
(c) encrypted real images after adding a template, (d) manipulated images output by a GM, (e)
recovered template from (c), and (f) recovered template from (d). Top left corner in last two
columns shows the cosine similarity of the recovered template with the added template.

195

Figure B.9 Visualization of samples used for GM StarGANv2; (a) added template, (b) real images,
(c) encrypted real images after adding a template, (d) manipulated images output by a GM, (e)
recovered template from (c), and (f) recovered template from (d). Top left corner in last two
columns shows the cosine similarity of the recovered template with the added template.

Figure B.10 Visualization of samples used for GM BicycleGAN; (a) added template, (b) real
images, (c) encrypted real images after adding a template, (d) manipulated images output by a GM,
(e) recovered template from (c), and (f) recovered template from (d). Top left corner in last two
columns shows the cosine similarity of the recovered template with the added template.

196

Figure B.11 Visualization of samples used for GM CONT_Encoder; (a) added template, (b) real
images, (c) encrypted real images after adding a template, (d) manipulated images output by a GM,
(e) recovered template from (c), and (f) recovered template from (d). Top left corner in last two
columns shows the cosine similarity of the recovered template with the added template.

Figure B.12 Visualization of samples used for GM SEAN; (a) added template, (b) real images,
(c) encrypted real images after adding a template, (d) manipulated images output by a GM, (e)
recovered template from (c), and (f) recovered template from (d). Top left corner in last two
columns shows the cosine similarity of the recovered template with the added template.

197

Figure B.13 Visualization of samples used for GM ALAE; (a) added template, (b) real images,
(c) encrypted real images after adding a template, (d) manipulated images output by a GM, (e)
recovered template from (c), and (f) recovered template from (d). Top left corner in last two
columns shows the cosine similarity of the recovered template with the added template.

Figure B.14 Visualization of samples used for GM Pix2Pix; (a) added template, (b) real images,
(c) encrypted real images after adding a template, (d) manipulated images output by a GM, (e)
recovered template from (c), and (f) recovered template from (d). Top left corner in last two
columns shows the cosine similarity of the recovered template with the added template.

198

Figure B.15 Visualization of samples used for GM DualGAN; (a) added template, (b) real images,
(c) encrypted real images after adding a template, (d) manipulated images output by a GM, (e)
recovered template from (c), and (f) recovered template from (d). Top left corner in last two
columns shows the cosine similarity of the recovered template with the added template.

Figure B.16 Visualization of samples used for GM CouncilGAN; (a) added template, (b) real
images, (c) encrypted real images after adding a template, (d) manipulated images output by a GM,
(e) recovered template from (c), and (f) recovered template from (d). Top left corner in last two
columns shows the cosine similarity of the recovered template with the added template.

199

Figure B.17 Visualization of samples used for GM ESRGAN; (a) added template, (b) real images,
(c) encrypted real images after adding a template, (d) manipulated images output by a GM, (e)
recovered template from (c), and (f) recovered template from (d). Top left corner in last two
columns shows the cosine similarity of the recovered template with the added template.

Figure B.18 Visualization of samples used for GM GANimation; (a) added template, (b) real
images, (c) encrypted real images after adding a template, (d) manipulated images output by a GM,
(e) recovered template from (c), and (f) recovered template from (d). Top left corner in last two
columns shows the cosine similarity of the recovered template with the added template.

200

APPENDIX C

MALP APPENDIX

C.1

Implementation Details

Experimental Setup and Hyperparameters We train MaLP for 150, 000 iterations with a batch

size of 4. For all of the networks, we use Adam optimizer except for the transformer which uses

AdamW with 𝛽1 = 0.9, 𝛽2 = 0.999, weight decay 0.5𝑒−5 and eps 1𝑒−8. The learning rate is 1𝑒−5

for all networks. The constraint weights are set as: 𝜆1 = 100, 𝜆2 = 5, 𝜆3 = 4, 𝜆4 = 25, 𝜆5 = 25, 𝜆6 =

25, 𝜆7 = 50, 𝜆8 = 15, 𝜆9 = 20, 𝜆10 = 50. We use a template set size of 1 and template strength as

30% unless mentioned. All experiments are conducted on one NVIDIA K80 GPU.

Network Architecture. We show the network architecture of various components of MaLP

in Fig. C.1. The shared network consists of 1 stem convolutional layer and 4 convolution blocks.

Each convolution block consists of convolutional and batch normalization layers followed by ReLU

activation. The output of the shared network is given to E𝐸 and E𝐶, both having the same

architecture with 3 convolution blocks and 1 stem convolutional layer. We use the transformer E𝑇

in the second branch of the framework where the ViT [79] architecture is adopted. The transformer

consists of 6 encoder blocks, and a dropout of 0.1 is used. The features of the transformer are

reshaped to the shape of the fakeness map i.e.1 × 128 × 128. Finally, we use a classifier C on

the predicted fakeness maps to perform real vs.fake binary classification. The classifier has 8

convolution blocks, 1 stem convolutional layer, and 3 fully connected layers. We apply the ReLU

activation between the layers.

GMs and dataset license information. We use a variety of face and generic GMs to show the

effectiveness of MaLP. The information for all the GMs along with their training datasets, is shown

in Tab. C.1. For many GMs used by [6], We use the test images released by [6]. for the remaining

GMs, we would release the test images for fair comparison of generalization benchmark by the

future works. We also show more visualization samples of the predicted fakeness maps by MaLP

in Fig. C.2- Fig. C.5. All the fakeness maps are shown in "pink" cmap for better representation.

201

Figure C.1 Network architecture for different components of MaLP. (a) Shared network, (b) Encoder
E𝐸 and CNN network E𝐶, (c) Classifier C, (d) Transformer E𝑇 , and (e) Transformer encoder block.

We also indicate the cosine similarity between the predicted and ground truth fakeness maps.

We observe that the fakeness maps for encrypted images have minimal bright regions. However,

for fake images, MaLP is able to localize the modified regions well, considering the modified

attributes/GMs are unseen in training.

The face datasets include CelebA [200] and CelebA-HQ [157], both of which don’t have any

associated Institutional Review Board (IRB) approval. The authors for both datasets mention the

availability of the dataset for non-commercial research purposes, which we strictly adhere to. For

generic images datasets, we use Facades [306], COCO [30], Horse2Zebra [385], Summer2Winter [385],

GTA2CITY [262], Edges2Shoes [144], Paris street-view [240] and Sketch-Photo [332] datasets.

All the mentioned generic image datasets can be used for non-commercial research purposes, as

mentioned by the authors, and we use the datasets for the same purposes.

Image Editing Degradations. We apply several image editing degradations to the test set to

verify the robustness of MaLP. The details of these operations are listed below:

1. JPEG compression: We compress the image with the compression quality of 50%.

2. Blur: We apply the Gaussian blur with a filter size of 7 × 7.

3. Noise: We apply a Gaussian noise with zero mean and unit variance.

202

Table C.1 List of GMs along with their training datasets.
Dataset

CelebA [200]

CelebA-HQ [157]

Facades [306]
COCO [30]
Horse2Zebra [385]
Summer2Winter [385]
GTA2CITY [262]
Edges2Shoes [144]
Paris Street-view [232]
Sketch-Photo [332]

GMs
STGAN [194], AttGAN [129], StarGAN [52],
GANimation [248], CouncilGAN [232],
ESRGAN [333], GDWCT [49]
SEAN [387], StarGAN-v2 [53], ALAE [245],
DRGAN [304], ColorGAN [223],
CycleGAN [385], BicycleGAN [386], Pix2Pix [144]
GauGAN [238]
AutoGAN [371]
DRIT [177]
UNIT [195]
MUNIT [139]
Cont_Enc [240]
DualGAN [357]

Table C.2 Ablation for localization loss.
CS ↑
0.9356
0.9230
0.9211
0..8777
0.9394

PSNR ↑ SSIM ↑
0.7114
22.16
0.6614
18.98
0.6816
19.12
0.3712
14.01
0.7312
23.020

Loss
CS
CS + L2
CS + SSIM + L2
CS + SSIM + L1
CS + SSIM

4. Low-resolution: We resize the image to half the original resolution and restore it back to the

original resolution using linear interpolation.

Potential Societal Impact The problem of manipulation localization is crucial from the perspective

of media forensics. Localizing the fake regions not only helps in the detection of these fake media

but, in the future, can also help recover the original image that the GM has manipulated. We also

show that MaLP can be used as a discriminator to improve the quality of GMs. While this is an

interesting application of MaLP, it can be a possibility that the GMs become more robust to our

framework, decreasing the localization performance if the training of the GM is done from scratch.

C.2 Additional Experiments

Localization Loss. We show the importance of manipulation loss (defined in Eq. 8) in Sec. 4.6.

We perform an ablation to formulate the loss of fakeness maps for manipulated images. As shown

in Tab. C.2, we try experimenting with various loss functions i.e.cosine similarity (CS), L1, L2 and

203

Table C.3 Comparison with [141] using multiple GMs in training. MaLP is able to outperform [141]
by training images manipulated by only STGAN.

Method

Training GMs

Cosine similarity ↑
AttGAN StarGAN StyleGAN

Hunag et al. [141]

MaLP

STGAN + ICGAN + PGGAN
+ StyleGAN + StyleGAN2
+ StarGAN + AttGAN
STGAN

0.6940

0.8494

0.7479

0.8557

0.8718

0.8255

Table C.4 Performance of MaLP across different attribute modifications seen in training.

Method

[141]
MaLP

Bald
0.9014
0.9478

Cosine similarity ↑
Bangs Black Hair Eyeglasses Mustache
0.9152
0.8850
0.9470
0.9329

0.9093
0.9549

0.8817
0.9367

Smile
0.8634
0.9489

structural similarity index measure (SSIM). Using just the CS loss results in better performance

compared to combining it with L1 or L2 loss. We observe a huge deterioration in performance

when using L1 loss. This can be explained as PSNR and SSIM are directly related to mean squared

error which is optimized by either an L2 or SSIM loss. Finally, adopting an SSIM loss with CS loss

results in a better performance as both of them are more related to the metrics, making it easier for

MaLP to converge.

Comparison with Baseline. Due to the limited GPU memory, we conduct proactive training with

one GM only because the GM needs to be loaded to the memory and used on the fly. On the other

hand, passive methods can be trained on multiple GMs because the image generation processes are

conducted offline. As shown in Tab. C.3, [141] trains on images manipulated by 7 different GMs,

unlike MaLP, which is trained on images manipulated by only 1 GM. We show the performance

on three GMs, which are seen for [141], but unseen for MaLP. MaLP performs better even though

these GMs’ images are not seen in training. Therefore, even though the training of MaLP is limited

by 1 GM, it can achieve better generalization to other GMs proving the effectiveness of proactive

schemes.

Multiple Attribute Modifications.

Instead of training on bald attribute modification by STGAN,

we train and test MaLP on multiple attribute modifications. These include bald, bangs, black hair,

204

Table C.5 Ablation study for transformer architecture.
Optimizer Depth Dropout Cosine similarity↑ Accuracy↑

Adam
AdamW
AdamW
AdamW
AdamW

6
1
1
3
6

0.1
0.0
0.0
0.0
0.1

0.8839
0.8825
0.8826
0.8830
0.8848

0.9514
0.9647
0.9680
0.9705
0.9856

eyeglasses, mustache, and smile manipulation. We show the results in Tab. C.4. MaLP performs

better for all the attribute modifications compared to the passive method [141]. We also observe an

increase in cosine similarity compared to when MaLP is trained on only bald attribute modification.

This is expected, as the more types of modifications MaLP sees in training, the better it learns to

localize.

Transformer Architecture Ablation. We ablate various parameters of the transformer to select

the best architecture for manipulation localization. We experiment with parameters that include

optimizer, depth i.e.number of blocks, and dropout. We only use the transformer branch and

switch off the CNN branch during training. The results are shown in Tab. C.5. We observe that

the localization performance is almost the same when using the transformer to predict fakeness

maps. However, the detection accuracy has a significant impact. Having dropout does increase

the performance for detection and localization. Further, using the weighted Adam optimizer is

more beneficial than using the vanilla Adam optimizer. Therefore, we adopt the architecture of

the transformer with 6 blocks and optimize it with a weighted Adam optimizer. Finally, we also

include the dropout to achieve the best performance for localization and detection.

205

Figure C.2 Visualization of fakeness maps for different attribute modifications by STGAN. (a) Real
image, (b) encrypted image, (c) manipulated image, (d) ground-truth 𝑴𝐺𝑇 , (e) predicted fakeness
map for encrypted images, and (f) predicted fakeness map for manipulated images. We also show
the cosine similarity between the predicted and ground-truth fakeness map below (f). All face
images come from SiWM-v2 data [115].

206

Figure C.3 Visualization of fakeness maps for different attribute modifications by STGAN. (a) Real
image, (b) encrypted image, (c) manipulated image, (d) ground-truth 𝑴𝐺𝑇 , (e) predicted fakeness
map for encrypted images, and (f) predicted fakeness map for manipulated images. We also show
the cosine similarity between the predicted and ground-truth fakeness map below (f). All face
images come from SiWM-v2 data [115].

207

Figure C.4 Visualization of fakeness maps for manipulation by DRIT. (a) Real image, (b) encrypted
image, (c) manipulated image, (d) ground-truth 𝑴𝐺𝑇 , (e) predicted fakeness map for encrypted
images, and (f) predicted fakeness map for manipulated images. We also show the cosine similarity
between the predicted and ground-truth fakeness map below (f).

208

Figure C.5 Visualization of fakeness maps for manipulation by GauGAN. (a) Real image, (b)
encrypted image, (c) manipulated image, (d) ground-truth 𝑴𝐺𝑇 , (e) predicted fakeness map for
encrypted images, and (f) predicted fakeness map for manipulated images. We also show the cosine
similarity between the predicted and ground-truth fakeness map below (f).

209

APPENDIX D

PROBED APPENDIX

D.1 Proof of Lemma 1

We begin our proof by considering the image 𝒊 as a column vector and the model as a linear

regression model with learnable weights 𝒘𝑡. The subscript of time 𝑡 denotes that the weights change

as one performs SGD updates.

SGD Steps. We first consider the gradient of weight (𝒘𝑡). The linear model uses SGD for training,

therefore, 𝒘𝑡 after 𝑡 gradient steps is given by:

𝒘𝑡 = 𝒘0 −

𝑡
∑︁

𝑖=0

𝑠𝑖 𝒈𝑡 = 𝒘0 −

𝑡
∑︁

𝑖=0

𝑠𝑖

𝜕L
𝜕𝒘𝑡

,

(D.1)

where, for linear regression model with image 𝒊, L = 𝑓 (𝒘𝑡 𝒊 − 𝑧) = 𝑓 (𝜂). To estimate the gradient

𝒘𝑡, we have,

𝒈𝑡 =

=

=

𝜕L (𝒘𝑡 𝒊 − 𝑧)
𝜕𝒘𝑡
𝜕L (𝒘𝑡 𝒊 − 𝑧)
𝜕 (𝒘𝑡 𝒊 − 𝑧)
𝜕L (𝜂)
𝜕𝜂 𝒊

𝒈𝑡 = 𝒊𝜐,

𝜕 (𝒘𝑡 𝒊 − 𝑧)
𝜕𝒘𝑡

(D.2)

where 𝜐 =

𝜕L (𝜂)
𝜕𝜂

is the gradient of the loss function wrt noise.

Optimal Weights. First, we will find the bound of the converged value 𝒘∞ and the optimal value

𝒘∗. If 𝜇𝑤 is mean of the learned weight, we have,

E (cid:16)

∥𝒘∞ − 𝒘∗∥2
2

(cid:17)

= E (cid:16)

∥𝒘∞ − 𝜇𝑤 + 𝜇𝑤 − 𝒘∗∥2
2

(cid:17)

,

= E((𝒘∞ − 𝜇𝑤)𝑇 (𝒘∞ − 𝜇𝑤)) + E((𝜇𝑤 − 𝒘∗)𝑇 (𝜇𝑤 − 𝒘∗))

+ 2E((𝒘∞ − 𝜇𝑤)𝑇 (𝜇𝑤 − 𝒘∗)),

= E((𝒘∞ − 𝜇𝑤)𝑇 (𝒘∞ − 𝜇𝑤)) + E((𝜇𝑤 − 𝒘∗)𝑇 (𝜇𝑤 − 𝒘∗))

(D.3)

210

Using E(𝒘∞ − 𝜇𝑤) = E(𝒘∞) − 𝜇𝑤 = 𝜇𝑤 − 𝜇𝑤 = 0, we have

=⇒ E (cid:16)

∥𝒘∞ − 𝒘∗∥2
2

(cid:17)

= 𝑉 𝑎𝑟 (𝒘∞) + E((𝜇𝑤 − 𝒘∗)𝑇 (𝜇𝑤 − 𝒘∗))

(D.4)

where 𝑉 𝑎𝑟 (𝒘) = (cid:205) 𝑗 𝑤2
𝑗 .

Gradient of Weight. Given the image vector 𝒊, and noise 𝜂 are statistically independent, the

image and noise gradient 𝜐 defined in Eq. (D.2) are also statistically independent. We also assume

that the distribution of image is normal Gaussian (E(𝒊) = 0). Therefore, the expectation of the

gradient 𝒈𝑡 is given by,

E( 𝒈𝑡) = E(𝒊)E(𝜐) = 0,

Next, the variance of 𝒈𝑡 is given as

𝑉 𝑎𝑟 ( 𝒈𝑡) = 𝑉 𝑎𝑟 (𝒊𝜐) = E(𝒊𝑇 𝒊) [𝑉 𝑎𝑟 (𝜐) + E2(𝜐)] − E(𝒊)E(𝜐).

(D.5)

(D.6)

We assume that image pixels are normally distributed. This is common since the networks do a

mean subtraction before inputting to the network. Thus, E(𝒊) = 0. Hence, we have

𝑉 𝑎𝑟 ( 𝒈𝑡) = E(𝒊𝑇 𝒊)𝑉 𝑎𝑟 (𝜐).

(D.7)

Converged Weight. From Eq. (D.1), the expectation of the weight at time 𝑡 is,

Therefore, for converged weight,

E(𝒘𝑡) = E(𝒘0) +

𝑡
∑︁

𝑖=0

𝑠𝑖E( 𝒈 𝑗 )

= 0 (Using Eq. (D.5))

E(𝒘∞) = lim
𝑡→∞

E(𝒘𝑡),

E(𝒘∞) = E(𝜇𝑤) = 0.

211

(D.8)

(D.9)

For variance, using Eq. (D.1) we have,

𝑉 𝑎𝑟 (𝒘𝑡) = 𝑉 𝑎𝑟 (𝒘0) + (

𝑡
∑︁

𝑖

𝑗 )𝑉 𝑎𝑟 ( 𝒈𝑡).
𝑠2

Therefore, we have,

𝑉 𝑎𝑟 (𝒘∞) = lim
𝑡→∞

(𝑉 𝑎𝑟 (𝒘𝑡))

= 𝑉 𝑎𝑟 (𝒘0) +

(cid:16)

lim
𝑡→∞

𝑡
∑︁

𝑖=

(cid:17)

𝑠2
𝑗

𝑉 𝑎𝑟 ( 𝒈𝑡)

𝑉 𝑎𝑟 (𝒘∞) = 𝑉 𝑎𝑟 (𝒘0) + S′𝑉 𝑎𝑟 ( 𝒈𝑡).

Substituting Eq. (D.7) in the above equation, we have

𝑉 𝑎𝑟 (𝒘∞) = 𝑉 𝑎𝑟 (𝒘0) + S′E(𝒊𝑇 𝒊)𝑉 𝑎𝑟 (𝜐),

Going back to Eq. (D.4), and substituting Eq. (D.8) and Eq. (D.10), we have,

E (cid:16)

∥𝒘∞ − 𝒘∗∥2
2

=⇒ E (cid:16)

∥𝒘∞ − 𝒘∗∥2
2

(cid:17)

(cid:17)

= 𝑉 𝑎𝑟 (𝒘0) + S′E(𝒊𝑇 𝒊)𝑉 𝑎𝑟 (𝜐) + E(||𝒘∗||2)

= 𝑐 + S𝑉 𝑎𝑟 (𝜐)

where 𝑐 is independent of loss function L and S = S′E(𝒊𝑇 𝒊) is also another constant.

(D.10)

(D.11)

(D.12)

Lemma 1.

We assume that the regression error term 𝑒 = 𝒘𝑇 𝒊 − ˆ𝑦, is drawn from zero mean Gaussian with

variance 𝜎2 as in [128]. So,

𝑉 𝑎𝑟 ( ˆ𝑒) = 𝑉 𝑎𝑟 (𝒘𝑇 𝒊 − ˆ𝑦) = 𝜎2.

(D.13)

For a passive detector with converged weights 𝒘∞, we have,

E (cid:16)

∥𝒘∞ − 𝒘∗∥2
2

=⇒ E (cid:16)

∥𝒘∞ − 𝒘∗∥2
2

(cid:17)

(cid:17)

= 𝑐 + S𝑉 𝑎𝑟 (𝜐)

= 𝑐 + S𝑉 𝑎𝑟 (𝑒)

= 𝑐 + S𝜎2

Similarly, for a proactive detector with converged weights 𝒘

′

∞, we have

E

(cid:18)(cid:13)
(cid:13)
(cid:13)𝒘

′

∞ − 𝒘∗

(cid:19)

(cid:13)
2
(cid:13)
(cid:13)
2

= 𝑐 + S𝑉 𝑎𝑟 (𝜐′

)

(D.14)

(D.15)

212

Assume that a proactive detector multiplies the input image vector 𝒊 with a scalar template 𝑠.

From Eq. (D.12), we write the loss term as,

′

L

=

𝑠𝒘𝑇 𝒊 − ˆ𝑦

(cid:17) 2

(cid:16)

1
2

=⇒

𝜕L′
𝜕𝒘

= (𝑠𝒘𝑇 𝒊 − ˆ𝑦)𝑠𝒊

(D.16)

Taking the variance,

𝑉 𝑎𝑟 (𝜐′

) = 𝑉 𝑎𝑟

(cid:19)

(cid:18) 𝜕L′
𝜕𝒘

= 𝑉 𝑎𝑟 ((𝑠𝒘𝑇 𝒊 − ˆ𝑦)𝑠𝒊)

= 𝑉 𝑎𝑟 (𝑠( ˆ𝑦 + 𝑒) − ˆ𝑦)𝑠2𝑉 𝑎𝑟 (𝒊)

, assuming E(𝒊) = 0

= 𝑉 𝑎𝑟 (𝑠𝑒 + (𝑠 − 1) ˆ𝑦)𝑠2𝑉 𝑎𝑟 (𝒊)

= (𝑉 𝑎𝑟 (𝑠𝑒) + 𝑉 𝑎𝑟 ((𝑠 − 1) ˆ𝑦))𝑠2𝑉 𝑎𝑟 (𝒊)

= 𝑠2𝑉 𝑎𝑟 (𝑒)𝑠2𝑉 𝑎𝑟 (𝒊)

, assuming 𝑉 𝑎𝑟 ( ˆ𝑦) = 0

≤ 𝑠2𝑉 𝑎𝑟 (𝑒)𝑠2

, assuming 𝑉 𝑎𝑟 (𝒊) ≤ 0.5× (−1)2+0.5 × 12 = 1 (D.17)

=⇒ 𝑉 𝑎𝑟 (𝜐′

) ≤ 𝑠4𝜎2

If the magnitude of the scalar template is bounded by 1 i.e., 𝑠2 < 1, we have

𝑉 𝑎𝑟 (𝜐′

) < 𝜎2.

(D.18)

(D.19)

The above shows that the gradients in the proactive model has less noise than the passive model (a

key for better convergence). Substituting above in Eq. (D.15), we have
(cid:18)(cid:13)
(cid:13)
(cid:13)𝒘

= 𝑐 + S𝑉 𝑎𝑟 (𝜐′

∞ − 𝒘∗

E

(cid:19)

)

′

(cid:13)
2
(cid:13)
(cid:13)
2

< 𝑐 + S𝜎2

< 𝑐 + S𝑉 𝑎𝑟 (𝜐)

=⇒ E

(cid:18)(cid:13)
(cid:13)
(cid:13)𝒘

′

∞ − 𝒘∗

(cid:19)

(cid:13)
2
(cid:13)
(cid:13)
2

< E (cid:16)

∥𝒘∞ − 𝒘∗∥2
2

(cid:17)

.

(D.20)

The last inequality follows trivially from Eq. (D.14).

213

D.2 Proof of Theorem 1

From Lemma 1, we have,

E

(cid:18)(cid:13)
(cid:13)
(cid:13)𝒘

′

∞ − 𝒘∗

(cid:13)
(cid:13)
(cid:13)

(cid:19)

2

2

< E (cid:16)

∥𝒘∞ − 𝒘∗∥2
2

(cid:17)

=⇒ 𝑉 𝑎𝑟 (𝒘‘

∞) < 𝑉 𝑎𝑟 (𝒘∞)

=⇒ E(|𝒘‘𝑇

∞ 𝒊 − 𝑦|) < E(|𝒘𝑇

∞𝒊 − 𝑦|)

=⇒ E( ˆ𝑦‘ − 𝑦) < E( ˆ𝑦 − 𝑦)

Since the proactive detector has a better bounding box prediction,

=⇒ E(𝐼𝑜𝑈 ′

2𝐷) > E(𝐼𝑜𝑈2𝐷)

Since 𝐴𝑃 is a non-decreasing function of 𝐼𝑜𝑈2𝐷, we have,

𝐴𝑃‘ ≥ 𝐴𝑃.

(D.21)

(D.22)

(D.23)

An important point to note is that the non-decreasing nature does not keep the inequality strict.

In other words, we agree that the final AP from passive and pro-active schemes could be equal.

However, our experience says that IoU improvements, especially close to 1, lead to significant

AP improvements. Current SoTA detectors already achieve decent IoU; hence, even a slight

improvement in IoU improves the AP score.

214

D.3

Implementation Details

We now include more details of our method here.

Network Architecture. The network architecture of encoder E and decoder D network used for

PrObeD is shown in Fig. D.1. Both networks consist of 2 stem convolution layers and 13 blocks,

each block containing convolutional, batch normalization, and ReLU activation layers. The images

are given as input to the encoder network to output the template, which is multiplied by the input

images to make them encrypted. The encrypted images are then passed to the decoder network to

recover the template. Finally, we input encrypted images to different object detectors to perform

detection.

Dataset license information. We use benchmark datasets for GOD and COD. The authors for

MS-COCO [190] dataset specify that the annotations in this dataset, along with this website, belong

to the COCO Consortium and are licensed under a Creative Commons Attribution 4.0 License. The

COD10K dataset is available for non-commercial purposes only [81]. The CAMO data is published

under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License [176]. Finally,

the NC4K dataset is available to use for non-commercial purposes.

Experimental Setup and Hyperparameters.

PrObeD is trained in an end-to-end manner for

all the object detectors, with training iterations similar to the pretrained object detector. For both

encoder and decoder networks, we use Adam optimizer with a learning rate of 1𝑒−5. We use

different weights of [𝜆𝑂𝐵𝐽, 𝜆𝐸 , 𝜆𝐷] for different object detectors. We use [7,10,10] for Faster-

RCNN, [50, 1.25, 4.25] for YOLOv5, [50, 7.5, 7.5] for DeTR and [10, 0.1, 0.1] for DGNet. All

experiments are conducted on one NVIDIA A100 GPU.

D.4 Additional Experiments

Train COD detector DGNet more.

Similar to the GOD detector, we train the COD detector

DGNet for more iterations, similar to after applying PrObeD. The results are shown in Tab. D.1.

We see a similar behavior as seen in GOD detectors; the performance improves after training for

more iterations, but only up to a certain extent. PrObeD is able to improve performance by a larger

215

Figure D.1 Architecture for encoder and decoder network.

margin, showing the effectiveness of the proactive schemes.

COD loss. Our loss design is inspired by the prior proactive works [7, 6], which estimate the

learnable template by applying a cosine similarity loss. The authors experiment with various loss

types, showing the effectiveness of the cosine similarity loss design. However, COD is analogous

to the segmentation task, which generally adopts a loss design of cross-entropy loss with dice loss,

which might be beneficial for COD. We perform an ablation by applying cross-entropy loss with

dice loss for COD. The results are shown in Tab. D.2. We see that our proactive wrapper is not

benefiting by removing the cosine similarity loss, proving the study of the prior proactive works.

216

Figure D.2 Error analysis for (a) Faster-RCNN, (b) YOLOv5, and (c) DeTR. PrObeD is able to
improve the number of correct predictions and reduce most errors.

Error analysis. Following [23], there can be a number of errors that deteriorate the performance

of the object detector. These are:

1. Classification error (Cls): Localized correctly but classified incorrectly.

217

Table D.1 Ablation of training iterations on DGNet for more iterations similar to after applying
PrObeD.

Method

Iter

E𝑚 ↑ S𝑚 ↑ wF𝛽 ↑ MAE↓ E𝑚 ↑ S𝑚 ↑ wF𝛽 ↑ MAE↓ E𝑚 ↑ S𝑚 ↑ wF𝛽 ↑ MAE↓
COD10K

CAMO

NC4K

DGNet[149] 1× 0.859 0.791 0.681 0.079 0.833 0.776 0.603 0.046 0.876 0.815 0.710 0.059
DGNet[149] 2× 0.861 0.791 0.682 0.080 0.832 0.778 0.606 0.045 0.875 0.814 0.711 0.059
+ PrObeD 2× 0.871 0.797 0.702 0.071 0.869 0.803 0.661 0.037 0.900 0.838 0.755 0.049

Table D.2 Ablation of dice loss with cross-entropy (CE) loss vs. cosine similarity.

CAMO

Method

COD10K
E𝑚 ↑ S𝑚 ↑ wF𝛽 ↑ MAE↓ E𝑚 ↑ S𝑚 ↑ wF𝛽 ↑ MAE↓ E𝑚 ↑ S𝑚 ↑ wF𝛽 ↑ MAE↓
0.831 0.782 0.688 0.084 0.810 0.795 0.646 0.045 0.874 0.817 0.721 0.060
Dice + CE loss
Cosine similarity 0.871 0.797 0.702 0.071 0.869 0.803 0.661 0.037 0.900 0.838 0.755 0.049

NC4K

2. Localization error (Loc): Classified correctly but localized incorrectly.

3. Both Classification and Localization error (Cls & Loc): Classified and localized incorrectly.

4. Duplicate detection error (Duplicate): Would be correct if not for a higher scoring detection.

5. Background error (Background): Detected background as foreground.

6. Missed target error (Missed): All undetected targets i.e.false negatives, which are not already

covered by classification or localization errors.

Fig. D.2 shows the error analysis for three object detectors, namely, Faster-RCNN, YOLOv5, and

DeTR. PrObeD improves the number of correct predictions of all three detectors, especially for

Faster-RCNN, where the number of correct predictions increases by around 17%. For DeTR and

YOLOv5, the improvement is less, which is evident from the less increase in correct predictions.

The major improvement for all three detectors comes from classification and localization-related

errors. All these errors decrease after PrObeD is applied to all the detectors. Further, Faster-

RCNN, being an old detector, makes a lot of background errors, which are reduced by a significant

margin after applying PrObeD. The gain is not much for DeTR and YOLOv5, which tend to make

fewer background errors. Finally, one-stage detectors suffer mostly from the problem of duplicate

detection, which is remedied by the PrObeD.

218

D.5 Potential Negative Societal Impact

PrObeD utilizes a proactive scheme to benefit object detection. Our approach can be considered

a benign adversarial attack on object detectors. However, with a change in the objective function,

PrObeD could also be used as an adversarial attack to deteriorate the performance of different

object detectors. This might pose a threat to object detectors, whether used for GOD or COD, and

some forms of adversarial training might be required to prevent the threat of adversarial attacks.

219

APPENDIX E

PROMARK APPENDIX

E.1 Multiple Watermark Configurations

We investigate the application of dual watermarks, each positioned on opposing sides of the

image. This exploration raises a pivotal query: “Is the spatial positioning of watermarks critical

to the performance?" To answer this, we ablate four distinct watermark configurations. As shown

in Tab. E.1, there is a consistent performance across all watermark placements (left, right, top,

bottom), thereby substantiating the spatial robustness of PrObeD in watermark positioning.

E.2 Watermark Robustness

We test our method against 14 different degradations (blur, various noises, fog, etc.), by adopting

the evaluation protocol detailed in the RoSteALS [27]. We use 50 watermarked training images

from LSUN dataset and use unconditional LDM with a strength of 30%. The average attribution

accuracy for training and generated images across all 14 attacks is 90.21±7.63% and 89.51±8.18%,

as compared to 95.12% without any degradation, showing the robustness of our approach to multiple

forms of watermark attack.

E.3 Possibility of Concept Leakage

We present multiple results where we attribute the images generated using non-watermarked

data, for example via random latent code and conditional generation. We detect no retention of the

watermark after noising or in random latent codes, with watermark detection accuracy of 50.56%

(chance 50%) after noising for ≥ 900 timestamps or in random latent codes. The LDM generates

an image from noise through inversion, and the watermark is added during this GenAI model

inference process. Our decoder is employed independently to identify the concept. To prove this,

we evaluate our model in Table 4 (main paper) for two more baselines, using held-out images (1)

with no watermark encryption, and (2) encrypted with a different concept’s watermark. ProMark

is able to attain an attribution accuracy of 94.32% and 94.01% respectively when evaluated with

ground-truth concept watermark for both baselines compared to 95.60% reported for watermarked

held-out data. Therefore, when inverting generating images that encrypt no watermark, or encrypt

220

Table E.1 Multi-concept attribution performance across different configurations.

Configuration

Attribution Accuracy (%) ↑

Secret 1
Left
Right
Top
Bottom

Secret 2
Right
Left
Bottom
Top

Secret 1
95.61
95.52
95.66
95.02

Secret 2 Combined
93.31
93.35
93.70
93.46

90.12
90.19
90.01
90.73

incorrect watermark, the correct concept watermark is encrypted.

E.4 Computational Efficiency

We demonstrate the computation efficiency of ProMark during inference (running watermark

decoder to perform causal attribution), which costs 5.6ms on one A100 GPU. Training with

watermarked data adds negligible cost to generative model training. This is comparable to

running inference on CLIP, or ALADIN to perform correlation based attribution (28.32 ms) but the

additional cost of the embedding search is 87.91 ms for a dataset of 20K LSUN training images.

ProMark therefore offers the advantage of both efficiency and causality for training data attribution.

We will add this to the paper.

E.5 Additional Watermark Strength Analysis

Our research introduces a new paradigm in concept attribution for images classified under

multiple concepts. We show the analysis of PSNR variation with watermark strength for the case of

multi-concept attribution. The results are shown in Fig. E.1. Our findings indicate that, compared

to single watermark cases, the PSNR for multi-concept images is marginally higher at equivalent

watermark strengths. However, as expected, an increase in watermark strength generally leads to a

decrease in PSNR.

Furthermore, we have visualized images from different datasets to showcase the extent of

degradation caused by varying watermark strengths. As discussed in Sec. 4.5, the performance of

our method improves with increased watermark strength. Nevertheless, this increase in strength

leads to a decline in image quality, evidenced by the emergence of bubble-like artifacts in the

images, as shown in Fig. E.2 (the watermark strength ranges from 0.1 to 1.0).

221

Figure E.1 PSNR vs.watermark strength for single vs multi-concept attribution.

Figure E.2 Noise Strength visualization for different watermark strength.

222

Spatial Domain

Fourier Domain

Correlation Matrix

Figure E.3 Watermark Visualization: Spatial domain, Fourier domain and inter-watermark cosine
similarity for 100 watermarks.

E.6 Watermark Discussion

We visualize some sample watermarks in both, spatial and frequency domain in Fig. E.3. These

watermarks are converted from bit-sequences to spatial domain as described in Sec. 3.4. Visually,

the watermarks appear indistinguishable from one another in both domains. Yet, their orthogonality

is clearly demonstrated through the cosine similarity matrix, which we used to analyze 100 different

watermarks. This matrix reveals that the inter-watermark cosine similarity is consistently close to

zero, decisively indicating the orthogonal nature of these watermarks.

E.7

Implementation Details

We train PrObeD with LDM for 15𝐾 iterations with a batch size of 32, using 8 NVIDIA A100

GPUs for each experiment. We use the default parameters for optimizers as used in the official

repository of [264]. The learning rate is set at 3.2𝑒−5 for training LDM.

We further show the architecture for the generic decoder used for comparing against pretrained

secret decoder shown in Fig. E.4. The generic decoder consists of 2 stem convolution layers and 10

convolution blocks. Each block consists of convolutional and batch normalization layers followed

by ReLU activation.

E.8 More Sampled Images

We use multiple datasets for evaluating PrObeD. We sample images from the trained LDM

for every class. We show some of the train and sampled images for the corresponding classes

223

Figure E.4 Generic decoder architecture.

for different datasets in Figs. E.5 to E.8. We argue that PrObeD is able to perform attribution to

different types of concepts, i.e.image templates (Fig. E.5), image style (Fig. E.8), style and content

(Fig. E.6), and ownership (Fig. E.7). Therefore, proactive based causal methods perform attribution

not only on the style or motif of the image as done by correlation based works, but also performs

attribution to a variety of concepts proving it’s generalizability.

224

Figure E.5 Training and sampled images for stock dataset.

225

Figure E.6 Training and sampled images for BAM dataset.

226

Figure E.7 Training and sampled images for wiki-a dataset.

227

Figure E.8 Training and sampled images for wiki-s dataset.

228

APPENDIX F

CUSTOMMARK APPENDIX

F.1 Additional Experiments

Components Ablation.

Tab. F.1 presents a comprehensive ablation study to analyze the

contribution of individual components in CustomMark to its overall performance. The complete

implementation of CustomMark achieves the highest performance across all metrics, with bit

accuracy at 96.10%, attribution accuracy at 91.83%, clip score at 0.80, and csd score at 0.77.

These results highlight the framework’s ability to maintain robust attribution while preserving image

quality. The performance drop observed when specific components are removed demonstrates the

critical role each plays in the model’s functionality.

The removal of the concept encoder results in a significant drop in performance, with bit

accuracy and attribution accuracy reduced to 81.21% and 65.19%, respectively. This highlights

the encoder’s essential role in embedding bi secret information effectively. Similarly, disabling

the mapper reduces bit accuracy to 93.10% and attribution accuracy to 87.11%, indicating its

importance in maintaining precise attribution. The absence of attention finetuning from LDM

moderately impacts the bit accuracy and attribution accuracy. However, qualitative performance is

greatly reduced with csd score falling to 0.65, showcasing its role in style matching of clean and

watermarked generated images during training.

The removal of regularization loss leads to minor performance degradation for attribution,

but it impacts the qualitative metrics like the csd score, which drops to 0.71, demonstrating

its role in ensuring consistency during watermark embedding, even tough it’s only for initial

iterations. Notably, the exclusion of style loss has the most detrimental effect on attribution

accuracy, which falls dramatically to 40.16%, emphasizing its importance in preserving stylistic

fidelity during the watermarking process. These results collectively validate the carefully designed

architecture of CustomMark, where each component contributes significantly in achieving both

robust attribution and high-quality image generation.

Sequential Learning Analysis.

Fig. F.1 demonstrates the performance of individual concepts

229

Table F.1 Ablation study of various components of CustomMark for 10 concepts in training. [KEYS:
att.:attention, reg. Regularization].

Changed

CustomMark
− Concept Encoder
− Mapper
− Att. Finetune
− Reg. Loss
− Style Loss

Bit

Attribution
Acc. (%)↑ Acc. (%) ↑

96.10
81.21
93.10
95.16
95.31
75.10

91.83
65.19
87.11
90.88
90.12
40.16

CLIP
Score ↑
0.80
0.65
0.79
0.71
0.77
0.66

CSD
Score ↑
0.77
0.61
0.78
0.65
0.71
0.62

Figure F.1 Performance variation of individual concepts during sequential learning.

during sequential learning with CustomMark, evaluated through CSD score deviation and attribution

accuracy as new concepts are added. The graphs illustrate how CustomMark maintains robust

performance while adapting to an increasing number of concepts, showcasing its scalability and

efficiency.

In the CSD score deviation plot( Fig. F.1(a)), the deviation remains minimal across most

concepts, even as the number of concepts increases from 3 to 10. For instance, Hopper and

Raphael exhibit only slight increases in deviation (+0.08 and +0.10, respectively) when additional

concepts are introduced. This indicates that CustomMark effectively preserves stylistic fidelity for

previously learned concepts while integrating new ones. Further, the CSD score before and after

attribution remains almost similar. It decreases a little bit in start when the concept is introduced,

but it gradually recovers to the original score. Notably, the deviation remains consistently low for

230

Figure F.2 Generated clean (left) and watermarked (right) images pairs for artists as concepts
sampled using big and complicated prompts.

concepts like Picasso and Monet, further validating the robustness of the model.

The attribution accuracy plot ( Fig. F.1(b)) highlights CustomMark’s strong adaptability, with

consistent attribution for new concepts added to training while maintaining high performance

for earlier ones. This demonstrates that CustomMark’s sequential learning approach effectively

balances the retention of previously learned attributions with the incorporation of new ones, keeping

in mind that CustomMark requires only about 10% additional training iterations per concept. These

231

Figure F.3 Analysis of original and perturbed tokens by (a) t-SNE plot, (b) norm distribution, and
(c) distribution of cosine similarity between the two sets of embeddings.

results underline the practical viability of CustomMark in dynamic, real-world scenarios where the

set of concepts evolves over time.

Complex Prompts. Fig. F.2 demonstrates the effectiveness of using complex and detailed prompts

to generate images that accurately match the artistic styles of renowned painters. Each pair

of images—one clean and one watermarked—illustrates that even though a long and complex

prompt, CustomMark was able to insert the corresponding watermark onto the generated images

as long as the concept token was perturbed.

Despite the complexity of the prompts, the generated images successfully capture the signature

style of artists such as Dali, Monet, Van Gogh, Picasso, and Warhol. The results showcase precise

interpretations of surreal, impressionistic, cubist, and other artistic movements, reinforcing the

ability of GenAI model to replicate stylistic nuances into the watermarked images.

Analysis of Token Embedding. Fig. F.3 illustrates the analysis of original and perturbed tokens

through t-SNE plots, norm distributions, and cosine similarity distributions.

In the t-SNE plot

( Fig. F.3(a)), the original tokens (red) and perturbed tokens (blue) demonstrate a clear separation,

signifying that the perturbed tokens effectively diverge from their original counterparts. This

divergence is critical for embedding unique watermarks and facilitating robust attribution. The

norm distributions ( Fig. F.3(b)) show that original tokens are centered very close to the norm 0 and

exhibit a narrower range of vector norms, while perturbed tokens ahve high norms close to 100,

and display a wider spread. This indicates that perturbations introduce divergence of the norm as

compared to the original tokens and promotes controlled variability to the token space, contributing

232

Figure F.4 Comparison with ProMark on WikiArt dataset.

to their distinctiveness. The cosine similarity distribution ( Fig. F.3(c)) reveals that the similarity

between original and perturbed tokens clusters around zero, highlighting that the perturbations

233

Figure F.5 Generated clean and watermarked images for artists as concepts sample by model trained
for attributing 200 artists.

maintain minimal overlap with the original token directions—a necessary condition for ensuring

effective attribution.

In our proposed approach, we apply the regularization loss during the initial iterations of

234

training. The regularization ensures that the perturbed tokens start with a meaningful deviation

from the original tokens, setting a strong foundation for subsequent learning. To analyze this

further, we don’t switch off the regularization loss. We observe that continuing the regularization

loss throughout the training process leads to the original and perturbed tokens becoming overly

similar, undermining the ability to embed distinguishable watermarks and impairing attribution

accuracy. With this approach, the model achieves a secret accuracy of 56.14% and an attribution

accuracy of 1.54%, Therefore, we strategically switch off the regularization loss after the initial

200 iterations to allow the perturbed tokens to diverge as they want. This maintains the separation

between original and perturbed tokens, ensuring that the model can generate robust watermarks

while preserving the quality of attribution.

F.2 More Watermarked Samples

Fig. F.4 provides a comparative analysis between clean images, ProMark [4], and CustomMark

on the WikiArt dataset, showcasing their performance in attribution while preserving artistic styles

across a range of renowned artists from WikiArt dataset. CustomMark demonstrates superior style

adaptation compared to ProMark, consistently maintaining the unique stylistic elements and visual

fidelity of the original artworks. For artists such as Degas, Picasso, and Van Gogh, CustomMark

effectively replicates the signature brushstrokes, color palettes, and composition techniques, resulting

in outputs that remain faithful to their distinctive styles. In contrast, ProMark introduces noticeable

bubble like artifacts and style distortions that detract from the visual coherence of the images.

Similarly, for detailed and intricate works by artists like Sargent and Dore, CustomMark preserves

the depth and intricacy, while ProMark struggles with fidelity, leading to degradation in fine details.

Fig. F.5 illustrates examples of clean and watermarked images for artists used as concepts,

sampled from a model trained on 200 artists. Unlike Fig. F.4, which focused on the WikiArt

dataset and showcased the performance of CustomMark for 23 artists, this figure demonstrates

the scalability of the method when extended to a much larger and diverse set of artistic concepts.

Across a wide range of styles, from Bosch and Klimt’s classic depictions to Koons and Haring’s

contemporary designs, the watermarked images retain the stylistic essence of the clean images

235

while embedding imperceptible watermarks. Notably, the approach performs consistently well

across different styles, capturing subtle details in works by artists such as Dürer, Toulouse, and

Vermeer without introducing artifacts.

This comparison highlights CustomMark’s ability to adapt seamlessly to various artistic styles,

ensuring high-quality outputs that respect the original artistic intent, even when dealing with

hundreds of distinct artistic styles. Its flexibility and fidelity make it a reliable solution for scenarios

requiring robust watermarking without compromising on artistic integrity.

F.3 Limitations

While CustomMark offers an efficient solution for concept attribution, it has some limitations.

First, it relies on the explicit mention of concepts in prompts, making attribution challenging when

an artist’s style is indirectly referenced or subtly embedded in the generated image. CustomMark

finds it challenging to embed large bit sequences due to the mapper network being too simple

for mapping bit sequence to noise perturbation. A sophisticated mapper network might address

this issue. Additionally, CustomMark has not been tested on multi-concept scenarios, such as

prompts combining multiple artists or blending diverse styles, leaving its robustness in such cases

unexplored. Another limitation of CustomMark is its reliance on generated data for training. If the

original GenAI model fails to adequately capture an artist’s unique style or nuances, the improved

model with attribution capabilities may struggle to accurately reflect or attribute that style in the

generated images. These limitations highlight areas for future improvement to enhance the system’s

versatility and robustness.

F.4 Potential Social Impact

The potential social impact of CustomMark lies in its ability to foster a collaborative and

transparent relationship between AI model developers and the artists. By introducing attribution

capabilities, this algorithm empowers artists to gain recognition for the influence of their styles on

AI-generated content, promoting a sense of agency and fairness. Unlike adversarial strategies that

often pit creators against AI systems, CustomMark provides a constructive mechanism to bridge

this divide, offering a signal for transparency without compromising creativity. By focusing on

236

attribution and transparency, CustomMark aims to support a harmonious integration of AI into the

creative landscape, minimizing potential societal harm and building trust between artists and AI

systems.

F.5

Implementation Details

Artist Lists. The list in Tab. F.2 presents a comprehensive compilation of 200 artists, which serves

as the foundation for our attribution experiments. For experiments requiring a specific number of

artists (top-𝑘), we systematically select the top-𝑘 artists based on their numerical ranking in the

table. This approach ensures consistency and reproducibility across various experimental setups.

An ablation study is conducted by varying 𝑘 as discussed in the main paper, with artists chosen

accordingly. The scalability and robustness of the attribution methodology are assessed under a

range of configurations, from smaller subsets of artists to the full set of 200 artists. Furthermore, we

extend our evaluation beyond 200 artists by leveraging 1, 000 classes from ImageNet as additional

concepts, demonstrating the scalability and adaptability of our approach.

Table F.2 Comprehensive List of 200 Artists.

Artist 1

Artist 2

Artist 3

Artist 4

Artist 5

1. Claude Monet

2. Pablo Picasso

3. Vincent van Gogh

4.

Michelangelo

5. Raphael Sanzio

Buonarroti

6. Rembrandt van Rijn

7. Salvador Dalí

8. Henri Matisse

9. Andy Warhol

10. Edward Hopper

11. Frida Kahlo

12. Edgar Degas

13. Paul Cézanne

14. Jackson Pollock

15. Edvard Munch

16. Gustav Klimt

17. Paul Gauguin

18.

Pierre-Auguste

19. Johannes Vermeer

20. Caravaggio

Renoir

21. Jan van Eyck

22. Édouard Manet

23. Georgia O’Keeffe

24. Francisco Goya

25. Albrecht Dürer

26. Sandro Botticelli

27. Titian

28. Diego Velázquez

29. Giotto di Bondone

30. El Greco

31. Peter Paul Rubens

32.

Caspar David

33. Wassily Kandinsky

34. Marc Chagall

35. Eugène Delacroix

Friedrich

36. Piet Mondrian

37. Roy Lichtenstein

38. Joan Miró

39. Hieronymus Bosch

40.

Jean-Michel

Basquiat

41. Gustave Courbet

42.

Thomas

43.

Jean-Auguste-

44. Élisabeth Vigée Le

45.

Artemisia

Gainsborough

Dominique Ingres

Brun

Gentileschi

237

Table F.2 (cont’d)

Artist 1

Artist 2

Artist 3

Artist 4

Artist 5

46. Camille Pissarro

47. Georges Seurat

48. Diego Rivera

49. Henri de Toulouse-

50. Édouard Vuillard

Lautrec

51. Berthe Morisot

52. Mary Cassatt

53.

James Abbott

54. John Singer Sargent 55. William Blake

McNeill Whistler

56. David Hockney

57. Keith Haring

58. Jasper Johns

59. Alfred Sisley

60.

Jean-Baptiste-

Camille Corot

61. Winslow Homer

62. Grant Wood

63. Paul Klee

64. Yayoi Kusama

65. Egon Schiele

66.

Amedeo

67. Fernand Léger

68. Giorgio de Chirico

69. Henri Rousseau

70. Max Ernst

Modigliani

71. Kazimir Malevich

72. Mark Rothko

73. René Magritte

74. Alphonse Mucha

75. Francis Bacon

76. Marcel Duchamp

77. Leonardo da Vinci

78. Lucian Freud

79. Anselm Kiefer

80. Joseph Beuys

81. Bridget Riley

82. Anish Kapoor

83. Damien Hirst

84. Tracey Emin

85. Ai Weiwei

86. Gerhard Richter

87. Jeff Koons

88. Takashi Murakami

89. Zhang Xiaogang

90. Jenny Saville

91. Kara Walker

92. Yoko Ono

93. Cindy Sherman

94. Louise Bourgeois

95. Barbara Kruger

96. Richard Serra

97. Donald Judd

98. Sol LeWitt

99. Frank Stella

100. Ellsworth Kelly

101.

Robert

102. Claes Oldenburg

103. Paolo Veronese

104. Pieter Bruegel

105. Anthony van Dyck

Rauschenberg

106. J.M.W. Turner

107. John Constable

108.

John Everett

109.

Dante Gabriel

110. Edward Burne-

Millais

Rossetti

Jones

111.

David Alfaro

112. Rufino Tamayo

113. Victor Vasarely

114. Kurt Schwitters

115.

Andy

Siqueiros

Goldsworthy

116. Richard Long

117. Robert Smithson

118. Christo Javacheff

119. Walter Gropius

120. Robert Venturi

121. Jean Nouvel

122. Daniel Libeskind

123. Richard Rogers

124. Renzo Piano

125. Norman Foster

126. Bjarke Ingels

127. Frank Gehry

128.

Santiago

129. Toyo Ito

130.

Frank Lloyd

Calatrava

Wright

131. Alvar Aalto

132.

Dominique

133. Luis Barragán

134. James Stirling

135. Peter Zumthor

Perrault

136. Kazuyo Sejima

137. Kengo Kuma

138. Jacques Herzog

139. Pierre de Meuron

140. César Pelli

141.

Christian de

142. Stefano Boeri

143. Wang Shu

144. Olafur Eliasson

145.

Thomas

Portzamparc

Hirschhorn

146. Felix Gonzalez-

147. Gilbert

148. Ugo Rondinone

149. Paul McCarthy

150. Cory Arcangel

Torres

238

Table F.2 (cont’d)

Artist 1

Artist 2

Artist 3

Artist 4

Artist 5

151. Elaine Sturtevant

152.

Marcel

153. Maurizio Cattelan 154. Rirkrit Tiravanija

155. Allan McCollum

Broodthaers

156. Glenn Ligon

157. Peter Fischli

158. David Weiss

159. Peter Doig

160. Thomas Schütte

161. Neo Rauch

162. Marlene Dumas

163. Felix Gonzalez-

164. Lorna Simpson

165. Byrne Morrison

Torres

166. Glenn Martin

167. Dan Collins

168. Matthew Barney

169. Peter Hujar

170. Shirin Neshat

171. Thomas Demand

172.

Alexander

173. Catherine Opie

174.

Wolfgang

175. Martin Creed

McQueen

Tillmans

176. Olafur Eliasson

177. James Turrell

178. Bill Viola

179. Andreas Gursky

180. Lewis Baltz

181. Cindy Sherman

182. Man Ray

183. Bruce Nauman

184. Sol LeWitt

185. Richard Hamilton

186. James Rosenquist

187. Nam June Paik

188. Vito Acconci

189. Susan Rothenberg 190. Lawrence Weiner

191. Daniel Buren

192. Robert Gober

193. Adrian Piper

194. Katharina Fritsch

195. Christian Marclay

196. Richard Avedon

197. Jeff Wall

198.

Edward

199. Julius Lange

200. Diane Arbus

Burtynsky

Distortion Applied for Robustness Evaluation. For robustness evaluation in Fig. 8 (main paper),

we apply several post-processing distortions. These augmentations are applied by following [88].

Below are the details:

1. Color Jitter: For the color jitter augmentation, we modified several aspects of the images.

The brightness factor, contrast factor, and saturation were adjusted to a value of 0.3, while

the hue factor was set to 0.1 to introduce controlled variations in the image colors.

2. Crop and Resize: For the crop and resize augmentation, we randomly extracted 384 × 384

blocks from the original 512 × 512 images and resized these blocks to 256 × 256, simulating

different framing conditions.

3. Gaussian Blur: We applied Gaussian blur with a kernel size of (3,3) and a sigma value of

(2.0, 2.0) to simulate soft-focus effects in the images.

239

4. Gaussian Noise: To introduce random noise, Gaussian noise was added to the images with

a standard deviation of 0.05, creating a more realistic representation of noisy environments.

5. JPEG compression: We used a quality setting of 50 to simulate compression artifacts often

encountered in real-world image data.

6. Rotation: This augmentation was randomly applied to the images within a range of 0 to 180

degrees to account for changes in orientation during training.

7. Sharpness: For the sharpness augmentation, we set the intensity to 1, enhancing the clarity

of certain features within the images.

Architecture Details. We use several networks for designing CustomMark, which include

concept encoder, secret mapper, and secret decoder. For concept encoder, a U-Net-inspired network

designed for processing and transforming 1D sequential data is adopted. Initially, a fully-connected

layer is maps the bit sequence to a feature vector which is concatenated with the token embedding.

This is given as input to the encoder-decoder framework of U-Net to output the perturbed token

embedding.

The mapper network is a feature transformation module designed to encode input indices into

high-dimensional representations using an embedding-based approach.

It employs a learnable

embedding layer that maps input indices ( e.g.16) to vectors in a higher-dimensional space ( e.g.64).

The embeddings are initialized orthogonally and scaled to maintain a unit standard deviation.

During the forward pass, the network retrieves embeddings for all possible input indices, weights

them element-wise based on the input tensor, and sums these weighted embeddings along the input

dimension. The result is normalized by the square root of batch size and biased by adding 1,

producing a robust high-dimensional representation for each input bit sequence.

Finally, we use the EfficientNet-B3 [295] architecture as its core backbone for secret decoder.

The network is initialized with pre-trained weights from the ImageNet dataset for robust feature

extraction. The final classifier layer of EfficientNet is replaced with a fully connected layer that

outputs the predicted bit sequence.

240

Prompt Details. Following [88], we use various prompts for sampling clean and watermarked

images which are used to train CustomMark. The collection of prompts is different, depending on

the concept we attribute. We replace the “[name]” with the corresponding concept token. Below

are the details:

1. Artists as concepts:

– “a painting, art by [name]”

– “a rendering, art by [name]”

– “a cropped painting, art by [name]”

– “the painting, art by [name]”

– “a clean painting, art by [name]”

– “a dirty painting, art by [name]”

– “a dark painting, art by [name]”

– “a picture, art by [name]”

– “a cool painting, art by [name]”

– “a close-up painting, art by [name]”

– “a bright painting, art by [name]”

– “a cropped painting, art by [name]”

– “a good painting, art by [name]”

– “a close-up painting, art by [name]”

– “a rendition, art by [name]”

– “a nice painting, art by [name]”

– “a small painting, art by [name]”

– “a weird painting, art by [name]”

– “a large painting, art by [name]”

241

– “A serene landscape painting in the style of [name]”

– “A bustling cityscape in the style of [name]”

– “A painting of a cozy cottage in the woods in the style of [name]”

– “A vibrant underwater scene in the style of [name]”

– “A whimsical painting of a flying elephant in the style of [name]”

– “A still life painting featuring fruit and flowers in the style of [name]”

– “A portrait of a famous historical figure in the style of [name]”

– “A painting of a dreamy night sky in the style of [name]”

– “A colorful abstract painting in the style of [name]”

– “A street scene from Paris in the style of [name]”

– “A depiction of a beautiful sunset over the ocean in the style of [name]”

– “A painting of a peaceful mountain village in the style of [name]”

– “An energetic painting of dancers in motion in the style of [name]”

– “A painting of a snow-covered winter scene in the style of [name]”

– “A painting of a tropical paradise in the style of [name]”

– “A painting of a magical forest filled with fantastical creatures in the style of [name]”

– “A painting of a dramatic stormy seascape in the style of [name]”

– “A portrait of a majestic lion in the style of [name]”

– “A painting of a romantic scene between two lovers in the style of [name]”

– “A painting of a serene Japanese garden in the style of [name]”

– “A painting of a bustling marketplace in the style of [name]”

– “A painting of a tranquil river scene in the style of [name]”

– “A painting of a fiery volcano eruption in the style of [name]”

– “A painting of a futuristic cityscape in the style of [name]”

242

– “A painting of a whimsical circus scene in the style of [name]”

– “A painting of a mysterious moonlit forest in the style of [name]”

– “A painting of a dramatic desert landscape in the style of [name]”

– “A portrait of a regal peacock in the style of [name]”

– “A painting of a mystical island in the style of [name]”

– “A painting of a lively carnival scene in the style of [name]”

2. ImageNet classes as concepts:

– “a photo of a [name]”

– “a rendering of a [name]”

– “a cropped photo of the [name]”

– “the photo of a [name]”

– “a photo of a clean [name]”

– “a photo of a dirty [name]”

– “a dark photo of the [name]”

– “a photo of my [name]”

– “a photo of the cool [name]”

– “a close-up photo of a [name]”

– “a bright photo of the [name]”

– “a cropped photo of a [name]”

– “a photo of the [name]”

– “a good photo of the [name]”

– “a photo of one [name]”

– “a close-up photo of the [name]”

– “a rendition of the [name]”

243

– “a photo of the clean [name]”

– “a rendition of a [name]”

– “a photo of a nice [name]”

– “a good photo of a [name]”

– “a photo of the nice [name]”

– “a photo of the small [name]”

– “a photo of the weird [name]”

– “a photo of the large [name]”

– “a photo of a cool [name]”

– “a photo of a small [name]”

– “a photo of a [name] playing sports”

– “a rendering of a [name] at a concert”

– “a cropped photo of the [name] cooking dinner”

– “the photo of a [name] at the beach”

– “a photo of a clean [name] participating in a marathon”

– “a photo of a dirty [name] after a mud run”

– “a dark photo of the [name] exploring a cave”

– “a photo of my [name] at graduation”

– “a photo of the cool [name] performing on stage”

– “a close-up photo of a [name] reading a book”

– “a bright photo of the [name] at a theme park”

– “a cropped photo of a [name] hiking in the mountains”

– “a photo of the [name] painting a mural”

– “a good photo of the [name] at a party”

244

– “a photo of one [name] playing an instrument”

– “a close-up photo of the [name] giving a speech”

– “a rendition of the [name] during a workout”

– “a photo of the clean [name] gardening”

– “a rendition of a [name] dancing in the rain”

– “a photo of a nice [name] volunteering at a charity event”

– “a photo of a [name] surfing a giant wave”

– “a rendering of a [name] skydiving over a scenic landscape”

– “a cropped photo of the [name] riding a rollercoaster”

– “the photo of a [name] rock climbing a steep cliff”

– “a photo of a clean [name] practicing yoga in a peaceful garden”

– “a photo of a dirty [name] participating in a paintball match”

– “a dark photo of the [name] stargazing at a remote location”

– “a photo of my [name] crossing the finish line at a race”

– “a photo of the cool [name] breakdancing in a crowded street”

– “a close-up photo of a [name] blowing out candles on a birthday cake”

– “a bright photo of the [name] flying a kite on a sunny day”

– “a cropped photo of a [name] ice-skating in a winter wonderland”

– “a photo of the [name] directing a short film”

– “a good photo of the [name] participating in a flash mob”

– “a photo of one [name] skateboarding in an urban park”

– “a close-up photo of the [name] solving a Rubik’s cube”

– “a rendition of the [name] fire dancing at a beach party”

– “a photo of the clean [name] planting a tree in a community park”

245

– “a rendition of a [name] performing a magic trick on stage”

– “a photo of a nice [name] rescuing a kitten from a tree”

246

APPENDIX G

PIVOT APPENDIX

G.1 Additional Experiments.

LoRa in backbone vs head. The ablation study shown in Tab. G.1 evaluates the effectiveness of

LoRA when applied to different parts of the baseline detector. When the LoRA layers application

is shifted from the backbone to the head, the performance of both detectors, TSN and MViTv2,

significantly decreases. Specifically, for TSN, the top-1 accuracy drops to 24.21% and the top-5

accuracy to 53.30%, while MViTv2 experiences a decline to 35.11% and 63.09% for top-1 and

top-5 accuracy, respectively. This indicates that the backbone plays a critical role in extracting

meaningful spatial-temporal features in video detection tasks, and adapting LoRA to the head limits

its capacity to leverage these features effectively. These results highlight that LoRA’s effectiveness

is highly dependent on its application to critical regions of the model, particularly the backbone

in this case, where it can better capture the temporal dynamics and spatial features necessary for

improving action recognition.

Template Learning.

Template learning plays a pivotal role in PiVoT by providing universal

adaptability and efficiency in detector performance. When the framework transitions from frame-

dependent templates to universal templates, a substantial degradation in accuracy is seen. For TSN,

universal templates achieve a top-1 accuracy of 32.88% compared to 51.37% for frame-dependent

templates, with a similar trend in top-5 accuracy (61.31% versus 78.71%). A similar degradation

is seen in MViTv2, where universal templates result degraded performance. This demonstrates

that while universal templates aim to generalize across frames, they cannot match the level of

frame-specific optimization provided by PiVoT’s default template approach.

Further, replacing learned templates with fixed templates also degrades performance. For TSN,

top-1 accuracy falls to 26.19% and top-5 accuracy to 52.01%, while for MViTv2, top-1 accuracy

decreases to 60.12% and top-5 accuracy to 86.11%. These results emphasize the necessity of

dynamic, learned templates in PiVoT, as fixed templates fail to adapt to the variations in temporal

dynamics and action-specific nuances inherent in video sequences. Overall, the original template

247

Table G.1 Ablation study of LoRA and template learning for PiVoT.

Changed From→To

TSN [324] (%)↑ MViTv2 [182] (%)↑
Top-1 Top-5 Top-1
68.81
78.71
51.37
-
35.11
53.30
24.21
Backbone→Head
57.70
61.31
Frame depend→Universal 32.88
60.12
52.01
26.19
Learn→Fixed

Top-5
91.63
63.09
85.14
86.11

PiVoT
LoRA

Template

Figure G.1 Backbone feature distribution with color intensity varied by detector head logits
confidence. Lighter color means detector is less confident and vice -versa.

learning mechanism in PiVoT proves to be critical for achieving superior performance compared to

alternative template designs.

G.2 Template Analysis.

Fig. G.1 demonstrates the backbone feature distribution of input frames and perturbed frames

when provided separately to the respective trained TSM detector. Perturbed frames, created by

adding input frames with the generated template, exhibit a distinct separation in feature space

248

Figure G.2 tSNE plot for all four detectors for input frames, perturbed frames, and estimated
templates.

compared to the original input frames. The color intensity variation, corresponding to the detector

logits, reveals a higher confidence for perturbed frames. This indicates that the template enhances

the model’s ability to extract discriminative features, leading to more confident predictions by the

detector. The addition of templates aligns features more effectively with the task requirements,

showcasing the utility of template-based enhancement for video-based tasks.

We further analyze the template enhancement in the t-SNE plots in Fig. G.2 which demonstrate

that, at the frame level, the input and perturbed frames exhibit minimal differences in distribution,

indicating that the addition of the template does not significantly alter the original frame content.

However, when viewed at the feature level in Fig. G.1, there is a marked distinction between the input

and perturbed frame feature distributions. This highlights that while the perturbation introduced

by the template is subtle at the pixel level, it has a significant impact on the feature representations

extracted by the detector.

This distinction underscores the effectiveness of the templates in enhancing task-relevant

features. By subtly modifying the input frames, the templates guide the detector’s feature space

towards better alignment with the underlying action-specific semantics. This leads to enriched

feature representations that improve the detector’s performance without compromising the natural

temporal and spatial consistency of the input frames. In essence, the templates act as an implicit

augmentation mechanism, creating a more expressive and discriminative feature space for accurate

action recognition and detection.

249

G.3 Limitations

The proposed PiVoT wrapper, while effective in enhancing video-based detectors, has certain

limitations that needs discussion. First, the method requires training the wrapper specifically with

each architecture, limiting its potential as a true plug-and-play solution. Developing a training-free

implementation could significantly improve its ease of adoption across diverse models. Second,

while performance gains are consistently observed across various tasks and datasets, the magnitude

of these gains cannot be guaranteed. This variability comes from the differing architectures and

dataset characteristics, which may influence the effectiveness of the wrapper. Another limitation lies

in the visibility and influence of the templates. Currently, the templates have substantial freedom

to enhance task-specific performance, but this poses challenges when perturbed videos need to

be publicly shared or used outside controlled environments. Making the templates imperceptible

would allow broader adoption to detectors, which might not have our wrapper installed. This means

that if the invisible templates are already embedded in the video, the need for having PiVoT on

every copy of detector will be eliminated. We leave all these useful directions for our future works.

G.4 Potential Social Impact

Video analysis tasks have diverse applications in the health industry, sports, entertainment, and

surveillance, where accuracy is critical. For instance, in healthcare, fall detection systems in homes

rely on accurate video-based monitoring to ensure timely assistance for patients. Similarly, in sports

and entertainment, analyzing player movements with precision enhances performance evaluation

and strategy development. Surveillance systems, which often operate in real-time, require high

accuracy to detect anomalies effectively.

PiVoT, a template-based approach, offers a practical solution by significantly improving the

accuracy of video-based detectors without substantially increasing the size or complexity of the

system. This efficiency ensures that existing systems can be enhanced with minimal computational

overhead, making the solution scalable for deployment in resource-constrained environments.

By enabling high performance without large architectural modifications, the technique broadens

accessibility and applicability, addressing the growing demand for robust, efficient, and accurate

250

video analysis across diverse real-world scenarios.

G.5

Implementation details

We provide the implementation details of our methods focusing on detector details, where the

LoRA layers ar applied in the backbone, and the architecture details of our framework.

Detector Details.

For all the detectors, we use the default config files as provided by the

MMACTION2 toolbox [58]. Below are the names of the config files used for our experiments:

1. TSN: tsn_imagenet-pretrained-r50_8xb32-1x1x8-50e_sthv2-rgb and tsn_imagenet-pretrained-

r50_8xb32-1x1x8-100e_kinetics400-rgb

2. TSM: tsm_imagenet-pretrained-r50_8xb16-1x1x8-50e_sthv2-rgb and tsm_imagenet-pretrained-

r50_8xb16-1x1x8-50e_kinetics400-rgb

3. MViTv2: mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb and mvit-small-p244_32xb16-

16x4x1-200e_kinetics400-rgb

4. SlowFast: slowfast_kinetics400-pretrained-r50_8xb16-4x16x1-20e_ava21-rgb

The models are first trained on the respective dataset to reproduce the reported performance.

This trained model is then further used with our PiVoT wrapper.

LoRA Application We use different backbones for different detectors. TSN and TSM use ResNet-

50, SlowFast uses 3D ResNet-50, and MViTv2 uses multi-scale ViT backbone.

In this ResNet-50 backbone, LoRA is selectively applied to the convolutional layers within

residual layer 3 and layer 4 of the network. Specifically, LoRA is integrated into the BasicBlock

and Bottleneck modules for these layers. For the BasicBlock, LoRA is applied at the first and

second convolutional layers (conv1 and conv2), adapting the channel dimensions through low-rank

matrices. Similarly, in the Bottleneck module, LoRA is applied at each of the three convolutional

layers (conv1, conv2, and conv3), modifying the input or output channels to enhance feature

adaptation. By limiting LoRA to these deeper layers, the model focuses on refining high-level

feature representations without overburdening the earlier stages of the network.

251

In the 3D ResNet-50 SlowFast LoRA network, LoRA is applied selectively to specific 3D

convolutional layers within both the slow and fast pathways. Specifically, LoRA is applied to the

conv1 layer and multiple convolutional layers across all ResNet stages (layer1, layer2, layer3, and

layer4). This includes the main convolutional layers within each block, such as conv1, conv2, and

conv3 in the bottleneck layers. This selective application focuses on enhancing the representational

capacity of key layers without modifying the entire model.

Finally for MViTv2, LoRA is applied within the multi-scale attention mechanism to the query

and key projections. Specifically, LoRA introduces two low-rank projection layers for the input

features, reducing their dimensionality to a smaller rank r. These reduced representations are

then projected back to the original dimensions using corresponding projection layers before being

added to the standard query and key projections. This enables LoRA to enhance the adaptability

of the attention mechanism while minimizing the additional parameter overhead. These LoRA

adaptations are applied at each attention block across all transformer layers of the MViT model,

making them integral to improving the model’s representational capacity and flexibility.

Architecture Details. We employ a 3D attention-based U-Net network [338, 56, 143] to estimate

templates from the input frames. LoRA layers are integrated into the detector’s backbone, as

previously described. The entire framework is trained end-to-end, with the detector initialized

using a pretrained model.

252

APPENDIX H

REVERSE ENGINEERING OF GENERATIVE MODELS
APPENDIX

H.1 Test sets for evaluation

The experiments described in the text were performed on four different test sets, each set

containing twelve different GMs for the leave out testing. For test sets, we follow the distribution

of GMs as follows: six GANs, two VAEs, two ARs, one NF and one AA model. We select this

distribution because of the number of GMs of each type in our dataset which has 81 GANs, 13

VAEs, 11 ARs, 5 NFs and 6 AAs. The sets considered are shown in Tab. H.1.

H.2 Ground truth for GMs

We collected a fake face dataset of 116 GMs, each of them with 1, 000 generated images. We also

collect the ground truth hyperparameters for network architecture and loss function types. Tab. H.2

shows the ground truth representation of the network architecture where different hyperparameters

are of different data types. Therefore, we apply min-max normalization for the continuous type

parameters to make all values in the range of [0, 1]. For multi-class and binary labels, we further

show the feature value for different labels in Tab. H.3. Note that some parameters share the same

values but with different meanings. For example, F14 and F15 represent skip connection and

down-sampling respectively. Tab. H.4 shows the ground truth representation of the loss function

types used to train each GM where all these values are binary indicating whether the particular loss

type was used or not.

H.3 Network architecture

Fig. H.5 shows the network architecture used in different experiments. For GM parsing, our

FEN has two stem convolution layers and 15 convolution blocks with each block having convolution,

batch normalization and ReLU activation to estimate the fingerprint. The encoder in the PN has

five convolution blocks with each block having convolution, pooling and ReLU activation. This is

followed by two fully connected layers to output a 512 dimension feature vector which is further

given as input to multiple branches to output different predictions. For continuous type parameters,

253

Table H.1 Test sets used for evaluation. Each set contains six GANs, two VAEs, two ARs, one AA
and one NF.
GM
GM 1
GM 2
GM 3
GM 4
GM 5
GM 6
GM 7
GM 8
GM 9
GM 10 RSGAN_HALF FAST_PIXEL
GM 11
GM 12

Set 3
BICYCLE_GAN
BIGGAN_512
CRGAN_C
FACTOR_VAE
FGSM
ICRGAN_C
LOGAN
MUNIT
PIXEL_SNAIL
STARGAN_2
SURVAE_FLOW_MAXPOOL
VAE_FIELD

Set 4
GFLM
IMAGE_GPT
LSGAN
MADE
PIX2PIX
PROG_GAN
RSGAN_REG
SEAN
STYLE_GAN
SURVAE_FLOW_NONPOOL
WGAN_DRA
YLG

Set 1
ADV_FACES
BETA_B
BETA_TCVAE
BIGGAN_128
DAGAN_C
DRGAN
FGAN
PIXEL_CNN
PIXEL_CNN++

Set 2
AAE
ADAGAN_C
BEGAN
BETA_H
BIGGAN_256
COCOGAN
CRAMERGAN
DEEPFOOL
DRIT

STARGAN
VAEGAN

FVBN
SRFLOW

we use two fully connected layers to output a 9-D network architecture. For discrete type parameters

and loss function parameters, we use separate classifiers with three fully connected layers for every

parameter to perform multi-class or binary classification.

For the deepfake detection task, we change the architecture of our FEN network as current

deepfake manipulation detection requires much deeper networks. Thus, our FEN architecture has

two stem convolution layers and 29 convolution blocks to estimate the fingerprint. For further

classification, we use a shallow network of five convolution blocks followed by two fully connected

layers.

For the image attribution task, we use the same FEN as used in model parsing, and a

shallow network of two convolution blocks and two fully connected layers to perform multi-class

classification.

Table H.2 Ground truth feature vector used for prediction of network architecture for all GMs.
F1: # layers, F2: # convolutional layers, F3: # fully connected layers, F4: # pooling layers, F5:
# normalization layers, F6: #filters, F7: # blocks, F8:# layers per block, F9: # parameters, F10:
normalization type, F11: non-linearity type in last layer, F12: nonlinearity type in blocks, F13:
up-sampling type, F14: skip connection, F15: downsampling.

F1

F2 F3 F4 F5

F6

F7 F8

F9

F10 F11 F12 F13 F14 F15

GM

AAE

ACGAN

9

0

18

10

7

1

ADAGAN_C

35

14

13

0

0

1

2

7

7

0

2307

4131

0

5

9

0

3

3

1593378

4276739

9416196

0

0

0

1

1

1

0

1

1

0

0

0

1

1

1

0

0

0

254

Table H.2 (cont’d.)

GM

F1

F2 F3 F4 F5

F6

F7 F8

F9

F10 F11 F12 F13 F14 F15

ADAGAN_P

35

14

13

ADV_FACES

ALAE

BEGAN

BETA_B

BETA_H

BETA_TCVAE

BGAN

45

23

33

25

10

7

7

7

8

9

4

4

4

0

BICYCLE_GAN

25

14

BIGGAN_128

BIGGAN_256

BIGGAN_512

CADGAN

CCGAN

CGAN

COCO_GAN

COGAN

63

21

75

25

87

29

8

4

22

12

8

19

9

0

9

5

COLOUR_GAN

19

10

CONT_ENC

19

11

1

8

1

3

3

3

5

1

1

1

1

1

0

5

1

0

0

0

CONTRAGAN

35

14

13

COUNCIL_GAN

62

30

CRAMER_GAN

9

4

3

1

CRGAN_C

35

14

13

CRGAN_P

35

14

13

CYCLEGAN

47

24

0

DAGAN_C

35

14

13

DAGAN_P

35

14

13

DCGAN

DEEPFOOL

DFCVAE

DISCOGAN

DRGAN

9

4

95

92

45

22

21

12

44

28

1

1

2

0

1

1

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

1

1

0

1

1

0

2

0

0

1

9

4

3

2

1

1

1

2

2

6

6

6

3

2

2

3

2

2

2

9

2

2

9

9

4

9

9

2

4

4

2

3

7

4131

20

2627

0

0

0

0

0

3

4094

515

99

99

99

0

10

4483

41

6123

49

7215

57

8365

3

451

10

3203

3

9

4

9

8

7

0

2883

259

2435

5987

4131

29

6214

4

7

7

454

4131

4131

23

2947

7

7

4

0

4131

4131

454

7236

21

4227

9

3459

14

4481

255

3

6

8

4

3

3

3

3

9416196

30000000

50200000

7278472

469173

469173

469173

1757412

10

23680256

10

50400000

12

55900000

14

56200000

2

9

3

4

2

9

8

3

3812355

29257731

1757412

50000000

1126790

19422404

40401187

9416196

10

69616944

3

3

3

9

3

3

3

9681284

9416196

9416196

11378179

9416196

9416196

9681284

10

22000000

7

9

8

2546234

29241731

18885068

0

1

1

0

3

3

3

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

1

0

0

0

2

0

1

0

1

1

2

1

3

3

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

3

1

1

1

1

2

0

1

1

1

2

1

1

1

1

1

1

2

1

2

1

2

1

1

1

1

1

1

1

1

1

1

2

2

0

0

0

1

0

0

0

0

0

0

1

1

1

0

1

0

0

0

0

0

0

0

0

0

0

1

0

0

0

1

0

1

0

1

1

0

0

1

1

1

0

0

1

1

1

1

1

0

0

1

1

1

1

1

1

1

1

1

1

1

1

0

0

1

1

0

0

1

0

1

1

1

0

0

1

1

1

1

1

0

0

1

1

1

0

0

0

0

0

1

0

0

0

0

1

1

1

Table H.2 (cont’d.)

F1

F2 F3 F4 F5

F6

F7 F8

F9

F10 F11 F12 F13 F14 F15

GM

DRIT

DUALGAN

EBGAN

ESRGAN

FACTOR_VAE

Fast pixel

FFGAN

FGAN

FGAN_KL

FGAN_NEYMAN

FGAN_PEARSON

FGSM

FPGAN

FSGAN

FVBN

19

10

25

14

6

3

66

66

7

17

4

9

39

19

5

5

5

5

0

0

0

0

95

92

23

12

37

20

0

1

1

0

3

0

1

3

3

3

3

1

0

0

28

0

28

GAN_ANIME

25

18

Gated_pixel_cnn

32

32

0

0

GDWCT

79

27

40

GFLM

GGAN

95

92

8

4

1

1

ICRGAN_C

35

14

13

ICRGAN_P

35

14

13

Image_GPT

59

42

INFOGAN

LAPGAN

Lmconv

LOGAN

LSGAN

MADE

MAGAN

MEMGAN

MMD_GAN

0

1

5

7

11

3

6

105

60

10

35

35

14

13

9

2

9

14

9

5

0

5

7

4

0

2

0

1

1

1

0

0

0

0

0

3

0

0

7

4

0

4

6

4

0

0

0

0

0

0

1

0

0

0

0

2

0

1

0

0

0

1

2

0

1

1

0

0

0

9

1793

10

4483

2

0

0

8

195

4547

99

768

19

3261

2

2

2

2

0

0

0

0

0

7236

11

2179

16

2863

0

7

0

0

2179

5433

11

5699

0

3

7

7

7236

451

4131

4131

17

4673

4

2

2

5

1

2

0

2

2

2

2

4

2

4

1

4

3

2

4

3

9

9

7

2

4

195

262

7156

15

4131

1923

0

963

1155

454

9

2

1

2

3

2

256

3

9564170

10

23680256

2

4

3

8

0

2

2

2

2

738433

7012163

469173

4600000

50000000

2256401

2256401

2256401

2256401

10

22000000

6

8

1

6

53192576

94669184

307721

8467854

10

3364161

4

51965832

10

22000000

2

3

3

8

2

2

5

3

4

2

3

4

3

3812355

9416196

9416196

401489

1049985

2182857

46000000

9416196

23909265

12552784

11140934

4128515

9681284

1

0

0

2

3

0

0

0

0

0

0

2

1

0

2

1

2

1

2

0

0

0

0

0

2

2

0

0

2

0

0

0

1

1

1

2

3

3

1

3

3

3

3

0

1

0

3

1

3

1

0

1

1

1

3

1

1

3

1

1

3

1

1

1

1

1

2

2

1

0

1

1

1

1

1

1

1

1

0

1

2

1

1

1

1

1

2

2

1

0

1

1

0

1

1

1

1

0

0

1

0

0

1

0

0

0

0

1

0

1

0

0

1

0

1

0

0

0

1

0

1

1

0

0

0

0

0

0

1

0

0

0

1

1

1

1

1

1

1

0

0

1

1

1

1

0

0

1

1

1

1

0

1

1

1

0

1

1

1

1

1

0

1

0

1

0

1

0

0

0

0

0

1

1

0

1

0

1

0

1

0

0

1

1

0

1

0

0

0

0

0

0

Table H.2 (cont’d.)

GM

F1

F2 F3 F4 F5

F6

F7 F8

F9

F10 F11 F12 F13 F14 F15

MRGAN

9

4

MSG_STYLE_GAN

33

25

MUNIT

NADE

OCFGAN

PGD

PIX2PIX

PixelCNN

18

15

1

9

0

4

95

92

29

16

17

9

1

8

0

1

1

1

0

0

0

0

0

0

0

2

0

0

PixelCNN++

105

60

10

35

4

0

3

0

4

0

451

4094

3715

0

454

7236

13

5507

768

8

0

3

3

2

1

2

4

2

2

7156

15

PIXELDA

PixelSnail

PROG_GAN

RGAN

RSGAN_HALF

RSGAN_QUAR

RSGAN_REG

RSGAN_RES_BOT

RSGAN_RES_HALF

RSGAN_RES_QUAR

RSGAN_RES_REG

SAGAN

SEAN

SEMANTIC

SGAN

SNGAN

SOFT_GAN

SRFLOW

SRRNET

27

14

90

90

26

25

7

8

8

8

15

15

15

15

11

3

4

4

4

7

7

7

7

6

19

16

23

12

7

3

23

11

8

0

66

66

74

36

STANDARD_VAE

7

4

STARGAN

23

12

1

0

1

1

1

1

1

1

1

1

1

1

0

0

1

1

5

0

1

3

0

STARGAN_2

67

26

12

STGAN

19

10

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

4

0

4

3

3

2

3

3

3

3

3

3

3

2

2

2

2

4

2

5

4

1

2

4

2

12

835

0

0

3

3

3

3

7

7

7

7

4

0

4051

4600

195

899

451

1795

963

1155

579

2307

139

5062

11

2179

3

195

11

3871

3

2

0

4547

37

2819

0

99

11

2179

25

4188

9

2953

257

2

8

6

1

3

15038350

50200000

10305035

785284

9681284

10

22000000

13

54404099

8

5

6

4600000

46000000

483715

10

40000000

8

2

2

2

2

4

4

4

4

4

7

6

2

5

3

4

46200000

1049985

13129731

3812355

48279555

758467

1201411

367235

4270595

16665286

266907367

53192576

1049985

10000000

1757412

7012163

16

4069955

3

6

469173

53192576

12

94008488

5

25000000

0

1

1

2

0

2

1

0

2

0

2

0

0

0

0

0

0

0

0

0

0

3

1

0

0

0

2

0

3

1

1

0

1

2

0

3

1

0

1

3

3

1

0

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2

1

3

1

2

1

1

2

1

0

1

1

2

0

0

1

3

3

2

1

1

1

1

1

1

1

2

1

1

2

1

2

0

1

1

1

2

2

0

1

1

0

0

1

1

0

1

1

0

0

0

0

0

0

1

1

1

1

0

0

0

0

0

0

1

0

0

0

0

1

1

0

1

1

1

0

1

1

1

0

1

0

0

1

1

1

1

1

1

1

0

1

0

0

1

0

0

1

1

0

0

1

0

1

1

0

0

0

1

0

1

0

0

1

1

0

0

0

0

0

0

0

0

0

1

1

0

0

0

1

1

1

1

1

Table H.2 (cont’d.)

GM

F1

F2 F3 F4 F5

F6

F7 F8

F9

F10 F11 F12 F13 F14 F15

STYLEGAN

33

25

STYLEGAN_2

33

25

STYLEGAN2_ADA

33

25

SURVAE_FLOW_MAXPOOL

95

90

SURVAE_FLOW_NONPOOL

90

90

TPGAN

UGAN

UNIT

VAE_field

VAE_flow

VAEGAN

VDVAE

WGAN

WGAN_DRA

WGAN_WC

WGANGP

YLG

45

31

9

4

43

22

6

14

17

0

0

7

48

42

9

5

18

10

18

10

9

5

33

20

8

8

8

0

0

2

1

0

6

14

2

0

0

1

1

0

1

0

0

0

5

0

1

0

0

0

0

0

6

0

0

0

0

2

0

0

0

0

0

4094

4094

4094

6542

6542

11

5275

4

771

21

4739

0

0

8

0

4

7

7

4

0

0

867

3502

1923

2307

2307

1923

10

5155

3

3

3

2

2

0

2

4

1

2

2

3

2

5

5

2

5

8

8

8

50200000

59000000

59000000

20

25000000

20

25000000

0

3

8

3

4

6

27233200

4850692

13131779

300304

760448

26396740

13

41000000

4

3

3

4

5

23909265

4276739

4276739

23905841

42078852

1

1

1

2

2

0

0

1

2

2

0

2

0

0

0

0

0

2

2

2

0

0

3

3

1

3

3

1

0

1

1

1

1

1

2

2

2

0

0

3

1

1

0

0

1

2

1

1

1

1

1

1

1

1

0

0

0

0

1

0

0

0

1

0

0

0

0

1

0

0

0

0

0

1

1

1

0

0

0

1

0

1

1

0

1

1

1

1

0

0

1

0

1

0

0

1

1

0

0

0

0

1

258

Feature

Normalization type

Non-linearity type in last layer

Table H.3 Feature value for different labels of multi-class and binary features.
Label
0
1
2
3
0
1
2
3
0
1
2
3
0
1
0
1

Value
Batch Normalization
Instance Normalization
Adaptive Instance Normalization
No Normalization
ReLU
Tanh
Leaky_ReLU
Sigmoid
ELU
ReLU
Leaky_ReLU
Sigmoid
Nearest Neighbour
Deconvolution
Feature used
Feature not used

Skip connection and downsampling

Non-linearity type in blocks

Upsampling type

Table H.4 Ground truth feature vector used for prediction of loss type for all GMs.

GM

AAE

ACGAN

ADAGAN_C

ADAGAN_P

ADV_FACES

ALAE

BEGAN

BETA_B

BETA_H

BETA_TCVAE

BGAN

BICYCLE_GAN

BIGGAN_128

BIGGAN_256

BIGGAN_512

CADGAN

CCGAN

CGAN

𝐿1 𝐿2 MSE MMD LS WGAN KL Adversarial Hinge CE

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

0

0

0

0

1

1

1

1

0

0

0

0

0

0

0

1

1

0

0

1

0

1

0

0

1

0

1

1

1

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

0

0

0

0

1

0

0

0

0

0

1

1

0

0

0

0

1

1

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

259

Table H.4 (cont’d).

GM

𝐿1 𝐿2 MSE MMD LS WGAN KL Adversarial Hinge CE

COCO_GAN

COGAN

COLOUR_GAN

CONT_ENC

CONTRAGAN

COUNCIL_GAN

CRAMER_GAN

CRGAN_C

CRGAN_P

CYCLEGAN

DAGAN_C

DAGAN_P

DCGAN

DEEPFOOL

DFCVAE

DISCOGAN

DRGAN

DRIT

DUALGAN

EBGAN

ESRGAN

FACTOR_VAE

Fast pixel

FFGAN

FGAN

FGAN_KL

FGAN_NEYMAN

FGAN_PEARSON

FGSM

FPGAN

FSGAN

FVBN

GAN_ANIME

Gated_pixel_cnn

GDWCT

GFLM

1

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

1

0

1

1

0

1

1

1

1

1

0

1

0

1

0

1

1

0

1

1

0

1

0

1

0

0

0

1

1

0

1

0

1

0

1

0

0

1

0

0

0

1

1

0

0

0

0

1

1

0

0

0

0

1

0

0

0

1

0

0

1

0

0

1

0

0

1

0

0

0

0

1

1

1

0

1

0

0

0

1

0

0

0

0

0

1

1

1

0

1

1

0

0

1

1

0

0

0

1

1

1

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

1

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

260

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

1

0

0

0

0

1

0

1

0

0

0

0

0

1

0

0

1

0

0

1

0

1

0

1

0

1

1

0

0

0

1

1

1

0

0

0

0

0

1

1

1

0

1

0

1

GM

GGAN

ICRGAN_C

ICRGAN_P

Image_GPT

INFOGAN

LAPGAN

Lmconv

LOGAN

LSGAN

MADE

MAGAN

MEMGAN

MMD_GAN

MRGAN

MSG_STYLE_GAN

MUNIT

NADE

OCFGAN

PGD

PIX2PIX

PixelCNN

PixelCNN++

PIXELDA

PixelSnail

PROG_GAN

RGAN

RSGAN_HALF

RSGAN_QUAR

RSGAN_REG

RSGAN_RES_BOT

RSGAN_RES_HALF

RSGAN_RES_QUAR

RSGAN_RES_REG

SAGAN

SEAN

SEMANTIC

Table H.4 (cont’d).

𝐿1 𝐿2 MSE MMD LS WGAN KL Adversarial Hinge CE

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

0

0

0

0

1

0

0

0

0

1

0

0

1

0

0

1

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

1

1

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

1

1

0

0

0

0

0

1

0

1

1

1

0

0

0

1

0

0

1

0

0

0

0

0

0

0

0

0

0

1

1

1

0

0

0

0

1

0

0

0

1

0

1

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

261

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

1

0

0

0

0

1

0

1

0

0

0

0

0

0

0

0

0

0

0

0

1

0

1

1

0

1

0

0

1

0

0

0

0

0

0

1

0

0

0

1

1

1

1

0

0

1

1

1

1

1

1

1

0

0

0

GM

SGAN

SNGAN

SOFT_GAN

SRFLOW

SRRNET

STANDARD_VAE

STARGAN

STARGAN_2

STGAN

STYLEGAN

STYLEGAN_2

STYLEGAN2_ADA

SURVAE_FLOW_MAXPOOL

SURVAE_FLOW_NONPOOL

TPGAN

UGAN

UNIT

VAE_field

VAE_flow

VAEGAN

VDVAE

WGAN

WGAN_DRA

WGAN_WC

WGANGP

YLG

Table H.4 (cont’d).

𝐿1 𝐿2 MSE MMD LS WGAN KL Adversarial Hinge CE

0

0

0

1

0

0

1

1

1

0

0

0

0

0

1

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

1

1

1

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

0

1

0

1

1

1

0

1

1

0

0

0

1

1

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

0

0

0

0

1

0

0

0

0

0

0

1

1

1

1

1

0

0

0

0

0

1

0

0

0

0

0

0

1

1

0

0

1

1

1

1

1

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

1

1

1

1

0

0

0

0

0

1

1

0

0

0

1

1

0

1

0

0

0

0

0

262

Table H.5 Ground truth feature vector used for prediction of network architecture for evaluation on
diffusion models. F1: # layers, F2: # convolutional layers, F3: # fully connected layers, F4: #
pooling layers, F5: # normalization layers, F6: #filters, F7: # blocks, F8:# layers per block, F9: #
parameters, F10: normalization type, F11: non-linearity type in last layer, F12: nonlinearity type
in blocks, F13: up-sampling type, F14: skip connection, F15: downsampling.

GM
ADM
ADM-G
DDPM
DDIM
LDM
Stable-Diffusion
GLIDE-Diffusion

F1
F2
134 122
134 122
134 122
134 122
134 122
84
94
80
90

F3
12
12
12
12
12
10
10

F4
0
0
0
0
0
0
0

F5
0
0
0
0
0
0
0

F6
5000
5000
5000
5000
5000
5000
5000

F7
8
8
8
8
8
8
8

F8
12
12
12
12
12
12
12

F9
554000000
600000000
554000000
554000000
554000000
552000000
270000000

F10 F11 F12 F13 F14 F15

1
1
1
1
1
1
1

1
1
1
1
1
1
1

1
1
1
1
1
1
1

1
1
1
1
1
1
1

1
1
1
1
1
1
1

1
1
1
1
1
1
1

Table H.6 Ground truth feature vector used for prediction of loss type for evaluation on diffusion
models.

GM
ADM
ADM-G
DDPM
DDIM
LDM
Stable-Diffusion
GLIDE-Diffusion

𝐿1
0
0
0
0
1
0
0

𝐿2 MSE MMD LS WGAN
0
0
0
0
0
0
0

0
0
0
0
0
0
0

0
0
0
0
0
0
0

1
1
1
1
1
1
1

0
0
0
0
0
1
1

KL Adversarial Hinge CE
1
0
0
0
0
0
1
0
0
0
0
0
1
0

0
0
0
0
0
0
0

0
0
0
0
0
0
0

Table H.7 Test sets used for evaluation on diffusion models.

Set 2

Set 1

GM
GM 1 ADM DDPM Stable-diffusion GLIDE-Diffusion
GM 2 ADM-G DDIM
GM 3 DDPM LDM GLIDE-Diffusion

DDIM
LDM

ADM-G

Set 3

Set 4

263

.
s
k
c
a
t
t
a

n
o
i
t
a
m
r
o
f
n
i
s
i
m
d
e
t
a
n
i
d
r
o
o
c

r
o
f
d
e
s
u

s
t
e
s

t
s
e
T
8
.
H
e
l
b
a
T

N
A
G
_
E
L
C
Y
C
B

I

F
L
A
H
_
N
A
G
S
R

8
2
1
_
N
A
G
G
B

I

N
A
G
S
L

T
E
N
R
R
S

C
_
N
A
G
A
D

N
A
G
O
L

E
A
V
C
F
D

N
A
G
A
S

5
1
M
G

4
1
M
G

3
1
M
G

2
1
M
G

1
1
M
G

0
1
M
G

9
M
G

8
M
G

7
M
G

6
M
G

T
I
N
U

X
I
P
2
X
I
P

T
I
R
D

N
A
G
R

I

E
M
N
A
_
N
A
G

B
_
A
T
E
B

s

M
G
n
e
e
S

5
M
G

4
M
G

3
M
G

2
M
G

1
M
G

e
p
y
T

N
A
G
R
A
T
S

N
A
G
F

C
_
N
A
G
R
C

N
A
G
F
F

N
A
G
B

E
A
V
C
T
_
A
T
E
B

N
A
G
P
A
L

N
A
G
P
T

N
A
G
W

N
A
G
C
A

E
A
L
A

E
D
A
M

N
A
G
A
M

G
E
R
_
N
A
G
S
R

N
A
G
R
D

s

M
G
n
e
e
s
n
U

264

Figure H.1 Feature heatmap for each feature in network architecture and loss function predicted
feature vector for face data. Each heatmap provides the importance of the region in the estimation
of the respective parameter.

265

Figure H.2 Feature heatmap for each feature in network architecture and loss function predicted
feature vector for MNIST data.

266

Figure H.3 Feature heatmap for each feature in network architecture and loss function predicted
feature vector for CIFAR data.

267

Figure H.4 Confusion matrix in the estimation of remaining parameters which were not shown in
paper for network architecture and loss function. (1)-(12): Standard cross-entropy and (12)-(24):
Weighted cross entropy. Weighted cross entropy handles imbalance of data much better than the
standard cross entropy which usually predicts one class.

268

Figure H.5 Network architecture for various components of our method. (a) FEN (b) Mean and
instance parser in PN (c) Shallow network for deepfake detection (d) Shallow network for image
attribution.

H.4 Feature heatmaps

Every hyperparameter defined for network architecture and loss function type prediction may

depend on certain region of the input image. To find out which region of the input image our model

is looking at to predict each hyperparameter, we mask out 5 × 5 region from the input image. For
the continuous type parameters, we compute the 𝐿1 error between every predicted hyperparameter

and its ground truth. This value of error will tell us how important is this 5 region in the input

image to predict a particular hyperparameter. The higher the value of this error, the higher is the

importance of that region in the prediction of the corresponding hyperparameter. For discete type

parameters in network architecture and loss function, we estimate the probability of the ground

truth label for every parameter. We subtract this probability from one to estimate the heatmap of

the respective feature. Important regions will not affect the probability of the ground truth label

269

for a particular feature. To obtain a stable heatmap, we do the above experiment on 100 randomly

chosen images across the different GMs and then calculate the average heatmap.

Fig. H.1, Fig. H.2 and Fig. H.3 show the feature heatmaps for every hyperparameter of network

architecture and loss type feature vector for Face, MNIST and CIFAR data respectively. For each

hyperparmater, there are certain regions of the input that are more important than others. Each

type of data has different type of heatmaps indicating different regions of importance. For face and

CIFAR, these regions lie mostly in the central part but for MNIST, many of the features depend on

the regions closer to edges. There are also some similarities between these heatmaps for a particular

type of data. This can indicate the similarity of these hyperparameters.

270