MACHINE INTELLIGENCE-ENABLED MULTIMODAL BIOMEDICAL IMAGING 

By 

Aniwat Juhong 

A DISSERTATION 

Submitted to 
Michigan State University 
in partial fulfillment of the requirements   
for the degree of 

Electrical and Computer Engineering – Doctor of Philosophy 

2025 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ABSTRACT 

Due  to  the  rapid  development  of  computational  technologies,  deep-learning-based  approaches 

have emerged as practical and promising remedies for a wide range of biomedical applications. 

This  dissertation  demonstrates  the  utilization  of  deep  learning  approaches  across  multiple 

modalities in  the  field  of  biomedical  applications: histopathology image analysis,  multispectral 

optoacoustic tomography (MSOT), computed tomography (CT), magnetic particle imaging (MPI), 

and  Raman  spectroscopy.  The  first  deep  learning  application  is  convolutional  neural  networks 

(CNNs)  for  resolution  enhancement  and  nuclei  segmentation  of  hematoxylin  and  eosin  (H&E) 

images. This deep learning-based approach could facilitate cancer diagnosis using H&E images 

acquired  by  a  low  resource  setting.  The  second  application  is  based  on  hybrid  recurrent  and 

convolutional neural  networks  to  generate  sequential  cross-sectional  MSOT  images  in  order  to 

reduce  the  acquisition  time.    Essentially,  the  proposed  deep  learning  model  can  generate  the 

missing sequential MSOT images in the data acquired by a large step size setting, resulting in a 

comparable resolution to the data acquired by a small step size setting. The third application is an 

efficient end-to-end deep learning model based on U-Net architecture and a multi-head attention 

mechanism for MPI-CT image segmentation. This proposed model can directly segment the MPI 

signal from the co-registered MPI-CT image with promising performance. Lastly, it is a custom-

made Raman spectrometer together with computer vision-based positional tracking and monocular 

depth estimation using deep learning for the visualization of 2D and 3D surface-enhanced Raman 

Scattering  (SERS)  nanoparticles  (NPs)  imaging,  respectively.  The  combination  of  Raman 

spectroscopy,  image processing, deep  learning, and SERS molecular imaging shows the  robust 

and feasible potential for clinical applications. 

 
 
 
 
Copyright by  
ANIWAT JUHONG 
2025 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ACKNOWLEDGMENTS 

I  would  like  to  express  my  gratitude  for  the  support  and  collaboration  provided  by  numerous 

people who helped me to refine my Doctor of Philosophy and complete this dissertation. First and 

foremost, I wish to wholeheartedly thank Prof. Zhen Qiu for his unwavering faith in my ability to 

carry out this project, his financial support, and his spirit of adventure. He is an exceptional mentor 

throughout my Ph.D. journey. This dissertation is a result of his vision to develop deep learning 

for numerous  useful  biomedical applications.  I  also  appreciate  invaluable suggestions from  my 

committee members: Prof.  Nelson Sepúlveda, Prof. Wen Li, Prof. Ming Han, and Prof. Xuefei 

Huang  to  complete  this  dissertation.  Without  their  evaluation  time,  feedback,  and  guidance,  I 

would  not  improve  my  research  skills  and  broaden  my  horizons.  Secondly,  I  am  profoundly 

grateful to Prof. Christopher H. Contag and Prof. Wibool Piyawattanametha for their support and 

providing  me  with  an  opportunity  to  work  at  the  Institute  for  Quantitative  Health  Science  and 

Engineering  (IQ),  Michigan  State  University.  Moreover,  I  would  like  to  thank  my  colleagues, 

including Dr. Bo Li, Dr. Cheng-You Yoa, Dr. Chia-Wei Yang, Dr. Kunli Liu, Dr. Brett Volmert, 

Yifan  Liu,  and  A.K.M.  Atique  Ullah  for  their  collaboration.  Last  but  not  least,  this 

acknowledgement  would  not  be  complete  without  mentioning  my  parents  and  family.  Their 

substantial encouragement, affection, and support led me to rise above all the difficulties during 

my Ph.D. study.  

iv 

 
 
 
 
 
 
TABLE OF CONTENTS 

CHAPTER 1: Introduction ............................................................................................................. 1 

CHAPTER 2: Super-resolution and Segmentation Deep Learning for Breast Cancer 
Histopathology Image Analysis ...................................................................................................... 3 

CHAPTER 3: Recurrent and Convolution Neural Networks for Sequential Multispectral 
Optoacoustic Tomography (MSOT) Imaging ............................................................................... 32 

CHAPTER 4: Multi-head Attention U-Net for MPI-CT Image Segmentation ............................ 54 

CHAPTER 5: Monocular Depth Estimation Based on Deep Learning for Intraoperative 
Guidance Using Surface-enhanced Raman Scattering (SERS) Imaging ...................................... 73 

CHAPTER 6: Summary and future work ..................................................................................... 95 

BIBLIOGRAPHY ......................................................................................................................... 97 

v 

 
 
 
 
 
 
 
 
 
 
 
CHAPTER 1: Introduction 

1.1 Deep learning overview  

Deep  learning  is  one  of  machine  learning  approaches  that  utilize  multiple  layers  of  data 

representation  to  effectively  capture  the  unique  features  of  the  input  data  at  different  stages, 

demonstrating  exceptional  performance  in  a  wide  range  of  applications  such  as  image 

classification, image segmentation, natural language processing, data generative, etc. As a result, 

deep  learning  has  been  rapidly  developed  in  recent  years,  encompassing  methodological 

constructions  and  actual  implementation.  Indeed,  deep learning employs  computational models 

consisting  of  numerous  layers  of  processing  to  acquire  and  represent  data  with  higher  level  of 

abstraction,  and  it  can  implicitly  capture  complex  patterns  in  extensive  datasets.  The  growing 

amount of data that can be gathered through biomedical and clinical data needs the advancement 

of deep learning techniques to handle, such as Convolution Neural networks (CNNs), Recurrent 

neural  networks  (RNNs),  Attention  mechanisms,  and  Transformer  based  neural  networks  to 

process and evaluate the data. Some examples of biomedical devices that commonly apply deep 

learning  include  Computed  Tomography  (CT),  Magnetic  Resonance  imaging  (MRI),  Magnetic 

Particle  imaging  (MPI),  Ultrasound,  photoacoustic  tomography,  optical  microscopy  and 

tomography and so on.  Specifically, this dissertation demonstrates deep learning for biophotonics 

and  molecular  imaging  applications,  which  are  multidisciplinary  life  sciences,  combining  the 

principles of optics, photonics and biology to investigate biological systems at tissue, cellular, and 

molecular levels. The field of biophotonics is  one of the essential parts for the development of 

unprecedented diagnostic and therapeutic approaches in the biomedical field; therefore, it has been 

significantly improved over decades, particularly the use of deep learning techniques to empower 

biophotonics  research  by  enabling  advanced  image  analysis,  improved  image  and  signal 

1 

 
processing, and the ability to comprehensively analyze biophotonics data.  

1.2 Organization of the dissertation  

This  dissertation  is  divided  into  four  chapters  for  four  different  modalities  and  applications. 

Additionally,  there  is  a  fifth  chapter  addressing  future  research.  Chapter  2  demonstrates 

approaches based on deep learning for super-resolution and segmentation for histology images. 

The two proposed deep learning models in this chapters were jointly trained together to reach the 

join optimization to perform both resolution enhancement and segmentation for breast cancer H&E 

images.  In  chapter  3,  a  deep  learning  application  for  generating  the  sequential  multispectral 

sequential Multispectral Optoacoustic Tomography (MSOT) Imaging is presented. The aim of this 

work is to reduce the acquisition time without any hardware modifications. In this work, the mice 

injected  with  ICG-conjugated  superparamagnetic  iron  oxide  nanoworms  particles  (NWs-ICG) 

were  scanned  under  the  MSOT  system  providing  three  imaging  modalities:  photoacoustic, 

ultrasound, and NWs-ICG acoustic images. The proposed deep learning can reduce the acquisition 

time  of  volumetric  imaging  for  these  three  modalities.  Chapter  4  shows  the  MPI  signal 

segmentation of MPI-CT images, which is  significantly important  for MPI quantification. This 

work proposed a novel architecture based on U-Net architecture and attention mechanisms that 

can surpass other state-of-the-art models. Lastly, chapter 5 shows the application of depth map 

estimation based on deep learning in tandem with surface enhanced Raman scattering (SERS) for 

the image-guidance surgery application. With depth information, the SERS is more practical for a 

real clinical application.  The final chapter concludes the dissertation, on-going work related to 

biomedical applications as well as possible future work.  

2 

 
 
 
CHAPTER 2: Super-resolution and Segmentation Deep Learning for Breast Cancer 

Histopathology Image Analysis 

Reprinted  with  permission  from  “A.  Juhong,  et  al.,  “Super-resolution  and  Segmentation  Deep 

learning  for  Breast  Cancer  Histopathology  Image  Analysis", Biomedical  Optics  Express, 14.1 

(2023): 18-36 ” [1], © Optica Publishing Group. 

Traditionally,  a  high-performance  microscope  with  a  large  numerical  aperture  is  required  to 

acquire high-resolution images. However, images’ size is typically tremendous. Therefore, they 

are not conveniently managed and transferred across a computer network or stored in a limited 

computer storage system. As a result, image compression is commonly used to reduce image size 

resulting  in  poor  image  resolution.  Here,  we  demonstrate  custom  convolution  neural  networks 

(CNNs)  for  both  super-resolution  image  enhancement  from  low-resolution  images  and 

characterization of both cells and nuclei from hematoxylin and eosin (H&E) stained breast cancer 

histopathological  images  by  using  a  combination  of  generator  and  discriminator  networks  so-

called  super-resolution  generative  adversarial  network-based  on  aggregated 

residual 

transformation  (SRGAN-ResNeXt)  to  facilitate  cancer  diagnosis  in  low  resource  settings.  The 

results  provide  high  enhancement  in  image  quality  where  the  peak  signal-to-noise  ratio  and 

structural  similarity  of  our  network  results  are  over  30  dB  and  0.93,  respectively.  The  derived 

performance is superior to the results obtained from both the bicubic interpolation and the well-

known SRGAN deep-learning methods. In addition, another custom CNN is used to perform image 

segmentation from the generated high-resolution breast cancer images derived with our model with 

an average Intersection over Union of 0.869 and an average Dice Similarity Coefficient of 0.893 

for the H&E image segmentation results. Finally, we propose the jointly trained SRGAN-ResNeXt 

and Inception U-net Models, which applied the weights from the individually trained SRGAN-

3 

 
ResNeXt and Inception U-net Models as the pre-trained weights for transfer learning. The jointly 

trained  model’s  results  are  progressively  improved  and  promising.  We  anticipate  these  custom 

CNNs can help resolve the inaccessibility of advanced microscopes or whole slide imaging (WSI) 

systems to acquire high-resolution images from low-performance microscopes located in remote-

constraint settings. 

2.1 Introduction 

Pathology diagnosis is routine work usually performed by a skilled pathologist or cytologist. The 

diagnosis process begins with staining (typically hematoxylin and eosin or H&E) of a specimen 

on a glass slide and observing it under a high-resolution (HR) microscope. Typically, the diagnosis 

process for each biopsy slide could take up to 15-20 mins per slide which is very time-consuming.  

Pathologists must visually scan over a vast field of view to find any abnormalities on each slide. 

Therefore, whole slide imaging (WSI) has been introduced to solve this main problem [1]. The 

WSI refers to scanning a complete microscope slide and creating a single high-resolution digital 

file.  This  is  commonly  achieved  by  capturing  many  small  HR  image  tiles  or  strips  and  then 

montaging  them  to  create  a  full  image  of  a  histological  section.  The  WSI  equipped  with 

pathological image diagnosis software is changing the workflow of many laboratories. Specimens 

on glass slides can now be transformed into HR digital files that can be efficiently stored, accessed, 

and  analyzed.  The  latter  is  due  to  the  advancement  of  computer  vision  and  convolution  neural 

networks (CNNs) algorithms in digital pathological image analysis [2, 3]. 

However, in resource-constraint settings, accessibility of both HR microscope and WSI is a 

crucial  obstacle  to  delivering  quality  health  care,  frequently  resulting  in  undertreatment  and 

overtreatment  of  infectious  diseases  based  on  clinical  assessment  alone  [4].   Laboratory 

infrastructure is typically clustered in urban settings and is relatively inaccessible in regions where 

4 

 
significant  portions  of  the  affected  population  reside  [5].  Many  of  the  neglected  diseases  in 

particular,  are  more  prevalent  in  rural  areas,  far  from  these  diagnostic  centers  [6].     Therefore, 

novel, simple, and inexpensive approaches to perform digital pathological diagnoses are needed 

in  both  clinical  and  public  health  environments.    Potential  solutions  are  to  provide  a  software-

based solution to help transform low-resolution (LR) to either HR or super-resolution (SR) images. 

Due to the rapid development of computational technologies, deep-learning-based diagnosis 

has become a sought-after technique for digital pathology image analysis implementation [2, 3]. 

Depending  on  the  analysis,  the  technique  can  be  divided  into  supervised  and  unsupervised 

learning. Supervised learning aims to define a function that can map input images to their outputs 

or labels (normal cells, abnormal cells, cancer cells, and other parameters) such as classification 

or segmentation problems. On the other hand, the purpose of unsupervised learning is to define 

another  function that  can extract  the latent  features and structures from  unlabeled data such as 

clustering problems, dimensional reduction, and super-high-resolution problems.  Several studies 

use CNNs for nuclei segmentation [7-11]. Those methods can surpass the traditional methods such 

as  Otsu  segmentation  [12],  Watershed  method  [13],  and  K-mean  clustering  [14]    since  the 

traditional methods are sensitive to parameter setting and could be effective for specific data types.  

CNNs based approaches have become practical tools for nuclei and cell segmentation tasks as they 

can  achieve  a  resounding  success.  HoverNet  [15]  is  one  of  the  effective  CNNs  for  nuclei 

segmentation. The model predicts horizontal and vertical distance between a nucleus centroid to 

its  corresponding  foreground  pixels.  Masker-controlled  watershed  is  then  applied  as  the  post-

processing method to obtain nucleus instances. However, the HoverNet results can be sensitive to 

the  noise  in  the  distance  maps  because  of  the  marker-controlled  watershed.  StarDIST  [16]  is 

another  CNNs  for  nuclei  segmentation  that  predicts  centroid  probability  maps  to  localize  the 

5 

 
nuclei. The predicted centroids are applied to generate polygons to determine the boundary and 

the number of the cells. The downside of the StarDIST is that polygons are only predicted using 

the centroid pixels' features. These results in a lack of contextual information for large nucleus 

instances and could affect prediction accuracy. CPP-Net[17] extends the StarDIST by integrating 

the rich contextual information from a sampled point set for each centroid pixel and applying the 

Shape-Award Perceptual loss that constrains CPP-Net’s predictions regarding the nucleus shape.  

U-net  architecture  is  a  renowned  convolution  neural  network  architecture  for  image 

segmentation.  It is widely used for biomedical image segmentation [18]. Its structure is simple 

convolution  blocks,  and  the  skip  connections  are  added  from  decoder  to  encoder.  The  U-net 

architecture allows for simultaneously using global location and context and it works with very 

few samples to improve the model performance.  In addition, it is an end-to-end process for the 

entire image in the forward pass  and directly generates the segmentation image. Its structure is 

also simple to be modified or assembled with other models. Potentially, the performance of the U-

net  can  be  improved  by  using  other  effective  convolution  architectures  to  replace  the  simple 

convolution  blocks.    In  recent  years,  CNNs  have  also  been  applied  for  super  high-resolution 

biomedical images with a wide range of imaging modalities [19-25] such as fluorescence imaging, 

light-sheet imaging, and color imaging of pathological slides. However, those works employed the 

same concept of SRGAN [26] that the generator is built using the ResNet architecture or residual 

structure[27].  Indeed,  several  architectures  can  surpass  the  residual  structure.  Exploring  one  of 

them and applying it to the generative adversarial network (GAN) will be more worthwhile.  For 

instance, the DenseNet [28] network is applied as the backbone for SGAN namely ESRGAN [29] 

showing the impressive result and surpassing the original SRGAN model. According to the Top-

1 and Top-5 accuracy vs. computational complexity testing reported on Benchmark Analysis of 

6 

 
Representative  Deep  Learning  Neural  Networks  Architectures  [30],  the  ResNeXt  CNNs 

architectures  can  outperform  state-of-the-art  (SOTA)  architectures  such  as  ResNet,  DenseNet, 

Inception,  etc.,  even  the  complexity  of  ResNeXt  is  somewhat  less  than  others.  Recently,  deep 

learning techniques based on transformer architectures [31] have emerged as an alternative to the 

CNNs  architectures  since  they  can  provide  better  results  on  large  datasets.  However,  the 

transformer architectures are more complicated and require a high computation cost. If the model 

is  excessively  complicated,  it  will  be  challenging  to  build  the  jointly  trained  models  to 

simultaneously update the weights of the joint models due to the restriction of computing resources 

(time, memory, speed, etc.).  

To  overcome  limitations  in  digital  pathological  diagnosis,  we  describe  a  novel  method  for 

transforming  LR  digital  pathological  images  derived  from  low-cost  microscopes  to  super-

resolution  (SR)  images  (equivalent  to  a  40x  magnification)  with  a  super-resolution  generative 

adversarial convolution neural network technique based on ResNeXt architecture [32] (SRGAN-

ResNeXt)  [22].  Most  SRGAN  deep  learning  works  for  biomedical  image  enhancement  used  a 

single residual network (ResNet) in each layer to capture and extract image features, while our 

deep  learning  used  the  ResNext  architecture  instead.    Typically,  the  ResNet  architecture  can 

exceptionally perform on very deep convolution layers since the skip connection in the ResNet 

adds the input information to the output of the convolution layers.  Therefore, the output of ResNet 

contains the representative features from the convolution operation and the critical information 

from  the  original  input.  Moreover,  the  skip  connection  allows  the  gradient  to  effortlessly 

backpropagate and update the weight to minimize the loss value.  However, the single residual 

block  might  be insufficient  to  capture  all  significant  features.  Therefore,  to  increase  the  model 

capability, we apply residual blocks in parallel (stacking the same topology blocks) for each layer 

7 

 
(ResNeXt  architecture).  Utilizing  the  ResNeXt  architecture  not  only  improves  the  feature 

capturing but also reduces the complexity of the model in preference to make it deeper since hyper-

parameters (width, filter sizes, etc.) are shared. This approach can provide considerable resolution 

enhancement for poor-quality images. Training the SRGAN-ResNeXt Model requires a dataset 

consisting of high-resolution images (ground truth) and corresponding low-resolution images. We 

used a commercial  microscope (Nikon  Eclipse  Ci) to  prepare a dataset  for training this model. 

Peak Signal to Noise Ratio (PSNR) and Structural image similarity method (SSIM) was used to 

evaluate the generated images from our model, which are 32.92 dB and 0.93, respectively. These 

are promising results as they are higher than the original SRGAN Model’s evaluation results that 

were trained on the same data set (H&E images). Furthermore, we applied the Inception U-net 

Model [33], the improved U-net Model by using Inception architecture as a backbone in the U-net 

network for H&E image segmentation.  To train the Inception U-net Model, a large number of 

H&E images are required to be accurately masked on nuclei areas which are very time-consuming. 

Thus, we used a dataset from a cancer imaging archive [34] to train our Inception U-net Model.  

Our inception U-net Model’s Union (IoU) and Dice Similarity Coefficient (DSC) are 0.869 and 

0.893,  respectively.    Since  the  SRGAN-ResNeXt  and  Inception  U-net  model  were  separately 

trained, the performance of both models could be improved by jointly training them together as 

the segmentation loss and the generator loss could be effectively back propagated to update the 

weights for the generator model and Inception U-net model with a joint optimization. 

Figure  1  shows  the  overall  workflow  of  the  models.  First,  the  breast  tumor  H&E  slides  were 

prepared on biopsy slides (Figure 1(a)-(b)) to be imaged with a 40x magnification (Figure 1(c)), 

then acquired the images’ quality was downgraded by downsampling and adding blurring noise. 

Therefore,  the  model  has  both  corresponding  ground  truth  (high-resolution  images)  and  low-

8 

 
resolution images for training the SRGAN-ResNeXt (Figure 1(d)-(f)). Eventually, the well-trained 

generator  model  from  the  SRGAN-ResNeXt  (Figure  1(h))  was  applied  to  the  unseen  low-

resolution  image  (Figure  1(g))  to  enhance  its  quality  by  generating  the  high-resolution  image 

(Figure 1(i)). Furthermore, the generated high-resolution image was characterized as its resolution 

was substantially improved and contained considerable details that were impossible to perform 

before applying the model. In other words, our approach can tackle those low-resolution images 

by applying the Inception U-net Model (Figure 1(j)) to the generated high-resolution images (the 

output of the generator model from SRGAN-ResNeXt). As a result, the newly generated image 

can be segmented and quantified to characterize the nuclei’s density, size, and morphology.  

Figure 1. The workflow of super-resolution and segmentation deep learning. (a) Fresh breast 
tumor tissues. (b) The corresponding H&E stained tissue slides. (c) A commercial microscope 
(Nikon  Eclipse  Ci)  for  capturing  the  H&E  stained  tissue  slide  images.  (d)  High-resolution 
images  acquired  by  the  microscope.  (e)  Simulated  low-resolution  images.  (f)  The  training 
SRGAN-  ResNeXt  network.  (g)  The  unseen  low-resolution  image.  (h)  The  generator  model 
from SRGAN-ResNeXt. (i) The generated high-resolution image. (j) The Inception U-net Model 
for segmentation. (k) The segmented H&E image. 

2.2 Methods 

2.2.1 Proposed SRGAN-ResNeXt architecture  

Here, we propose SRGAN-ResNeXt architecture built from scratch to synthesize super-resolution 

images  from  low-resolution  images.  The  concept  of  the  SRGAN-ResNeXt  is  similar  to  the 

9 

 
 
traditional  GAN  that  consists  of  generator  and  discriminator  models.  The  generator  and 

discriminator  models  of  our  SRGAN-ResNeXt  are  depicted  in  Figure  2(a)  and  Figure  2(b), 

respectively. The generator model takes a low-resolution image as the input and generates a high-

resolution  image  after  passing  through  the  convolution,  ResNeXt,  and  upsampling  layers.  The 

discriminator model is utilized to distinguish the generated image from the ground-truth image by 

taking them as the input and providing probability as the output. The ultimate goal of SRGAN-

ResNeXt is to train the generator model to synthesize the image that can fool the discriminator 

completely. To achieve this, we need to design the generator model properly, use a large number 

of images as the dataset to train the models, and fine-tune the hyperparameters thoroughly. To train 

SRGAN-ResNeXt, we first trained the discriminator model by freezing the generator model. Next 

step, we used an adversarial network to train the generator model. The adversarial network (Figure 

2(c)) is the combined models, which are the generator model, discriminator model, and VGG19-

the latter works as the feature extractor [35].   

2.2.2 Generator model 

The generator network is a deep convolution network containing the pre-residual layer, 16 parallel-

residual layers (ResNeXt), a post-residual layer, two upsampling layers, and the final convolution 

layer as shown in Figure 2(a). To assemble the generator model, the pre-residual block is the first 

block, which contains a single 2D convolution layer and ReLU is used as the activation function. 

The  second  block  is  16  parallel-residual  layers  (ResNeXt  architecture).  Each  layer  after 

convolution  layers  is  followed  by  a  batch  normalization  with  0.8  of  momentum  value  and  the 

activation function is also ReLU. For the ResNeXt block, the size of transformation sets or branch 

numbers is defined as cardinality. Increasing the number of cardinalities can improve and better 

the  performance  of  the  convolution  neural  network.  However,  the  excessive  number  of 

10 

 
cardinalities  could  lead  to  expensive  computation.  Thus,  we  use  eight  cardinalities  for  our 

generator model [Figure 2(a)], which is the optimal number of our task. The next block is the post-

residual block, the simple convolution  layer,  and batch normalization (momentum  =0.8). After 

that, the fourth block is the upsampling block, which has two sub-pixel convolution layers [36], 

upsampling  the scale by four times.  Lastly, the last  convolution  layer uses the Tanh  activation 

function  to  form  the  generated  image  with  R,  G,  and  B  color  channels.  To  train  the  generator 

model,  we  need  to  use  the  joint  model,  which  is  the  adversarial  network  [Figure  2(c)].  The 

discriminator and VGG19 models are untrainable during training the generator model. 

2.2.3 Discriminator model 

The  discriminator  network  [37]  is  a  relatively  simple  convolution  network,  comprising  eight 

convolutional layers and two fully connected layers, designed to evaluate the similarity between 

the ground truth and generated images. After each convolution block, a batch normalization layer 

is used, followed by an activation function named the Leaky ReLU function (α=0.2). The number 

of 3x3 filter kernels increases by a factor of 2 from 64 (the first layer) to 512 (the eighth layer) 

kernels similar to the VGG network. The last two layers are dense layers working as a classification 

block,  predicting  the  probability  of  an  image  being  either  real  or  fake.  We  have  to  freeze  the 

generator model or make it untrainable to train the discriminator model. The learning progress of 

the  discriminator  model  is  remarkably  faster  than  the  generator  model.  Therefore,  during  the 

training generator model, it must be slowed down learning progress which will be further discussed 

in the next section below. 

11 

 
 
 
 
2.2.4 Loss functions 

The perceptual loss function (𝐼𝑆𝑅) is highly significant to the performance of the generator model 

in the SRGAN-ResNeXt network. It is the weighted sum of a content loss (VGG19 loss, IX

SR) 

and adversarial loss (Discriminator loss, IGen

SR )  as shown in Equation (1) as  

𝑆𝑅 . 
𝑆𝑅 + 𝐶𝑤𝐼𝐺𝑒𝑛
The  generator  exploits  this  loss  function  to  optimize  and  update  its  trainable  parameters.  To 

𝐼𝑆𝑅 = 𝐼𝑋

(1) 

achieve the well-trained generator model, the weight, 𝐶𝑤, was assigned to the loss value from the 

discriminator  model  to  slow  down  the  learning  progress  since  the  discriminator  model  can  be 

trained faster than the generator model. If the discriminator model can excessively perform well 

to distinguish between the generated image and the ground truth image, we would not be able to 

come  up  with  the  exceptional  generator  model  since  the  generated  image  cannot  fool  the 

discriminator  model.  In  the  original  SRGAN  training,  𝐶𝑤 is  a  constant  for  the  whole  learning 

process. However, this weight started from 0.5 and increased to 0.05 for every 10,000 epochs in 

our model. Since the generator model will gradually improve its performance and capability, we 

have to balance the performance of both the generator and discriminator models. The total number 

of epochs for training our model was 50,000. Therefore, 𝐶𝑤 was varied from 0.5 to 0.7.  

12 

 
Figure 2. Super-resolution generative adversarial network-based on SRGAN-ResNeXt. (a) The 
architecture of the generator. (b) The architecture of the discriminator. (c) The combined models 
so-called adversarial model for training Generator model. 

13 

 
 
 
Albeit using the pixel-wise mean square error (MSE) to distinguish between the ground truth and 

the reconstructed image is undemanding to optimize, it returns a poor-quality image in terms of 

human perception. The output of MSE is the average features’ difference of two data. Therefore, 

it  cannot  extract  high-dimensional  features.  However,  the  content  loss  or  VGG  loss  (𝐼𝑋

𝑆𝑅)  ,  is 

defined as the Euclidean distance between the feature map of the generated image 𝐺𝜃𝐺(𝐼𝐻𝑅) and 

the ground truth, 𝐼𝐻𝑅, can help solve this problem. The  𝐼𝑋

𝑆𝑅 loss is based on ReLU activation layers 

of the pre-train 19-layer VGG network and it can be calculated following Equation (2) as shown 

as 

                   𝐼𝑉𝐺𝐺

𝑆𝑅 =  

1
𝑊𝑖,𝑗𝐻𝑖,𝑗

∑

𝑊𝑖,𝑗
𝑥=1

∑ (  Ø𝑖,𝑗(𝐼𝐻𝑅)𝑥,𝑦 −   Ø𝑖,𝑗(𝐺𝜃𝐺(𝐼𝐿𝑅)𝑥,𝑦)
, 

𝐻𝑖,𝑗
𝑦=1

(2) 

where  𝑊𝑖,𝑗 𝑎𝑛𝑑 𝐻𝑖,𝑗  describe  the  dimensions  of  the  respective  feature  maps  within  the  VGG 

network.  The  features  map  (Ø𝑖,𝑗),  can  be  obtained  by  the  j-th  convolution  before  the 

𝑖𝑡ℎ maxpooling layer within the VGG19 network. Apart from using a feature map from VGG loss, 

the adversarial loss (𝐼𝐺𝑒𝑛

𝑆𝑅 ) is also employed to differentiate the similarity of the two images. It is 

defined as the probabilities varying from 0 to 1, which is the result of the discriminator model 

(𝐷𝜃𝐷(𝐺𝜃𝐺(𝐼𝐿𝑅))) as shown in Equation (3) below as 

                  𝐼𝐺𝑒𝑛

𝑆𝑅 = ∑

𝑁
𝑛=1

−𝑙𝑜𝑔𝐷𝜃𝐷(𝐺𝜃𝐺(𝐼𝐿𝑅)).

(3) 

The perceptual loss effectively leverages the combination of these two loss functions to train the 

generator model that can generate high-detailed images. 

2.2.5 Dataset for training SRGAN-ResNeXt Model 

To  obtain  breast  cancer  H&E  images,  the  female  MUC1  double-transgenic  mice  with  breast 

tumors [38] were euthanized and their tumors were sent out to the histopathology lab (MSU-IHPL 

Research facility) to prepare the H&E stained breast tumor slides. All procedures performed on 

14 

 
 
animals  were approved  by the University’s  Institutional  Animal Care &  Use Committee (AUF 

06/18-082-00) and were  within the guideline of human care of laboratory  animals.  Four tumor 

mice were euthanized, and a tumor of each mouse was surgically removed to prepare four different 

tumor  H&E  slides.  The  H&E  slides  were  then  imaged  by  the  commercial  microscope  (Nikon 

Eclipse Ci) with 40x magnifications to prepare the dataset for training SRGAN-ResNeXt. The size 

of each whole slide image is greater than 80,000 x 80,000 pixels and the image patches with a size 

of 256 x 256 pixels were extracted from each whole slide image with a 50 % overlapping area. 

The data augmentation was applied to these extracted image patches. The total number of image 

patches including the augmented images is over 13,000 images, which were used for training only. 

To  prepare  the  low-resolution  images,  we  downed  sampling  4  times  from  the  original  high-

resolution  image  patch  and  added  blurring  noise  using  the  normalized  boxed  filter  with  kernel 

shown in Equation (4) below. We increased the kernel size until we could not discriminate the 

nuclei boundary and the simulated low-resolution images are even worse than some native low-

resolution images. 

                  𝐾 =

1

𝑘𝑠𝑖𝑧𝑒.𝑤𝑖𝑑𝑡ℎ∗𝑘𝑠𝑖𝑧𝑒.ℎ𝑒𝑖𝑔ℎ𝑡

[

1 ⋯ 1
⋮
⋱
⋮
1 ⋯ 1

], 

(4) 

Where K is the blurring normalized boxed filter, ksize.width is the kernel width, and ksize.height 

is the kernel height. Figure 3(a) shows the cropping area from the large FOV H&E images. Figure 

3(b) are the small patches that were cropped from the large FOV image.  

15 

 
 
Figure  3.  Data  set  preparation  for  training  SRGAN-ResNeXt,  cropped  image  with  50% 
overlapping area. (a) Large field of view H&E image, (b) The small patches of the large image 
(a) with 50% overlapping area. 

2.2.6 The Inception U-net architecture  

The conventional CNNs for image segmentation tasks have two main components: an encoder and 

a  decoder.  Similarly,  the  U-net  architecture  has  these  two  parts,  but  the  skip  connection  is  the 

crucial mechanism that allows U-net to surpass the conventional method and perform better. This 

concept is akin to the residual block that the input (encoder part) will concatenate to the output 

(decoder part) at the same dimension. However, each layer of the original U-net architecture is a 

simple convolution block, which might be insufficient to extract some crucial information. For this 

reason, the Inception architecture [39] was applied to improve the capability of the U-net Model. 

Inception architecture uses a wide range of kernel sizes for the same input to simultaneously extract 

global and local features. A larger kernel size is suitable for the information distributed globally, 

whereas a smaller kernel size is appropriate for the information distributed locally. Consequently, 

the Inception CNN architecture can be satisfactorily performed to extract the feature from the data. 

16 

 
 
Here, we applied four different kernel sizes of the Inception blocks in our U-net Model as shown 

in Figure 4 below by replacing each convolution block in the original U-net architecture with the 

Inception blocks. 

Figure 4. Inception U-net architecture for H&E image segmentation.  Every single blue box 
corresponds to a multi-channel feature map. The value over the boxes represents the number of 
channels.  

Figure 4 illustrates the Inception U-net architecture. The first part is the encoder (the left side of 

Figure 4) where the Inception convolution blocks are utilized instead of the simple convolution 

blocks. All Inception blocks in this part consist of different sizes (3x3, 5x5, and 1x1) parallel filters 

(Inception structure) followed by a rectified linear unit (ReLU) and a 2x2 max pooling operation 

with  the  stride  of  2 steps  for  downsampling,  respectively  and  this  is  the repeated  process.  The 

number of feature channels is double at each downsampling step. The second part is the decoder 

(the  right  side  of  Figure  4).  It  consists  of  a  feature  map  upsampling  followed  by  a  2x2  up-

convolution  (halving  the  number  of  feature  channels),  a  corresponding  concatenation  from  the 

decoder part, and Inception blocks. The ReLU activation is used for each block. The H&E images 

17 

 
 
and  their  corresponding  segmentation  masks  are  implemented  to  train  this  model  as  input  and 

output, respectively. The loss function for U-net is a mean squared error (MSE) function as shown 

in Equation (5) shown below as   

                   𝑀𝑆𝐸 =

1

𝑁

 ∑ (𝑦𝑖 − 𝑦̂𝑖)2

𝑁
𝑖=1

, 

(5) 

where the MSE is the average of the squared differences between ground truth (𝑦𝑖) and predicted 

value from our model (𝑦̂𝑖) and N is the number of samples. 

2.2.7 Data set for training the segmentation models 

Since image segmentation is a supervised task, the outputs or targets need to be labeled, which is 

expensive and time-consuming. Fortunately, several datasets provide the H&E images and their 

corresponding nuclei masks. Here, we used the dataset from the cancer imaging archive[34]. This 

dataset provides nucleus segmentation for the whole cancer slide over 1,000 images in the cancer 

genome atlas (TCGA) repository. These images are from 10 different cancer types such as bladder 

urothelial  carcinoma  (BLCA),  invasive  breast  carcinoma  (BRCA),  cervical  squamous  cell 

carcinoma, and endocervical adenocarcinoma (CESC). 

2.2.8 Jointly trained SRGAN-ResNeXt and Inception U-net Models 

The SRGAN-ResNeXt and Inception U-net Models were jointly trained by using the separately 

trained weights of the SRGAN-ResNeXt Model and the Inception U-net Model as the pre-trained 

weights for transfer learning. Figure 5(a) shows the joint models for training the generator model. 

The conception of the jointly trained generator (JTG) Model is akin to the adversarial model shown 

in Figure 2(c). Still, the JTG Model employs not only the content loss (returned by the VGG19 

Model) and the adversarial loss (returned by the discriminator model) but also the segmentation 

loss of the generated high-resolution image and ground truth high-resolution image (returned by 

18 

 
the jointly trained Inception U-net). The combined loss of the JTG Model is shown in Equation 

(6) as  

                   𝐼𝐽𝐺 = 𝐼𝑋

𝑆𝑅 + 𝐶𝑤𝐼𝐺𝑒𝑛

𝑆𝑅𝑆 , 
𝑆𝑅 + 𝐶𝑤2𝐼𝐺𝑒𝑛𝑆

(6) 

Where  𝐼𝐽𝐺is  the  combined  loss  of  the  jointly  trained  generator  model,  IX

SR  is  the  content  loss 

(VGG19  loss),  IX

SR  is  the  adversarial  loss  (Discriminator  loss),  𝐼𝐺𝑒𝑛𝑆

𝑆𝑅𝑆   is  the  segmentation  loss 

(Jointly trained Inception U-net loss), and 𝐶𝑤 & 𝐶𝑤2 are hyperparameters. The VGG19 Model, the 

discriminator model, and the jointly trained Inception U-net Model are fixed as untrainable during 

training the JTG Model. 

The jointly trained Inception U-net (JTIU) Model was trained using the generated high-resolution 

image  (returned  by  the  JTG  Model)  and  the  ground  truth  of  the  high-resolution  image  as  the 

model’s inputs. The outputs of both inputs have the same ground truth to calculate the loss value. 

Therefore, the JTIU can learn how to generate the same quality segmentation image from both 

generated  high-resolution  images  and  native  high-resolution  images.  During  training  the  JTIU 

Model, the JTG Model was fixed as well.  

19 

 
Figure  5.  Jointly  trained  SRGAN-ResNeXt  Model  and  Inception  U-net  Model.  (a)  The 
assembled models for the jointly trained generator (JTG) Model. (b) The assembled models for 
the jointly trained Inception U-net (JTIU) Model.  

2.2.9 Data set for the jointly trained Models 

Two other tumor mice were sacrificed, and a tumor of each mouse was prepared for H&E slides. 

Therefore,  we  have  two  tumor  H&E  slides  from  different  mice  for  training  the  jointly  trained 

models. The 220 image patches with a size of 256 x 256 pixels were randomly extracted  from 

these H&E slides (110 patches per slide). 210 and 10 patches were used for training and testing, 

respectively. Each image patch was manually labeled for the ground truth of segmentation. Thus, 

this dataset contains low-resolution, high-resolution and segmentation images. 

20 

 
 
 
 
 
2.2.10 Training implementations  

The  separately  trained  SRGAN-ResNeXt  and  Inception  U-net  models  were  trained  on  Google 

Colaboratory-Pro (or Google Colab-pro) and implemented on the computer with a 9th Gen Intel 

Core i7-9750H CPU, 16 GB RAM, and  an NVIDIA RTX 2060 graphic card. Since the jointly 

trained models require more resources for training due to the combination of several models, they 

were trained on Google Colaboratory-Pro+ (Google Colab Pro+), which provides Faster GPUs and 

significantly more memory than Google Colab-pro.  

2.3 Results and discussion   

2.3.1 Super high-resolution image reconstruction and segmentation. 

The  goal  of  SRGAN-ResNeXt  is  to  have  a  well-trained  generator  model  to  reconstruct  high-

resolution  images.  We  could  not  feed  the  large  image  into  the  generator  model  due  to  the 

computation  restriction  during  implementation.  Therefore,  the  large  images  were  divided  into 

serval small images. Furthermore, the overlapping area between these divided images was required 

to stitch them back to obtain the same field of view (FOV) as the original large image. Figure 6 

shows  the  results  of  applying  both  the SRGAN-ResNeXt  and  the  Inception  U-net  Models  to  a 

breast tumor H&E image. Figure 6(a1), 6(b1), and 6(c1) are the small patches of the whole slide 

image from different areas. All these small images were downscaled and added blurring noise as 

shown in Figure 6(a2), 6(b2), and 6(c2). The SRGAN-ResNeXt Model was employed to enhance 

these  low-resolution  images  by  synthesizing  high-resolution  images  (Figure  6(a3),  6(b3),  and 

6(c3)).  The  Inception  U-net  was  then  applied  to  these  generated  high-resolution  images  for 

segmentation (Figure 6(a4), 6(b4), and 6(c4)).   

21 

 
 
 
Figure 6. The whole slide image (WSI) of a breast tumor H&E slide and the result of our deep 
learning model. (a1, b1, and c1) The high-resolution images of the WSI from different areas. 
(a2, b2, and c2) The low-resolution images. (a3, b3, and c3) The reconstructed high-resolution 
images using our deep learning model (SRGAN-ResNeXt). (a4, b4, and c4) The corresponding 
nuclei segmentation to (a3, b3, and c3) using the Inception U-net Model. 

Figure  7(a1)  and  7(b1)  show  the  low-resolution  image  and  the  enhanced-resolution  image 

generated by the SRGAN-ResNeXt model, respectively. They were fed into the Inception U-net 

Model for nuclei segmentation. Figure 7(a2) shows the segmentation result of the low-resolution 

image  and  Figure  7(b2)  shows  the  segmentation  result  of  the  enhanced  image.  It  is  relatively 

demanding to perform the image segmentation for the low-resolution image without enhancing its 

resolution first. The CNNs cannot extract meaningful features from the blurry pixels resulting in 

unsatisfactory  segmentation  performance.  The  mean  square  error  (MSE)  of  blurry  images  and 

generated high-resolution images are 21.24 and 2.75, respectively. The MSE of the blurry image 

is  significantly  higher  than  the  generated  high-resolution  image.  To  circumvent  this  issue,  we 

propose  to  apply  the  SRGAN-ResNeXt  Model  to  improve  the  poor-quality  image  before 

characterizing or performing segmentation to obtain better results. Figure 7(c1) and 7(c2) show 

the ground truth for high-resolution image and segmentation image, respectively.  

22 

 
 
 
Figure  7.  The  H&E  image  segmentation  of  the  low-resolution  image  and  the  enhanced-
resolution image. (a1-a2) The low-resolution image and its segmentation image (output of the 
Inception U-net). (b1-b2) The enhanced-resolution image (output of the SRGAN-ResNeXt) and 
its segmentation image (output of the Inception U-net). (c1-c2) The ground truth of the high-
resolution image and the segmentation image. (d) Ground truth preparation for both of the high-
resolution image and the segmented image. 

2.3.2. Performance of the SRGAN-ResNeXt Model 

Peak signal to noise ratio (PSNR) is one of the ubiquitous methods used to quantify the quality of 

the generated image compared to the original image (ground truth) [31]. It is a ratio between the 

maximum possible power of a signal and the power of distorting noise, affecting its representation 

quality. The higher the PSNR, the better the quality of the generated image. To compute the PSNR, 

we have to calculate the mean squire error (MSE) first and use the Equation (7) below to define 

PSNR as 

                     𝑃𝑆𝑁𝑅 = 20𝑙𝑜𝑔10(

𝑀𝐴𝑋𝑓
√𝑀𝑆𝐸

) .  

The MSE is defined as the following 

𝑀𝑆𝐸 =  

1

𝑚𝑛

∑

𝑚−1
0

∑ ‖𝑓(𝑖, 𝑗) − 𝑔(𝑖, 𝑗)‖2

𝑛−1
0

, 

Where f is the matrix data of the ground truth,  

g is the matrix data of the generated image,  

m is the number of rows of pixels of the images, 

23 

(7) 

(8) 

 
 
i represents the index of that row, 

n is the number of columns of pixels of the image, 

 j represents the index of that column, and 

𝑀𝐴𝑋𝑓 is the maximum signal value that exists in our ground truth. 

Structural  similarity  index  measure  (SSIM)  is  a  perception-based  model.  It  considers  image 

distortion  in  terms  of  perceived  change  structural  information  (loss  of  correlation,  luminance 

(2μ𝑥μ𝑦+c1)(2σ+c2)
2 +c1)(σ𝑥
2+μ𝑦

2+σ𝑦

, 
2 +c2)

(μ𝑥

(9) 

distortion, and contrast distortion) [40].  

𝑆𝑆𝐼𝑀 (𝑥, 𝑦) =

Where  

μ𝑥 denotes the average of x,  

μ𝑦 denotes the average of y, 

2 denotes the variance of x, 

σ𝑥

2denotes the variance of y, 

 σ𝑦

σ denotes the covariance of x and y,  

and c1 and c2 are two variables to stabilize the division with a weak denominator. 

Here, we calculated the PSNR [dB] and SSIM index between the generated images reconstructed 

by our model and high-resolution images (ground truth) by using data from two different H&E 

breast cancer slides, which are not used to train the model (unseen data). For each slide, we used 

the  random  54  small  low-resolution  images  with  a  size  of  64x64  pixels  to  reconstruct  high-

resolution  images  with  a  size  of  256x  256  pixels  compared  to  the  ground.  The  results  of 

PSNR/SSIM are shown in Table 1 below. In order to compare the performance of the generator 

models with different backbone architectures (ResNet (original SRGAN), Transformer, DenseNet, 

and ResNeXt), we trained them with the same dataset we acquired from the breast cancer H&E 

24 

 
slides. The proposed model can provide better results, which the average PSNR/SSIM of the data 

from both H&E slides is over 30 dB/0.92, whereas the average result from the traditional method 

(Bicubic  interpolation),  the  typical  SRGAN,  SRGAN-DenseNet,  and  SRGAN-Transformer  are 

24.10 dB/0.848, 27.51 dB/0.915, 27.55 dB/0.93, and 18.50 dB/0.69, respectively.  

Table 1. PSNR/SSIM compares results between the generated high-resolution images and the 
ground truth (realistic high-resolution images) from the testing dataset. 
Average  
PSNR/SSIM 

Breast cancer1 
40x 
24.13 dB/ 0.84 

Breast cancer2 
40x 
24.07 dB/0.86 

24.1 dB/0.85 

Bicubic 
interpolation 
SRGAN Model  
SRGAN-DenseNet 
SRGAN-
Transformer  
Our model  
(SRGAN-ResNeXt) 
Ground truth 
(high-resolution 
image) 

27.18 dB/0.92 
27.84 dB/0.91 
27.96 dB/0.93 
27.15 dB/0.93 
18.68 dB / 0.69  18.33 dB /0.68 

27.51 dB/0.915 
27.55 dB/0.93 
18.50 dB/ 0.69 

32.34 dB/ 0.93 

31.92 dB/0.93 

32.13 dB/0.93 

∞/1 

∞/1 

∞/1 

Figure  8  compares  the  reconstruction  results  of  the  typical  SRGAN,  SRGAN-Transformer, 

SRGAN-DenseNet, and our SRGAN-ResNeXt. Figure 8(a) and 8(b) illustrate the original high-

resolution (ground truth) breast tumor H&E image and bicubic interpolation of a low-resolution 

image,  respectively.  Figure  8(c),  8(d),  8(e),  and  8(f)  show  the  generated  high-resolution  H&E 

images  reconstructed  by  the  traditional  SRGAN,  the  SRGAN-Transformer,  the  SRGAN-

DenseNet,  and  our  SRGAN-ResNeXt,  respectively.  The  contrast  of  some  areas  of  SRGAN-

DenseNet  results  looks  slightly  better  than  SRGAN,  and  SRGAN-ResNeXt  results.  However, 

some small details of the SRGAN-DenseNet results are missing as shown in Figure 8(g) pointed 

out by the red arrows. For the SRGAN-Transformer, it cannot surpass the SRGAN based on CNNs 

architectures by training with our limited custom dataset and computational resource. The model 

based on the Transformer architecture can potentially overcome the CNNs models if the dataset is 

25 

 
 
sufficiently large and the computational resources have high performance enough to increase the 

model complexity (increasing  the number of attention heads, Transformer encoders, multilayer 

perceptron, etc.)  

Figure  8.  Comparison  of  the results  for  our  deep-learning  model  based  on  ResNeXt  against 
bicubic  interpolation  of  the  low-resolution  image,  SRGAN,  SRGAN-Transformer,  and 
SRGAN-DenseNet. (a) The original ground truth image. (b) Bicubic interpolation of the low-
resolution image. (c) The SRGAN result. (d) The SRGAN-Transformer result. (e) the SRGAN-
DenseNet  result.  (f)  Our  model  result.  (g1-g6)  Enlarged  image  in  the  red  boxes  from  (a-f), 
respectively. (h1-h6) Enlarged images in the yellow boxes from (a-f), respectively.  

26 

 
 
  
 
 
2.3.3 Performance of the Inception U-net architecture 

Intersection over Union (IoU) as known as the Jaccard index is the benchmark used to evaluate 

the similarity between a predicted segmentation area and its labeled area (ground truth) [41]. The 

concept  of  IoU  is  to  measure  of  pixels  common  between  the  target  and  predictions  mask 

(intersection) divided by the total number of pixels present across both the prediction mask and 

ground truth (union) as shown in the equation below 

                     𝐼𝑜𝑈 =

𝑡𝑎𝑟𝑔𝑒𝑡 ∩ 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛

𝑡𝑎𝑟𝑔𝑒𝑡 ∪ 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛

 . 

(10) 

The IoU ranges from 0 -1 (0-100%) with 0 indicating that there is no overlapping area, whereas 1 

indicates an impeccably overlapping area.  

Dice similarity coefficient (DSC) is another well-known parameter used to evaluate the similarity 

between  the  predicted  area  (our  output)  and  ground  truth  [32].  The  DSC  can  be  calculated 

following the equation below  

                     𝐷𝑆𝐶 =

2 |𝑋 ∩ 𝑌|
|𝑋|+|𝑌|

 . 

(11) 

It is remarkably similar to the IoU. They are positively correlated. The unseen H&E cancer images 

from the cancer imaging archive [34] were used to evaluate the performance of our Inception U-

net and the typical U-net Models. Table 2 shows their performance that the IoU and DSC from the 

Inception U-net Model are higher than the ones from the U-net Model. According to this result, 

Inception U-net Model can surpass the original U-net Model by using the Inception architecture 

as a core structure instead of a simple convolution block.  

Although the Inception U-net can slightly surpass the original U-net, these improvements will have 

a tremendous impact on the histopathology analyses because the histopathology image analysis 

needs to  perform on the vast  area of H&E images (whole slide image), the small  accurate and 

inaccurate segmented nuclei of each small patch will be accumulated and lead to the correct and 

27 

 
incorrect diagnosis results. For example, one of the criteria to determine tumor stages is the density 

of inflammatory cells. The segmentation area can be used to determine it. Suppose there is a small 

error in the segmentation of inflammatory cells in every small H&E image patch. In that case, the 

total number of inflammatory cells on the whole slide image might be less accurate than the actual 

one, so a pathologist could wrongly diagnose the tumor stage.  

Table 2. The comparison of tumor cell nuclei segmentation performances using U-net and 
Inception U-net architectures. 

IoU/Jaccard index 
DSC/F1score 

U-net  
0.720 
0.875 

Inception U-net 
0.869 
0.893 

Figure 9. Comparison results between the traditional U-net and Inception U-net by using H&E 
images and ground truth from the dataset [34]. (a) A low density of nuclei H&E image. (b) A 
high density of nuclei H&E image. The results from both models have been colored code such 
that green denotes false negative, yellow denotes true positive, and red denotes false positive 
pixels.  

2.3.4 Performance of the jointly trained SRGAN-ResNeXt and Inception U-net Models 

After jointly training SRGAN-ResNeXt and Inception U-net Models on another unseen dataset, 

the performance of the ResNeXt generator was slightly improved due to the limited number of 

28 

 
 
 
 
 
 
data (220 patches). Still, the performance of the Inception U-net was considerably enhanced as 

shown in Figure 10, Table. 3, and Table. 4 below. 

Figure 10. The improvement of the SRGAN-ResNeXt and Inception U-net after training them 
jointly. (a) Low-resolution image input. (b1-b2) The ResNeXt generator and Inception U-net 
models’  results.  (c1-c2)  The  jointly  trained  models’  results.  (d1-d2)  High-resolution  and 
segmentation ground truth images.   

Table 3 and Table 4 show the performance improvement of the jointly trained SRGAN-ResNeXt 

and Inception U-net Models, respectively. Since the jointly trained models require to apply the 

dataset that contains not only low-resolution and high-resolution images but also the corresponding 

segmentation masks, preparing large data is expensive. Although the joint models were trained on 

the small dataset (220 patches from two different tumor mice), the results look promising. The 

performance of the jointly trained models can be potentially  improved by training them on the 

larger dataset. 

Table 3. PSNR/SSIM compares results between the high-resolution generated and the ground 
truth (realistic high-resolution image) dataset of the SRGAN-ResNeXt model and the jointly 
trained SRGAN-ResNeXt. 
SRGAN-ResNeXt 

PSNR/SSIM 

Jointly trained SRGAN-
ResNeXt 
31.63 dB/0.92 

PSNR/SIIM 

31.56 dB/ 0.91 

29 

 
 
Table 4. The comparison of tumor cell nuclei segmentation performances using U-net and 
Inception U-net architectures. 

 Inception U-net 

IoU/Jaccard index 
DSC/F1score 

0.50 
0.75 

Jointly trained Inception 
U-net 
0.84 
0.91 

2.4.Conclusion 

In  this  work,  we  demonstrated  a  practical  approach  to  enhancing  low-resolution  H&E  stained 

images by using the state-of-the-art SRGAN-ResNeXt network. The model can deeply learn how 

to map the low-resolution images to their corresponding high-resolution images. Even though cell 

images contain sophisticated patterns and structures, the SRGAN-ResNeXt Model can still provide 

high-quality  reconstruction  results.  Moreover,  it  can  outperform  the  original  SRGAN  Model. 

Therefore, we take these advantages to characterize and quantify the nuclei from the generated 

high-resolution images. The nuclei from those generated images were segmented using another 

neural network: the Inception U-net architecture. Since we have generated both high-resolution 

H&E images and their nuclei segmentation, we can derive both nuclei area, pixel intensity, and 

other essential parameters to assist pathologists’ diagnosis. If the resolution of H&E images is poor 

and unfavorable, the characterization could be inaccurate leading to misdiagnosis. Moreover, the 

individually well-trained weights of SRGAN-ResNeXt and Inception U-net Models can be applied 

as  the  pre-trained  weights  (transfer  learning)  for  the  jointly  trained  SRGAN-ResNeXt  and 

Inception U-net Models. The performance of the jointly trained models is noticeably improved 

and promising. We anticipate this work can be applied in broad applications such as retrieving 

image  quality  from  a  compressed  archiving  image  for  transferring  among  data  networks  and 

enhancing image quality from a low-cost microscope. For the latter, these custom CNNs can help 

solve the inaccessibility of advanced microscopes to  acquire high-resolution  images from low-

performance microscopes located in most remote clinical settings in developing nations. In future 

30 

 
 
 
 
 
 
work, we intend to apply the proposed CNNs to decrease image acquisition time for a WSI H&E 

scanner which typically uses a high NA objective lens in combination with a slow scan to acquire 

a high-resolution image.  

31 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
CHAPTER 3: Recurrent and Convolution Neural Networks for Sequential Multispectral 

Optoacoustic Tomography (MSOT) Imaging 

Reprinted  with  permission  from  “A.  Juhong,  et  al.,  “Recurrent  and  Convolutional  Neural 

Networks for Sequential Multispectral Optoacoustic Tomography (MSOT) Imaging", Journal of 

Biophotonics,  16,  no.11  (2023):  e202300142  ”  [42],  ©  2023  The  Authors, Journal  of 

Biophotonics published by Wiley-VCH GmbH. 

Volumetric optoacoustic imaging is a beneficial technique for diagnosing and analyzing biological 

samples since it provides meticulous details in anatomy and physiology. However, acquiring high 

through-plane  resolution  volumetric  images  is  time-consuming,  requiring  a  precise  motorized 

stage to  move samples under the optoacoustic system along  the z-axis. Here, we propose deep 

learning based on hybrid recurrent and convolution neural networks to generate sequential cross-

sectional  optoacoustic  images.  A  multispectral  optoacoustic  tomography  (MSOT)  system  was 

utilized to acquire the dataset from breast tumors for training our deep learning model. This system 

can  simultaneously  acquire  the  sequential  images  (cross-sectional  images)  of  MSOT  and 

ultrasound. Furthermore, it provides a spectral unmixing algorithm applied to the MSOT images 

for extracting the sequential images of a specific exogenous contrast agent. This study used ICG-

conjugated superparamagnetic iron oxide nanoworms particles (NWs-ICG) as the contrast agent. 

Our deep learning model applies to all three modalities (multispectral optoacoustic imaging at a 

specific  wavelength,  ultrasound,  and  NWs-ICG  optoacoustic  imaging).  The  generated  2D 

sequential images were compared to the ground truth 2D sequential images acquired using a small 

step  size.  The  results  of  these  three  modalities  can  achieve  excellent  image  quality  where  the 

average of peak-signal-to-noise ratio and summation  absolute errors between the ground truths 

and the generated images is over 75 dB and less than 2,000. Instead of acquiring seven images 

32 

 
with a step size of 0.1 mm, we can receive two images with a step size of 0.6 mm as input images 

for the proposed deep learning model.  The deep learning model can generate or interpolate other 

five images with the step size of 0.1 mm between these two input images meaning we can save 

acquisition time by approximately 71%. 

3.1 Introduction  

Multispectral  Optoacoustic  Tomography  (MSOT)  is  an  in  vivo  optical  imaging  modality  for 

molecular, anatomical, and functional imaging Fields [43, 44]. The principle of MSOT is based on 

the optoacoustic effect, i.e., a molecule is excited by an ultra-short laser pulse, which can penetrate 

through tissue several centimeters [45, 46], resulting in thermoelastic expansion surrounding the 

molecule that generates a photoacoustic wave [47]. The ultrasound traducer is then used to detect 

this  wave  as  an  ultrasound  signal.    The  difference  of  absorption  contrast  of  tissue  in  single 

wavelength images is employed to reconstruct anatomical images. Using multiple wavelengths to 

excite the tissue, we can obtain multispectral images from intrinsic and extrinsic signals. A laser 

between 680 nm and 980 nm is the predominant source for intrinsic signals such as deoxygenated 

hemoglobin, oxygenated hemoglobin, melanin, myoglobin, bilirubin, fat, etc.  Extrinsic signals do 

not usually occur in cells, tissue, or animals. Agents that can absorb in the near-infrared (NIR) 

range  such  as  indocyanine  green,  fluorescence  proteins,  nanoparticles,  etc.,  can  increase  the 

optoacoustic  signal  (extrinsic  signal).  Thus,  they  can  be  distinguished  from  intrinsic  tissue 

background  signals  by  using  effective  spectral  unmixing  algorithms  such  as  linear  regression, 

guided independent comment (ICA), and principal component analysis (PCA) [48, 49].  MSOT is 

widely used for several studies such as cancer research [50-54], drug development [55, 56], and 

nanoparticle  [57-60].    However,  using  multiwavelength  excitation  to  scan  the  sample  is  time-

consuming,  especially  cross-sectional  scanning  for  3D  image  reconstruction.  Imaging  needs  to 

33 

 
sweep all the wavelengths with every single scanning position. For in vivo experiments, this might 

lead to image degradation from motion artifacts and potential lethality from prolonged anesthesia.  

In recent years, deep learning-based approaches have played a vital role in optoacoustic 

imaging,  and  they  have  been  widely  used  in  several  applications  such  as  image  classification, 

segmentation [61-65], quantitative photoacoustic imaging [66-70], image enhancement [71-75], 

etc. One main advantage of deep learning for those applications is that it depends less on hardware 

modifications. In addition, most of those deep learning techniques were designed to use a single 

2D image as their input and apply convolution architectures for feature extraction.  For instance, 

deep learning for automatic segmentation of optoacoustic ultrasound (OPUS) images [76] used the 

U-net  architecture  [18]  to perform the image segmentation. U-net  is  a well-known convolution 

neural network (CNN) architecture for image segmentation, particularly biomedical images [77-

80]. 

Nevertheless, there are no techniques based on deep learning to reduce the acquisition time 

of  cross-sectional  scanning  for  3D  photoacoustic  imaging.  Herein,  we  propose  the  hybrid 

architecture  of  convolution  neural  network  (CNN)  and  recurrent  neural  network  (RNN)  for 

generating  sequential  optoacoustic,  unmixed  optoacoustic  of  a  specific  contrast  agent,  and 

ultrasound images to extend the stack of cross-sectional images and reduce acquisition time by 

approximately  71%.  This  hybrid  architecture  is  called  Inception  Generator  Long  Sort-Term 

Memory  (I-Gen-LSTM).  The  Inception  Generator  is  a  CNN  model  designed  based  on  the 

Inception  U-net  architecture.  Inception  is  a  convolution  layer  [81]  that  convolves  the  input  in 

parallel with different kernel sizes extracting more features than a simple convolution layer. RNN 

is a robust and effective approach for sequential problems. It is a feed-forward neural network with 

internal memory and performs the same function for every data input. In addition, the output of 

34 

 
the current input depends upon the previous output. However, the original RNN has drawbacks 

regarding exploding and vanishing gradients from backpropagation to update weights, particularly 

long  sequential  inputs.  Long  Short-Term  Memory  (LSTM)  networks  [82]  are  improved  RNN 

networks  capable  of  learning  long-term  dependencies  by  adding  a  forget  gate,  input  gate,  and 

output  gate.  Therefore,  we  leverage  Inception  Generator  and  LSTM  networks  to  generate 

sequential images. Our results demonstrate that the I-Gen-LSTM model is a versatile method that 

can generate not only sequential optoacoustic images but also sequential  unmixed optoacoustic 

and ultrasound images.  

3.2 Methods 

3.2.1 Data acquisition 

A commercial multispectral optoacoustic tomography (MSOT) system (inVision 512-echo, iThera 

Medical  GmbH, Munich, Germany) was used to  acquire the data for training  the  I-Gen-LSTM 

model. The MSOT system has a 270-degree ultrasound transducer tomographic array, which can 

acquire signals from multiple angles around an object. This tomographic array enables the system 

for imaging complex shapes since it can capture 2-dimensional signals in the imaging plane. Figure 

11(a) shows the detection and illumination geometry in the imaging chamber of the MSOT system. 

In addition, this system provides a tunable laser with a range of 660-1,300 nm, which is particularly 

suitable for most biological samples. The excitation pulse laser is used to illuminate the sample. 

The sample absorbs this pulse and converts it to heat, which results in a transient thermoelastic 

expansion that generates an acoustic wave. The ultrasound transducer is then used to detect this 

acoustic wave, and the back-projection algorithm [83] is applied to the detected optoacoustic wave 

to  reconstruct  the  images.  For  the  dataset  preparation,  transgenic  mice  [84]  with  breast  tumors 

were  intravenously  injected  with  indocyanine  green  (ICG)-conjugated  superparamagnetic  iron 

35 

 
oxide nanoworms (NWs-ICG) [85], which accumulate in tumors longer than pure ICG through the 

enhanced permeability and retention (EPR) effect [86]. Twenty-four hours after injection, the mice 

were  euthanized  and  the  tumors  were  removed  and  dissected  for  this  study.  All  procedures 

performed  on  animals  were  approved  by  the  University’s  Institutional  Animal  Care  &  Use 

Committee  and  were  within  the  guidelines  of  humane  care  of  laboratory  animals.  To  acquire 

images of the tumors, 4 mg of agarose powder was dissolved in 40 mL of warm deionized water. 

The breast tumor was put in this dissolved agarose solution, allowing approximately 15 minutes 

for the solution to solidify. The hardened agarose with the tumor inside shown in Figure 11(b), 

was grasped by the holder and then scanned by the inVision MSOT system with the excitation 

pulse at wavelengths from 800 nm to 1000 nm (a comprehensive range of the NWs-ICG study). 

Since  the  inVision  MSOT  system  can  provide  corresponding  ultrasound  images,  NWs-ICG 

optoacoustic images obtained through linear spectral unmixing algorithm [87], and each single-

wavelength optoacoustic image, these three imaging modalities were simultaneously acquired in 

every scanning position. Figure 11(d1-d4) shows the ultrasound images of the breast tumor with 

different scanning positions, Figure 11(e1-e4) shows the corresponding NWs-ICG optoacoustic 

images  reconstructed  from  multispectral  optoacoustic  imaging  with  the  excitation  pulse  at 

wavelengths from 800 to 1,000 nm by using the multispectral unmixing algorithm; Figure 11(f1-

f4)  shows  the  corresponding  single-wave  optoacoustic  image  at  800  nm  excitation;  and  Figure 

11(g1-g4) shows the corresponding overlaid images of these three imaging modalities. 

36 

 
Figure 11. Ultrasound, NWs-ICG optoacoustic obtained through multispectral unmixing, and 
optoacoustic  at  800  nm  excitation  imaging  of  an  ex  vivo  breast  tumor  from  a  mouse 
intravenously  injected  with  NWs-ICG.  (a)  The  detection  and  illumination  geometry  in  the 
imaging chamber of the MSOT system. (b) The breast tumor is embedded in agarose. (c) NWs-
ICG structure. (d1-d4) Ultrasound images of the breast tumor with different step sizes. (e1-e4) 
The  corresponding  NWs-ICG  optoacoustic  images  were  obtained  through  multispectral 
unmixing.  (f1-f4)  The  corresponding  single-wavelength (λex  =  800  nm) optoacoustic  images. 
(g1-g4) with an overlay of the ultrasound, the NWs-ICG optoacoustic(colormap), and the single-
wavelength optoacoustic images. 

37 

 
 
3.2.2 I-Gen-LSTM and discriminator models  

The I-Gen-LSTM model comprises three main neural networks depicted in Figure 12(a-c). The 

first  neural  network  is  the  Inception  encoder  &  decoder  network  based  on  Inception  U-net 

architecture.  The  original  U-net  architect  employs  simple  convolution  blocks  with  the  skip 

connection of encoders and decoders at the same dimension helping the model to circumvent the 

vanishing and exploding gradients problems.  However, the simple convolution blocks might be 

insufficient to extract all crucial information comprehensively. Inception architecture is one of the 

effective CNNs  architectures since it applies a wide range of kernel sizes to extract global and 

local  features.  A  large  and  a  small  kernel  size  are  tailored  to  extract  information  distributed 

globally and locally, respectively. With this attribute, the encoder & decoder network was designed 

using Inception U-net as its backbone as shown in Figure 2(a), for improving the model capability. 

This network takes two 2D images, acquired from an arbitrary consecutive position with a step 

size of 0.6 mm, as its inputs (input 1 and input 2, as shown in Figure 12(a)). The encoder shown 

on the left side of Figure 12(a) generates encoder outputs (E1n -E5n, where n is the input image 

number, i.e., 1 and 2). Inception architecture in the encoder with three different kernel sizes (1x1, 

3x3, and 5x5) assembled as the parallel filters are used to extract features from the tensors followed 

by  a  rectified  linear  unit  (ReLU)  and  a  2x2  max  pooling  with  the  stride  of  2  steps  for 

downsampling, respectively. Similarly, Inception architecture is also used in the decoder blocks. 

The encoder blocks are used to generate decoder outputs (D1n-D5n, where n is the input image 

number,  i.e.,  1  and  2)  as  shown  in  the  right  side  of  Figure  12(a)  followed  by  a  feature  map 

upsampling, a 2x2 up-convolution (halving the number of feature channels), and a corresponding 

concatenation from the encoder part.  

38 

 
The second neural network is the convolutional LSTM network (ConvLSTM)  [88], a recurrent 

neural network for spatio-temporal prediction. It has a convolutional structure in both the input-

to-state and state-to-state transitions as shown in the bottom right of Figure 12(b). In other words, 

internal matrix multiplications are exchanged with convolution operations. Consequently, the data 

flowing through the ConvLSTM cells keeps the input dimension instead of being a 1D vector with 

features. The main equations of ConvLSTM are expressed in Equations (12-16) below, where ‘*’ 

and  ‘๐’  represent  the  convolution  operator  and  the  Hadamard  product  (element-wise  matrix 

multiplication), respectively. All variables in Equations (12-16) were shown in the “ConvLSTM 

block” in Figure 12(b).  

𝑖𝑡  =  𝜎(𝑊𝑥𝑖 ∗ 𝑋𝑡 + 𝑊ℎ𝑖 ∗ 𝐻𝑡−1๐𝐶𝑡−1 + 𝑏𝑖) 

𝑓𝑡  =  𝜎(𝑊𝑥𝑓 ∗ 𝑋𝑡 + 𝑊ℎ𝑡 ∗ 𝐻𝑡−1 + 𝑊𝑐𝑡๐𝐶𝑡−1 + 𝑏𝑓) 

𝑐𝑡  =  𝑓𝑡๐𝐶𝑡−1 + 𝑖𝑡๐tanh (𝑊𝑥𝑐 ∗ 𝑋𝑡 + 𝑊ℎ𝑐 ∗ 𝐻𝑡−1 + 𝑏𝑐) 

𝑜𝑡  =  𝜎(𝑊𝑥𝑜 ∗ 𝑋𝑡 + 𝑊ℎ𝑜 ∗ 𝐻𝑡−1 + 𝑊𝑐𝑜๐𝐶𝑡 + 𝑏𝑐) 

𝐻𝑡  =  𝑜𝑡๐tanh (𝐶𝑡) 

(12) 

(13) 

(14) 

(15) 

(16). 

The ConvLSTM takes the outputs of the Inception encoder from both input images (E11-E51 and 

E12-E52) as its inputs to generate five sequential blocks (Recurrent Conv1 to Recurrent Conv5) 

as shown in Figure 12(b). Recurrent Conv 1, 2, 3, 4, and 5 have dimensions of (5x128x128x512), 

(5x64x64x512),  (5x32x32x512),  (5x16x16x512),  and  (5x8x8x512),  respectively.  The  first 

dimension represents the number of output images (five sequential output images). Lastly, it is the 

sequential  image  generator  network  inspired  by  U-net  architecture.  The  model  takes  Recurrent 

Conv 1-5, two input images, encoder outputs (E11-E51 and E12-E52), and decoder outputs (D11-

D41 and D12-D42) to reconstruct five sequential images of different scanning positions as shown 

in Figure 12(c). The left side of Figure 12(c) shows the concatenated encoder and decoder outputs 

39 

 
generated by the Inception encoder &decoder (Figure 12(a)). The right side of Figure 12(c) shows 

Conv2D  transpose  and  Conv2D  operations  for  the  Recurrent  Conv  1-5  generated  by  the 

ConvLSTM blocks (Figure 12(b)) and the concatenated encoder & decoder outputs.  

All Conv2D transpose, Conv2D blocks utilize ReLU as their activation function except the last 

Conv2D* that applies hyperbolic tangent or tanh as its activation function. Indeed, the Recurrent 

Conv blocks regulate the gradual  change in  the  sequential  output  images.   In short, the  I-Gen-

LSTM  model  takes  two  images  acquired  by  consecutive  positions  with  0.6  mm  steps  size  and 

generates the five sequential images between these two images with gradual change following the 

scanning positions (step sizes of 0.1 – 0.5 mm). The ground truth images acquired using a small 

step size (0.1-0.5 mm) were used to determine the loss value from these five generated images. 

The loss functions will be elucidated in Section 3.2.3. 

40 

 
 
 
 
 
 
 
 
Figure  12.  I-Gen-LSTM  and  discriminator  architectures.  (a)  Inception  encoder  and  decoder 
network  were  applied  to  both  images  (input1  and  input2).  (b)  ConvLSTM  network  for 
generating  the  sequential  blocks  (Recurrent  Conv  1-5)  fed  to  the  sequential  image  generator 
network  for  reconstructing  the  sequential  output  images.  (c)  The  sequential  image  generator 
network. (d) The discriminator network. 

41 

 
 
 
 
 
 
 
Figure 12 (cont’d). 

The discriminator network shown in  Figure  12(d) is  a simple convolution  network designed to 

evaluate  the  similarity  between  the  ground  truths  and  generated  images.  The  model  comprises 

42 

 
 
 
eight convolutional layers and two fully connected layers. After each convolution block, a batch 

normalization layer is used, followed by an activation function named the Leaky ReLU function 

(α=0.2). The number of 3x3 filter kernels increases by a factor of 2 from 64 (the first layer) to 512 

(the eighth layer) kernels. The last two layers are dense layers working as a classification block, 

predicting the probability of an image being either real or fake. To train the I-Gen-LSTM model, 

we  assemble  the  models  as  a  generative  adversarial  network  (GAN)  [89]  shown  in  Figure  13 

below.  

Figure 13. GAN with the combination of three loss functions (the content loss, the neighbor 
loss, and the adversarial loss functions) for training the I-Gen-LSTM model. 

3.2.3 Loss functions 

To optimize the I-Gen-LSTM model, we designed custom-made loss functions, namely the content 

loss (VGG19 loss, IVGG

𝑆𝑆 ) [35], adversarial loss (Discriminator loss, IGen

SS ), and neighbor loss (IN

SS)  

as shown in Equation (17). Where Cw1, Cw2, and Cw3 are the hyper-parameters set as 0.7, 0.1, and 

0.2, respectively.  

𝐼𝑆𝑆 = 𝐶𝑤1𝐼𝑉𝐺𝐺

𝑆𝑆 + 𝐶𝑤2𝐼𝐺𝑒𝑛

𝑆𝑆 + 𝐶𝑤3𝐼𝑁

𝑆𝑆 

(17) 

The  content  loss  or  VGG  loss  (𝐼𝑉𝐺𝐺

𝑆𝑆 ),  which  is  defined  as  the  Euclidean  distance  between  the 

feature  map  of  the  generated  image (𝐺𝜃𝐺(𝐼𝐿𝑆))  and  the  ground  truth  (𝐼𝑆𝑆),  can  extract  high 

43 

 
 
dimensional  features  helping  the  model  to  generate  the  image  with  perceptually  satisfying 

solutions  without excessively  smooth  textures. The   𝐼𝑉𝐺𝐺

𝑆𝑆  loss is  based on the ReLU activation 

layers of the pre-train 19-layer VGG network and it can be calculated following Equation (18) as 

shown as 

𝑆𝑅 =  
𝐼𝑉𝐺𝐺

1
𝑊𝑖,𝑗𝐻𝑖,𝑗

𝐻𝑖,𝑗

𝑊𝑖,𝑗
∑ ∑(  Ø𝑖,𝑗(𝐼𝑆𝑆)𝑥,𝑦 −   Ø𝑖,𝑗(𝐺𝜃𝐺(𝐼𝐿𝑆)𝑥,𝑦)
𝑥=1

𝑦=1

2

(18) 

where  𝑊𝑖,𝑗 𝑎𝑛𝑑 𝐻𝑖,𝑗  describe  the  dimensions  of  the  respective  feature  maps  within  the  VGG 

network. The features map (Ø𝑖,𝑗) can be obtained by the j-th convolution before the 𝑖𝑡ℎ maxpooling 

layer within the VGG19 network. 

Moreover,  the  adversarial  loss  (𝐼𝐺𝑒𝑛

𝑆𝑆 )  is  also  employed  to  distinguish  the  similarity  of  the  two 

images.  It  is  defined  as  the  probabilities,  varying  from  0  to  1,  which  are  the  result  of  the 

discriminator model (𝐷𝜃𝐷(𝐺𝜃𝐺(𝐼𝐿𝑆))) as shown in Equation (19). Where 𝐼𝐿𝑆 is the input images, 

𝐺𝜃𝐺is the generator model, and 𝐷𝜃𝐷 is the discriminator model.  

𝑁

𝑆𝑆 = ∑ −𝑙𝑜𝑔𝐷𝜃𝐷(𝐺𝜃𝐺(𝐼𝐿𝑆))
𝐼𝐺𝑒𝑛

𝑛=1

(19) 

Apart from using the content and adversarial losses, the neighbor loss is also applied to optimize 

the  model.  Since  the  I-Gen-LSTM  model  generates  sequential  images,  the  neighbor  loss  is 

essential  to  regulate  the  change  of  each  generated  image  in  the  sequence.  The  concept  of  the 

neighbor loss function is to differentiate between the current generated image and the neighbor 

images in the same sequence as expressed in Equation (20) below as 

𝑁
𝑆𝑆 = ∑(𝑚𝑠𝑒(
𝐼𝑁
𝑛=1

𝐼𝑛, 𝐼𝑛−1) + 𝑚𝑠𝑒(𝐼𝑛, 𝐼𝑛+1)) 

(20) 

The  custom-made  loss  function  effectively  leverages  the  combination  of  these  three  loss 

functions to train the I-Gen-LSTM model that can generate high-quality sequential images. 

44 

 
 
 
3.2.4 I-Gen-LSTM model for Volumetric Imaging 

To collect the database for training the model, 16 breast tumors from mice intravenously injected 

with NWs-ICG were acquired by the MSOT system. The data from these tumors were allocated 

for training (11 tumors), validation (3 tumors), and testing (2 tumors) datasets. The training time 

on Google Colaboratory (CoLab) Pro is approximately 40 hours. After initializing and importing 

the model, the I-Gen-LSTM can generate five sequential images by taking less than 1 second for 

the five output images on a personal computer (PC) with an 11th Gen Intel core i7-11700k CPU, 

16 GB RAM, and an NVIDIA RTX 3090 graphic card.  

 3.3 Results and discussion 

3.3.1 Sequential NWs-ICG optoacoustic, ultrasound, and optoacoustic (λex = 800 nm) 

image reconstruction. 

The  breast  tumor  dissected  from  an  NWs-ICG-injected  mouse  was  scanned  under  the  MSOT 

system. Figure 14 shows the generated sequential images generated by the I-Gen-LSTM model. 

Two input images of each modality, acquired from consecutive stage positions with a step size of 

0.6 mm, are used as the inputs for the I-Gen-LSTM model. Here, we demonstrate a z-scanning 

range from 9.7 mm-10.3 mm with a step size of 0.1 mm as a representative. 

45 

 
Figure 14. Results of sequential  image reconstruction  generated  by the I-Gen-LSTM  model. 
The two input images for each modality simultaneously acquired with a step size of 0.6 mm 
were fed into the I-Gen-LSTM model. The green, blue, and violet boxes show generated images 
(GEN), ground truth (GT), and the absolute error between GEN and GT images (|GT-GEN|) 
represented as color map images. The red-dashed boxes show the local features fairly change 
along  the  z-scanning  position  and  the  yellow-dashed  boxes  are  the  corresponding  enlarged 
images of the red-dashed boxes. The scale bar is 5 mm. (a) NWs-ICG optoacoustic sequential 
image reconstruction result. (b) Ultrasound sequential image reconstruction result. (c) Single-
wavelength optoacoustic (λex = 800 nm) reconstruction result. 

46 

 
 
 
Figure 14 (cont’d). 

The red-dashed  boxes  in Figure  14 show local  features, which are fairly changing  along  the z-

scanning position and are somewhat straightforward to observe. The orange-dashed boxes are the 

corresponding enlarged images of the red-dashed boxes. Figure 14(a) shows the sequential image 

reconstruction  result  of  NWs-ICG  optoacoustic  imaging,  Figure  14(b)  shows  the  result  of 

ultrasound imaging, and Figure 14(c) shows the result of single-wavelength optoacoustic (λex = 

800 nm) imaging. The average Peak-signal-to-noise ratio (PSNR) dB/ the average summation of 

absolute  errors  (SAE)  between  the  ground  truths  (GT)  and  generated  images  (GEN)  for  this 

scanning range of NWs-ICG optoacoustic, ultrasound, and optoacoustic (λex = 800 nm) imaging 

are 87.72 dB/923.66 ,78.83 dB/4,323.19, 75.60 dB/2,223.40, respectively. 

47 

 
 
 
 
 
 
 
 
 
 
3.3.2 Three-dimensional reconstruction of the stack 2D NWs-ICG optoacoustic, ultrasound, 

and optoacoustic (λex = 800 nm) images 

Since the MSOT system and our deep learning model provide the stack of multiple cross-sectional 

images for NWs-ICG optoacoustic, ultrasound, and optoacoustic (λex = 800 nm) images, we can 

use  these  images  to  reconstruct  three-dimensional  (3D)  images  by  using  Amira  (Mercury 

Computer system, Berlin, Germany) software. Figure 15 shows the 3D reconstruction results of 

the ground truth and the generated images.  Figure 15(a) demonstrates the 3D reconstruction of 

generated images from the I-Gen-LSTM model and Figure 15(b) shows the reconstruction of the 

ground  truths  acquired  by  mechanical  scanning.  After  finished  the  experiment,  the  tumor  was 

removed from the agarose and sent to the histopathology lab (MSU-IHPL Research facility) to 

prepare a Hematoxylin-and-Eosin (H&E) stained breast tumor slide shown in Figure 15(c). 

48 

 
Figure  15.  3D  image  reconstruction  of  the  breast  tumor  using  cross-sectional  NWs-ICG 
optoacoustic,  ultrasound,  and  optoacoustic  (λex  =  800  nm)  stacked  images.  (a)  The  3D 
reconstruction result of the NWs-ICG optoacoustic, ultrasound, and optoacoustic (λex = 800 nm) 
images  generated  by  the  I-Gen-LSTM  model  with  a  step  size  of  0.1  mm.  (b)  The  3D 
reconstruction  result  acquired  by  mechanical  scanning  with  a  step  size  of  0.1  mm.  (c)  The 
photograph of the corresponding tumor and its H&E slide image. 

49 

 
 
 
 
 
3.3.3 Evaluations  

The NWs-ICG optoacoustic, ultrasound, and optoacoustic (λex = 800 nm) images from two tumors 

not used for training the model were utilized for the model evaluation. Each tumor was scanned 

with  a  step  size  of  0.1  mm.  Every  two-image  (with  a  0.6  mm  scanning  step  in  between)  was 

assigned as the input for the I-Gen-LSTM model to generate five sequential images with a step 

size of 0.1 mm. Here, the model was evaluated using four quantitative metrics: the average PSNR, 

SAE  (GEN,  GT),  SAE  (𝐼𝑛𝑝𝑢𝑡1, 𝐺𝑇),and  SAE  (𝐼𝑛𝑝𝑢𝑡2, 𝐺𝑇).  They  were  applied  to  the  testing 

dataset acquired from the tumors for all scanning positions. A large PSNR and a small SAE (GEN, 

GT) imply high-quality generated images. Indeed, if the SAE (GEN, GT) can perform better than 

SAE  (Input1-GT)  and  SAE  (Input2-GT),  it  also  means  that  the  model  can  effectively  generate 

sequential images. All average evaluation metrics can be calculated following Equation (21-23).  

Average PSNR 

Average SAE (GEN, GT) 

Average SAE (𝐼𝑛𝑝𝑢𝑡𝑘 ,GT) 

= 

= 

= 

5
𝑖

𝑁
𝑗

∑ ∑ 𝑃𝑆𝑁𝑅𝑗(𝐺𝐸𝑁𝑖 , 𝐺𝑇𝑖) 
5 × 𝑁

5
𝑖

𝑁
𝑗

∑ ∑ 𝑆𝐴𝐸𝑗(𝐺𝐸𝑁𝑖 , 𝐺𝑇𝑖) 
5 × 𝑁

5
𝑖

𝑁
𝑗

∑ ∑ 𝑆𝐴𝐸𝑗(𝐼𝑛𝑝𝑢𝑡𝑘 , 𝐺𝑇𝑖) 
5 × 𝑁

(21) 

(22) 

(23) 

Where,  

N is the number of scanning positions with a step size of 0.6 mm, 

𝐺𝐸𝑁𝑖 is the generated image at “i” scanning position in between two input images (acquired with 

a step size of 0.6 mm),   

𝐺𝑇𝑖 is the corresponding ground truth, 

𝐼𝑛𝑝𝑢𝑡𝑘 images are the two input images (k=1 and 2) acquired from arbitrary consecutive positions 

with a step of 0.6 mm. 

50 

 
 
 
 
Figure 16 shows the representative result from one of the evaluated tumors as the graph of the 

average PSNR and SAE (GEN, GT) vs. scanning positions. Table 5 shows the average evaluation 

metrics of the generated sequential NWs-ICG optoacoustic, ultrasound, and optoacoustic (λex= 800 

nm) images for all testing datasets. Overall, the average PSNR and SAE between generated images 

and ground truths of all modalities are greater than 75 dB and less than 2,000, respectively.  

This indicates that the I-Gen-LSTM model can generate sequential images with promising results. 

To comprehensively evaluate the model performance, we also compared SAE (GEN, GT) to SAE 

(𝐼𝑛𝑝𝑢𝑡1, 𝐺𝑇) and SAE( 𝐼𝑛𝑝𝑢𝑡2, 𝐺𝑇) as the baseline for comparison. The average SAE (GEN, GT) 

of  optoacoustic  (λ=  800  nm)  and  ultrasound  imaging  performs  better  than  the  average 

SAE(𝐼𝑛𝑝𝑢𝑡1, 𝐺𝑇) and SAE( 𝐼𝑛𝑝𝑢𝑡2, 𝐺𝑇), but the NWs-ICG optoacoustic imaging does not (the 

average  SAE  (GEN,  GT)  is  slightly  higher  than  the  average  of  SAE(𝐼𝑛𝑝𝑢𝑡1, 𝐺𝑇) and  SAE( 

𝐼𝑛𝑝𝑢𝑡2, 𝐺𝑇)) due to the tiny changing features in the sequential NWs-ICG optoacoustic imaging 

and  the  limited  number  of  the  training  dataset.  Although  the  overall  result  is  favorable  and 

encouraging,  the  deep  learning  model  could  be  improved  in  future  work.  We  will  use  a  larger 

dataset with a larger image size to train the deep learning model so that the convolution/LSTM 

blocks  can  efficiently  capture  more  sequential  features,  especially  in  a  tiny  changing  feature 

modality such as NWs-ICG optoacoustic imaging. 

Table 5. Average quantitative metrics of optoacoustic (λex = 800 nm), NWs-ICG optoacoustic, 
and ultrasound images generated by the proposed deep learning model. 

Average quantitative 
metrics 
PSNR (dB) 
SAE (GEN, GT) 
SAE (𝐼𝑛𝑝𝑢𝑡1, 𝐺𝑇) 
SAE (𝐼𝑛𝑝𝑢𝑡2, 𝐺𝑇) 

Optoacoustic (λex = 
800 nm) 
76.53 
1,706.12 
6,812.92 
5,294.94 

NWs-ICG 
optoacoustic 
83.75 
858.54 
406.59 
284.02 

Ultrasound 

80.44 
1,265.87 
6,695.71 
4,902.67 

51 

 
 
Figure 16. The PSNR and SAE (GEN, GT) evaluation in one of the testing tumors. (a-b) The 
graph between the PSNR and SAE (GEN, GT) values vs. scanning positions for all generated 
OPUS, NWs-ICG optoacoustic, and optoacoustic (λex = 800 nm) images, respectively. 

3.4 Conclusion 

This  work  demonstrates  a  deep  learning  technique  based  on  recurrent  and  convolution  neural 

networks for generating sequential NWs-ICG optoacoustic (multispectral unmixing), ultrasound, 

and  optoacoustic  images.  It  has  shown  robust  and  promising  performance  in  the  accurate 

reconstruction of the sequential images for all modalities, according to the quantitative evaluation 

of model performance using the PSNR and SAE for all scanning positions of the generated images 

(reconstructed by the deep learning model) and ground truth (acquired by mechanical scanning). 

The  architecture  of  our  model  is  versatile  since  it  can  promisingly  generate  sequential  cross-

sectional images of three modalities from the commercial MSOT system. Using our deep learning 

can substantially reduce acquisition time. However, all the training data were acquired from  ex 

vivo tissues completely fixed in agarose. Model performance with images acquired in vivo may be 

52 

 
 
affected  by  cardiac  and  respiratory  motion.  In  the  future,  we  will  explore  the  possibility  of 

optimizing and applying the model to generate sequential images of in vivo samples with motion 

artifacts. 

53 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
CHAPTER 4: Multi-head Attention U-Net for MPI-CT Image Segmentation 

Reprinted with permission from  “A.  Juhong, et al., "Multi-head Attention U-Net  for Magnetic 

Particle Imaging-Computed Tomography image segmentation." Advanced Intelligent Systems, 6, 

no. 10 (2024): 2400007” [90], © 2024 The Author(s), Advanced Intelligent Systems published by 

Wiley-VCH GmbH. 

Magnetic particle imaging (MPI) is an emerging non-invasive molecular imaging modality with 

high sensitivity and specificity, exceptional linear quantitative ability, and potential for successful 

applications in clinical settings. Computed tomography (CT) is typically combined with the MPI 

image to obtain more anatomical information. Herein, we present a deep learning-based approach 

for MPI-CT image segmentation. The dataset utilized in training the proposed deep learning model 

is  obtained  from  a  transgenic  mouse  model  of  breast  cancer  following  administration  of 

indocyanine green (ICG)-conjugated superparamagnetic iron oxide nanoworms (NWs-ICG) as the 

tracer.  The  NWs-ICG  particles  progressively  accumulate  in  tumors  due  to  the  enhanced 

permeability  and  retention  (EPR)  effect.  The  proposed  deep  learning  model  exploits  the 

advantages of the multi-head attention mechanism and the U-Net model to perform segmentation 

on the MPI-CT images, showing superb results. In addition, we characterized the model with the 

different number of attention heads to explore the optimal number for our custom MPI-CT dataset. 

4.1 Introduction  

MPI is a highly sensitive imaging modality initially introduced in 2005 [91-93]. Unlike traditional 

imaging  techniques  such  as  magnetic  resonance  imaging  (MRI),  sonography,  computed 

tomography (CT), and X-ray, MPI is not employed for structural imaging purposes. Nevertheless, 

it  is  a  tracer  imaging  modality  akin  to  positron emission  tomography  (PET)  and  single  photon 

emission computed tomography (SPECT). The concept of MPI is to detect the three-dimensional 

54 

 
distribution of superparamagnetic iron-oxide nanoparticles (SPIONs) with extraordinary contrast 

and sensitivity, allowing us to track and quantify the tracer materials effectively. In addition, MPI 

signal can only be detected from the administered tracer providing an image without background 

as  well  as  improving  signal-to-noise  ratios.  Indeed,  the  development  of  MPI  involved 

strengthening the existing imaging modalities (MRI, PET, SPECT, etc.). For instance, PET and 

SPECT tracers typically have half-lives in a range of minutes to hours, whereas the MPI tracer can 

last for several days to weeks [94]. Therefore, MPI is more eminently suitable for dynamic imaging 

applications than traditional tracer imaging methods. Numerous prototypes and commercial MPI 

scanners  have  demonstrated  impressive  results  in  in-vivo  studies  for  vascular  imaging  [95-97], 

oncology [98-100], and cell tracking [101, 102]. The MPI system for humans is under development 

and  may  become  available  in  the  near  future  [103].  Like  PET,  an  MPI  image  is  frequently 

combined with a CT image for registering the particle signal (the MPI image) and the anatomical 

information (the CT image). This will enhance the diagnostic potential by identifying the precise 

location of functional events in the body [104].   

Biocompatibility is one of the essential features for using biomaterials, particularly MPI 

tracers (iron oxide particles), for in-vivo applications and clinical trials. Nanoworms (NWs) are 

biocompatible  iron  oxide  particles  widely  used  for  biomedical  applications.  NWs  include  a 

considerably lower inflammatory response than spherical iron oxide nanoparticles [105]. NWs are 

a nanostructure with an elongated assembly of iron oxide (IO) [106]. This structure can potentially 

augment the nanoparticles’ capability for circulation and tumor targeting.  Due to their nanoscale 

dimensions,  NWs  can  remain  in  tumors  longer  than  pure  fluorescence  contrast  agents,  also 

recognized as the enhanced permeability and retention (EPR) effect [107, 108].  

55 

 
 
Recently, image processing based on deep learning has become a promising approach for 

medical  applications  due  to  the  rapid  development  of  computation  technologies  for  image 

classification [109-111], regression [112-114], reconstruction [115-117], and segmentation [118-

121]. Deep learning models contain a large number of function approximators. As a result, the 

models  without  further  modifications  tend  to  neglect  essential  parts  of  the  input  and  focus  on 

others. The use of the attention mechanism [122] is one of the practical approaches to remedy this 

problem.    The  attention  mechanism  is  an  ingenious  and  powerful  technique  allowing  neural 

networks to focus on meaningful parts of an input tensor. This mechanism is the key innovation 

behind numerous successful deep learning architectures such as TransUnet [123], BRET [124], 

and Swin transformer [125]. Multiplicative attention (Luong attention) [126] and additive attention 

(Bahanau  attention)  [127]    are  two  initial  instances  of  attention  sparking  the  revolution.  Since 

multiplicative  attention  implements  matrix  multiplication  for  calculating  the  output,  it  is  more 

memory-efficient in practice and faster than additive attention. However, the additive attention can 

be superior to the multiplicative attention for large dimensional input features [128]. The U-Net 

architecture [129] is a widely recognized convolutional neural network (CNN) that has achieved 

prominence  in  the  field  of  medical  image  segmentation  due  to  its  simplicity  and  remarkable 

performance. The original U-Net architecture contains two main components: an encoder and a 

decoder. The skip connection mechanism is added to the same dimensional encoder and decoder. 

Essentially, it combines spatial information from the down-sampling path (encoder) with the up-

sampling path (decoder) to retain marvelous spatial information. In addition, the skip connection 

mechanism allows the gradient descent to readily propagate back to update the weights (learnable 

parameters). However, the skip connection mechanism brings along the poor feature representation 

from  the  encoder  path.  The  attention  U-Net  architecture  [40]  can  tackle  this  problem  by 

56 

 
implementing  the  attention  mechanism  at  the  skip  connection,  allowing  the  model  to  actively 

suppress  actions  at  irrelevant  features.  This  reduces  the  computational  resources  wasted  on 

irrelevant  activations  and  provides  superior  network  generalization.  The  attention  mechanism 

applied in the attention U-Net is called the attention gates (AGs) [130]  based on additive attention. 

The CNN model with AGs can be easily trained from scratch and boost the model’s performance 

by  automatically  learning  to  focus  on  some  crucial  features  without  additional  supervision. 

Available  MPI  data  are  remarkably  limited  for  a  computational  study  of  robust  MPI  image 

quantification.  Herein,  we  propose  a  multi-head  attention  U-Net  model  for  the  MPI-CT  image 

segmentation. The MPI-CT images acquired from mice with breast tumors were manually labeled 

as the ground truths for training the model. The attention U-Net model [131] inspires the proposed 

model. Still, we apply the attention mechanism in parallel  (multi-head attention) to step up the 

model capability for focusing on noteworthy features. 

4.2 Methods  

An extensive overview of the workflow involved in training the proposed multi-head attention U-

Net  model  is  shown  in  Figure  17  below.  First,  NWs  were  synthesized  by  the  co-precipitation 

method of Fe2+ and Fe3+ salts with the polysaccharide dextran coating, as depicted in Figure 17(a1), 

the  particles  were  then  conjugated  with  ICG,  resulting  in  the  formation  of  conjugated 

superparamagnetic  iron  oxide  nanoworms  referred  to  as  NWs-ICG  [85].  In  addition,  we  also 

acquired  a  transmission  electron  microscopy  (TEM)  image  of  NWs-ICG  particles  as  shown  in 

Figure 17(a2).  With this  structure, the detection  of NWs-ICG can be  achieved by fluorescence 

imaging and optoacoustic imaging, in addition to the use of MPI as shown in Figure 17(a3). Thus, 

this  offers  captivating  prospects  for  a  multimodal  imaging  study.  However,  this  paper  mainly 

focuses on MPI. A mouse with breast tumors was injected with NWs-ICG through the intravenous 

57 

 
administration injection method, followed by MPI-CT image acquisition. Figure 17(b1) shows the 

MPI and micro-CT image systems used in this work. The fundamental concept of MPI is illustrated 

in Figure 17(b2). In short, an intense magnetic field is generated by two permanent magnets, and 

the inside of this magnetic field contains a small area with low magnetic field intensity known as 

the  field-free  region  (FFR).  By  rapidly  moving  the  FFR  across  the  imaging  volume,  the 

magnetization  of  SPIONs  passing  through  the  FFR  induces  a  signal  (oscillating  changes  in 

magnetization)  in  the  imager’s  receive  coil.  In  other  words,  SPIONs  not  passing  not  passing 

through the FFR do not generate a signal in the receiver coil due to a strong magnetic field outside 

the FFR inhibiting SPIONs from rotating. Lastly, the MPI-CT images were manually labeled as 

the ground truths for training the deep learning model as shown in Figure 17(c).  

58 

 
Figure 17. Overview of MPI-CT image segmentation using the custom dataset. (a) An injected-
NWs-ICG breast tumor mouse; (a1) the chemical structure of NWs-ICG; (a2) TEM image of 
NWs-ICG particles with a scale bar of 40 nm; (a3) the multimodality imaging (fluorescence, 
optoacoustic, and MPI) of the tumor dissected from the NWs-ICG injected mouse. (b) MPI-CT 
image acquisition; (b1) MPI scanner and Micro-CT imaging system; (b2) illustration of the MPI 
principle. (c) Ground truth labeling in MPI-CT image segmentation. 

59 

 
 
 
 
 
 
4.2.1 Dataset preparation 

To acquire a custom MPI-CT image dataset, MMTV-PyMT transgenic mice with breast cancer 

were intravenously injected with NWs-ICG at the concentration and volume of 2 mg/mL and 400 

µL, respectively. All procedures used in experiments conducted on animals were approved by the 

Institutional  Animal  Care  &  Use  Committee  (IACUC)  of  Michigan  State  University.  The 

Momentum MPI scanner (Magnetic Insight, Inc., Alameda, CA, USA) was employed to acquire 

the 3D MPI images of the NWs-ICG injected mice. The scanner was configured with the following 

parameters: 3D scan mode, Z FOV 10.0 cm, number of projections 21, and selection field gradient 

5.7  T/m.  The  Micro  CT  system  (PerkinElmer,  Inc.,  Hopkinton,  MA,  USA)  with  the  following 

parameters: speed scan mode and voltage of 90 kV was then used to acquire the corresponding CT 

images.  Finally,  3D  MPI-CT  images  were  reconstructed  using  VivoQuant  software  (Magnetic 

Insight, Inc., Alameda, CA). The imaging was performed at four different time points: 1 hour, 24 

hours, 48 hours, and 72 hours after injection. Therefore, with one mouse, we can obtain 3D datasets 

at these four different time points. However, we only focus on 2D images in this work. To obtain 

the 2D image dataset, the 3D images were rotated with random angles for capturing the 2D images, 

and we had to ensure that the perspectives or rotation angles were not the same (0 or 180 degrees 

from the existing images) for the data cleaning purpose. Figure 18(a) shows the MPI-CT images 

of the NWs-ICG injected mouse 1-72 hours post injection. MPI signal areas from MPI-CT images 

were  manually  labeled  as  the  ground  truth  for  training  the  segmentation  deep  learning  model. 

There are 104 2D MPI-CT images and their corresponding ground truths from four different mice 

used for this study (91 images for a training dataset, 4 images for a validation dataset, and 9 images 

for  a  testing  dataset).  To  affirm  that  there  were  NWs-ICG  particles  in  the  tumor  tissues,  after 

acquiring MPI-CT images, the tissues were dissected from the mice and preserved in a solution of 

60 

 
10%  neutral  buffered  formalin  (NBF).  These  NBF-fixed  tissues  were  embedded  in  paraffin, 

followed by sectioning with a thickness of 5 μm and staining with Prussian Blue to detect ferric 

from iron and hematoxylin and eosin (H&E). All histological procedures were carried out by the 

Michigan  State  University  investigative  histopathology  laboratory.  Figure  18(c-d)  show  the 

Prussian blue stained histology image of one of the dissected tumors from NWs-ICG injected mice 

acquired by a commercially available microscope (Nikon Eclipse Ci, Nikon Inc, Tokyo, Japan).  

Figure 18. (a) MPI-CT images of the NWs-ICG injected mouse acquired from 1 – 72 hours 
post-injection. The yellow-dashed circles (MPI-CT image at 72 h) show the MPI signal of 
NWs-ICG from the tumors. (b) Photograph of the NWs-ICG injected mouse. (c-d) Prussian 
blue stained histological image of the breast  tumor dissected from the  NWs-ICG injected 
mouse acquired by 10x and 40x magnifications, respectively. 

61 

 
 
 
 
 
4.2.2. Multi-head attention U-Net 

The main structure of the multi-head attention U-Net model is somewhat similar to the original 

attention  U-Net  model,  which  consists  of  the  encoder,  bottleneck,  decoder,  and  single-head 

attention layers. However, the proposed model applies parallel attention gates (AGs) in each skip 

connection from encoder to decoder instead of a single attention head. This modification allows 

the model to collect and incorporate more salient information effectively. In addition, employing 

parallel  AGs  enables  the  model  to  simultaneously  process  input  from  distinct  representation 

subspaces  at  numerous  locations  [31].  Figure  19(a)  illustrates  the  multi-head  attention  U-Net 

architecture.  The  first  part  is  the  encoder  (the  left  side  of  Figure  19(a)).  The  input  image  is 

progressively filtered and down-sampled by applying a convolution block, then a rectified linear 

unit (ReLU), and max-pooling 2x2 filters with a stride of 2. Furthermore, the number of feature 

channels  is  doubled  at  each  downsampling  step.  The  second  part  is  multi-head  attention  gates 

(MH-AGs). The features propagated through the skip connections are filtered by exploiting these 

MH-AGs,  which  can  help  the  model  localize  and  focus  on  relevant  features  without  cropping 

regions of interest. The third part is the decoder (the right side of Figure 19(a)). It consists of a 

concatenation of the attention weights from the MH-AG layer, a convolution block with the ReLU 

activation function, and a feature map upsampling followed by a 2x2 up-convolution resulting in 

a reduction of the number of feature channels by half. Figure 19(b) shows the MH-AG architecture 

employed between the encoder and decoder of the U-Net in Figure 19(a). MH-AG is a parallel 

mechanism block that minimizes the need for training a significant number of weights (learnable 

parameters) to enhance the performance of the U-Net model further. Moreover, the MH-AG adopts 

the same transformation in all branches to minimize the need to adjust hyperparameters in each 

branch manually. The output of each branch in MH-AG is obtained by performing element-wise 

62 

 
multiplication between the input feature maps and attention coefficients (𝑥^𝑛

𝑙 = 𝑥𝑖

𝑙 ∙ 𝛼𝑖

𝑙) allowing 

the model to identify salient information. To identify focus areas, a gating vector (𝑔𝑖) is assigned 

to each pixel. The gating vector encompasses contextual information utilized to suppress lower-

level  feature  responses  selectively.  The  gating  coefficient  is  derived  through  the  utilization  of 

additive attention mathematically represented as follows: 

𝑙 = 𝜓𝑇(σ1(𝑊𝑥
𝑞𝑎𝑡𝑡

𝑇𝑥𝑖

𝑙 + 𝑊𝑔

𝑇𝑔𝑖 + 𝑏𝑔) + 𝑏𝜓), 

𝑙 =   σ2 (𝑞𝑎𝑡𝑡
𝜎𝑖

𝑙 (𝑥𝑖

𝑙, 𝑔𝑖; 𝛩𝑎𝑡𝑡)), 

(24) 

(25) 

“Where σ2(𝑥𝑖) = 

1
1+𝑒𝑥𝑝 (−𝑥𝑖)

 represents the sigmoid activation function, 𝛩𝑎𝑡𝑡 represents a group of 

parameters that comprises linear transformation 𝑤𝑥 ∈ 𝑅𝐹𝑙×𝐹𝑖𝑛𝑡, 𝑤𝑔 ∈ 𝑅𝐹𝑔×𝐹𝑖𝑛𝑡, 𝜓 ∈ 𝑅𝐹𝑙×𝐹𝑖𝑛𝑡, and 

bias terms 𝑏𝜓 ∈ 𝑅, and  𝑏𝑔 ∈ 𝑅𝐹𝑔×𝐹𝑖𝑛𝑡. Channel-wise 1 x 1 x 1 convolutions for the input tensor 

are employed for computing the linear transformations.  

63 

 
Figure 19. Schematic of the multi-head attention U-Net (the proposed model) for MPI-CT 
image  segmentation.  (a)  The  left  side  of  the  schematic  represents  the  encoder  blocks;  the 
tensor  is  progressively  down-sampled  by  a  factor  of  2  (e.g.,  H1  =  H5/16);  the  right  side 
represents the decoder blocks, the tensor is up-sampled gradually by a factor of 2. The muti-
head  attention  gates  (MH-AGs)  are  applied  between  the  encoder  and  decoder  to  assign 
weights (learnable parameters) to noteworthy features. (b) Multi-head attention gate (MH-
𝑙 ) are scaled with 
AG) architecture (n is  the number of attention heads).  Input  features  (𝑥𝑛
𝑙 )  computed  in  each  branch  of  MH-AG.  The  gating  signal  (g) 
attention  coefficients  (α𝑛
collected  from  a  coarser  scale  provides  activations  and  contextual  information,  which  is 
applied to determine spatial regions. The output of each branch is then concatenated before 
feeding to the convolution layer, batch normalization, and sigmoid function to compute the 
final result of MH-AG. 

64 

 
 
 
4.2.3 Loss function 

Dice loss is widely used for medical image segmentation by comparing the similarity of two binary 

images (ground truth segmentation and predicted segmentation). Since our custom MPI-CT image 

dataset  is  limited  and  we  want  to  prove  the  concept  that  multi-head  attention  can  potentially 

enhance the model performance for MPI-CT image segmentation, the dice loss is simply used to 

train  all  models  for  a  purpose  of  performance  comparison.  Equation  26  shows  the  dice  loss 

function.  

𝐷𝑖𝑐𝑒𝐿𝑜𝑠𝑠(𝑦, 𝑦) = 1 −

(2𝑦𝑦+1)
, 
(𝑦+𝑦+1)

(26) 

Where y represents the ground truth and 𝑦 represents the predicted segmentation generated by a 

deep learning model. After assembling all the parts for building the models, the MPI-CT images 

and their corresponding segmentation masks were then utilized to train the models as inputs and 

ground truths, respectively with the following hyperparameters: an Adam optimizer [132] with an 

intimal rate of 5x10-4, a batch size of 8, and 60 epochs. All the models in this study were trained 

on a personal computer equipped with an 11th Gen Intel core i7-11700k CPU, 64 GB of RAM, and 

an NVIDIA RTX 3090 graphic card.  

4.3 Results and discussion  

4.3.1 Gradient-weighted class activation maps (Grad-CAM) 

Gradient-weighted  class  activation  mapping  (Grad-CAM)  [133]  is  a  class-discriminative 

localization technique. It can generate a visual representation of any CNN-based model without 

altering the model itself. Grad-CAM leverages the gradient information flowing through a specific 

convolutional layer to assign crucial weights to each neuron to determine a particular decision of 

interest. This gradient information is then used to calculate the localization map visualized as a 

heat map image. In short, the intuitive interpretation of Grad-CAM is based on the concept that 

65 

 
the model must observe some pixels and decide what object is present in the image, which can be 

interpreted as a gradient in mathematical terms. To compute Grad-CAM, the equations below are 

applied. Equation 27 is used to calculate the neuron’s important weight (𝛼𝑘

𝑐 ) by calculating the 

global average pooling of the gradient from backpropagation.  𝛼𝑘

𝑐 is then employed to calculate 

the localization map Grad-CAM as shown in Equation 27 and 28.   

𝑐 =

𝛼𝑘

1

𝑍

(∑ ∑ 𝜕𝑦𝑐
𝑘
𝜕𝐴𝑖𝑗

𝑗

𝑖

), 

𝑐
𝐿𝐺𝑟𝑎𝑑−𝐶𝐴𝑀

= 𝑅𝑒𝐿𝑈(∑ 𝛼𝑘

𝑐𝐴𝑘

),  

𝑘

(27) 

(28) 

Where  

𝜕𝑦𝑐
𝜕𝐴𝑖𝑗

𝑘  is the gradient from backpropagation, 𝐴𝑘 is feature map activation of a convolutional 

layer, 𝛼𝑘

𝑐 is neuron import weight, 𝐿𝐺𝑟𝑎𝑑−𝐶𝐴𝑀

𝑐

 is localization map Grad-CAM (coarse heat map).  

Grad-CAM is applied to each multi-head attention layer (MH-AG layer 1-4) output in order to 

characterize and understand the multi-head attention U-Net model behavior. The attention weights 

of different MH-AG layers are visualized as shown in  Figure 20. Figure 20(a) shows the input 

image, ground truth, and the segmentation outputs of 6-head, 4-head, and 2-head attention U-Net 

models. Figure 20(b) shows the Grad-CAM results of the corresponding attention U-Net models. 

According to these Grad-CAM results and final segmentation outputs, the 4-head attention U-Net 

model  can  exceptionally  perform  MPI-CT  image  segmentation  and  surpass  6-head  and  2-head 

attention U-Net models since it can focus on more meaningful features and predict a more accurate 

result. It is interesting to note that each MH-AG layer output of the 4-head attention U-Net model 

pays attention to different meaningful features, the MH-AG layer 4 pays attention to the overall 

boundary of the MPI signal, the MH-AG layer 3 focuses on the increasingly precise boundary of 

the MPI signal, the MH-AG layer 2 changes the focus from the boundary of the MPI signal to the 

skeleton (bone structure, i.e., CT image), and the MH-AG layer 1 entirely focuses on the real target 

66 

 
MPI signal. With these different meaningful features, the learnable parameters of the model can 

be assigned to pay attention to the relevant features and circumvent irrelevant features for the final 

prediction. However, the 2-head and 6-head attention U-Net models behave in different ways. The 

MH-AG layers 4 and 3 of the 2-head attention U-Net poorly estimate the boundary of the MPI 

signal, and the MH-AG layers 2 and 1 focus on somewhat the same features (MPI signal areas). 

Although the MH-AG layers 4 and 3 of the 6-head attention U-Net can perform better than the 2-

head attention model, the MH-AG layers 2 and 1 also pay attention to relatively the same features 

(MPI signal areas). Indeed, the optimal number of attention heads depends on the tasks we desire 

to train the deep learning model and the data features. If there are a larger number of important 

features, the higher number of attention heads could potentially help the model perform better by 

capturing more essential information. Nevertheless, the excessive number of attention heads could 

lead to less impressive performance, according to the Grad-CAM results illustrated in Figure 20 

and our quantitative experiment discussed in the next section. 

67 

 
Figure 20.  A comparison of Grad-CAMs results of 2-head attention, 4-head attention, and 6-
head attention U-Net architectures. (a) Input MPI-CT image, segmentation ground truth and 
outputs of each attention architecture. (b) The Grad-CAM results of the attention architectures 
at different MH-AG layers (MH-AG layer (1-4)). 

4.3.2 Implementation and evaluation metrics  

Intersection over Union (IoU) is commonly used to evaluate the similarity between a predicted 

segmentation area and its ground truth [121]. The concept of IoU is to quantify the common area 

of  the  ground  truth  and  prediction  mask  (intersection)  divided  by  the  entire  number  of  pixels 

present across both the prediction mask and ground truth (union) as shown in the equation below. 

68 

 
 
𝐼𝑜𝑈 =  

𝑔𝑟𝑜𝑢𝑛𝑑 𝑡𝑟𝑢𝑡ℎ ∩ 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛
𝑔𝑟𝑜𝑢𝑛𝑑 𝑡𝑟𝑢𝑡ℎ ∪ 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛

(29) 

The IoU ranges from 0 -1 (0-100%), with 0 indicating no overlapping area, whereas 1 indicates 

impeccably overlapping area.  

The  dice  similarity  coefficient  (DSC)  is  another  well-known  parameter  used  to  evaluate  the 

similarity  between  the  predicted  area  (our  output)  and  ground  truth  [32].  The  DSC  can  be 

calculated following the equation below. 

𝐷𝑆𝐶 =  

2|𝑔𝑟𝑜𝑢𝑛𝑑 𝑡𝑟𝑢𝑡ℎ  ∩ 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛|
|𝑔𝑟𝑜𝑢𝑛𝑑 𝑡𝑟𝑢𝑡ℎ| + |𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛|

Precision is defined as the ratio of true positive results to the total number of positive results, 

which is the summation of true positive and false positive as shown in Equation 31. 

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =  

𝑇𝑃

𝑇𝑃+𝐹𝑃

 , 

(30) 

(31) 

Sensitivity, also known as Recall, is the number of true positive results over the summation of 

true positive and false negative results as shown in Equation 32. 

𝑅𝑒𝑐𝑎𝑙𝑙 =  

𝑇𝑃

𝑇𝑃+𝐹𝑁

, 

(32) 

Accuracy, also known as the Rand index, is the number of correct predictions divided by the 

total number of predictions as shown in Equation 33.  

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =  

𝑇𝑃+𝑇𝑁

𝑇𝑃+𝑇𝑁+𝐹𝑁+𝐹𝑃

, 

(33) 

Where TP is true positive, TN is true negative, FP is false positive, and FN is false negative.  As 

previously stated, if the number of attention heads is excessive, the performance of a deep learning 

model  based  on  the  attention  heads  could  deteriorate.  Thus,  we  characterized  the  number  of 

attention heads and employed Dice and IoU as the representative benchmarks.  Figure 21 illustrates 

the characterization results of the U-Net based on the different number of attention heads. With 

regards to the plot of Dice/IoU scores vs the number of attention heads, it begins at 0.889/0.804 

69 

 
  
  
with the 1-head attention architecture, it gradually increases and then reaches the highest score at 

0.909/0.835 with the 4-head attention architecture before declining progressively to 0.906/0.829 

and 0.901/0.822 with 5 and 6 attention heads, respectively. Therefore, the multi-head attention U-

Net  with  4  heads  is  the  optimal  model  providing  the  best  result  for  the  MPI-CT  image 

segmentation.  

Figure 21. The performance of the multi-head attention U-Net models with the different number 
of attention heads (Dice/IoU scores vs the number of attention heads plot). 

Table 6 shows the comprehensive characterization results of MPI-CT image segmentation of deep 

learning  models  with  different  architectures.  Apart  from  using  Dice  and  IoU  scores  as  model 

evaluation  metrics,  we  also  characterized  the  performance  of  each  model  using  accuracy, 

precision, and recall. Overall, the 4-head attention U-Net model can outperform other multi-head 

attention  U-Net  models  including  the  original  U-Net  model  as  well  as  the  state-of-the-art 

Transformer U-Net model. The representative visualization MPI-CT image segmentation results, 

together with the corresponding input images and ground truths of each architecture are illustrated 

in Figure 22.  

70 

 
 
Table 6. Quantitative evaluation (average ± standard deviation of each metric) of the different 
deep learning architectures for MPI-CT image segmentation. 

Methods 

U-Net 
Transformer U-Net 
1-head Attention U-Net 
2-head Attention U-Net 
3-head Attention U-Net 
4-head Attention U-Net 
(the proposed model) 
5-head Attention U-Net 
6-head Attention U-Net 

Accuracy 

Precision 

0.983± 0.004  0.891± 0.074 
0.985± 0.005  0.909± 0.057 
0.984± 0.005  0.892± 0.068 
0.985± 0.004  0.888± 0.063 
0.987± 0.005  0.926± 0.038 
0.987± 0.005  0.920± 0.040 

Recall 
0.879± 0.076 
0.878± 0.069 
0.891± 0.069 
0.911± 0.057 
0.890± 0.065 
0.902± 0.058 

Dice 
0.883± 0.059 
0.892± 0.053 
0.889± 0.052 
0.897± 0.041 
0.906± 0.039 
0.909± 0.036 

IoU 
0.794± 0.089 
0.809± 0.083 
0.804± 0.083 
0.816± 0.052 
0.830± 0.063 
0.835± 0.060 

0.986± 0.004  0.913± 0.049 
0.985± 0.005  0.894± 0.074 

0.903± 0.060 
0.912± 0.053 

0.906± 0.030 
0.901± 0.043 

0.830± 0.050 
0.822± 0.070 

Figure 22. Visualization semantic segmentation results of the proposed model compared to other 
traditional U-Net models. From left to right, input MPI-CT images, the ground truth images, the 
segmentation  results  generated  by  U-Net,  Trans-U-Net,  Attention  U-Net,  and  our  proposed 
model (4-head attention), respectively.  

4.4 Conclusion  

Since  MPI  is  a  novel  medical  imaging  technology,  the  data  are  strictly  limited  for  a  robust 

computation study. This  work demonstrates the  multi-head attention U-Net  model,  an efficient 

end-to-end deep learning based on U-Net architecture and multi-head attention mechanism, for 

71 

 
 
 
MPI-CT image segmentation. The proposed model  was trained using  a  custom MPI-CT image 

dataset collected from transgenic mice with breast tumors injected with a promising MPI tracer for 

tumor  imaging,  namely  NWs-ICG.    To  examine  the  concept  of  multi-head  attention,  a  simple 

convolution block is employed as the backbone structure of the U-Net architecture to minimize 

the influence of other factors. Genuinely, the performance of the U-Net architecture can also be 

improved by using more efficient convolution  blocks  as the backbone. The optimal number of 

attention heads was experimentally observed in this study. Although an increase in the number of 

attention  heads  can  potentially  boost  the  model’s  capability,  the  excessive  number  of  attention 

heads results in a decline in capability. Our study shows that the attention U-Net with 4 heads is 

the most favorable architecture for MPI-CT image segmentation.  In future work, in addition to 

improving the model’s performance, we would like to explore the possibility of exploiting deep 

learning for 3D MPI segmentation and MPI intensity segmentation. We anticipate this work to 

embark on an intensive study for MPI image analysis and implement  it on humans in  the near 

future.  

72 

 
 
 
 
 
 
 
 
CHAPTER 5: Monocular Depth Estimation Based on Deep Learning for Intraoperative 

Guidance Using Surface-enhanced Raman Scattering (SERS) Imaging 

Reprinted with permission from “A. Juhong, et al., "Monocular depth estimation based on deep 

learning  for  intraoperative  guidance  surface-enhanced  Raman  scattering  (SERS)  imaging." 

Photonics Research, 13, no. 2, pp. 550-560 (2025)” [134], © Optica Publishing Group and Chinese 

Laser Press. 

Imaging of surface-enhanced Raman scattering (SERS) nanoparticles (NPs) has been intensively 

studied for cancer detection due to its high sensitivity, unconstrained low signal-to-noise ratios, 

and  multiplexing  detection  capability.  Furthermore,  conjugating  SERS  NPs  with  various 

biomarkers is straightforward, resulting in numerous successful studies on cancer detection and 

diagnosis. However, Raman spectroscopy only provides the spectral data from an imaging area 

without co-registered anatomic context. This is not practical and suitable for clinical applications. 

Here,  we  propose  a  custom-made  Raman  spectrometer  together  with  computer  vision-based 

positional tracking and monocular depth estimation using deep learning (DL) for the visualization 

of  2D  and  3D  SERS NPs  imaging,  respectively.  In  addition,  the  SERS NPs  used  in  this study 

(hyaluronic acid  (HA)-conjugated  SERS NPs) showed clear tumor targeting capabilities (target 

CD44 typically overexpressed in tumors) by an ex vivo experiment and immunohistochemistry. 

The  combination  of  Raman  spectroscopy,  image  processing,  and  SERS  molecular  imaging, 

therefore, offers a robust and feasible potential for clinical applications.  

5.1 Introduction 

Surgical resection of a tumor is a standard of care therapy for most solid tumors. The ultimate goal 

of surgical  resection is  to remove the  entire tumor with  minimal  damage to  adjacent  tissue, an 

outcome that strongly correlates with reduced tumor recurrence and improved survival [135, 136]. 

73 

 
Tumor margins in numerous aggressive cancers are typically indistinct due to the primary tumor’s 

propensity to invade into adjacent healthy tissue areas. As a result, defining appropriate margins 

for surgical resection remains challenging [137]. There are several modalities used in the clinic to 

visualize tumors and facilitate tumor removal such as magnetic resonance imaging (MRI), positron 

emission tomography (PET), and computed tomography (CT) [138-141]. However, these imaging 

modalities lack sufficient resolution needed to identify and remove microscopic   sites of cancer 

invasion from the main tumor mass. To achieve precise tumor delineation and complete resection, 

a  suitable  intraoperative  tool  should  meet  the  following  requirements:  high  sensitivity  and 

specificity, short acquisition time for real-time or near-real-time intraoperative detection, and high 

spatial resolution. With regards to imaging modalities, optical imaging exhibits distinct advantages 

compared to the previously mentioned non-optical imaging modalities in several aspects, such as 

lack of ionizing radiation, high sensitivity, and excellent spatiotemporal resolution [142-145].  

Recently, surface-enhanced Raman spectroscopy (SERS) nanoparticles (NPs) imaging has 

increasingly been recognized as a promising molecular imaging technique for clear delineation of 

tumor margins and tumor surgical resection due to its exceptional sensitivity, distinctive Raman 

signature (fingerprint), multiplexing detection capability [146-152], and lack of autofluorescence 

and photobleaching problems associated with fluorescence imaging. SERS NPs are composed of 

a gold core, Raman active dye, and silica shell, which have been developed to function as tumor-

targeting beacons showing substantially strong signals due to the surface plasmon resonance (SPR) 

effect [153] of the metallic core (gold). In addition, they can be effortlessly conjugated with various 

tumor-targeting ligands as well as fabricated with different Raman-active dyes.  Each Raman dye 

emits  a  unique  Raman  spectrum,  called  “flavor”,  facilitating  multiplexing.  Several  research 

groups, as well as our group, have demonstrated encouraging results of SERS NPs imaging for ex 

74 

 
vivo, in vivo, and image-guided surgery experiments [154-158]. However, Raman spectroscopy 

predominantly provides spectral data, lacking the capability to co-register and visually represent 

anatomic features, limiting applications for image-guided surgery.  

To  overcome  this  problem,  we  propose  a  custom-made  Raman  spectroscopy  system 

together with computer vision-based positional tracking and DL-based techniques to visualize 2D 

and 3D SERS NPs imaging, respectively. Specifically, the traditional template matching algorithm 

[159] is employed for probe tracking, and the affine transformation [160] is then used to co-register 

a 2D SERS image (reconstructed by using the multiplexing algorithm [161, 162]) and a sample 

photograph. For 3D imaging, the image is reconstructed based on a deep-learning monocular depth 

estimation (distance relative to the camera) of each given pixel in the input image. Multiple Depth 

Estimation Accuracy with Single Network (MiDaS) is a promising DL technique that estimates 

depth from an arbitrary input image. MiDaS utilizes a conventional encoder-decoder structure to 

generate the depth map images. The legacy MiDaS V2.1 model [163] uses a residual network as 

the backbone for feature extraction as this network structure is invulnerable to vanishing gradients 

and  allows  MiDaS  to  extract  multi-channel  feature  maps  from  input  tensors.  The  vision 

transformer (ViT) [164]  is the state-of-the-art model employed in computer vision tasks. It can 

surpass convolutional neural networks (CNNs)-based models across various domains and settings. 

Therefore, the latest MiDaS versions (3.0 [165] and 3.1 [166]) replace the CNNs backbone with 

vision transformer networks showing superior results. In this work, we directly utilized the pre-

trained MiDaS 3.1 to reconstruct a 3D mouse image and co-register with the SERS image.  

75 

 
 
 
 
5.2 Methods 

5.2.1 Raman spectrometer  

A schematic of the proposed Raman system is illustrated in  Figure 23. A 785-nm laser (iBeam 

Smart  785,  Toptica  Photonics,  Munich,  Germany)  is  employed  for  the  excitation  source,  the 

custom-made fiber bundle Raman catheter (Fiber guide Industries, Caldwell, ID, USA) is used for 

the laser illumination and the Raman spectra collection. A proximal end of the probe is made up 

of one single mode fiber (780HP, 4.4 µm core diameter) for 785 nm  laser illumination  and 36 

multimode  fibers  (AFS200/220T,  200-µm  core)  for  the  Raman  spectra  collection  as  shown  in 

Figure 23(b). The single-mode fiber for illumination is centrally positioned with the probe and 

encompassed by the 36 multimode fibers for Raman spectra acquisition. In addition, a fused silica 

plano-convex  lens  (L1,  f=6.83  mm,  PLCS-4.0-3.1-UV,  CVI  Laser  Optics,  Albuquerque,  NM, 

USA)  is  placed  in  front  of  the  probe  to  collimate  the  785  nm  laser  illumination  with  a  beam 

diameter of 1 mm and power of 30 mW on the sample. For the distal end, it is arranged in a vertical 

array or linear array for effectively coupling the light to the spectrometer (Kymera 193i-A, Andor 

Technology, Belfast, UK) by using optical relay lenses (L2, f = 100 mm, AC254-100-B and L3, 

f=80 mm, AC254-080-B, Thorlabs Inc., Newton, NJ, USA). In addition, the Rayleigh scattering 

from the collected light is filtered out by a long-pass filter (LPF, 𝜆𝑐 = 830 nm; BLP01-830R-25, 

Semrock,  Rochester,  NY,  USA),  placed  between  the  relay  lenses.  As  a  result,  the  light  that 

traverses  the  spectrometer  is  solely  subjected  to  Stokes-Raman  scattering.  The  Stokes-Raman 

scattering light from the spectrometer is then collected by a cooled deep-depletion spectroscopic 

charge-coupled  device  (CCD)  array  (1024  x  256  pixels  with  a  pixel  size  of  26  µm  x  26  µm; 

DU920P  Bx-DD,  Andor  technology,  Belfast,  UK)  with  a  wavelength  range  of  835-  912  nm 

(Raman  shift  of  770  -1777  cm-1).  To  achieve  raster  scanning,  a  two-axis  translation  stage  is 

76 

 
constructed by joining two linear stages in an orthogonal manner (DDS050, Thorlabs Inc., Newton, 

NJ,  USA).  Furthermore,  a  color  monocular  camera  (ELP  5-50mm,  with  Sony  IMX323  chip, 

Shenzhen, China) is applied to track the Raman probe position and capture the sample photograph 

to reconstruct the 2D and 3D co-registered SERS images. 

Figure 23. Schematic of the custom-made Raman imaging system together with the visualization 
system. (a) The optical diagram of the Raman spectroscopy system. A 785 nm laser is used to 
illuminate the sample through a single mode fiber and collimated by an L1 lens. The scattered 
light is then collected by the Raman probe, coupled into the spectrometer using the relay optics 
(L2 and L3 lenses) with an interchangeable mirror (IM) and a long pass filter (LPF) in between. 
The  spectrometer  consists  of  a  rotatable  grating,  three  mirrors  (M1:  reflection  mirror,  M2: 
collimating mirror, and M3: focusing mirror), and a back-illuminated deep-depletion CCD. To 
perform 2D Raman imaging, the Raman probe is translated by a two-axis motorized stage. (b) 
the photograph of the distal and proximal ends of the custom-made fiber bundle. (c) Schematic 
of the visualization system for generating the 2D and 3D co-registered SERS imaging. 

5.2.2 SERS NPs synthesis  

SERS  NPs  were  synthesized  using  the  tris-based  assisted  synthesis  protocol  with  Au  NPs 

formation  at  elevated temperature as shown  in  Figure 24(a).  First,  the sodium  citrate reduction 

approach was employed to prepare 17 nm Au-NP seeds. The seeds were then mixed with tris at 98 

°C,  followed  by  adding  gold  chloride  for  seed-mediated  growth  to  obtain  50  nm  Au-NPs.  The 

Raman dye was promptly added after the formation of 50 nm Au-NPs, and the solution was stirred 

77 

 
 
for one minute, followed by cooling in an ice bath. To functionalize SERS NPs with biomolecules, 

particularly hyaluronic acid (HA) and polyethylene glycol (PEG), thiol groups were employed for 

the attachment of these biomolecules and to Au NPs via gold-thiol interaction [167-171] . S420 

SERS NPs were mixed with thiolated-HA and this mixture solution was then incubated at 4 °C 

overnight.  After  that,  unbounded  HA  was  removed  by  repeated  centrifugation.  Likewise,  the 

procedure to conjugate PEG with S481 SERS NPs is the same as the HA conjugation. The size 

and shape of synthesized SERS NPs were characterized by a transmission electron microscope 

(TEM; 2200FS, JEOL Ltd., Tokyo, Japan) and a dynamic light scattering particle analyzer (DLS; 

Zetasizer Nano ZS, Malvern Panalytical Ltd., Malvern, England, UK). SERS NPs are homogenous 

spheres with approximately 50 nm in diameter as shown in Figure 24(b). The DLS result was also 

applied to validate the distribution size with a measurement of 56 nm as shown in Figure 24(c). 

The comprehensive synthesis protocol and  characterization of SERS-NPs are demonstrated in our 

previous  work  [157].  The  normalized  Raman  spectra  (acquired  by  our  custom-made  Raman 

spectrometer) of S420 and S481 SERS NPs with a concentration of 500 pM are demonstrated in 

Figure 24(d). 

78 

 
Figure 24. Synthesis of the SERS NPs. (a) SERS NPs synthesis and HA/PEG conjugation 
procedure. First, 17 nm gold seeds (Au NP) are formed. Second, the NPs further grow to 50 
nm meanwhile different Raman reporters (S420 and S481) are attached to the gold surface. 
Lastly, the SERS NPs are functionalized with HA or PEG. (b) TEM image of the SERS NP 
with diameter of approximately 50 nm. (c) DLS result of the corresponding SERS NPs. The 
measured  size  is  56.16  nm  in  diameter.  (d)  Normalized  Raman  spectra  of  the  stock  SERS 
solution of both flavors (S420 and S481).  

79 

 
   
 
 
5.2.3 Position tracking and image co-registration algorithms 

Before  processing  the  data  acquired  by  a  low-cost  camera,  camera  calibration  [172,  173]  was 

applied to correct the image distortion due to the lens quality and optical alignment.  Template 

matching algorithm [174] is then used to determine the precise position of a Raman probe image 

(the template image) in a large surgery area image (the input image). The concept of this algorithm 

is to slide the template image over the input image, akin to a 2D convolutional operation, followed 

by a comparison of the template and the corresponding patch of the input image, which can be 

done  by  several  methods.  In  this  work,  we  employed  a  normalized  cosine  coefficient 

(TM_CCOEF_NORMED) implemented in Python using the OpenCV library  [175] to calculate 

the  template  matching  for  the  Raman  probe  detection.  With  the  Raman  probe  position,  the 

scanning position can be easily estimated during data acquisition. In addition, to accurately overlay 

the SERS image (X) and surgery area image (Y), an image co-registration algorithm is required 

by calculating the geometric transformation matrix (T) as shown in the equations below. 

𝑌 =  𝑇. 𝑋, 

𝑋 =   [

′
′ ⋯ 𝑥𝑛
′
𝑥2
𝑥1
′ … 𝑦𝑛
′ 𝑦2
′
𝑦1
1 … 1
1

], 

𝑌 =   [

𝑥1 𝑥2 … 𝑥𝑛
𝑦1 𝑦2 … 𝑦𝑛
1 … 1
1

], 

𝑇  =   [

𝑚00 𝑚01 𝑚02
𝑚10 𝑚11 𝑚12
1
0
0

], 

(34) 

(35) 

(36) 

(37) 

where (𝑥𝑛

′  , 𝑦𝑛

′  ) and (𝑥𝑛, 𝑦𝑛) are the corresponding positions (n is the number of corresponding 

positions) in the input image X and the reference image (Y), respectively, and 𝑚𝑖𝑗 is the simplified 

transformation  matrix  parameters  derived  from  the  rotation,  scaling,  shearing,  and  translation 

matrices as shown in the equation below.  

80 

 
 
 
 
𝑇  =   [

1 0 𝑡𝑥
0 1 𝑡𝑦
1
0 0

] [

𝑠𝑥
0
0

0
0
𝑠𝑦 0
1
0

] [

1
𝑠ℎ𝑦
0

𝑠ℎ𝑥 0
0
1
1
0

] [

cos(𝜃)
− sin(𝜃)
0

sin(𝜃) 0
cos(𝜃) 0
1

0

] 

(38) 

Where the translation matrix: 𝑡𝑥 and 𝑡𝑦 are the displacement along the x and y axes, respectively, 

the scaling matrix: 𝑠𝑥 and 𝑠𝑦 are the scale factors along the x and y axes, respectively, the shear 

matrix: 𝑠ℎ𝑥 and 𝑠ℎ𝑦 are  the shear  factors  along  the  x and y  axes,  respectively,  and  the rotation 

matrix: 𝜃 is the angle of rotation. Indeed, T matrix can be estimated by using corresponding points 

together with the minimized least square error (𝜀2)  as shown below:  

𝜀2 = ‖𝑇𝑋 − 𝑌‖2, 

𝑑𝜀2
𝑑𝑇

= −2𝑋𝑇(𝑌 − 𝑇𝑋) = 0 , 

𝑋𝑇𝑌  = 𝑋𝑇𝑇𝑋  , 

𝑇  = (𝑋𝑇𝑋)−1  (𝑋𝑇Y). 

(39) 

(40) 

(41) 

(42) 

To  obtain  a  more  accurate  co-registration  result  (2D  co-registered  SERS  image),  the  estimated 

transformation matrix (T) is then applied to the reconstructed SERS image (X) derived from the 

demultiplexing algorithm. In our case, the raster scan was applied to reconstruct the SERS image 

and the fiducial landmarks (four corners of the scanning area) were marked on the sample. Thus, 

the four corners of the SERS image were used as the corresponding points to  the four fiducial 

points on the samples for the image co-registration. 

5.2.4 Depth estimation using DL 

MiDaS is considered as a promising model for performing monocular depth estimation, and the 

original MiDaS V 2.1 [163] is based on a CNN backbone, however the newer versions (MiDaS V 

3.0  [165]  and  V  3.1  [166])    employ  a  transformer  architectures  as  their  backbones,  which  can 

significantly outperform the original version. The training protocol of the MiDaS V2.1, 3.0, and 

3.1 models are analogous. Breifly, the MiDaS models were trained by using 12 mixing datasets, 

81 

 
multi-objective optimization [176] with Adam [177], and scale-and-shift-invariant loss [178]. The 

encoder  and  decoder  weights  were  updated  by  applying  the  learning  rates  of  1e-5  and  1e-4, 

respectively.  The  models  were  initially  pre-trained  on  a  subset  of  the  datasets  for  60  epochs, 

followed by training for another 60 epochs on the full dataset. The complete training details are 

elucidated  in  the  original  MiDaS  V  2.1  paper.  All  DL  models  demonstrated  in  this  work  were 

implemented on a personal computer equipped with an 11th Gen Intel core i7-11700k CPU, 64 GB, 

and an NVIDIA RTX 3090 graphic processing unit (GPU). Indeed, all MiDas models are built 

using encoder and decoder structures. Each MiDaS model differs in the backbone of the encoder 

part  (variant  of  CNNs  and  Transfomer  architectures),  while  the  rest  of  the  model  remains 

consistent. Since the latest MiDaS V 3.1 provides the best result compared to other versions, it is 

used in this study. Bi-direction Encoder repression from Image Tranfomers (BEiT) [179]  is used 

as the backbone of MiDas V 3.1, as shown in Figure 25(a-b). BEiT is a state-of-the-art architecture 

that  enables  self-supervised  pretraining  of  vision  transformer  (ViT)  to  surpass  supervision 

pretraining. The pre-train task in BEiT is the masked image modeling (MIM) head, as shown in 

Figure 25(b). The concept of MIM is to recover the original visual tokens based on the corrupted 

image patches. In other words, MIM uses two views for each image to train the model. First, the 

2D image with a size of HxWxC is divided into a sequence of  HW/P2 patches for each channel, 

where (H,W) is the image size, C is the number of channels, and (P,P) is the patch size. All the 

patches are then flatten into vectors and linearly projected. Second, an image tokenizer converts 

the  image  into  a  sequence  of  discrete  tokens  rather  than  using  raw  pixels.  Discrete  variational 

autoencoder (dVAE) [180, 181] is directly used to train this image tokenizer. Indeed, the image 

tokenizer is a readily trained token genertor for the input patches.  

82 

 
Figure 25. (a) Overview of the MiDaS V 3.1 architecture. The input image is embedded with a 
positional  embedding  and  a  patch-independent  readout  token  (orange)  is  included.  These 
patches  are  fed  to  four  BEiT  stages.  At  each  BEiT,  the  output  tensor  is  passed  through  the 
Reassemble  and  Fusion  blocks  to  predict  the  encoder  outputs  for  each  stage.  (b)  BEiT 
transformer  architecture  used  in  the  encoder  part  in  (a).  (c)  Reassemble  block  applied  to 
assemble  the  tokens into feature  maps  with 1/s  the spatial  resolution  of  the  input image. (d) 
Fusion block used to combine the features and upsample the features maps by two times.  

The  outputs  from  the  tokenizer  and  MIM  are  used  to  determine  the  loss  value  to  update  the 

learnable parameters allowing the network to obtain a deep understanding of underlying image 

patterns without the explicit lables. It is important to note that the BEiT was initially designed for 

83 

 
 
an image classification problem and does not provide depth estimation functionality. To assemble  

MiDaS V 3.1, BEiT is used as a feature extractor and must be appropriately connected to the depth 

decoder. Regarding the encoder-decoder in MiDaS, the input is progessively processed for each 

encoder stage, similar to the decoder stage. Thus, the BEiT backbone can be integrated by placing 

appropriate hooks, meaning a tensor computed in the encoder is taken and available as input for 

the decoder at one of its stage. This requires a reassembling process to reshape the tensors to fit 

the decoder, as shown in Figure 25(c-d). Essentially, the input image is embedded as the tokens, 

which are passed through serval BEiT stages. At each stage, the tokens are ressemable into image-

like represtanion with different resolutions.  After that, the fusion module is employed to fuse and 

upsample these image-like represtanions  in  order to  generate an exquisite prediction.  The final 

prediction is then fed to a task-specific ouput head to generate the depthmap image.  The depth 

map  image  generated  by  the  MiDaS  model  is  considered  as  a  dispartily-like  image  (inversely 

propotional  to  the  depth  map  intesnity),  which  is  then  projected  into  3D  space  using  the 

reprojectImageTo3D  function  in  OpenCV  [175].  Lastly,  The  color  of  each  pixel  in  the  2D  co-

registered SERS image is mapped onto the corresponding positions (x-y plane) in the 3D space of 

depth map image to obtain the final 3D SERS image.  

5.3 Results and discussion 

5.3.1 Phantom characterizations  

The step-wedge with a height of 9.5 mm of each step, which was constructed from the standard 

mounting bases (BA1S, Thorlabs Inc., Newton, NJ, USA), was used as a phantom to characterize 

the depth estimation DL models. The camera captured this phantom photograph and was used as 

the input for the three different MiDaS models (CNN, ViT, and BEiT) to estimate the depth and 

compare the performance of each model. To quantify the performance of each model, the depth 

84 

 
map intensities from step 4 to step 1 (along with the white-dashed line) were plotted as illustrated 

in Figure 26(a). The absolute errors were then calculated from the intensity profiles of each model 

and the ground truth (the black line). Table 7 shows the average absolute error ± standard deviation 

results of each model. It shows that the MiDaS model based on BEiT architecture can surpass other 

models with the lowest average absolute error of 0.0485 ± 0.1737.  

Figure 26. Validation of depth map imaging and Raman spectra at different distances from 
a  camera  and  a  Raman  catheter,  respectively.  (a)  Depth  map  imaging  of  a  step-wedge 
phantom generated by MiDaS models based on three different backbones (CNN, ViT, and 
BEiT) and the comparison of the depth map intensity profiles of each model. (b) Depth 
map imaging of a tumor phantom with different distances from the camera. (c) The Raman 
spectra of S420 SERS NPs characterization at different distances from the Raman catheter 
by using the step-wedge phantom. (d) A linearity plot of the highest intensity of S420 (1614 
cm-1) versus the distances from the Raman catheter. 

85 

 
   
 
 
 
 
 
 
 
 
Table 7. Depth map intensity characterization results (Average absolute error ± Standard 
deviation) of MiDaS models with three different architectures: CNN, VIT, and BEiT. 

Step number 
Step 1 
Step 2 
Step 3 
Step 4 

CNN 
0.074 ± 0.56 
0.070± 0.046 
0.135± 0.088 
0.092± 0.077 

ViT 
0.318 ± 0.14 
0.252± 0.01 
0.018± 0.016 
0.161± 0.10 

BEiT 
0.051 ± 0.56 
0.032 ± 0.04 
0.024 ± 0.012 
0.087± 0.083 

Furthermore, a 3D-printed tumor phantom was utilized for thorough characterization of the MiDaS 

models, as depicted in Figure 26(b). The distance between the phantom and camera varied from 5 

cm to 11 cm with an increment of 2 cm. The phantom depth map images were then generated by 

the MiDaS models. The quality images captured at the out-of-focus distances (5 cm and 7 cm) are 

unsatisfactory,  leading  to  deterioration  of  depth  map  quality,  as  the  models  cannot  correctly 

recognize some poor resolution areas to generate the depth map image, especially the CNN MiDaS 

model.  Nevertheless,  the  BEiT  model  can  still  generate  somewhat  decent  quality  depth  map 

images.  Table  8.  shows  four  evaluation  metrics  (average  value  from  all  distances  ±  standard 

deviation): IoU, F1-score, Recall, and Precision, of the depth map images and their corresponding 

masks. This evaluation shows the overall performance of the MiDaS models for generating depth 

map images of the same object with different image quality (in-focus and out-of-focus images), 

particularly  the BEiT MiDaS model can surpass  other models with  the promising  scores of all 

evaluation metrics. In addition, the complexity and average execution time for one input image 

were  evaluated  to  assess  the  feasibility  for  intraoperative  guidance  applications.  Although  we 

implemented MiDaS on a moderate-budget  GPU (an NVIDIA RTX 3090 GPU), the execution 

time  is  feasible  for  intraoperative  guidance  applications.  Indeed,  the  execution  time  can  be 

improved by using more powerful GPUs currently available on the market. 

86 

 
 
 
Table 8. Tumor phantom characterization result of the three different MiDaS models 

Evaluation 
IoU 
F1-score 
Recall 
Precision 
Execution time 
(second) 
The number of 
parameters 

CNN 
0.139± 0.026 
0.244± 0.041 
0.262 ± 0.024 
0.234 ± 0.058 
0.861 

ViT 
0.241± 0.018 
0.389± 0.024 
0.370 ± 0.027 
0.421 ± 0.074 
0.998 

BEiT 
0.272 ± 0.033 
0.426 ± 0.042 
0.402 ± 0.029 
0.466 ± 0.088 
1.175 

105 M 

334 M 

345 M 

In addition to the depth map image characterization, the intensity of Raman spectra of the same 

sample at  various distances from  the Raman catheter was also  characterized by using the step-

wedge phantom from Figure 26(a) and S420 SERS NPs solution with a concentration of 500 pM 

as shown in Figure 26(c).  The SERS NPs solution was dropped on each step with a volume of 20 

µL, followed by acquiring the Raman spectra using 30 mW laser power and 1 second exposure 

time.  The  linearity  plot  of  the  highest  peak  of  S420  (1614  cm-1)  and  the  distance  between  the 

Raman catheter and sample is illustrated in Figure 26(d). The distance between the catheter and 

the  sample  is  inversely  proportional  to  the  intensity  of  the  Rama  spectra.  Thus,  this  has  to  be 

addressed to enhance the accuracy of clinical applications.  

5.3.2 Ex-vivo experiment 

To validate the targeting capability of the conjugated-HA SERS NPs, we performed an ex-vivo 

experiment on tumor tissue and spleen connective tissue (control) harvested from the MUC1 breast 

tumor mouse model [38]. All procedures used in experiments conducted on animals were approved 

by the Institutional Animal Care & Use Committee (IACUC) of Michigan State University. SERS-

NPs used in this experiment were also published in our previous work [157]. First, we scanned the 

background signal from all the tissues. Second, all tissues were incubated with the mixture solution 

of S420-HA and S481-PEG SERS NPs with a concentration of 250 pM for 15 minutes. The S481-

PEG was used as a control SERS NPs solution (non-targeting). In the next step, all the tissues were 

87 

 
 
rinsed by phosphate-buffered saline (PBS) 4-5 times, followed by acquiring the Raman spectra 

and  reconstructing  the  image  using  the  demultiplexing  algorithm  [161,  162].  This  algorithm  is 

based  on  the  direct  classical  least  squares  (DCLS)  method,  using  measured  Raman  spectra, 

reference  spectra  of  SERS  NPs  of  each  flavor  (spectra  of  a  pure  SERS NPs  solution  at  a  high 

concentration), and background spectra as inputs to estimate the weight of a specific flavor.  

Figure 27. (a) Multiplexed Raman images of tissues topically stained with the mixture of SERS-
HA (CD44 targeting) and SERS-PEG (control) solution, (a1) Photographs of the mouse tumor 
tissue and spleen connective tissue (control), and (a2-a4) Raman images of individual channels 
and  ratiometric  results.  (b)  H&E  and  IHC-CD44  images  of  the  corresponding  tissues.  (c) 
Representative enlarged IHC images in (b) of the breast tumor and normal tissues. Scale bars in 
(a-b) and (c) are 5 mm and 50 μm, respectively. 

Ideally, by rinsing tissues after incubation, the non-targeting NPs (S481-PEG) should be removed 

from the incubated tissues, and the majority of targeting NPs (S420-HA) should remain on the 

tumor with overexpressed CD44.  However, in the practical experiment, we detected signals from 

both S420-HA and S481-PEG in both the tumor and normal tissues, as shown in Figure 27(a1-a3), 

due to tissue texture and non-specific binding. Therefore, the Raman ratiometric image of S420-

88 

 
 
HA and S481-PEG was applied to evaluate the targeting of the NPs, as shown in Figure 27(a4). 

According to the ratiometric result, the ratio of targeting NPs (S420-HA) on the tumor tissue is 

significantly stronger than the ratio on the  control  tissue, which is  encouraging  and promising. 

Furthermore,  the  H&E  and  IHC  of  CD44  of  the  corresponding  tissues  were  prepared,  and  the 

results are shown in Figure 27(b1-b2), respectively. CD44 is labeled as brown areas, and they are 

intense (overexpressed) in the tumor tissue as shown in Figure 27(c). This is also consistent with 

the ratiometric result. 

5.3.3 Image-guided surgery experiment 

In this experiment, we would like to validate the capability of the proposed Raman system and 

SERS NPs and closely replicate the clinical conditions of human surgery. A 5-month-old female 

C57BL6  double  transgenic  mouse  with  breast  cancer  was  used  for  this  experiment.  First,  the 

operative surgery area (tumor area) was defined, followed by acquiring the Raman signal as the 

background signal. The mouse was then intratumorally injected with the S420-HA solution with a 

concentration of 500 pM, a volume of 100 µL, and a depth of injection of approximately 2-3 mm. 

42 hours after the injection, the mouse was euthanized by using a table-top research anesthesia 

machine (V300PS-PARKLAND SCIENTIFIC, USA) with 10 lpm of oxygen flow and 1.5% of 

anesthetic agent vapor in oxygen during the image-guided surgery imaging. The tumor skin was 

then cut open followed by rinsing the tumor area with PBS 4-5 times and acquiring Raman spectra. 

After that, the Raman image (weight of S420-HA) of the scanned area was reconstructed and the 

tumor was also gradually resected following the white boundaries as shown in Figure 28.  It is 

important to note that the deeper the resection is performed, the weaker the signal of SERS NPs 

is.  This  is  due  to  the  effective  working  distance  of  the  Raman  probe.  Therefore,  the  depth  of 

information on the operative area is essential for providing additional insights and guidance for 

89 

 
more effective surgery, and we also demonstrate the concept of the 3D SERS NPs imaging in the 

next section.  

Figure  28.  SERS  image-guided  surgery  for  resection  of  a  mouse  with  a  breast  tumor.  (a) 
Photographs of the tumor during the intraoperative SERS image-guided surgery from the first 
removal to the complete removal. (b) the corresponding SERS imaging (weight of S420-HA) 
reconstructed by the demultiplexing algorithm. The scale bar is 5 mm, and the white boundaries 
depict the resection regions.  

5.3.4 2D tracking and 3D SERS imaging  

In  addition  to  the  image-guided  surgery  and  ex-vivo  experiments,  we  demonstrate  our  custom-

made  Raman system  and monocular depth  estimation based  on DL to  visualize the SERS NPs 

signal on the sample in 2D and 3D surfaces in the physical world. To simplify the experiment, the 

S420-HA solution with a concentration of 500 pM was directly dropped on the cut-open tumor of 

another breast tumor mouse with an incubation time of 15 minutes followed by rinsing with PBS 

90 

 
 
4-5 times and acquiring Raman spectra, respectively.  Before applying this S420-HA solution, the 

background Raman signal was also acquired as it is one of the input variables for the SERS image 

reconstruction. A color camera was used to record the video of the scanning area and capture the 

photograph of the sample to generate the 2D SERS mapping video and the 3D SERS image. To 

generate the 2D SERS mapping video, the template matching algorithm was applied to track the 

Raman  catheter  position  to  estimate  the  scanning  positions.  After  that,  the  SERS  signals  (the 

weights  of  S420-HA)  were  then  generated  on  these  estimated  scanning  positions  as  shown  in 

Figure 29 (a). After completing the scanning, the image co-registration algorithm was applied to 

co-register 2D SERS image with the sample photograph and the MiDaS DL based on BEiT was 

utilized  to  generate  the  depth  map  image.  With  these  2D  co-registered  SERS  and  depth  map 

images, the 3D-coregiesterd SERS image was reconstructed and projected as point clouds in the 

3D space as shown in Figure 29(b). Since Figure 29(a) shows the Raman catheter tracking with 

real-time 2D SERS image reconstruction, the large field of view (FOV) was needed to acquire the 

image  for  covering  the catheter  and  scanning  area  images.  Nevertheless,  the  smaller  FOV  was 

employed to illustrate greater detail in the 3D SERS image shown in Figure 29(b). According to 

these promising results, the proposed method can facilitate 2D and 3D SERS imaging through the 

utilization of a Raman catheter system and a simple camera, which can immeasurably improve the 

visualization  and  precision  of  SERS  NPs  distribution  leading  to  more  efficient  clinical 

applications. Specifically, it is beneficial for image-guided surgery by assisting surgeons to locate 

solid tumors and achieve more precise resections. However, there is an obvious artifact pattern in 

3D SERS imaging. It is caused by the large excitation laser (approximately 1 mm). This could be 

resolved by improving the optic design of the Raman system to reduce the beam size and adding 

a scanner to maintain the acquisition speed, which could be our future work.  

91 

 
Figure 29. (a) 2D SERS image during Raman spectra acquisition, (a1) before scanning, (a2) 
during  scanning,  and  (a3)  complete  scanning.  (b)  3D  image  of  the  sample,  SERS,  and  co-
registered SERS reconstructed by using affine transformation and Midas 3.1 DL model with the 
BiET backbone architecture. The scale bars of (a1) and (b1) are 10 mm and 8 mm, respectively. 

5.4 Conclusion  

Intraoperative imaging systems, in tandem with exogenous contrast agents, play a crucial role in 

tumor resection by assisting a surgeon to identify tumor areas with a high degree of sensitivity and 

specificity.  However,  traditional  imaging  systems  commonly  encounter  poor  tumor  margin 

visualization, particularly the weak signal of a tumor at deeper layers. Without depth information, 

these weak signals might be neglected, leading to ineffective tumor resection. Therefore, the whole 

tumor might not be completely removed, causing tumor recurrence. In recent years, SERS NPs 

imaging has been increasingly recognized as an encouraging molecular imaging technique due to 

its remarkable sensitivity, multiplexing detection capability, and photostability. In addition, it has 

92 

 
 
demonstrated significant potential in cancer detection and enhancing delineation of tumor margin, 

as SERS NPs can be easily conjugated with various biomarkers. 

In  this  work,  we  propose  an  approach  to  visualize  2D  and  3D  SERS  imaging.  A  step-wedge 

phantom and a tumor phantom were used to evaluate the depth map estimation performance of 

MiDaS models with three different back-bone architectures: CNN, ViT, and BEiT. MiDas based 

on BEiT can outperform other models; thus, it was employed for 3D visualization of SERS NPs. 

HA-conjugated SERS NPs were evaluated by ex-vivo and image-guided surgery experiments by 

using the traditional 2D SERS image reconstruction showing promising results. Nevertheless, it 

lacks  the  depth  information  for  practical  clinic  applications,  affecting  surgery  outcomes. 

Therefore, the proposed approach combines the use of a custom-made Raman spectrometer with 

computer vision-based positional tracking for 2D SERS imaging and monocular depth estimation 

based  on  the  MiDaS  model  for  3D  SERS  imaging.  This  combination  can  overcome  the 

disadvantage of the conventional Raman system, which only provides spectra information and is 

unsuitable  for  clinical  applications.  The  2D  and  3D  image  co-registration  between  the  Raman 

imaging  and  the  sample  photograph  in  the  physical  world  enables  better  performance  and 

efficiency of tumor resection, potentially leading to its implementation in human clinical trials in 

the near future. Essentially, the proposed method shows a proof-concept study of image-guided 

surgery by using 3D and 2D SERS imaging. However, there are some limitations that need to be 

improved in the future, particularly the resolution of SERS imaging. The excitation laser beam 

diameter in the proposed system is somewhat large (roughly 1 mm), causing the artifact in 3D and 

2D image reconstruction, which is unsuitable for small tumor resection. Therefore, the optics part 

should be re-designed to obtain smaller beam size for enhanced resolution. In addition, the depth 

map estimation using MiDaS can be influenced by the resolution of an input image acquired at an 

93 

 
out-of-focus distance. Thus, auto-focus approaches, such as resolution enhancement deep learning 

or a hardware-based approach, should be considered to avoid this problem. The proposed method 

may be more feasible for future clinical applications as a result of these improvements. 

94 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
CHAPTER 6: Summary and future work 

In  this  dissertation,  a  wide  range  of  biomedical  applications  based  on  different  deep  learning 

techniques  have  been  presented.  Firstly,  a  practical  deep  learning  model  for  the  resolution 

enhancement of H&E-stained images by using the state-of-the-art SRGAN-ResNeXt network has 

been demonstrated. The model can deeply learn how to map the low-resolution images to their 

corresponding high-resolution images. Even though cell images contain sophisticated patterns and 

structures,  the  SRGAN-ResNeXt  model  can  still  provide  high-quality  reconstruction  results. 

Moreover, it can outperform the original SRGAN model. Therefore, we take these advantages to 

characterize and quantify the nuclei from the generated high-resolution images. Secondly, deep 

learning  based  on  recurrent  and  convolutional  neural  networks  has  been  demonstrated  for 

generating  sequential  NWs-ICG  optoacoustic  (multispectral  unmixing),  ultrasound,  and 

optoacoustic  images.  It  has  shown  robust  and  promising  performance  in  the  accurate 

reconstruction of the sequential images for all modalities, according to the quantitative evaluation 

of model performance using the PSNR and SAE for all scanning positions of the generated images 

(reconstructed by the deep learning model) and ground truth (acquired by mechanical scanning). 

The  architecture  of  our  model  is  versatile  since  it  can  promisingly  generate  sequential  cross-

sectional images of three modalities from a commercial MSOT system. Using our deep learning 

can substantially reduce acquisition time. However, all the training data were acquired from  ex 

vivo tissues completely fixed in agarose. Model performance with images acquired in vivo may be 

affected  by  cardiac  and  respiratory  motion.  Thirdly,  the  proposed  multi-head  attention  U-Net 

model, an efficient end-to-end deep learning based on U-Net architecture and multi-head attention 

mechanism, was demonstrated for MPI-CT image segmentation. The proposed model was trained 

using a custom MPI-CT image dataset collected from transgenic mice with breast tumors injected 

95 

 
with  a  promising  MPI  tracer  for  tumor  imaging,  namely  NWs-ICG.    The  optimal  number  of 

attention heads was experimentally observed in this study. Although an increase in the number of 

attention  heads  can  potentially  boost  the  model’s  capability,  the  excessive  number  of  attention 

heads results in a decline in capability. Our study shows that the attention U-Net with four heads 

is the most favorable architecture for MPI-CT image segmentation.  Lastly, we propose a method 

to generate 2D and 3D SERS imaging. The proposed method integrates the use of a custom-made 

Raman spectrometer with image processing and deep learning to generate 2D and 3D SERS image, 

which  can  overcome  the  drawback  of  the  conventional  Raman  system,  only  providing  spectra 

information. The 2D and 3D image co-registration between the Raman imaging and the sample 

photograph in the physical world enables better performance and efficiency of tumor resection, 

potentially leading to its implementation in human clinical trials in the near future. 

In  addition  to  the  applications  mentioned  above,  I  am  working  on  virtual  H&E  images 

using deep learning. In this work, the virtual H&E deep learning model is employed to transform 

auto-fluorescence images of unstained tissue slides to virtual H&E images. Another deep learning 

model is then applied to screening the cancer areas. With this concept, it could potentially shorten 

the standard cancer diagnosis and be useful for practical clinical applications.  Furthermore, in my 

future  work,  I  plan  on  developing  a  universal  visual-language  foundation  deep  learning  model 

using a variety of pathology images and biomedical fundamental texts for cancer detection with 

several downstream tasks related to pathology images to achieve superb performance on pathology 

image classification, segmentation, and biomarker quantitative. 

96 

 
 
 
1. 

2. 

3. 

4. 

5. 

6. 

7. 

8. 

9. 

BIBLIOGRAPHY 

Juhong,  A.,  Li,  B.,  Yao,  C.-Y.,  Yang,  C.-W.,  Agnew,  D.  W.,  Lei,  Y.  L.,  Huang,  X., 
Piyawattanametha,  W.,  and  Qiu,  Z.  (2022).  Super-resolution  and  segmentation  deep 
learning for breast cancer histopathology image analysis. Biomedical Optics Express: 14, 
18-36. 

Litjens,  G.,  Sánchez,  C.  I.,  Timofeeva,  N.,  Hermsen,  M.,  Nagtegaal,  I.,  Kovacs,  I., 
Hulsbergen-Van  De Kaa, C., Bult, P., Van Ginneken, B., and Van Der Laak, J. (2016). 
Deep  learning  as  a  tool  for  increased  accuracy  and  efficiency  of  histopathological 
diagnosis. Scientific reports: 6, 1-11. 

Mendez, A. J., Tahoces, P. G., Lado, M. a. J., Souto, M., and Vidal, J. J. (1998). Computer‐
aided  diagnosis:  Automatic  detection  of  malignant  masses  in  digitized  mammograms. 
Medical Physics: 25, 957-964. 

Bogoch, I. I., Koydemir, H. C., Tseng, D., Ephraim, R. K., Duah, E., Tee, J., Andrews, J. 
R., and Ozcan, A. (2017). Evaluation of a mobile phone-based microscope for screening 
of Schistosoma haematobium infection in rural Ghana. The American journal of tropical 
medicine and hygiene: 96, 1468. 

Petti,  C.  A.,  Polage,  C.  R.,  Quinn,  T.  C.,  Ronald,  A.  R.,  and  Sande,  M.  A.  (2006). 
Laboratory  medicine  in  Africa:  a  barrier  to  effective  health  care.  Clinical  Infectious 
Diseases: 42, 377-382. 

Colley,  D.  G.,  Bustinduy,  A.  L.,  Secor,  W.  E.,  and  King,  C.  H.  (2014).  Human 
schistosomiasis. The Lancet: 383, 2253-2264. 

Irshad, H., Veillard, A., Roux, L., and Racoceanu, D. (2013). Methods for nuclei detection, 
segmentation,  and  classification  in  digital  histopathology:  a  review—current  status  and 
future potential. IEEE reviews in biomedical engineering: 7, 97-114. 

Sirinukunwattana, K., Raza, S. E. A., Tsang, Y.-W., Snead, D. R., Cree, I. A., and Rajpoot, 
N. M. (2016). Locality sensitive deep learning for detection and classification of nuclei in 
routine colon cancer histology images. IEEE transactions on medical imaging: 35, 1196-
1206. 

Song, Y., Zhang, L., Chen, S., Ni, D., Lei, B., and Wang, T. (2015). Accurate segmentation 
of  cervical  cytoplasm  and  nuclei  based  on  multiscale  convolutional  network  and  graph 
partitioning. IEEE Transactions on Biomedical Engineering: 62, 2421-2433. 

10. 

Xing, F., Xie, Y., and Yang, L. (2015). An automatic learning-based framework for robust 
nucleus segmentation. IEEE transactions on medical imaging: 35, 550-566. 

97 

 
 
 
 
 
 
 
 
 
 
 
11. 

12. 

13. 

14. 

15. 

16. 

17. 

18. 

19. 

20. 

Xing, F. and Yang, L. (2016). Robust nucleus/cell detection and segmentation in digital 
pathology and microscopy images: a comprehensive review. IEEE reviews in biomedical 
engineering: 9, 234-263. 

Otsu,  N.  (1979).  A  threshold  selection  method  from  gray-level  histograms.  IEEE 
transactions on systems, man, and cybernetics: 9, 62-66. 

Yang,  X.,  Li,  H.,  and  Zhou,  X.  (2006).  Nuclei  segmentation  using  marker-controlled 
watershed, tracking using mean-shift, and Kalman filter in time-lapse microscopy. IEEE 
Transactions on Circuits and Systems I: Regular Papers: 53, 2405-2414. 

Filipczuk, P., Kowal, M., and Obuchowicz, A. (2011) Automatic breast cancer diagnosis 
based  on  k-means  clustering  and  adaptive  thresholding  hybrid  segmentation.  Image 
processing and communications challenges 3 (Springer), pp. 295-302. 

Graham, S., Vu, Q. D., Raza, S. E. A., Azam, A., Tsang, Y. W., Kwak, J. T., and Rajpoot, 
N.  (2019).  Hover-net:  Simultaneous  segmentation  and  classification  of  nuclei  in  multi-
tissue histology images. Medical Image Analysis: 58, 101563. 

Schmidt, U., Weigert, M., Broaddus, C., and Myers, G. (2018). Cell detection with star-
convex  polygons.  In  International  Conference  on  Medical  Image  Computing  and 
Computer-Assisted Intervention, Springer: 265-273. 

Chen, S., Ding, C., Liu, M., and Tao, D. (2021). CPP-net: Context-aware polygon proposal 
network for nucleus segmentation. arXiv preprint arXiv:2102.06867. 

Ronneberger,  O.,  Fischer,  P.,  and  Brox,  T.  (2015).  U-net:  Convolutional  networks  for 
biomedical image segmentation. In International Conference on Medical image computing 
and computer-assisted intervention, Springer: 234-241. 

de Haan, K., Zhang, Y., Liu, T., Sisk, A. E., Diaz, M. F., Zuckerman, J. E., Rivenson, Y., 
Wallace, W. D., and Ozcan, A. (2020). Deep learning-based transformation of the H&E 
stain 
improves  kidney  disease  diagnosis.  arXiv  preprint 
arXiv:2008.08871. 

into  special  stains 

Liu, T., De Haan, K., Rivenson, Y., Wei, Z., Zeng, X., Zhang, Y., and Ozcan, A. (2019). 
Deep learning-based super-resolution in coherent imaging systems. Scientific reports: 9, 1-
13. 

21.  Mukherjee, L., Keikhosravi, A., Bui, D., and Eliceiri, K. W. (2018). Convolutional neural 
networks for whole slide image superresolution. Biomedical optics express: 9, 5368-5386. 

22. 

Rivenson, Y., Göröcs, Z., Günaydin, H., Zhang, Y., Wang, H., and Ozcan, A. (2017). Deep 
learning microscopy. Optica: 4, 1437-1443. 

98 

 
 
 
 
 
 
 
 
 
 
 
 
23.  Wang, H., Rivenson, Y., Jin, Y., Wei, Z., Gao, R., Günaydın, H., Bentolila, L. A., Kural, 
C.,  and  Ozcan,  A.  (2019).  Deep  learning  enables  cross-modality  super-resolution  in 
fluorescence microscopy. Nature methods: 16, 103-110. 

24. 

25. 

26. 

27. 

28. 

Zhang,  H.,  Fang,  C.,  Xie,  X.,  Yang,  Y.,  Mei,  W.,  Jin,  D.,  and  Fei,  P.  (2019).  High-
throughput, high-resolution deep learning microscopy based on registration-free generative 
adversarial network. Biomedical optics express: 10, 1044-1063. 

Zheng, T., Oda, H., Moriya, T., Sugino, T., Nakamura, S., Oda, M., Mori, M., Takabatake, 
H., Natori, H., and Mori, K. (2020). Multi-modality super-resolution loss for GAN-based 
super-resolution of clinical CT images using micro CT image database. In Medical Imaging 
2020: Image Processing, International Society for Optics and Photonics: 1131305. 

Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., 
Tejani,  A.,  Totz,  J.,  and  Wang,  Z.  (2017).  Photo-realistic  single  image  super-resolution 
using  a  generative  adversarial  network.  In  Proceedings  of  the  IEEE  conference  on 
computer vision and pattern recognition, 4681-4690. 

He,  K.,  Zhang,  X.,  Ren,  S.,  and  Sun,  J.  (2016).  Deep  residual  learning  for  image 
recognition.  In  Proceedings  of  the  IEEE  conference  on  computer  vision  and  pattern 
recognition, 770-778. 

Iandola, F., Moskewicz, M., Karayev, S., Girshick, R., Darrell, T., and Keutzer, K. (2014). 
Densenet: 
Implementing  efficient  convnet  descriptor  pyramids.  arXiv  preprint 
arXiv:1404.1869. 

29.  Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., Qiao, Y., and Change Loy, C. (2018). 
Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the 
European conference on computer vision (ECCV) workshops, 0-0. 

30. 

31. 

32. 

33. 

34. 

Bianco,  S.,  Cadene,  R.,  Celona,  L.,  and  Napoletano,  P.  (2018).  Benchmark  analysis  of 
representative deep neural network architectures. IEEE access: 6, 64270-64277. 

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., 
and  Polosukhin,  I.  (2017).  Attention  is  all  you  need.  Advances  in  neural  information 
processing systems: 30. 

Xie,  S.,  Girshick,  R.,  Dollár,  P.,  Tu,  Z.,  and  He,  K.  (2017).  Aggregated  residual 
transformations  for  deep  neural  networks.  In  Proceedings  of  the  IEEE  conference  on 
computer vision and pattern recognition, 1492-1500. 

Delibasoglu, I. and Cetin, M. (2020). Improved U-Nets with inception blocks for building 
detection. Journal of Applied Remote Sensing: 14, 044512. 
Hou, L., Gupta, R., Van Arnam, J. S., Zhang, Y., Sivalenka, K., Samaras, D., Kurc, T. M., 
and  Saltz,  J.  H.  (2020).  Dataset  of  segmented  nuclei  in  hematoxylin  and  eosin  stained 
histopathology images of ten cancer types. Scientific data: 7, 1-12. 

99 

 
 
 
 
 
 
 
 
 
 
 
35. 

36. 

37. 

38. 

39. 

Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale 
image recognition. arXiv preprint arXiv:1409.1556. 

Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A. P., Bishop, R., Rueckert, D., and 
Wang, Z. (2016). Real-time single image and video super-resolution using an efficient sub-
pixel convolutional neural network. In Proceedings of the IEEE conference on computer 
vision and pattern recognition, 1874-1883. 

Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised representation learning with 
deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. 

Stergiou, N., Gaidzik, N., Heimes, A.-S., Dietzen, S., Besenius, P., Jäkel, J., Brenner, W., 
Schmidt,  M.,  Kunz,  H.,  and  Schmitt,  E.  (2019).  Reduced  breast  tumor  growth  after 
immunization with a tumor-restricted MUC1 glycopeptide conjugated to tetanus toxoid. 
Cancer Immunology Research: 7, 113-122. 

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, 
V.,  and  Rabinovich,  A.  (2015).  Going  deeper  with  convolutions.  In  Proceedings  of  the 
IEEE conference on computer vision and pattern recognition, 1-9. 

40.  Wang,  Z.,  Bovik,  A.  C.,  Sheikh,  H.  R.,  and  Simoncelli,  E.  P.  (2004).  Image  quality 
assessment:  from  error  visibility  to  structural  similarity.  IEEE  transactions  on  image 
processing: 13, 600-612. 

41. 

42. 

Chang,  H.-H.,  Zhuang,  A.  H.,  Valentino,  D.  J.,  and  Chu,  W.-C.  (2009).  Performance 
measure  characterization 
segmentation  algorithms. 
for  evaluating  neuroimage 
Neuroimage: 47, 122-135. 

Juhong, A., Li, B., Liu, Y., Yao, C. Y., Yang, C. W., Agnew, D. W., Lei, Y. L., Luker, G. 
D., Bumpers, H., and Huang, X. (2023). Recurrent and convolutional neural networks for 
sequential  multispectral  optoacoustic 
imaging.  Journal  of 
Biophotonics: 16, e202300142. 

tomography  (MSOT) 

43. 

Ntziachristos, V. and Razansky, D. (2010). Molecular imaging by means of multispectral 
optoacoustic tomography (MSOT). Chemical reviews: 110, 2783-2794. 

44.  Wang,  L.  V.  and  Hu,  S.  (2012).  Photoacoustic  tomography:  in  vivo  imaging  from 

organelles to organs. science: 335, 1458-1462. 

45. 

46. 

Buehler,  A.,  Kacprowicz,  M.,  Taruttis,  A.,  and  Ntziachristos,  V.  (2013).  Real-time 
handheld multispectral optoacoustic imaging. Optics letters: 38, 1404-1406. 

Dima, A. and Ntziachristos, V. (2016). In-vivo handheld optoacoustic tomography of the 
human thyroid. Photoacoustics: 4, 65-69. 

100 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
47. 

48. 

49. 

50. 

51. 

52. 

53. 

54. 

55. 

56. 

57. 

Tam, A. C. (1986). Applications of photoacoustic sensing techniques. Reviews of Modern 
Physics: 58, 381. 

Razansky,  D.,  Distel,  M.,  Vinegoni,  C.,  Ma,  R.,  Perrimon,  N.,  Köster,  R.  W.,  and 
Ntziachristos,  V.  (2009).  Multispectral  opto-acoustic  tomography  of  deep-seated 
fluorescent proteins in vivo. Nature photonics: 3, 412-417. 

Tzoumas,  S.,  Deliolanis,  N.  C.,  Morscher,  S.,  and  Ntziachristos,  V.  (2013).  Unmixing 
molecular agents from absorbing tissue in multispectral optoacoustic tomography. IEEE 
transactions on medical imaging: 33, 48-60. 

Diot,  G.,  Metz,  S.,  Noske,  A.,  Liapis,  E.,  Schroeder,  B.,  Ovsepian,  S.  V.,  Meier,  R., 
Rummeny,  E.,  and  Ntziachristos,  V.  (2017).  Multispectral  optoacoustic  tomography 
(MSOT) of human breast cancer. Clinical Cancer Research: 23, 6912-6922. 

Quiros-Gonzalez, I., Tomaszewski, M. R., Aitken, S. J., Ansel-Bollepalli, L., McDuffus, 
L.-A.,  Gill,  M.,  Hacker,  L.,  Brunker,  J.,  and  Bohndiek,  S.  E.  (2018).  Optoacoustics 
delineates  murine  breast  cancer  models  displaying  angiogenesis  and  vascular  mimicry. 
British journal of cancer: 118, 1098-1106. 

Ron,  A.,  Deán-Ben,  X.  L.,  Gottschalk,  S.,  and  Razansky,  D.  (2019).  Volumetric 
optoacoustic  imaging  unveils  high-resolution  patterns  of  acute  and  cyclic  hypoxia  in  a 
murine model of breast cancer. Cancer research: 79, 4767-4775. 

Taruttis, A., van Dam, G. M., and Ntziachristos, V. (2015). Mesoscopic and macroscopic 
optoacoustic imaging of cancer. Cancer research: 75, 1548-1559. 

Tomaszewski, M. R., Gehrung, M., Joseph, J., Quiros-Gonzalez, I., Disselhorst, J. A., and 
Bohndiek, S. E. (2018).  Oxygen-enhanced  and dynamic contrast-enhanced  optoacoustic 
tomography  provide  surrogate  biomarkers  of  tumor  vascular  function,  hypoxia,  and 
necrosis. Cancer research: 78, 5980-5991. 

Regensburger, A. P., Fonteyne, L. M., Jüngert, J., Wagner, A. L., Gerhalter, T., Nagel, A. 
M.,  Heiss,  R.,  Flenkenthaler,  F.,  Qurashi,  M.,  and  Neurath,  M.  F.  (2019).  Detection  of 
collagens  by  multispectral  optoacoustic  tomography  as  an  imaging  biomarker  for 
Duchenne muscular dystrophy. Nature medicine: 25, 1905-1915. 

Song,  W.,  Tang,  Z.,  Zhang,  D.,  Burton,  N.,  Driessen,  W.,  and  Chen,  X.  (2015). 
Comprehensive studies of pharmacokinetics and biodistribution of indocyanine green and 
liposomal indocyanine green by multispectral optoacoustic tomography. RSC advances: 5, 
3807-3813. 

Anani, T., Brannen, A., Panizzi, P., Duin, E. C., and David, A. E. (2020).  Quantitative, 
real-time  in  vivo  tracking  of  magnetic  nanoparticles  using  multispectral  optoacoustic 
tomography  (MSOT) imaging. Journal  of  pharmaceutical  and biomedical  analysis:  178, 
112951. 

101 

 
 
 
 
 
 
 
 
 
 
 
58. 

59. 

Gurka, M. K., Pender, D., Chuong, P., Fouts, B. L., Sobelov, A., McNally, M. W., Mezera, 
M., Woo, S. Y., and McNally, L. R. (2016).  Identification  of pancreatic  tumors in vivo 
with  ligand-targeted,  pH  responsive  mesoporous  silica  nanoparticles  by  multispectral 
optoacoustic tomography. Journal of controlled release: 231, 60-67. 

Li, D., Zhang, G., Xu, W., Wang, J., Wang, Y., Qiu, L., Ding, J., and Yang, X. (2017). 
Investigating the effect of chemical structure of semiconducting polymer nanoparticle on 
photothermal therapy and photoacoustic imaging. Theranostics: 7, 4029. 

60.  Wang,  S.,  Zhang,  L.,  Zhao,  J.,  He,  M.,  Huang,  Y.,  and  Zhao,  S.  (2021).  A  tumor 
for 
microenvironment–induced 
simultaneously  activated  photoacoustic  imaging  and  photothermal  therapy.  Science 
Advances: 7, eabe3588. 

nanoparticle 

red-shifted 

absorption 

polymer 

61. 

62. 

63. 

64. 

65. 

66. 

67. 

68. 

Gröhl, J., Schellenberg, M., Dreher, K., Holzwarth, N., Tizabi, M. D., Seitel, A., and Maier-
Hein, L. (2021). Semantic segmentation of multispectral photoacoustic images using deep 
learning. arXiv preprint arXiv:2105.09624. 

Yuan, A. Y., Gao, Y., Peng, L., Zhou, L., Liu, J., Zhu, S., and Song, W. (2020). Hybrid 
deep  learning  network  for  vascular  segmentation  in  photoacoustic  imaging.  Biomedical 
Optics Express: 11, 6445-6457. 

Luke,  G.  P.,  Hoffer-Hawlik,  K.,  Van  Namen,  A.  C.,  and  Shang,  R.  (2019).  O-Net:  a 
convolutional  neural  network  for  quantitative  photoacoustic  image  segmentation  and 
oximetry. arXiv preprint arXiv:1911.01935. 

Lan,  H.,  Jiang,  D.,  Yang,  C.,  and  Gao,  F.  (2019).  Y-Net:  a  hybrid  deep  learning 
reconstruction 
in  vivo.  arXiv  preprint 
arXiv:1908.00975. 

for  photoacoustic 

framework 

imaging 

Zhang,  J.,  Chen,  B.,  Zhou,  M.,  Lan,  H.,  and  Gao,  F.  (2018).  Photoacoustic  image 
classification and segmentation of breast cancer: a feasibility study. IEEE Access: 7, 5457-
5466. 

Chen, T., Lu, T., Song, S., Miao, S., Gao, F., and Li, J. (2020). A deep learning method 
based  on  U-Net  for  quantitative  photoacoustic  imaging.  In  Photons  Plus  Ultrasound: 
Imaging and Sensing 2020, International Society for Optics and Photonics: 112403V. 

Bench,  C.,  Hauptmann,  A.,  and  Cox,  B.  T.  (2020).  Toward  accurate  quantitative 
photoacoustic  imaging:  learning  vascular  blood  oxygen  saturation  in  three  dimensions. 
Journal of Biomedical Optics: 25, 085003. 
Yang,  C.,  Lan,  H.,  Zhong,  H.,  and  Gao,  F.  (2019).  Quantitative  photoacoustic  blood 
oxygenation imaging using deep residual and recurrent neural network. In 2019 IEEE 16th 
International Symposium on Biomedical Imaging (ISBI 2019), IEEE: 741-744. 

102 

 
 
 
 
 
 
 
 
 
 
 
 
69. 

70. 

71. 

72. 

73. 

74. 

75. 

76. 

77. 

78. 

79. 

80. 

Gröhl,  J.,  Kirchner,  T.,  Adler,  T.,  and  Maier-Hein,  L.  (2019).  Estimation  of  blood 
oxygenation  with  learned  spectral  decoloring  for  quantitative  photoacoustic  imaging 
(LSD-qPAI). arXiv preprint arXiv:1902.05839. 

Cai, C., Deng, K., Ma, C., and Luo, J. (2018). End-to-end deep neural network for optical 
inversion in quantitative photoacoustic imaging. Optics letters: 43, 2752-2755. 

Allman,  D.,  Reiter,  A.,  and  Bell,  M.  A.  L.  (2018).  Photoacoustic  source  detection  and 
reflection  artifact  removal  enabled  by  deep  learning.  IEEE  transactions  on  medical 
imaging: 37, 1464-1477. 

Davoudi,  N.,  Deán-Ben,  X.  L.,  and  Razansky,  D.  (2019).  Deep  learning  optoacoustic 
tomography with sparse data. Nature Machine Intelligence: 1, 453-460. 

Hariri, A., Alipour, K., Mantri, Y., Schulze, J. P., and Jokerst, J. V. (2020). Deep learning 
improves contrast in low-fluence photoacoustic imaging. Biomedical optics express: 11, 
3360-3373. 

Lu, T., Chen, T., Gao, F., Sun, B., Ntziachristos, V., and Li, J. (2021). LV‐GAN: A deep 
learning approach for limited‐view optoacoustic imaging based on hybrid datasets. Journal 
of biophotonics: 14, e202000325. 

Sivasubramanian,  K.  and  Xing,  L.  (2020).  Deep  learning  for  image  processing  and 
reconstruction  to  enhance  led-based  photoacoustic  imaging.  LED-Based  Photoacoustic 
Imaging: From Bench to Bedside, 203-241. 

Lafci,  B.,  Merčep,  E.,  Morscher,  S.,  Deán-Ben,  X.  L.,  and  Razansky,  D.  (2020).  Deep 
learning  for  automatic  segmentation  of  hybrid  optoacoustic  ultrasound  (OPUS)  images. 
IEEE transactions on ultrasonics, ferroelectrics, and frequency control: 68, 688-696. 

Aydın, M., Kiraz, B., Eren, F., Uysallı, Y., Morova, B., Ozcan, S. C., Acilan, C., and Kiraz, 
A.  (2022).  A  Deep  Learning  Model  for  Automated  Segmentation  of  Fluorescence  Cell 
images. In Journal of Physics: Conference Series, IOP Publishing: 012003. 

de  Haan,  K.,  Ceylan  Koydemir,  H.,  Rivenson,  Y.,  Tseng,  D.,  Van  Dyne,  E.,  Bakic,  L., 
Karinca, D., Liang, K., Ilango, M., and Gumustekin, E. (2020). Automated screening of 
sickle cells using a smartphone-based microscope and deep learning. NPJ digital medicine: 
3, 76. 

Ibtehaz, N. and Rahman, M. S. (2020). MultiResUNet: Rethinking the U-Net architecture 
for multimodal biomedical image segmentation. Neural networks: 121, 74-87. 

Punn, N. S. and Agarwal, S. (2022). Modality specific U-Net variants for biomedical image 
segmentation: a survey. Artificial Intelligence Review: 55, 5845-5889. 

103 

 
 
 
 
 
 
 
 
 
 
 
 
 
81. 

82. 

83. 

84. 

85. 

86. 

87. 

88. 

89. 

90. 

91. 

92. 

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016).  Rethinking the 
inception  architecture  for  computer  vision.  In  Proceedings  of  the  IEEE  conference  on 
computer vision and pattern recognition, 2818-2826. 

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation: 
9, 1735-1780. 

Xu, M. and Wang, L. V. (2005).  Universal  back-projection  algorithm  for photoacoustic 
computed tomography. Physical Review E: 71, 016706. 

Stergiou, N., Gaidzik, N., Heimes, A.-S., Dietzen, S., Besenius, P., Jäkel, J., Brenner, W., 
Schmidt,  M.,  Kunz,  H.,  and  Schmitt,  E.  (2019).  Reduced  Breast  Tumor  Growth  after 
Immunization  with  a  Tumor-Restricted  MUC1  Glycopeptide  Conjugated  to  Tetanus 
ToxoidImmunization  against  Tumor-Restricted  MUC1  in  Breast  Cancer.  Cancer 
Immunology Research: 7, 113-122. 

Yang,  C.-W.,  Liu,  K.,  Yao,  C.-Y.,  Li,  B.,  Juhong,  A.,  Qiu,  Z.,  and  Huang,  X.  (2022). 
Indocyanine  Green-Conjugated  Superparamagnetic 
for 
Multimodality Breast Cancer Imaging. ACS Applied Nano Materials: 5, 18912-18920. 

Iron  Oxide  Nanoworm 

Greish,  K.  (2010).  Enhanced  permeability  and  retention  (EPR)  effect  for  anticancer 
nanomedicine drug targeting. Cancer nanotechnology: Methods and protocols, 25-37. 

Keshava,  N.  and  Mustard,  J.  F.  (2002).  Spectral  unmixing.  IEEE  signal  processing 
magazine: 19, 44-57. 

Xingjian, S., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K., and Woo, W.-c. (2015). 
Convolutional LSTM network: A machine learning approach for precipitation nowcasting. 
In Advances in neural information processing systems, 802-810. 

Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., and Bharath, A. A. 
(2018). Generative adversarial networks: An overview. IEEE signal processing magazine: 
35, 53-65. 

Juhong, A., Li, B., Liu, Y., Yang, C. W., Yao, C. Y., Agnew, D. W., Lei, Y. L., Luker, G. 
D., Bumpers, H., and Huang, X. (2024). Multihead Attention U‐Net for Magnetic Particle 
Imaging–Computed Tomography Image Segmentation. Advanced Intelligent Systems: 6, 
2400007. 

Bulte, J. W. (2019). Superparamagnetic iron oxides as MPI tracers: A primer and review 
of early applications. Advanced drug delivery reviews: 138, 293-301. 

Gleich, B. and Weizenecker, J. (2005). Tomographic imaging using the nonlinear response 
of magnetic particles. Nature: 435, 1214-1217. 

104 

 
 
 
 
 
 
 
 
 
 
 
 
 
93. 

94. 

95. 

96. 

97. 

98. 

99. 

Scarfe, L., Brillant, N., Kumar, J. D., Ali, N., Alrumayh, A., Amali, M., Barbellion, S., 
Jones,  V.,  Niemeijer,  M.,  and  Potdevin,  S.  (2017).  Preclinical  imaging  methods  for 
assessing  the  safety  and  efficacy  of  regenerative  medicine  therapies.  NPJ  Regenerative 
medicine: 2, 28. 

Zheng, B., Vazin, T., Goodwill, P. W., Conway, A., Verma, A., Ulku Saritas, E., Schaffer, 
D., and Conolly, S. M. (2015). Magnetic particle imaging tracks the long-term fate of in 
vivo neural cell implants with high image contrast. Scientific reports: 5, 14055. 

Rahmer,  J.,  Gleich,  B.,  Weizenecker,  J.,  and  Borgert,  J.  (2010).  3D  real-time  magnetic 
particle imaging of cerebral blood flow in living mice. In Proceedings of the International 
Society for Magnetic Resonance in Medicine, 714. 

Ludewig, P., Gdaniec, N., Sedlacik, J., Forkert, N. D., Szwargulski, P., Graeser, M., Adam, 
G., Kaul, M. G., Krishnan, K. M., and Ferguson, R. M. (2017). Magnetic particle imaging 
for real-time perfusion imaging in acute stroke. ACS nano: 11, 10480-10488. 

Orendorff, R., Keselman, K., and Conolly, S. (2018). Quantitative cerebral blood flow and 
volume measurements by magnetic particle imaging. In 13th European Molecular Imaging 
Meeting, 20-23. 

Fu, A., Wilson, R. J., Smith, B. R., Mullenix, J., Earhart, C., Akin, D., Guccione, S., Wang, 
S.  X.,  and  Gambhir,  S.  S.  (2012).  Fluorescent  magnetic  nanoparticles  for  magnetically 
enhanced cancer imaging and targeting in living subjects. ACS nano: 6, 6862-6869. 

Tomitaka, A., Arami, H., Gandhi, S., and Krishnan, K. M. (2015). Lactoferrin conjugated 
iron  oxide  nanoparticles  for  targeting  brain  glioma  cells  in  magnetic  particle  imaging. 
Nanoscale: 7, 16890-16898. 

100.  Finas, D., Baumann, K., Sydow, L., Heinrich, K., Gräfe, K., Rody, A., Lüdtke-Buzug, K., 
and  Buzug,  T.  (2013).  Lymphatic  tissue  and  superparamagnetic  nanoparticles-magnetic 
particle  imaging  for  detection  and  distribution  in  a  breast  cancer  model.  Biomedical 
Engineering/Biomedizinische Technik: 58, 000010151520134262. 

101.  Song, G., Chen, M., Zhang, Y., Cui, L., Qu, H., Zheng, X., Wintermark, M., Liu, Z., and 
Rao, J. (2018). Janus iron oxides@ semiconducting polymer nanoparticle tracer for cell 
tracking by magnetic particle imaging. Nano letters: 18, 182-189. 

102.  Zheng, B., von See, M. P., Yu, E., Gunel, B., Lu, K., Vazin, T., Schaffer, D. V., Goodwill, 
P.  W.,  and  Conolly, S.  M.  (2016).  Quantitative  magnetic  particle  imaging  monitors  the 
transplantation, biodistribution, and clearance of stem cells in vivo. Theranostics: 6, 291. 
103.  Wu, L. C., Zhang, Y., Steinberg, G., Qu, H., Huang, S., Cheng, M., Bliss, T., Du, F., Rao, 
J.,  and  Song,  G.  (2019).  A  review  of  magnetic  particle  imaging  and  perspectives  on 
neuroimaging. American Journal of Neuroradiology: 40, 206-212. 

105 

 
 
 
 
 
 
 
 
 
 
 
104.  Herz, S., Vogel, P., Dietrich, P., Kampf, T., Rückert, M. A., Kickuth, R., Behr, V. C., and 
Bley, T. A. (2018). Magnetic particle imaging guided real-time percutaneous transluminal 
angioplasty in a phantom model. Cardiovascular and interventional radiology: 41, 1100-
1105. 

105.  Hossaini Nasr, S., Tonson, A., El-Dakdouki, M. H., Zhu, D. C., Agnew, D., Wiseman, R., 
Qian, C., and Huang, X. (2018). Effects of nanoprobe morphology on cellular binding and 
inflammatory  responses:  hyaluronan-conjugated  magnetic  nanoworms  for  magnetic 
resonance  imaging  of  atherosclerotic  plaques.  ACS  applied  materials  &  interfaces:  10, 
11495-11507. 

106.  Park, J. H., von Maltzahn, G., Zhang, L., Schwartz, M. P., Ruoslahti, E., Bhatia, S. N., and 
Sailor,  M.  J.  (2008).  Magnetic  iron  oxide  nanoworms  for  tumor  targeting  and  imaging. 
Advanced materials: 20, 1630-1635. 

107. 

Iyer,  A.  K.,  Khaled,  G.,  Fang,  J.,  and  Maeda,  H.  (2006).  Exploiting  the  enhanced 
permeability and retention effect for tumor targeting. Drug discovery today: 11, 812-818. 

108.  Kobayashi, H., Watanabe, R., and Choyke, P. L. (2014). Improving conventional enhanced 
permeability and retention (EPR) effects; what is the appropriate target? Theranostics: 4, 
81. 

109.  Li, Q., Cai, W., Wang, X., Zhou, Y., Feng, D. D., and Chen, M. (2014). Medical image 
classification with convolutional neural network. In 2014 13th international conference on 
control automation robotics & vision (ICARCV), IEEE: 844-848. 

110.  Liu, Q., Yu, L., Luo, L., Dou, Q., and Heng, P. A. (2020). Semi-supervised medical image 
classification with relation-driven  self-ensembling  model. IEEE  transactions on medical 
imaging: 39, 3429-3440. 

111.  Deepa, S. and Devi, B. A. (2011). A survey on artificial intelligence approaches for medical 

image classification. Indian Journal of Science and Technology: 4, 1583-1595. 

112.  Goldstein,  B.  A.,  Navar,  A.  M.,  and  Carter,  R.  E.  (2017).  Moving  beyond  regression 
techniques in cardiovascular risk prediction: applying machine learning to address analytic 
challenges. European heart journal: 38, 1805-1814. 

113.  Maulud, D. and Abdulazeez, A. M. (2020). A review on linear regression comprehensive 

in machine learning. Journal of Applied Science and Technology Trends: 1, 140-147. 

114.  Christodoulou,  E.,  Ma,  J.,  Collins,  G.  S.,  Steyerberg,  E.  W.,  Verbakel,  J.  Y.,  and  Van 
Calster, B. (2019). A systematic review shows no performance benefit of machine learning 
over logistic regression for clinical  prediction models. Journal of clinical epidemiology: 
110, 12-22. 

106 

 
 
 
 
 
 
 
 
 
 
 
 
115.  Zhang, S., Liang, G., Pan, S., and Zheng, L. (2018). A fast medical image super resolution 

method based on deep learning network. IEEE Access: 7, 12319-12327. 

116.  Wang, G., Ye, J. C., Mueller, K., and Fessler, J. A. (2018). Image reconstruction is a new 
frontier of machine learning. IEEE transactions on medical imaging: 37, 1289-1296. 

117.  Lundervold, A. S. and Lundervold, A. (2019). An overview of deep learning in medical 

imaging focusing on MRI. Zeitschrift für Medizinische Physik: 29, 102-127. 

118.  Hesamian, M. H., Jia, W., He, X., and Kennedy, P. (2019). Deep learning techniques for 
medical image segmentation: achievements and challenges. Journal of digital imaging: 32, 
582-596. 

119.  Tajbakhsh,  N.,  Jeyaseelan,  L.,  Li,  Q.,  Chiang,  J.  N.,  Wu,  Z.,  and  Ding,  X.  (2020). 
Embracing  imperfect  datasets:  A  review  of  deep  learning  solutions  for  medical  image 
segmentation. Medical Image Analysis: 63, 101693. 

120.  Maier,  A.,  Syben,  C.,  Lasser,  T.,  and  Riess,  C.  (2019).  A  gentle  introduction  to  deep 

learning in medical image processing. Zeitschrift für Medizinische Physik: 29, 86-101. 

121.  Wang, R., Lei, T., Cui, R., Zhang, B., Meng, H., and Nandi, A. K. (2022). Medical image 
segmentation using deep learning: A survey. IET Image Processing: 16, 1243-1267. 

122.  Niu,  Z.,  Zhong,  G.,  and  Yu,  H.  (2021).  A  review  on  the  attention  mechanism  of  deep 

learning. Neurocomputing: 452, 48-62. 

123.  Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A. L., and Zhou, Y. 
(2021). Transunet: Transformers make strong encoders for medical image segmentation. 
arXiv preprint arXiv:2102.04306. 

124.  Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep 
bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 

125.  Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021). Swin 
transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the 
IEEE/CVF international conference on computer vision, 10012-10022. 

126.  Luong, M.-T., Pham, H., and Manning, C. D. (2015).  Effective approaches to attention-

based neural machine translation. arXiv preprint arXiv:1508.04025. 

127.  Chorowski, J. K., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y. (2015). Attention-
based models for speech recognition. Advances in neural information processing systems: 
28. 

128.  Britz,  D.,  Goldie,  A.,  Luong,  M.-T.,  and  Le,  Q.  (2017).  Massive  exploration  of  neural 

machine translation architectures. arXiv preprint arXiv:1703.03906. 

107 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
129.  Ronneberger,  O.,  Fischer,  P.,  and  Brox,  T.  (2015).  U-net:  Convolutional  networks  for 
biomedical  image  segmentation.  In  Medical  image  computing  and  computer-assisted 
intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-
9, 2015, proceedings, part III 18, Springer: 234-241. 

130.  Schlemper, J., Oktay, O., Schaap, M., Heinrich, M., Kainz, B., Glocker, B., and Rueckert, 
D.  (2019).  Attention  gated  networks:  Learning  to  leverage  salient  regions  in  medical 
images. Medical image analysis: 53, 197-207. 

131.  Oktay,  O.,  Schlemper,  J.,  Folgoc,  L.  L.,  Lee,  M.,  Heinrich,  M.,  Misawa,  K.,  Mori,  K., 
McDonagh, S., Hammerla, N. Y., and Kainz, B. (2018). Attention u-net: Learning where 
to look for the pancreas. arXiv preprint arXiv:1804.03999. 

132.  Kingma,  D.  P.  and  Ba,  J.  (2014).  Adam:  A  method  for  stochastic  optimization.  arXiv 

preprint arXiv:1412.6980. 

133.  Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. (2017). 
Grad-cam:  Visual  explanations  from  deep  networks  via  gradient-based  localization.  In 
Proceedings of the IEEE international conference on computer vision, 618-626. 

134. 

Juhong,  A.,  Li,  B.,  Liu,  Y.,  Yao,  C.-Y.,  Yang,  C.-W.,  Atique  Ullah,  A.,  Liu,  K., 
Lewandowski,  R.  P.,  Harkema,  J.  R.,  and  Agnew,  D.  W.  (2025).  Monocular  depth 
estimation  based  on  deep  learning  for  intraoperative  guidance  using  surface-enhanced 
Raman scattering imaging. Photonics Research: 13, 550-560. 

135.  Lukianova-Hleb, E. Y., Kim, Y.-S., Belatsarkouski, I., Gillenwater, A. M., O'Neill, B. E., 
and  Lapotko,  D.  O.  (2016).  Intraoperative  diagnostics  and  elimination  of  residual 
microtumours with plasmonic nanobubbles. Nature Nanotechnology: 11, 525-532. 

136.  Wang, T., Wang, D., Yu, H., Feng, B., Zhou, F., Zhang, H., Zhou, L., Jiao, S., and Li, Y. 
(2018).  A  cancer  vaccine-mediated  postoperative  immunotherapy  for  recurrent  and 
metastatic tumors. Nature communications: 9, 1532. 

137.  Anup, N., Gadeval, A., and Tekade, R. K. (2023). A 3D-printed graphene BioFuse implant 
for postsurgical adjuvant therapy of cancer: proof of concept in 2D-and 3D-spheroid tumor 
models. ACS Applied Bio Materials: 6, 1195-1212. 

138.  Aydın, H., Sillenberg, I., and von Lieven, H. (2001). Patterns of failure following CT-based 

3-D irradiation for malignant glioma. Strahlentherapie und Onkologie: 177, 424-431. 

139.  Gao, R. W., Teraphongphom, N. T., van den Berg, N. S., Martin, B. A., Oberhelman, N. 
J., Divi, V., Kaplan, M. J., Hong, S. S., Lu, G., and Ertsey, R. (2018). Determination of 
tumor margins with surgical specimen mapping using near-infrared fluorescence. Cancer 
research: 78, 5144-5154. 

108 

 
 
 
 
 
 
 
 
 
 
 
 
140.  Gao, X., Yue, Q., Liu, Z., Ke, M., Zhou, X., Li, S., Zhang, J., Zhang, R., Chen, L., and 
Mao,  Y.  (2017).  Guiding  brain‐tumor  surgery  via  blood–brain‐barrier‐permeable  gold 
nanoprobes with acid‐triggered MRI/SERRS signals. Advanced Materials: 29, 1603917. 

141.  Kunjachan, S., Ehling, J., Storm, G., Kiessling, F., and Lammers, T. (2015). Noninvasive 
imaging  of  nanomedicines  and  nanotheranostics:  principles,  progress,  and  prospects. 
Chemical reviews: 115, 10907-10937. 

142.  Kircher, M. F., Mahmood, U., King, R. S., Weissleder, R., and Josephson, L. (2003).  A 
multimodal nanoparticle for preoperative magnetic resonance imaging and intraoperative 
optical brain tumor delineation. Cancer research: 63, 8122-8125. 

143.  Pal, S., Ray, A., Andreou, C., Zhou, Y., Rakshit, T., Wlodarczyk, M., Maeda, M., Toledo-
Crow, R., Berisha, N., and Yang, J. (2019). DNA-enabled rational design of fluorescence-
Raman bimodal nanoprobes for cancer imaging and therapy. Nature communications: 10, 
1926. 

144.  Qi, J., Li, J., Liu, R., Li, Q., Zhang, H., Lam, J. W., Kwok, R. T., Liu, D., Ding, D., and 
Tang,  B.  Z.  (2019).  Boosting  fluorescence-photoacoustic-Raman  properties  in  one 
fluorophore for precise cancer surgery. Chem: 5, 2657-2677. 

145.  Zysk,  A.  M.,  Chen,  K.,  Gabrielson,  E.,  Tafra,  L.,  May  Gonzalez,  E.  A.,  Canner,  J.  K., 
Schneider,  E.  B.,  Cittadine,  A.  J.,  Scott  Carney,  P.,  and  Boppart,  S.  A.  (2015). 
Intraoperative assessment of final margins with a handheld optical imaging probe during 
breast-conserving surgery may reduce the reoperation rate: results of a multicenter study. 
Annals of surgical oncology: 22, 3356-3362. 

146.  Laing, S., Jamieson, L. E., Faulds, K., and Graham, D. (2017). Surface-enhanced Raman 

spectroscopy for in vivo biosensing. Nature Reviews Chemistry: 1, 0060. 

147.  Langer,  J.,  Jimenez  de  Aberasturi,  D.,  Aizpurua,  J., Alvarez-Puebla,  R. A.,  Auguié,  B., 
Baumberg, J. J., Bazan, G. C., Bell, S. E., Boisen, A., and Brolo, A. G. (2019). Present and 
future of surface-enhanced Raman scattering. ACS nano: 14, 28-117. 

148.  Li, M., Cushing, S. K., and Wu, N. (2015). Plasmon-enhanced optical sensors: a review. 

Analyst: 140, 386-406. 

149.  Li, D., Hui, H., Zhang, Y., Tong, W., Tian, F., Yang, X., Liu, J., Chen, Y., and Tian, J. 
(2020). Deep learning for virtual histological staining of bright-field microscopic images 
of unlabeled carotid artery tissue. Molecular imaging and biology: 22, 1301-1309. 

150.  Pan, X., Li, L., Lin, H., Tan, J., Wang, H., Liao, M., Chen, C., Shan, B., Chen, Y., and Li, 
M.  (2019).  A  graphene  oxide-gold  nanostar  hybrid  based-paper  biosensor  for  label-free 
SERS  detection  of  serum  bilirubin  for  diagnosis  of  jaundice.  Biosensors  and 
Bioelectronics: 145, 111713. 

109 

 
 
 
 
 
 
 
 
 
 
 
 
151.  Shan,  B.,  Pu,  Y.,  Chen,  Y.,  Liao,  M.,  and  Li,  M.  (2018).  Novel  SERS  labels:  Rational 
design,  functional  integration  and  biomedical  applications.  Coordination  Chemistry 
Reviews: 371, 11-37. 

152.  Wang,  Y.,  Kang,  S.,  Khan,  A.,  Ruttner,  G.,  Leigh,  S.  Y.,  Murray,  M.,  Abeytunge,  S., 
Peterson,  G.,  Rajadhyaksha,  M.,  and  Dintzis,  S.  (2016).  Quantitative  molecular 
phenotyping  with  topically  applied  SERS  nanoparticles  for  intraoperative  guidance  of 
breast cancer lumpectomy. Scientific reports: 6, 21242. 

153.  Liang, A., Liu, Q., Wen, G., and Jiang, Z. (2012). The surface-plasmon-resonance effect 
of nanogold/silver and its analytical applications. TrAC Trends in Analytical Chemistry: 
37, 32-47. 

154.  Davis, R. M., Campbell, J. L., Burkitt, S., Qiu, Z., Kang, S., Mehraein, M., Miyasato, D., 
Salinas, H., Liu, J. T., and Zavaleta, C. (2018). A raman imaging approach using CD47 
antibody-labeled SERS nanoparticles for identifying breast cancer and its potential to guide 
surgical resection. Nanomaterials: 8, 953. 

155.  Gao,  H.  (2016).  Progress  and  perspectives  on  targeting  nanoparticles  for  brain  drug 

delivery. Acta Pharmaceutica Sinica B: 6, 268-286. 

156.  Huang,  R., Harmsen,  S., Samii, J.  M., Karabeber,  H., Pitter, K.  L., Holland,  E.  C., and 
Kircher, M. F. (2016). High precision imaging of microscopic spread of glioblastoma with 
a targeted ultrasensitive SERRS molecular imaging probe. Theranostics: 6, 1075. 

157.  Liu, K., Ullah, A. A., Juhong, A., Yang, C. W., Yao, C. Y., Li, X., Bumpers, H. L., Qiu, 
Z., and Huang, X. (2024). Robust Synthesis of Targeting Glyco‐Nanoparticles for Surface 
Enhanced  Resonance  Raman  Based  Image‐Guided  Tumor  Surgery.  Small  Science, 
2300154. 

158.  Zavaleta, C. L., Smith, B. R., Walton, I., Doering, W., Davis, G., Shojaei, B., Natan, M. J., 
and Gambhir,  S. S. (2009).  Multiplexed imaging  of surface  enhanced  Raman scattering 
nanotags  in  living  mice  using  noninvasive  Raman  spectroscopy.  Proceedings  of  the 
National Academy of Sciences: 106, 13511-13516. 

159.  Brunelli, R. (2009). Template matching techniques in computer vision: theory and practice.  

(John Wiley & Sons). 

160.  Mikolajczyk, K. and Schmid, C. (2004). Scale & affine invariant interest point detectors. 

International journal of computer vision: 60, 63-86. 

161.  Garai, E., Sensarn, S., Zavaleta, C. L., Van de Sompel, D., Loewke, N. O., Mandella, M. 
J.,  Gambhir,  S.  S.,  and  Contag,  C.  H.  (2013).  High-sensitivity,  real-time,  ratiometric 
imaging of surface-enhanced Raman scattering nanoparticles with a clinically translatable 
Raman endoscope device. Journal of biomedical optics: 18, 096008-096008. 

110 

 
 
 
 
 
 
 
 
 
 
 
 
162.  Zavaleta,  C. L.,  Garai,  E.,  Liu, J. T., Sensarn,  S., Mandella,  M. J., Van  de  Sompel, D., 
Friedland,  S., Van Dam,  J., Contag, C. H., and Gambhir,  S. S. (2013).  A Raman-based 
endoscopic  strategy  for  multiplexed  molecular  imaging.  Proceedings  of  the  National 
Academy of Sciences: 110, E2288-E2297. 

163.  Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., and Koltun, V. (2020). Towards robust 
monocular  depth  estimation:  Mixing  datasets  for  zero-shot  cross-dataset  transfer.  IEEE 
transactions on pattern analysis and machine intelligence: 44, 1623-1637. 

164.  Dosovitskiy,  A., Beyer,  L., Kolesnikov,  A., Weissenborn,  D., Zhai,  X.,  Unterthiner,  T., 
Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16x16 
words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. 

165.  Ranftl,  R.,  Bochkovskiy,  A.,  and  Koltun,  V.  (2021).  Vision  transformers  for  dense 
prediction. In Proceedings of the IEEE/CVF international conference on computer vision, 
12179-12188. 

166.  Birkl, R., Wofk, D., and Müller, M. (2023). Midas v3. 1--a model zoo for robust monocular 

relative depth estimation. arXiv preprint arXiv:2307.14460. 

167.  Gotov, O., Battogtokh, G., Shin, D., and Ko, Y. T. (2018). Hyaluronic acid-coated cisplatin 
conjugated  gold  nanoparticles  for  combined  cancer  treatment.  Journal  of  industrial  and 
engineering chemistry: 65, 236-243. 

168.  Lee, H., Lee, K., Kim, I. K., and Park, T. G. (2008).  Synthesis, characterization, and in 
immobilized  gold  nanoprobes. 

vivo  diagnostic  applications  of  hyaluronic  acid 
Biomaterials: 29, 4709-4718. 

169.  Lee, M.-Y., Yang, J.-A., Jung, H. S., Beack, S., Choi, J. E., Hur, W., Koo, H., Kim, K., 
Yoon,  S.  K.,  and  Hahn,  S.  K.  (2012).  Hyaluronic  acid–gold  nanoparticle/interferon  α 
complex for targeted treatment of hepatitis C virus infection. ACS nano: 6, 9522-9531. 

170.  Li, X., Zhou, H., Yang, L., Du, G., Pai-Panandiker, A. S., Huang, X., and Yan, B. (2011). 
Enhancement  of  cell  recognition  in  vitro  by  dual-ligand  cancer  targeting  gold 
nanoparticles. Biomaterials: 32, 2540-2545. 

171.  Xue, Y., Li, X., Li, H., and Zhang, W. (2014). Quantifying thiol–gold interactions towards 

the efficient strength control. Nature communications: 5, 1-9. 

172. 

Juhong, A., Li, B., Yao, C.-Y., Yang, C.-W., Liu, K., Agnew, D. W., Lei, Y. L., Luker, G. 
D., Bumpers, H., and Huang, X. (2023). Cost-Effective Near Infrared Fluorescence Wide-
Field Camera for Breast  Tumor Imaging.  IEEE Photonics Technology  Letters: 35, 813-
816. 

173.  Zhang, Z. (2000). A flexible new technique for camera calibration. IEEE Transactions on 

pattern analysis and machine intelligence: 22, 1330-1334. 

111 

 
 
 
 
 
 
 
 
 
 
 
 
174. 

Jurie, F. and Dhome, M. (2001). A simple and efficient template matching algorithm. In 
Proceedings  Eighth  IEEE  International  Conference  on  Computer  Vision.  ICCV  2001, 
IEEE: 544-549. 

175.  Bradski, G. and Kaehler, A. (2000). OpenCV. Dr. Dobb’s journal of software tools: 3. 

176.  Sener,  O.  and  Koltun,  V.  (2018).  Multi-task  learning  as  multi-objective  optimization. 

Advances in neural information processing systems: 31. 

177.  Diederik, P. K. (2014). Adam: A method for stochastic optimization. 

178.  Li,  Z. and  Snavely,  N.  (2018).  Megadepth:  Learning  single-view  depth  prediction  from 
internet  photos. In  Proceedings of the IEEE conference  on computer vision and pattern 
recognition, 2041-2050. 

179.  Bao,  H.,  Dong,  L.,  Piao,  S.,  and  Wei,  F.  (2021).  Beit:  Bert  pre-training  of  image 

transformers. arXiv preprint arXiv:2106.08254. 

180.  Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., and Sutskever, 
I.  (2021).  Zero-shot  text-to-image  generation.  In  International  conference  on  machine 
learning, Pmlr: 8821-8831. 

181.  Rolfe, J. T. (2016). Discrete variational autoencoders. arXiv preprint arXiv:1609.02200. 

112