HIGH RECALL DEEP LEARNING METHODS FOR PEDIATRIC RIB FRACTURE
DETECTION

By

Jonathan Burkow

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Computational Mathematics, Science, and Engineering—Doctor of Philosophy

2024

ABSTRACT

Numerous data-driven, machine-learned models have been developed to localize pathologies in

medical images. The work in this dissertation focuses on model development for the application

of rib fracture detection in pediatric radiographs. Child abuse leading to physical harm is a severe

problem, with over 100,000 confirmed instances in the United States during 2020, and children

under 3 having higher susceptibility for fatal outcomes. Rib fractures are the most prevalent fracture

site observed in young children with a near 100% positive predictive value for abuse if discovered.

However, the detection of rib fractures in pediatric radiographic scans proves an arduous challenge,

with even specialized radiologists failing to identify up to two-thirds of present fractures upon

initial reviews. There are various reasons for this difficulty: fractures can be obliquely oriented to

the imaging detector, obfuscated by other structures, incomplete, subtle, and/or non-displaced.

In this dissertation, we curate a custom dataset of 1,109 pediatric chest radiographs that were

labeled by seven board-certified pediatric radiologists, yielding 624 fracture-present images. We

analyze fracture prevalence patterns within the dataset, stratifying by age groups and examining

factors such as fracture sidedness and rib locations. We investigate several methods for improving

the sensitivity performance of two one-stage deep convolutional neural network (CNN) object-

detection architectures, RetinaNet and YOLOv5. These include testing different input image

filtering techniques, model ensembling approaches that combine predictions from multiple models,

and a novel avalanche decision scheme we developed that dynamically adjusts the acceptance

threshold during inference based on the number of fractures already detected.

We examine the performance of these networks using several metrics, including F2 score which

summarizes precision and recall weighted toward high-sensitivity tasks. Fine-tuning of prior

state-of-the-art one-stage models, RetinaNet and YOLOv5, achieves F2 scores of 0.48 ± 0.01 and

0.48 ± 0.04 respectively. Our best performing model used three ensembled YOLOv5 models with

multiple image filters and an avalanche decision scheme, achieving an F2 score of 0.73 ± 0.01.

When we conducted an expert inter-reader performance evaluation on the same test set of images,

it resulted in an F2 score of 0.732. We then take inspiration from our fracture prevalence analysis

to create a domain-specific two-stage detection scheme, reliant on a prevalence-weighted region

proposal method. The results demonstrated throughout this dissertation assert that a combination

of sensitivity-driving methods yields object detector performance that approaches capabilities of

expert radiologists, suggesting these methods may provide a viable approach to help identify these

difficult-to-find rib fractures.

Copyright by
JONATHAN BURKOW
2024

This thesis is dedicated to everyone who has influenced my academic and personal growth. To my
parents, Richard and Linda Burkow, who have provided so much support and unwavering love. To
my sister-in-law, Amy Burkow, for consistently pushing me to be confident in myself and become
a better scholar. To my brother, Daniel Burkow, you have always been there to make the mistakes
first so I could learn how to be better. In all seriousness, you have always been a benchmark for
me to measure against, and your encouragement and guidance means more than you know. And
finally, to the love of my life, Tiffany Thornton. You believed in me through all the times I
struggled to and picked me up when I needed it. Your love, patience, and compassion are all why I
have been able to complete this degree.

v

ACKNOWLEDGEMENTS

Throughout most of my education from undergrad and below, I felt like I was just going through the

motions academically. That changed when I became a part of the Research in Industrial Projects for

Students (RIPS) summer research experience hosted at UCLA. The project I worked on, sponsored

by the Aerospace Corporation, in addition to the other projects, showed me the potential of applying

mathematics and computation in real-world settings. From then on, I knew I wanted to contribute

to innovative and exciting applications within industry. The Computational Mathematics, Science,

and Engineering department at Michigan State University felt like a perfect fit to advance my

academic career with this focus in mind; though, it was still incredibly difficult to consider leaving

my Business Analyst job at Discover for.

I remember going back and forth on the day of the

deadline to accept entry into the program, to the point I considered missing a Nightwish concert

I planned to attend that night because of the stress (I did end up going and having a great time).

Despite the process of completing this degree proving incredibly challenging, it has been one of

the most rewarding experiences and has ultimately changed my life for the better.

I want to thank my graduate student peers and faculty of the CMSE department I have met over

the years that I have had the pleasure to talk to and learn from. I particularly want to thank Dr.

Bob Termuhlen for being a great classmate and friend through all of our core classes and for giving

me countless rides home from campus, especially when we worked past when busses would stop

operating.

I also want to thank past and present students of the Medical Imaging and Data Integration

(MIDI) lab, for being wonderful resources for any questions I had and for providing suggestions to

my work that helped it to become what is shown throughout this dissertation. A massive shoutout

to Dr. Muneeza Azmat, who was always available to either vent to or get hype from both when I

was there in person and over the phone or Zoom. Though, to be fair, when we were in the office

together we probably spent far too much time gossiping or talking about K-Pop that we should have

spent working.

I want to thank my committee members, Dr. Adam Alessio, Dr. Vishnu Boddeti, Dr. Arun

vi

Ross, and Dr. Yuying Xie, for all their guidance and helping me become the scholar I am today. I

cannot thank my advisor, Adam, enough for everything he has done for me throughout my doctorate

journey. Throughout the Covid-19 pandemic, he offered so much kindness and understanding, not

only to me but to all members of the lab, constantly checking in to make sure everyone was doing

alright. This was particularly impactful for me, as I had moved back home to Arizona and was

across the country from everyone. His signing off on me staying home allowed me to be present

in the last couple years of my cat Pierre’s life, which I am eternally grateful for. I wholeheartedly

believe that his mentorship is one of the biggest reasons I was able to complete this journey.

vii

TABLE OF CONTENTS

CHAPTER 1

INTRO TO COMPUTER VISION IN MACHINE LEARNING . . . . .

1

CHAPTER 2

DEEP LEARNING IN MEDICAL IMAGING . . . . . . . . . . . . .

. 23

CHAPTER 3

PEDIATRIC RIB FRACTURES - DATA AND PATTERNS . . . . . . . 35

CHAPTER 4

IMPROVING ONE-STAGE DETECTORS . . . . . . . . . . . . . . .

. 51

CHAPTER 5

DOMAIN-SPECIFIC TWO-STAGE DETECTOR . . . . . . . . . . . . 78

CHAPTER 6

CONCLUSION .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 93

BIBLIOGRAPHY . .

. .

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

APPENDIX

SUPPLEMENTAL TABLES . . . . . . . . . . . . . . . . . . . . . . . 109

viii

CHAPTER 1

INTRO TO COMPUTER VISION IN MACHINE LEARNING

1.1 Computer Vision

Computer vision is the use of computers to extract high-level information from visual me-

dia, such as digital images and video, to provide automated decisions and/or recommendations.

Traditional computer vision techniques include edge detection, noise reduction, and image en-

hancement. Recent advancements in machine learning (ML) from both algorithmic improvements

and computational accessibility have garnered massive interest in computer vision tasks. Image

recognition, a major sub-task of computer vision, has especially experienced immense growth from

the advancements. There are three prevailing tasks within image recognition: image classification,

object detection, and segmentation.

Classification attributes a label that describes the overall image or video. Figure 1.1(a) illustrates

this, classifying the image as a whole as “cat." For use cases requiring both classification and

localization of objects, whole-image classification proves insufficient. Object detection is a task

that ascribes a rectangular box around objects of interest and classifying each of them with an

associated label. An example is shown in Figure 1.1(b). Rather than the entire image labeled as

"cat," the cat object is roughly surrounded with a box and inherits the "cat" label; a spoon and

lamp are each also bounded by boxes with their respective labels. Object detection provides more

information about what is in the image due to the localization of the objects, thus is closer to what

humans do when viewing images. Segmentation outputs a pixel-level representation of where each

object is in the image. Figure 1.1(c) shows a segmentation output where the exact pixels of the cat

are highlighted, rather than just a box surrounding it. The two most common types of segmentation

are semantic and instance segmentation. Semantic segmentation splits the segmentations by the

class type; i.e., the output of multiple overlapping objects of the same class in close proximity will be

combined into one cluster of pixels. Instance segmentation further refines the output by separating

instances of objects of the same class; i.e., a group of people standing together in an image would

have boundaries separating each individual into different clusters of pixels with individual labels.

1

(a) classification

(b) detection

(c) segmentation

Figure 1.1 Examples of the three main image recognition tasks.

1.2 Artificial Neural Networks

The root for the idea of artificial neural networks was introduced in 1958 by Frank Rosen-

blatt [1]. Dubbed the perceptron, the idea was influenced by the way the brain sends and processes

information signals; information transmits from neurons to neuron via electrical impulses through

connections known as synapses. Translated directly, the machine neuron receives an input, pro-

cesses a calculation through an activation, and directs the result on through the next layer. These

bits of information connecting neurons are known as weights. The use of the word “network"

implies the existence of many interconnected neurons. Indeed, artificial neural networks can be

made up of anywhere from dozens to millions of neurons, with associated weights connecting each,

that encode the information learned from the model. The basic structure of an artificial neural

network starts with the input layer, then a various number of inner layers known as ‘hidden" layers,

all directing toward the final output layer. The size of these neural networks can grow in two ways:

increasing the number of hidden layers in the network (known as increasing depth), and increasing

the amount of neurons in each layer (increasing width). Generally, all of these layers are called fully

connected layers, meaning every neuron in each layer is connected to every neuron in the previous

and the following layers, with an associated weight for each connection. The value of each weight

represents the quantity of information sent between neurons, i.e., larger weight values represent

2

avenues of higher information passage through the neurons. Each layer of the network generally

also has an associated bias value, which acts as a small shift to the output of the activation function.

Over numerous repetitions through the training data, these weight values are updated so that the

overall output of the network improves for the given task. This process allows a network to learn

complex mappings between given input data and the output.

1.2.1 Activation Layers

As mentioned previously, the activation layer is the final computational step before sending

information to subsequent layers in the network. These are various non-linear transformations

that take in a weighted sum of the input values from the prior layer with the associated weights.

There are various functions that can be used as activation functions; a couple of the most common

ones are provided in Figure 1.2. Different functions offer different benefits. For instance, the

(a) sigmoid

(b) tanh

(c) ReLU

Figure 1.2 Visualizations of the sigmoid, tanh, and ReLU activation functions.

Rectified Linear Unit (ReLU) [2] function maps positive inputs to a linear function and negative

inputs to zero, reducing computational overhead during backpropagation (which will be explored

in the following section) since its derivative is either 0 or 1 (except at 𝑥 = 0). The sigmoid, or

more commonly the logistic function, is useful at the output layer of a binary classification network

since the output values lie strictly between (0, 1); we can differentiate class outputs by defining

𝑦 ≥ 0.5 → 𝑐𝑙𝑎𝑠𝑠1 and 𝑦 < 0.5 → 𝑐𝑙𝑎𝑠𝑠0. The hyperbolic tangent function behaves similarly to

the sigmoid, however values 𝑥 < 0 are mapped to negative output values.

1.2.2 Optimization

Before training a neural network from scratch, all weight values dictated by the number and size

of all convolutional kernels and fully connected layers are randomly initialized. After propagating

3

forward through the entire neural network, the network then begins a process called backpropa-

gation. First introduced in 1968 by Rumelhart et al.

[3], backpropagation is necessary for the

neural network to reach an optimal solution by which all the unknown weight parameters from

all kernels or neurons throughout the network are updated. In order for both the network to learn

and to optimize the weights, a loss function is needed. Loss functions, denoted as L (·), measure

how well-performing the model is after each iteration while training; the ultimate goal is to reduce

the loss (as calculated from the loss function) as close to zero as possible. For example, in a

classification problem with 𝑐 classes the most standard loss function is cross entropy,

L (𝑦𝑖, 𝑝𝑖) = −

𝑐
∑︁

𝑖=1

𝑦𝑖 log( 𝑝𝑖),

(1.1)

where 𝑦𝑖 is 1 if item 𝑖 belongs to class label 𝑦𝑖 and 𝑝𝑖 is the probability of the model predicting that

class. The higher the probability, the lower the loss function, and thus the better performing the

model is. Weight values are optimized by backpropagation computing gradients for every weight

value throughout the network, starting at the loss function and traversing backward, and updating

them all. One of the most common implementations for these gradient computations is gradient

descent [4]. A tunable parameter 𝛼, also known as the learning rate, modulates the magnitude of

changes to the weights and therefore how long it takes for the model to converge during training.

The set of all weights 𝜃 will be updated during each iteration, 𝑡, of backpropagation by

𝜃𝑡 = 𝜃𝑡−1 − 𝛼∇𝜃 𝐿 (𝜃𝑡−1).

(1.2)

where 𝐿(𝜃) is the loss function with respect to all weights of the network. There are a couple

variations of gradient descent that update the model in slightly different ways. The first is batch

gradient descent, which waits until the gradients from processing through the entire training set have

been calculated and then updates all weights. This offers a decently stable path for convergence,

but the computational complexity is proportional to the size of the dataset. On the opposite side

of the spectrum, stochastic gradient descent updates weights after calculating the gradient from

each training example. This speeds up the computation, but suffers from a more erratic and less

stable convergence. Instead, the most commonly implemented optimization technique—and what

4

is generally abbreviated as SGD—is mini-batch stochastic gradient descent; the training dataset is

divided into smaller batches and the weights of the network are updated after gradients for each

batch are calculated.

An improved optimization technique coined Adam—from adaptive moment estimation—was

introduced in 2017 [5]. As the size of datasets (explored later in Section 2.3 and network complexity

increased, the desire for more efficient optimization methods arose. By calculating adaptive

learning rates independently for parameters based on estimates of the first two moments (mean,

and uncentered variance) of the gradients, Adam is able to update parameters more effectively,

especially in situations of parameter sparsity. Some parameters may need to update more quickly

or slower than others, which stochastic gradient descent does not handle well since every parameter

is given the same learning rate. Overall, Adam is more computationally efficient while requiring

less memory, an extremely important attribute with larger and larger datasets.

1.3 Convolutional Neural Networks

Feed-forward, fully connected neural networks are powerful; however, using images as inputs

can quickly run into computational bottlenecks. In order to be sent through a traditional multi-

layer neural network, an image must be flattened into a one-dimensional vector which becomes

exponentially computationally demanding as image sizes increase. For example, given a three-

channel RGB image with dimensions 512 × 512, the resulting vector has a length of 786, 432

that needs to be sent through the network. Convolutional neural networks (CNNs) bypass these

limitations and are able to handle image data much more efficiently and effectively, ultimately

being able to find patterns within images. The primary components of CNNs that enable network

processing of large images is the use of convolutional and pooling layers, which dramatically

reduces the number of the unknown, trainable parameters found in conventional fully connected

neural networks.

1.3.1 Convolutional Layers

In order for a computer model to learn from an image, it needs a way of traversing across

the image to extract meaningful information. Convolutional neural network layers achieve this

5

through a process known as convolution and using a convolutional kernel: a small, square matrix

of initially randomized values. Similar to the associated weights between neurons in a multi-layer

fully connected network, the values in these convolutional kernels are the values learned during the

training process. Common sizes for convolutional kernels include 3 × 3, 5 × 5, and 7 × 7. Figure 1.3

illustrates the first few steps of applying a kernel to an input. The 3 × 3 kernel is first placed at the

top left of the image, then the Hadamard (element-wise) product is computed on all 9 overlapping

cells, and the sum of that product matrix becomes the first value of the output matrix. The entire

3 × 3 kernel is then shifted to the right one cell, and the process repeated. This is done over and

over, until the right-most column of the kernel reaches the right-most column of the image; then,

the kernel is moved down a row and back to the left. Colloquially known as the sliding window,

this process repeats until the bottom-right kernel value reaches the bottom right of the image. The

final output of the kernel applied across the image is known as a feature map. Early on in the

convolutional neural network, these kernels extract low-level features such as edges and corners.

As the network gets deeper, the kernels start associating these features from previous layers into

more abstract things such as shapes and textures.

Figure 1.3 Example of the first three steps of convolution.

There are a few options that modify how a convolutional kernel is applied on the input, the

first of which is the stride. In the example above, the kernel slid one space to the right after each

summed element-wise product; this corresponds to a stride of 1.

Increasing stride to 2 would

shift the kernel by two spaces after each summed element-wise product with the input. Another

6

option when customizing convolution is padding, also known as zero-padding, that optionally adds

a boundary of zeros around the input. Figure 1.4 illustrates applying a padding of 1, which adds

two extra rows and columns to the input before the kernel slides across. By padding with zeros,

the kernel is able to encode more information from the edges of the input since it passes over those

values more. Padding is also used to counteract the downsampling effect of applying a kernel; by

adding a sufficient number of zeros around the image the output can retain the same dimensions as

the input, allowing chained convolution layers at the same scale.

Figure 1.4 Zero-padding an input with one level of zeros.

The last option is called dilation. This alters the kernel by adding spaces between the kernel

values for values greater than 1. A 3 × 3 kernel is unchanged with a dilation of 1; with a dilation

of 2, a single space is added between each filter value, as shown in Figure 1.5. Adding dilation

can be beneficial where an excessive number of parameters across all convolutional filters creates

a performance bottleneck, where using a smaller kernel with dilation gives effectively the same

receptive field on the input with significantly fewer parameters to learn per kernel (9 for a 3 × 3

kernel versus 25 for a 5 × 5 kernel).

Given an input with height 𝐻 and width 𝑊 with 𝐶 channels, kernel size 𝑘, stride 𝑠, zero-pad 𝑝,

and dilation 𝑑, the dimensions of the resulting feature map can be calculated as
(cid:18)𝑊 − (𝑘 𝑑 − 𝑑 + 1) + 2𝑝
𝑠

(cid:18) 𝐻 − (𝑘 𝑑 − 𝑑 + 1) + 2𝑝
𝑠

𝐻 × 𝑊 × 𝐶𝑖𝑛

Conv2D
−−−−−−→

+ 1

×

(cid:19)

(cid:19)

+ 1

× 𝐶𝑜𝑢𝑡 . (1.3)

Multiple convolutional kernels can be used on the same input within each convolutional layer,

adjusting the third dimension of the output layer. For instance, one could create 16 distinct,

7

Figure 1.5 (left) a 3 × 3 kernel with dilation 1. (right) a 3 × 3 kernel with dilation 2.

randomly generated 3 × 3 kernels and apply each one of them across the input image. This would

change 𝐶𝑜𝑢𝑡 in Equation 1.3 to 16, each channel being the resulting values from each respective

kernel applied on the input image. At the end of each convolution layer, much like traditional

artificial neural networks, an activation function is applied which normalizes the values of the

feature map computed with the convolutional kernel.

1.3.2 Pooling Layers

(a) average pooling

(b) max pooling

Figure 1.6 The two types of pooling operations.

After one or multiple consecutive convolutional layers, a layer called a pooling layer is added

as a downsampling technique to reduce the spatial dimensions for the next stage of the network.

One pooling method is known as average pooling, shown in Figure 1.6(a), where small sections of

the feature map get averaged into a single value. Another method is max pooling, shown in Figure

1.6(b), where the largest values in similarly sized sections of the feature map are retained. Max

pooling is more common due to improved generalization and training performance [6, 7]. Typical

sizing for pooling layers is 2 × 2 with a stride of 2, which reduces each of the spatial dimensions

of the input by a factor of 2. Pooling layers serve two purposes; first, reducing the spatial size of

the feature map decreases the computational overhead for further convolutional layers. Secondly,

8

by reducing the spatial size, dominant features learned by the CNN become more invariant to

positional changes, i.e., if similar objects are in the same image at different locations, the CNN will

be less likely to have issues predicting them as the same objects.

1.3.3 Classification Network Performance

Convolutional neural networks experienced a surge of interest predominantly because of the

success of AlexNet [8] in 2012 after it achieved a significant improvement in accuracy score on the

ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [9] dataset compared to previous,

hand-crafted methods. Winners of the ILSVRC competition thereafter continued using CNN-

based models, solidifying the transition away from traditional methods to these highly-performing

deep learning models for image recognition tasks. Table 1.1 provides the sizes and ImageNet-1K

performance for some of the most common CNN architectures over the past decade. The ubiquity

Table 1.1 Performance for various classification architectures over time. Top-5 Error (%) is
calculated on the ImageNet-1K test set (except for * for DenseNet where only validation set error
was given). The number of layers for ConvNeXt-Base was not provided.

Year

Model

Layers Parameters Top-5 Error (%)

LeNet [10]
1998
AlexNet [8]
2012
GoogLeNet [11]
2014
VGG-16 [12]
2015
ResNet-50 [13]
2016
ResNet-152 [13]
2016
2017
ResNeXt-101 [14]
2018 DenseNet-121 [15]
2020 EfficientNet-B7 [16]
2022 ConvNeXt-Base [17]

7
8
22
16
50
152
101
121
36

60K
∼62M
∼6.7M
∼138M
∼25M
∼60M
∼44M
∼8M
∼66M
∼89M

-
16.4
6.7
7.3
5.25
3.57
5.3
7.71*
3.06
3.13

of powerful modern computational hardware has allowed the creation of deeper and more complex

CNN architectures, further dominating the object detection task and spearheading advancements

in applications such as automated medical diagnosis and autonomous vehicle operation. This can

be seen with networks such as VGG-16 with over 138 million parameters, all the way through

ConvNeXt-Base with 89 million. Despite AlexNet’s success with a top-5 error of 16.4, networks

have achieved remarkably high scores in the low 3% ranges. For this image recognition task, human

9

error has been estimated to be near 5% [9]; ever since ResNet-152 from 2016, there have existed

machine learning classifiers capable of out-performing humans on this type of dataset.

As evident in Table 1.1, a notable improvement in classification performance was seen going

from the VGG-16 network to the ResNet-50 network. Not only did the error decrease by over 2%,

the more important factor was that the number of parameters was drastically reduced from over 138

million to 25 million despite the depth of the network increasing. This was due to the introduction

of the residual block (Figure 1.7(b)) allowing more convolutional layers and the reduction in fully

connected layers at the end of the network. The residual block adds a “skip" connection from the

(a)

(b)

Figure 1.7 (a) standard convolution block as used in VGG-16. (b) residual block as proposed in
ResNet. Images from [18].

beginning of a convolution block that concatenates with the output. This ultimately helps training

of the neural network by reducing the chance that deeper networks reach a point where training error

begins to increase, as well as providing an identity mapping that reduces the chance of a vanishing

gradient during backpropagation. The vanishing gradient problem occurs as values from the end

of the network are sent backward through the network and get exponentially squished toward zero,

therefore preventing the network from learning [19].

10

1.4 Object Detection

The overwhelming success in classification tasks has extended interest in convolutional neural

networks for object detection tasks. This is a natural extension since the same networks used in

the classification tasks above can be reused for object detection. When used as the foundation of

object detection architectures, these networks are called the “backbone" networks . Traditionally,

these networks are pre-trained on massive datasets like the ImageNet [20] dataset mentioned above.

It is then common to use these pre-trained weights as the starting point for detection tasks; this

is known as transfer learning. With all of the low-level features learned through the backbone

architecture through the initial training, detection tasks can begin with a pre-defined starting point

and the network can be fine-tuned with the dataset of the given task.

Broadly, there are two ways to set up an object detection network; one-stage and two-stage

methods.

In a one-stage network, also known as single-shot networks, the entire process from

inputting an image to an output with both bounding box and label predictions is handled end-to-end

within a single framework. For two-stage networks, there is an initial stage used as a method for

proposing regions, with the second stage designed to properly label the predicted regions. Figure 1.8

illustrates the difference between one- and two-stage networks. In the earlier competitions between

the two, there was generally an understood trade-off of detection capabilities and inferencing time;

i.e., a one-stage network would process images quicker than a two-stage network at the cost of lower

detection accuracy.

1.4.1 Two-Stage Detectors

The first well known two-stage detector was introduced in 2014 by Girshick et al.

, named

R-CNN [21]. In their implementation, they employ a selective search algorithm [22] as the first

stage to propose 2, 000 regions from the image and sends each of them separately into its second

stage (see Figure 1.9(a) for an example). However, due to the first stage being completely detached

from the backbone CNN, every region proposal gets sent individually through the classifier stage.

The Fast R-CNN improved upon the proposal stage by pooling regions of interest into feature maps

before sending them through the classifier [23]. Due to the sharing of computation across the

11

Figure 1.8 Flow chart comparing basic steps of one-stage versus two-stage object detectors

network from the RoI pooling, training time was 18 times quicker, but possibly more important,

the time required a model to provide predictions on a test image switched from being measured in

seconds per image to images per second.

Faster R-CNN introduced a new first stage with a region proposal network (RPN) that was

created to output proposal regions for objects [24]. The output from the RPN has a sliding-window

process to generate a list of anchor boxes that are then output for each step in the sliding window.

Mask R-CNN took this a step further to have a second stage that not only provided the class and

bounding box predictions but also output a pixel-level binary mask for each prediction, allowing

the capability for full object segmentation [25]

1.4.2 One-Stage Detectors

Examples of one-stage detectors are Single-Shot Detector (SSD) [27], You Only Look Once

(YOLO) [26], and RetinaNet [28]. In the YOLO architecture, images are divided into a 𝑆 × 𝑆 grid

where each grid cell is responsible for predicting bounding boxes and confidence scores for any

objects contained within the cell (Figure 1.9(b)). By sending the image through the network only

12

(a)

(b)

Figure 1.9 (a) region proposal example for R-CNN [21] (©2014).
YOLO [26] (©2016).

(b) grid cells detection in

once, the time to train the CNN architectures as well as the time it takes for them to propose bounding

boxes on new images is greatly reduced compared to their two-stage counterparts. Originally praised

and implemented due to their faster real-time inferencing capabilities, single-stage architectures and

their variants have become more common and have been achieving results on par with two-stage

methods.

Table 1.2 provides a breakdown various different detection networks, with their mean Average

Precision (mAP) performance on either the Pascal VOC 2007 [29] or the Microsoft COCO [30]

datasets, in addition to how many images the trained networks can detect on per second (abbreviate

FPS for frames per second).

Of the detection architectures in Table 1.2, this dissertation has a large focus on RetinaNet and

13

Table 1.2 Table of mainstream, state-of-the-art detection architectures of recent years. Performance
of the architectures is provided on two datasets: mAP1 is on the Pascal Visual Object Classes (VOC)
2007 [29] test dataset, mAP2
50 is on the Microsoft Common Objects in Context (COCO) [30] test
dataset.

Year

Architecture

Backbone

mAP1 mAP2

2014 R-CNN [21]
2015 Fast R-CNN [23]

2015 You Only Look Once (YOLO) [26]

2016 Faster R-CNN [24]
2016 Single Shot Detector (SSD) [27]
2017 Mask R-CNN [25]
2018 RetinaNet [28]
2018 YOLOv3 [31]
2020 EfficientDet [32]
2020 YOLOv5-l6 [33]

66.0
70.0
63.4
66.4
73.2
76.8

VGG-16
VGG-16
Custom
VGG-16
VGG-16
VGG-16
ResNet-101
ResNet-101
Darknet-53
EfficientNet
CSP-Darknet53

50 FPS
0.08
0.5
45
21
17
22
5
5
20
35
63

42.7
46.5
60.3
61.1
57.9
65.9
71.3

YOLOv5. RetinaNet was one of the first object detection architectures to incorporate a Feature

Pyramid Network (FPN) [34] on top of the backbone network, which can be seen in Figure 1.10(a-

b). During convolution, output feature maps are sent directly through to a matching upscaling

level at three resolution scales. After these two feature maps are concatenated at each level in the

pyramid, anchor boxes are used to scan through the feature maps to send object proposals through

to two sub-networks (Figure 1.10(c-d)). There are nine anchor boxes per level that vary in aspect

ratio (1:1, 1:2, and 2:1), as well as three scaled sizes per aspect ratio. After anchors are sent through

to the sub-networks, one sub-network is responsible for the classification of the object and the

second is tasked with the bounding box regression to localize the detection around the object. On

one hand, the incorporation of the FPN and classification and regression sub-networks drastically

increased RetinaNet’s mean average precision (mAP) performance on the MS COCO dataset, but

also causing it to have a much slower inference time compared to SSD or YOLO (5 FPS vs 59 and

45, respectively). The other major contribution from the RetinaNet model was a new loss function

they dubbed Focal Loss (Equation 1.4. They simplified the standard cross entropy loss function

(Equation 1.1) into the − log( 𝑝𝑡) term by defining 𝑝𝑡 = 𝑝 if 𝑦 = 1 and 𝑝𝑡 = 1 − 𝑝 and then defined

14

Figure 1.10 Diagram of the RetinaNet detection architecture [28] (©2020).

Focal Loss (FL) as:

𝐹 𝐿( 𝑝𝑡) = −𝛼𝑡 (1 − 𝑝𝑡)𝛾 log( 𝑝𝑡),

(1.4)

where 𝛼𝑡 is a class-balancing hyperparameter. The most important term of this new loss function

is the (1 − 𝑝𝑡)𝛾 with what they call the “focusing" parameter, 𝛾 (which they set to 2). In essence,

depending on how confident the model is for a given prediction, this component adjusts how much

the loss contributes. For highly confident predictions this will dramatically reduce the effect of

the loss; e.g., for 𝑝𝑡 = 0.99, this term becomes (1 − 𝑝𝑡)𝛾 = (1 − 0.99)2 = 0.012 = 0.0001. By

introducing this term, the model effectively learns to disregard highly confident predictions and

“focus" on improving and correcting more difficult examples.

1.4.3 Challenges for Object Detection

There are numerous challenges encountered in object detection tasks, each causing different

effects in the performance of the deep learning models. By far the most common, especially

depending on the specific application, is the access and quality of the data. With the increasing size

and complexity of models over the years (such as the parameter and depths of backbone networks

seen in Table 1.1), there is a proportional need for more and better data to improve training quality.

With smaller datasets, there is a chance for the detector to overfit to the training set and not be as

capable when generalizing to outside datasets.

There are augmentation techniques that can somewhat mitigate the effects of smaller datasets.

These include transformations such as rotating or flipping images horizontally or vertically. Hue,

brightness, and saturation modifications modify the values and pixel intensities that can also

improve detector performance if given images that are drastically different in terms of exposure or

15

light-levels. Over time, more complex augmentation strategies have been considered. For instance,

YOLOv4[35] introduced the concept of mosaic augmentation, where segments from four different

training samples are mixed together at varying sizes into new training samples. This changes the

context of the objects being detected, limiting chances for the model to make direct associations

from certain contexts to the existence of objects. Even more recently, there has been interest in

synthetically generating images as an augmentation technique [36, 37, 38] (a direct application of

this will be discussed later in Section 4.9).

Another hurdle for detection tasks, especially with earlier architectures, is object scale variation.

Even though humans can easily perceive objects at smaller sizes to be the same, the difference in

scale can prove difficult for detectors. The SSD architecture [27] was one of the first one-stage

networks to implement sending feature maps at multiple stages during downsampling as outputs

straight to the detection and classification end of the network, allowing the network to learn about

objects at multiple scales. The Feature Pyramid Network (FPN) [34] discussed above is a more

complex implementation of handling the multi-scale problem, since the feature maps from the earlier

downsampling stages are concatenated with feature maps of upsampling stages prior to being sent

to the regression and classification heads. These led to marked improvements in detecting objects

of different sizes.

Object occlusion also hampers localization capabilities for detectors, as the more an object

gets obscured by another object the more challenging it becomes to separate them into separate

objects, or perhaps even outputting a prediction for the occluded object [39]. Imagine an automated

driverless car with vision of an upcoming intersection, but a human behind a sign is not recognized

and therefore runs the risk of the car not properly stopping in time if they decide to move onto the

street.

1.5 Evaluation Metrics

There are various metrics in use to evaluate the performance of machine learning models across

classification, detection, and segmentation tasks. Here, we provide background on a few of the

main ones, especially those that will be used throughout this dissertation.

16

1.5.1 Confusion Matrix

For classification tasks, the overall performance of a model can be visualized in what is known

as a confusion matrix. It quantitatively measures how the predicted values from the model compare

to their true values. Table 1.3 shows what a confusion matrix looks like for a binary classification

task with only two labels, 1 (positive) and 0 (negative). From the values within this matrix, there

Table 1.3 A typical layout for a confusion matrix for a binary classifier.

True Label

1
True Positive (TP)

0
1
False Positive (FP)
0 False Negative (FN) True Negative (TN)

Predicted Label

are multiple other metrics that can be derived. The first of which is accuracy,

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =

𝑇 𝑃 + 𝑇 𝑁
𝑇 𝑃 + 𝐹𝑃 + 𝐹 𝑁 + 𝑇 𝑁

,

(1.5)

that provides a very general overview of how well the model performs.

It is bounded between

0, representing the model achieving no correct predictions, and 1 where the model is correct

on everything. Accuracy can be misleading, however, especially for imbalanced datasets. For

example, for a dataset containing forty-five 0-labeled instances and five 1-labeled instances, if the

model predicted every instance as a negative, or 0, label, the accuracy would be 0.90. There are

additional metrics that provide a more fine-grained perspective of model performance. Precision,

otherwise known as the positive predictive value (PPV), is a ratio comparing the number of correctly

labeled instances to all of the instances labeled as 1, i.e.,

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =

𝑇 𝑃
𝑇 𝑃 + 𝐹𝑃

,

(1.6)

with values approaching 1 representing models that are accurate with the labels they predict as

positive. Specificity, or the true negative rate (TNR), captures how many 0-labeled instances

compare to the total number of true 0-labeled instances,

𝑆 𝑝𝑒𝑐𝑖 𝑓 𝑖𝑐𝑖𝑡𝑦 =

𝑇 𝑁
𝑇 𝑁 + 𝐹𝑃

,

(1.7)

17

where values approaching 1 indicate the percentage of negative classes properly predicted as

negative. Recall, otherwise known as sensitivity (note: we use these two terms interchangeably

throughout the dissertation), demonstrates how well the model is able to accurately predict the

positive classes in the dataset, calculated as

𝑅𝑒𝑐𝑎𝑙𝑙 =

𝑇 𝑃
𝑇 𝑃 + 𝐹 𝑁

.

(1.8)

All three of these scores (precision, specificity, and recall) are bounded in the range [0, 1] and, as

discussed, values closer to 1 represent better model performance. By splitting the performance

into these separate metrics, we are able to better grasp the capabilities of machine learning models

depending on the perspective we want to approach evaluation.

Note that for object detection tasks, there are slight modifications to the application of these

three metrics. Given that object detection carries a localization aspect, a "true negative" cannot be

measured as it requires the absence of an object which the detector should not apply a bounding box

for. This eliminates specificity as a measure for object detectors. For precision and recall, this can

be calculated for all proposed boxes that carry a probability score exceeding the model acceptance

threshold; this is how these metrics are used throughout this dissertation.

1.5.2 Receiver Operating Characteristic

While the confusion matrix, and therefore precision, specificity, and recall, are all great ways to

measure the performance of classification models, they require the probabilities for each instance

output by the model to be assigned to either the positive 1 or negative 0 classes based on a defined

threshold, such as 0.50 for a sigmoid output activation. To get a more holistic understanding of how

the model performs given all probability scores of the output, the receiver operating characteristic

(ROC) curve presents the sensitivity and false positive rate (calculated as 1 − 𝑠𝑝𝑒𝑐𝑖 𝑓 𝑖𝑐𝑖𝑡𝑦) for each

of the predicted probabilities. This provides insight into the tradeoff of sensitivity and specificity

for different operating points for the model (different threshold settings).

From Figure 1.11, it can be seen that model performance tends toward the top-left. A perfect

model would appear as a single dot in the top-left most corner, showing the model has a sensitivity

18

of 1 (every positive class in the dataset was correctly predicted) with a false positive rate of 0 (no

items were falsely predicted). The dashed gray line in the diagonal is the ROC to be expected from

a completely random classifier, i.e., having a 50/50 chance of scoring a given instance as positive

or negative. From the ROC curve, a single measure called the AUC (area-under-the-curve, also

ROC-AUC) can be calculated as the name implies: the area underneath the computed ROC curve.

Values approaching 1 represent model performance tending towards a perfect classifier, whereas

an AUC score of 0.50 matches a random classifier.

Figure 1.11 An example ROC curve. The dashed diagonal line represents performance expected
from a completely random classifier. The shaded region represents the area-under-the-curve (AUC)
for the orange curve. Lines tending toward the top-left of the graph represent better model
performance.

1.5.3 F2 Score

A common evaluation metric used in classification and detection tasks is F1 score, which is the

harmonic mean between precision and recall.

𝐹1 = 2 ·

precision · recall
precision + recall

(1.9)

19

10011 -Specificity (FPR)SensitivityThis is a specific implementation of a general 𝐹𝛽 score where both precision and recall are considered

equally as important in the final calculation. The constant term 𝛽 can be changed to weight precision

or recall 𝛽-times more than the other. For instance, if one wants recall to carry 𝛽-times as much

importance as precision in the calculation, the equation becomes

𝐹𝛽 = (1 + 𝛽2) ·

precision · recall
(𝛽2 · precision) + recall

(1.10)

If instead one desired to weight precision 𝛽-times more than recall, the denominator would change

to (precision + (𝛽2 · recall)). For our evaluation, we are placing a larger emphasis on recall over

precision, as we want the deep learning models to be liberal with predicting potential fracture

locations even if the predictions actually do not contain fractures. In the long term, this tool is

intended to serve as a reading aid for radiologists to flag suspicious regions; the final determination

of the presence or absence of fractures will be determined by the radiologist aided by the model

output. For these reasons, we are predominantly evaluating all models by F2 score and placing

twice as much weight on recall as precision, i.e., setting 𝛽 = 2 in equation (1.10).

1.5.4 Mean Average Precision (mAP)

As discussed above in Section 1.5.1, precision represents the number of labels predicted as

positive that are correctly labeled, whereas recall represents the number of correct labels identified

of all possible positive objects. In object detection, the values found for TP, FP, and FN labels can

be tweaked in two ways: varying the model confidence for accepting bounding box predictions, or

varying the IOU threshold [40]. We can visualize precision and recall performance across all model

confidences by computing the precision-recall curve, where the IOU threshold is held constant.

Then, to summarize the performance into a single metric, we can calculate the Average Precision

(AP) by taking the weighted sum of precision values at equal intervals across recall values. Going

further, the mean Average Precision (mAP) includes multiple precision-recall curves across many

IOU thresholds and computes the mean of the average precisions from all plots; e.g., if the score is

represented as mAP[0.05:0.05:0.95], precision-recall curves are generated for every IOU threshold

between 0.05 and 0.95 at intervals of 0.05, and the average precisions across all 19 precision-recall

20

curves is averaged. This provides a much more broad understanding of the performance of an

object detection architecture.

1.5.5

Intersection-over-Union (IOU)

The Jaccard Index, or Intersection-over-Union (IOU) [41], is a measure for quickly and easily

scoring the overlap from a predicted region and a ground-truth region. This can be used for

both bounding boxes in object detection, and pixel-level predictions from image segmentation. It

computes the overlapping region between two regions and divides it by all elements/pixels among

the two regions, shown in Equation 1.11,

𝐼𝑂𝑈 =

| 𝐴 ∩ 𝐵|
| 𝐴 ∪ 𝐵|

,

(1.11)

where | · | is the number of pixels in the region. Bounded in the range [0, 1], higher values represent

predicted regions that more closely mirror the true regions. Assuming a proposed box contains

within it the entire true box, the IOU score may still be small if the size of the proposed box is

much greater than the true box.

1.5.6 Dice Coefficient

The Dice coefficient [42] is a measure for evaluating the similarity between two sets. For image

segmentation, it scores how many pixels from a model output overlap with ground-truth pixel

annotations. Equation 1.12 gives the mathematical expression to compute the Dice coefficient

Dice Coefficient = 2 ·

| 𝐴 ∩ 𝐵|
| 𝐴| + |𝐵|

(1.12)

where | · | represents the number of pixels in the respective set. The Dice coefficient ranges from

[0, 1], with 1 representing pixel-perfect coverage of the ground-truth annotation.

1.5.7 Alternative Free-Response Receiver Operating Characteristic (AFROC)

In traditional classification tasks, the receiver operating characteristic (ROC) curve is an evalu-

ation measure showing the relationship between the true positive rate and false positive rate across

a wide range of thresholds. From this, the accompanying area under the curve (AUC) can be

computed to summarize ROC curve performance between [0, 1]. Unfortunately, this measure is

21

unsuitable for detection tasks as predictions are localized within the image, rather than the image

as a whole. The alternative free-response receiver operating characteristic (AFROC) [43] resolves

this issue by incorporating local-level annotation information. Note, for the following equations the

word lesion represents a hand-labeled region of interest on a given image (e.g., a cancerous mass in

an organ and a bone fracture would both be considered lesions). Two main equations are needed to

calculate the AFROC curve, the lesion localization factor (LLF) and non-lesion localization factor

(NLF):

𝐿𝐿𝐹 =

𝑁 𝐿𝐹 =

# correctly marked lesions
total # of lesions
# incorrectly marked lesions
total # of images

.

(1.13)

(1.14)

Note that 𝐿𝐿𝐹 ∈ [0, 1] but 𝑁 𝐿𝐹 ∈ [0, ∞). To bound both axes, AFROC assumes the likelihood of

marking 𝑛 localized regions inaccurately per image to be determined by the Poisson distribution,
𝑃(𝑛) = 𝑒−𝜆𝜆𝑛
𝑛!

, with 𝜆 being the average number of incorrectly marked localizations, i.e., the NLF.

Then, the false positive rate can be computed as

𝐹𝑃𝑅 = 1 − 𝑃(0)

𝑒−𝑁 𝐿𝐹 𝑁 𝐿𝐹0

= 1 −

0!
= 1 − 𝑒−𝑁 𝐿𝐹 .

(1.15)

(1.16)

(1.17)

Therefore the AFROC curve, like the original ROC curve, has both a x- and y-axis restricted to

[0, 1]. This allows the calculation of the area under the curve once again, with AUC values closer

to 1 representing perfect performance.

22

CHAPTER 2

DEEP LEARNING IN MEDICAL IMAGING

2.1 Significance

Deep learning-based architectures have been gaining popularity within the medical field over

the past several years. The capabilities to automatically classify, localize regions of interest, and/or

provide pixel-level segmentations of pathologies in images has numerous applications within the

medical field that can ultimately lead to improvement of patient outcomes. This is incredibly

poignant knowing there is a perpetual shortage of physicians in the country. With a smaller ratio

of physicians to population, the ability to effectively and efficiently diagnose patients becomes

more and more of a challenge, especially for individuals living in more rural communities with

even less access to high quality healthcare. The issue compounds further when specialists are

needed, since they are much more difficult to find than general physicians. From the mid 1990’s

to early 2010’s, despite a raw increase in number of radiologists of nearly 40%, the proportion

of trained radiologists compared to general physicians dropped by 8.8% [44]. Without enough

access to trained and specialized radiologists, there is a two-fold effect on patient outcomes: either

patients get care from less specialized doctors that may miss important and possibly life-threatening

conditions, or the specialized doctors have less time to interpret the medical images and therefore

also risk missing critical regions. While overlooking a fractured bone in a hand or foot from a

radiological scan is not likely to cause life threatening issues, missing a malignant tumor in an organ

can lead to terminal outcomes, especially if not caught early enough. By leveraging the powers

of computers to automate initial evaluation of medical images, the likelihood of detecting more of

these lesions increases, and improving the chances for positive patient outcomes.

2.2 Medical Images and Modalities

Compared to natural images, there are various different challenges when attempting to use

medical imaging for deep learning. The most obvious is the manner in which images are captured.

Figure 2.1 presents example images from different imaging modalities. Each image was generated

23

Figure 2.1 Example images from common medical imaging exams of the chest region. Displays
the diversity of image presentations and the reality that medical images differ from natural images
in multiple respects. Modified with permission under CC 4.0 International from [48].

from very different physical processes as outlined below. The focus of this work will use X-ray

radiographs of the chest. In brief, x-rays are generated in a x-ray tube and transmitted throughout

a subject. An imaging detector is situated on the other side of the subject and collects the x-rays

that are not attenuated by the subject. This creates an image of the x-ray attenuation. Figure 2.2

compares images from common chest radiograph acquisitions.

While the work in this dissertation focuses on X-ray radiographs, there are many other methods.

A common medical imaging modality is computed tomography (CT), represented by the top-left

image in Figure 2.1. In CT scans, x-rays are sent out through the patient as the x-ray tube rotates

around them, capturing many views from all angles around them. There is generally a trade-off of

image quality and patient radiation dose, and there have been numerous efforts to retain as much

information with lower dosages [45, 46, 47]. Positron emission tomography (PET) (top-right image

in Figure 2.1) involves injecting the patient with a glucose-like substance that has been labeled with

radioactive isotopes that, once in the tissue, decay and release positrons that annihilate in collisions

with nearby electrons, generating two gamma rays that get sent in opposing directions that the

detector captures.

For X-ray, CT, and nuclear medicine modalities, there has been a persistent concern over the

24

Figure 2.2 Examples of common chest radiograph acquisitions using x-ray and patient positioning
for AP and PA views. The work in this dissertation uses AP chest radiographs. Used with
permission from [49].

amount of ionizing radiation absorbed by the patient. Each procedure has an approximate effective

dose measured in millisieverts (mSv) that changes based on numerous factors including the machine

used as well as the region being imaged [50]. Generally, each year an average person receives an

effective dose of approximately 3 mSv from natural sources of radiation. A thoracic x-ray imparts

approximately 0.1 mSv, whereas a CT scan of the chest imparts approximately 6.1 mSv. There

have been efforts to quantify annual effective doses and cumulative effective doses (CED) as ways

to monitor the amount of radiation from imaging across age groups and populations. Fazel et

al.

[51] found that the annual effective dose increases as ages in the population grow, with 52

per 1000 patients experiencing between 20-50 mSv annually compared to only 4.9 per 1000 in

younger adults aged 18-34. They also found that on a yearly basis, CT scans of the abdomen

represent 18.3% of the annual effective dose across all imaging procedures. In many studies, the

threshold given for cumulative effective dose is greater than 100 mSv [52]. Frush et al. found in

their meta-analysis that for pediatric patients, the number of patients exceeding this CED ranged

generally represented less then 1% of the patients, but across all ages could be as high as 2.9%.

There remain unknown associations between the risk of cancer and the low levels of radiation

25

imparted in medical imaging. In the face of this unknown knowledge, the general practice is to

limit image quality to ensure as-low-as-reasonably-achievable radiation doses are used [53].

While X-ray, CT, and PET involve some level of radiation exposure, there are other imaging

procedures that do not involve ionizing radiation. One of these is magnetic resonance imaging

(MRI), shown in the top-middle image in Figure 2.1. MRI machines generate a steady magnetic

field around the patient, aligning the atoms inside them in a certain way. Then, bursts of radio

frequencies are sent toward the patient, and as the atoms relax and re-align with the magnetic field

they emit signals that the machine capture. MRI is widely considered to be the superior modality

for soft tissue imaging. Lastly, ultrasound (bottom-left image in Figure 2.1) uses sound waves

emitted at a higher frequency than we can hear into the patient, and reflections from these sound

waves generate the image seen during an ultrasound scan. One of the major benefits of ultrasound

is that the image can be generated in real-time so it is very beneficial for patient-facing applications

such as fetal development during pregnancy [54].

From the source, medical images are also stored differently. Natural images, especially those

taken by cell phones or general consumer cameras, are commonly stored in PNG or JPEG formats

which can be easily adapted or directly used in machine learning models. It is common for medical

images to be stored in DICOM (Digital Imaging and Communications in Medicine) files in hospital

storage systems called PACS (picture archiving and communication system) [55]. These DICOM

files contain the pixel values that were reconstructed from the raw detector measurements, as well

as metadata that provides information such as the scanner used, the settings of the machine, pixel

spacing and slice thickness, but also patient health information (PHI) such as the patient name, age,

and imaging center where the image was taken. Whereas natural images are generally three-channel

RGB images with 8-bits of information per channel, medical images tend to be monochromatic

single-channel images with anywhere from 12-bit to 16-bit values. For a single radiograph, these

images might not be the largest—for instance, an uncompressed 2000 × 3000 16-bit radiograph

would be about 12MB in size–compared to images taken from a phone camera which may be in the

4-7MB range for a similar 12MP image. However, consider a 3D CT volume for a single patient,

26

where each image is 512 × 512 and there are 500 slices; at a pixel-spacing of 0.1mm, this only

represents 50mm of physical space imaged, but is 512 · 512 · 500 · 16/8 = 262MB in size. A group

of 50 patients yields a sizeable 13GB dataset in that case.

2.3 Challenges in Obtaining Medical Imaging Datasets

One of the leading issues with applying deep learning to medical imaging applications is the

difficulty in attaining adequate amounts of data. Natural image datasets like the ones mentioned in

Chapter 1—PASCAL VOC, ImageNet, and COCO—all have the advantage of being massive sets of

images. The 2007 iteration of the PASCAL VOC dataset had 9, 963 images with 24, 640 uniquely

annotated objects among 20 classes [29]. At the time of its initial release in 2009, ImageNet

consisted of over 3.2 million images with 5, 247 unique labels, where each label has an average

of over 600 images each [20]. The standard dataset used for pre-training networks, as well as

the ILSVRC classification challenges, used a subset from the ImageNet dataset of 1, 000 classes

with over 1.2 million images commonly referred to as ImageNet-1K [9]. ImageNet expanded to

contain well over 14 million images with nearly 22, 000 classes; however, these classes are not

mutually exclusive like the smaller set and there is heavy class imbalance between labels [56].

While not nearly as large, the Microsoft COCO dataset still had over 328, 000 images with 2.5

million individual annotations of over 90 classes [30]. However, none of these datasets approach

the enormity of Google’s internal dataset, JFT-300M [57]. This massive image dataset has over

300 million images containing over 375 million labels spread across 18, 291 classes.

For these natural image datasets, bridging the gap from classification tasks to object detection

and segmentation is not incredibly difficult. Normal human readers are able to easily discern object

classes from one another and either place a box around each one or create pixel-level segmentations

of each object. There are various methods these labels can be obtained, such as giving smaller

chunks of images to colleagues and friends or crowd-sourcing by providing an online platform for

users around the world to submit their labels. Thus, these images are easy to obtain and inexpensive

to get labeled. This is unfortunately not the case for medical imaging datasets.

One of the first roadblocks to retrieving medical imaging data is the regulations surrounding

27

them. In the United States, any information on patients garnered through doctors visits is protected

by the Health Insurance Portability and Accountability Act (HIPAA). This includes all types of

medical imaging performed, whether it be a radiograph, MRI, or PET scan. In order for medical

images to be usable by outside personnel such as machine learning researchers, a request needs

to be made to an institutional review board (IRB) and they must grant approval for the data to be

released. Then, all potentially identifying patient information needs to be properly anonymized so

that any external use cannot possibly be linked back to the actual patient. This is a tedious and

time-consuming process, making it quite difficult to acquire data in the first place, let alone large

quantities of it. Meanwhile, it is estimated that over 3.6 billion medical diagnostic examinations

are completed annually worldwide [58], with approximately 70-80 million chest radiographs and

CT scans each year within the United States alone [59, 60].

Even after medical images are granted approval to be used and properly retrieved, there is a

chance they cannot be immediately used for computer vision purposes. In some cases, such as for

whole-image classification, it is possible that the diagnoses in the charts associated with the images

can be extracted; e.g., a patient who received a chest x-ray with pneumonia would have the label

extracted simultaneously. However, for localization tasks like object detection and segmentation

the problem becomes much more challenging. For each task one wants to apply deep learning to,

they must get the hand-labeled annotations on the images which requires specific domain experts.

For instance, radiographs containing various types of bone fractures would likely benefit from a

musculoskeletal radiologists as opposed to an oncologist who specializes in cancer. Moreover,

due to the inherent challenges posed by providing accurate labels on the set of images, the best

case scenario for the least noisy labels is to obtain multiple reads on the same images with a final

consensus read. This compounds both the requirement in the number of collaborating medical

professionals as well as the amount of time it takes to not only acquire the images but to get

accurate labels.

Due to these challenges and the increasing interest and demand for applying deep learning

methods to medical imaging tasks, there have been a few large-scale imaging datasets released for

28

public use. One of the most well known chest radiograph datasets is CheXpert [61]. In it, they

collected 224, 316 chest radiographs from 65, 240 unique patients. There are 14 labels, such as

pneumonia, pneumothorax, edema, and fracture. In order to obtain ground-truth labels for their

data, they used a consensus result from five different board-certified radiologists. DeepLesion [62]

is a CT-based dataset released by the NIHCC that comprises over 33, 000 images from 4, 477

unique patients. They used bookmarked images that contained relevant annotations, such as arrows

or diameters of lesions, added by radiologists over time. This dataset serves as a way to measure

performance of deep learning models in localizing and detecting lesions of various regions, such

as on the liver, lungs, and abdomen.

The data science competition website Kaggle has an extensive collection of datasets for computer

vision tasks. Currently, it houses over 2, 900 datasets under the "Computer Vision" category.

However, when it comes to medical imaging datasets the collection appears to be relatively limited.

A search for “x-ray" yields only 48 datasets, while searches for "computed tomography" or "MRI"

return 12 and 23 datasets, respectively. Notably, the vast majority of datasets on Kaggle are intended

for non-medical applications, demonstrating the divide in publicly available datasets for medical

imaging compared to natural imaging tasks.

2.4 Brief Overview of Previous Applications

Given the widespread interest in machine learning and AI, there is an equally wide array of

studies applying their automated classification and localization prowess on radiological datasets.

Deep learning algorithms have been tasked with classifying breast density, classifying malignancy

in lung nodules, identifying various lung disease, and even determining bone age [63]. Below, we

provide a few specific examples of classification and detection tasks applied to medical images.

2.4.1 Classification Examples

Deep convolutional neural networks have been able to perform exceptionally well in classifying

cases of skin cancer [64, 65]. Esteva et al. utilized a GoogleNet Inception v3 architecture pre-

trained on ImageNet-1K and fine-tuned on a curated dataset of over 129, 000 clinical images

showing 2, 032 various diseases, with 1, 942 test images properly labeled via biopsy. Compared

29

to expert dermatologists scoring between 65 and 66% on a portion of the validation samples, the

deep learning model achieved a result of 72.1 ± 0.9% accuracy. More impressively, the CNN

architecture scored above nearly all 21 expert dermatologists who scored a subset of the biopsy-

labeled test set, achieving ROC AUC values ranging from 0.91 to 0.96 depending on the type of

skin cancer. Similarly, Haenssle et al. trained an Inception v4 architecture to classify dermoscopic

images of melanocytic lesions as melanoma or benign nevi. Using a test set of 300 images, the

CNN achieved a sensitivity of 95%, specificity of 80%, and AUC of 0.95. When tested against 58

dermatologists on a subset of 100 difficult cases, the CNN performed comparably or better than

most of the dermatologists, achieving higher specificity at the dermatologists’ average sensitivity.

In other words, the average sensitivity of the dermatologists was 86.6% with which they achieved a

specificity of 71.3% where the Inception V4 network reached 82.5%, yielding a higher ROC AUC

of 0.86 compared to the dermatologists’ 0.79. When dermatologists were given additional clinical

information alongside images, the mean sensitivity increased to 88.9% with a specificity of 75.7%,

whereas the CNN at an equivalent sensitivity attained a 82.5% specificity.

There are also many applications of deep learning architectures on more traditional medical

imaging modalities such as radiographs, CT, and MRI. Multiple studies have shown the effect of deep

classification networks on identifying beast cancer in both mammography and MRI [66, 67, 68].

Hu et al. sought to classify 927 unique breast lesions as either benign (N=199) or malignant

(N=728). First using an ImageNet pre-trained VGG19 network as a feature extractor and sending

the outputs through various SVM classifiers they were able to achieve up to 77.9% sensitivity

and 78.5% specificity. Shen et al. created a whole-image breast cancer classifier by first training

ResNet50 patch classifiers and adding additional convolutional layers. Testing the performance

on two datasets, CBIS-DDSM and INbreast, they achieved an ROC AUC score of 0.91 and 0.95

for each, respectively. Whereas these examples used just images for training and evaluation, there

has been investigation into the idea of incorporating both image and non-image data for improved

classification performance. Holste et al. [69] found that on a dataset of over 17, 000 breast MRIs,

using just images for a ResNet50 classifier gave an AUC of 0.849 but by adding in a feature vector

30

extracted from thirty-three non-image inputs, such as patient age and breast density, the AUC

increased to 0.898 which could be improved even further through ensembling to 0.903.

The cohort that released the CheXpert dataset from earlier published in-house results of networks

on their dataset [61]. They trained a DenseNet121 network on their data and compared the

performance to three radiologists in a smaller 500 image subset that they had annotated by five

radiologists, making a total of eight radiologists involved in labeling and scoring the test set.

Across a handful of the 14 classes, the trained models were able to perform relatively close to the

board-certified radiologists with AUCs ranging from 0.85 to 0.97.

2.4.2 Examples of Localization and Detection in Medical Images

There is a wealth of object detection applications on a variety of medical imaging tasks. A

review by Mazurowski et al. [70] listed 17 studies focusing on detection tasks, compared to 12 on

various classification tasks and 13 for segmentation. These detection applications vary wildly by

region, such as breast cancers and tumors, pulmonary embolisms, and cerebral hemorrhages. Of

the 17, only one was specified as a fracture detector, demonstrating the quite wide range of tasks

that object detection can be applied to in the medical field.

When introducing the DeepLesion dataset, the authors trained a Faster R-CNN based model

on the dataset for single-lesion and multi-lesion detection. The model obtains a slightly higher

detection accuracy when doing multi-class detection compared with single-class, increasing from

59.45% to 64.3% [62]. Prior to conducting the work discussed through the remainder of this

dissertation, we compared three different object detection CNNs on their ability to detect lesions.

Using a subset from the DeepLesion dataset, we created a training set of 7, 500 lesion-present

images and a test set of 1, 000 lesion-present images. Table 2.1 shows the performance of the SSD,

YOLOv3, and RetinaNet architectures. Performance varies wildly between the three architectures.

YOLOv3, as an example, barely captured any of the lesions present in the test set, but was the

most confident in the regions it proposed. RetinaNet performed overall the best, with the highest

F1 score of 0.412; even then, it was only able to capture just below 34% of all lesions. Figure 2.3

provides a single slice and the detection performance from each of the models.

31

Table 2.1 Performance of SSD, YOLOv3, and RetinaNet at detecting lesions on a subset of the
DeepLesion [62] dataset.

SSD YOLOv3 RetinaNet

Precision
Recall
F1 Score

0.657
0.241
0.353

0.909
0.010
0.019

0.526
0.339
0.412

Figure 2.3 Example slice from the DeepLesion test dataset with ground truth (yellow) and predicted
(teal) boxes from SSD (left), YOLOv3 (center), and RetinaNet (right).

Numerous deep learning methods for rib fracture detection have been developed in the last 5

years for volumetric CT images [71, 72, 73]. Yao et al. implemented a three-stage process for rib

fracture detection, beginning with a U-Net for bone segmentation of the CT image, isolating the

ribs and removing additional bony structures such as scapulae, and classifying whether a fracture

was present via a 3D DenseNet [74]. A similar approach was taken by Zhang et al. utilizing a

nnU-Net to segment areas of ribs that may contain a fracture and a secondary stage with a DenseNet

to classify the segmented region [75]. MICCAI hosted the RibFrac challenge in 2020 that invited

methods to detect the location and classify fractures into four clinical categories on CT images [76].

The three leading methods from this challenge used RetinaNet or variations of masked R-CNNs.

Fewer methods have been developed for rib fracture detection on 2D radiographs.

There have been substantial efforts to apply deep learning to chest x-ray images, thanks partially

to large publicly available data sets of common chest pathologies [61, 77]. These efforts largely focus

on classification of the image and on lung diseases such as fibrosis, pneumonia, and COVID-19 [78].

There have been successful efforts for improved wrist fracture detection on radiographs [79], but

32

there has been less effort focused on rib fracture assessment using radiographs [80]. Gao et al.

performed rib fracture detection on radiographs with their proposed CCE-Net where multiple feature

extraction modules were fused together as inputs to a two-stage detection network demonstrating

improved performance compared to competing R-CNN and YOLOv4 architectures [81]. At the

time of writing this dissertation, we are not aware of any other published methods attempting to

automatically detect the location of rib fractures on pediatric chest x-rays.

2.5 Challenges of Pediatric Rib Fracture Detection in Medical Imaging

Detection of rib fractures on pediatric chest radiographs is challenging for a host of reasons [82].

Depending on the child’s orientation to the imaging detector, a region containing a fracture on the

rib can be obscured by the same rib. For example, on an AP-view radiograph a fracture occurring at

posterior arc of the rib can be covered by the posterior part of the rib, making it particularly difficult

to identify especially if the fracture is not recent. The age of fractures also make identification

difficult. Acute rib fractures may be undetectable on chest radiographs and in some cases only

become evident after callus formation (new bone growth) develops 10 to 14 days into the healing

process [83]. There are also age-related variations in the rib structure; in children, each rib can

grow by nearly 10% per year with middle ribs growing the fastest [84]. Lastly, overlying anatomy

and artifacts, such as monitor leads, support devices, and clothing, can create further perceptional

differences that can hinder outcomes of the interpretations. It is therefore not surprising that missed

rib fractures in children are quite common and that the sensitivity for detection of any rib fracture

on pediatric chest radiography by experts is only about 31% [85, 86]. In practice, it is desired

to have multiple expert radiologists read images and use either a combination of annotations or a

separate consensus read to act as ground truth. As we reveal later in Section 4.2, even intra-reader

variability on these difficult pediatric datasets can still leave a lot of fractures missed.

As was discussed previously, the a major challenge to overcome in object detection is access to

large enough datasets, especially for medical image applications. This challenge grows even more

when looking specifically at pediatric datasets, considering children are a vulnerable population

and also generally have less medical imaging encounters than older patients. Ghosh et al. presented

33

results for a similar task as rib fracture detection; specifically, they perform classification of sub-

patches of the radiograph, which provides coarse localization of objects. For their study, they were

also challenged by dataset volume and were only able to procure a total of 415 chest radiographs

of children under the age of 2 including chest radiographs, which were imaged over the course of

18 years [87].

34

CHAPTER 3

PEDIATRIC RIB FRACTURES - DATA AND PATTERNS

3.1 Background

Child physical abuse is a serious problem with over 108,000 confirmed cases in the US in 2020,

leading to an estimated 1,750 deaths (2.38 for every 100,000 children) from abuse and neglect [88].

This issue becomes even more severe when focusing on children under the age of 3, where over

70% of infant deaths were from children 3 years and younger [89]. Bone fractures are the second

most common presentation for abused children, after soft-tissue injuries [90]. The most common

fracture site in abused children is the ribcage, accounting for over 70% of fractures [91]. In young

children, studies have shown rib fractures are a result of child abuse 80-100% of the time [92, 93]

and is 23.7 times more likely from abuse than from an accident [94]. Barsness et al.

[95] found

that out of 78 children, 62 were under the age of 3 but represented over 94% of the total number

of rib fractures. After removing accidental or disease-related fractures, they calculated a 100%

positive predictive value for rib fracture occurrence stemming from abuse. In short, rib fractures

are extremely important to detect as they are a sentinel injury for physical abuse in young children

that portend poor outcomes; a single rib fracture in children associated with a 2.5 times increase in

mortality rate [96].

Detection of rib fractures in pediatric radiographs is difficult. Expert, specialized radiologists

performing their first reads on x-rays can miss up to two-thirds of all present rib fractures [97].

The skeletal structure of small children is not the same as adults; rather than being rigid, bones

are much more elastic and flexible. This causes breaks to appear differently as opposed to adults

where a rib fracture is usually a much more defined break. Additionally, small children may not

have their chest imaged immediately after injury considering they are unable to express the cause

for discomfort. If rib fractures were indeed present, it would likely not coincide with the timing of

the imaging and they may present in various stages of healing. In addition to the base challenge of

reading these exams, there are decreasing numbers and availability of radiologists to interpret these

studies. While the raw number of trained radiologists has increased 39.2% from 1995 to 2011, the

35

ratio of radiologists to general physicians has decreased from 4.0% to 3.7% in the same time [44].

With the ever-increasing application of deep learning to the medical field, the implementation of

computer vision models as a first-read augmentation technique could improve both detection of rib

fractures and speed of interpretation of radiologists, especially for non-expert readers.

Moreover, a driving motivation for our work is to improve the sensitivity of fracture detection to

ensure that few to no fractures are missed during interpretation. The pediatric rib fracture detection

task warrants methods that are high sensitivity, even if they have reduced precision, because of the

high positive-predictive value of rib fractures as an indicator of child abuse and the downstream

risk of missing a fracture is considerable [95]. Automatic object detectors with high sensitivity

(identical to high recall) have the potential to augment radiologist performance by flagging multiple

suspicious regions in the image leading to higher sensitivity interpretations.

3.2 Dataset Curation and Pre-Processing

Our pediatric rib fracture dataset was collected in collaboration with Seattle Children’s Hospital

and approved by their institutional review board (STUDY00000853) with informed consent waived

due to the study design. Data were collected for this minimal risk retrospective analysis and all

methods and experiments in this study were carried out in accordance with relevant guidelines and

regulations of the institution (Seattle Children’s Hospital). In this convenience sample, we first

searched the medical record for patients with chest radiographs with confirmed rib fractures and

identified 624 cases in the period of Aug 15, 2006 through Aug 15, 2017. Gender and age statistics

for these fracture-present studies were extracted. An age- and gender-matched sample of chest

radiographs with no rib fractures was created. Chest radiographic images were extracted from the

medical record and fully anonymized. These images had (height by width) dimensions on average

of 2348 ± 685 by 2134 ± 500 pixels with pixel spacing of 0.128±0.023 mm. All images were

quantized from 12- or 16-bit integer to 8-bit integer precision and analyzed with their original pixel

spacing.

The images we have are chest radiographs in an anterior-posterior perspective (front-to-back).

The files were provided in Digital Imaging and Communications in Medicine (DICOM) format,the

36

standard file type for storing medical imaging data and related information. There is four-tiered

hierarchy that governs the organization of DICOM files. First is the patient receiving the imaging

procedure. The second is the study, or the specific imaging procedure being performed on the

patient (e.g. radiograph, PET, CT, MRI). Third is series, which could be different regions being

images for the given study, such as extremities, head, abdomen, etc. The last are the instances,

which are individual slices or images performed for that series. In a CT scan, for instance, each

instance corresponds to a single slice of a 3D CT volume; a single series could contain hundreds

of instances for a full-body, high resolution CT scan.

Figure 3.1 The Orthanc web viewer for providing annotations.

In order to perform object detection, we needed a way to get ground-truth annotations for all

images. Despite anonymizing each image by removing any and all protected health information, we

still did not want to distribute all DICOM files themselves. Another prohibitive factor is the size of

the dataset, which would require long data transfers to and from colleagues. Orthanc [98], an open-

source, locally-hosted web server, allows us to locally upload all DICOM images and colleagues

gain access to them through a browser. An example of the web viewer interface colleagues used

for annotating is shown in Figure 3.1.

37

In total, the dataset contains 1, 109 unique patients, of which 624 are fracture present and

485 are fracture absent. There are 385 (34.7%) female and 724 (65.3%) male patients. The

average age of patients is 281.74 ± 769.42 (range 0-7300; median 84; IQR 224) days. In order to

perform and evaluate object detection, we obtained hand-drawn ground-truth annotations for all

images. In short, the fracture present cases were each interpreted by one of seven board-certified

pediatric radiologists with 5-20 years of experience. During interpretation, the radiologists had

access to all of the available radiographic views of the chest (usually supine anterior-posterior (AP)

although occasionally other views were available). They were instructed to draw bounding boxes

as closely around each detected fracture as possible on the AP view only; the object detection

methods discussed below were only applied to the AP view image. From our total batch of 624

fracture-present images that had a single interpretation each, a subset of 338 were given a second

examination by a second board-certified pediatric radiologist to enable estimation of inter-reader

variability. This inter-reader variability estimate served as a performance baseline for evaluating

the proposed methods.

We created custom scripts using RestAPI to pull the bounding box locations drawn by the

radiologists to our workflow.

Individual JSON files were made for each patient that saved the

associated top left 𝑥 and 𝑦 and bottom right 𝑥 and 𝑦 values of each bounding box.

3.3 U-Net Segmentation and Cropping

An inherent issue with medical images is that there tends to be a large amount of unnecessary

regions surrounding the object or region of interest. Because all of this space provides no additional

information but requires larger image sizes and more space, we wanted get as closely around the

thoracic region as possible. To achieve this, we utilized a U-Net style model [99] we trained

and evaluated to segment chest radiographs into multiple anatomic regions. A custom MATLAB

tool was developed to enable annotation of target labels. Three trained users drew boundaries

around the lungs, spine, and bottom of the thorax. The tool smoothed all drawn lines and used

these user-drawn boundaries to divide the chest into seven non-overlapping regions: left and right

lung, left and right “subdiaphragm” (the thorax below the superior boundary of diaphragm), spine,

38

mediastinum, and background. In total, 469 radiographs were manually labeled for segmentation.

This is an extension of the work presented by Holste et al. [100].

We adapted a U-Net3+ architecture which includes full-scale skip connections [101] and report

Dice coefficients (discussed in Section 1.5.6) on the test set. After each inference from the U-Net3+

model, the proposed segmentation maps are refined with basic morphological operations to remove

small spurious disconnected regions from background and close all foreground regions. We also

utilized per-image pixel spacing to add 5mm of physical space to all four sides of the segmented

region in order to account for possible fracture bounding boxes along the edge.

Figure 3.2 Representative results from multi-class segmentation showing manually labeled images
(left) and U-Net results (right) with final automatically-generated cropped region represented by
red box.

39

Of the 469 labeled images, the U-Net3+ model was trained with 422 and tested on the remaining

47 images. Representative segmentations from the test set are presented in Figure 3.2. The mean

Dice coefficient for each of the seven regions exceeded 0.88 and visual assessment confirmed that

all final cropped images in the test set contained only the thoracic cavity.

3.4 Fracture Prevalence

Before we could take a deeper dive into the ground truth data and identify any potential

underlying patterns, we need to understand the previously established literature on rib fracture

patterns in pediatric patients. Males tend to make up a majority of cases of non-accidental

rib fractures in children, with percentages ranging anywhere from the mid 50’s up to 68% of

patients [102, 95, 103, 104, 105, 106]. Various studies have looked at different ranges of children,

such as 0-12 months [102], 0-24 months [103], and 0-36 months [107], while a few had extended

ranges up through 18 years of age [95, 108, 109, 110]. The common discovery among these studies

is that rib fractures tend to be most common in children under 2 years of age, with an especially

high rate in infants under the age of 12 months.

It has been repeatedly found that up to 80% of patients present with multiple rib fractures [103,

108, 110]. This is an especially salient statistic as multiple rib fractures has high specificity for

child abuse [111, 112]. There tends to be an association with which ribs fractures frequently occur.

More specifically, prior studies have shown the fifth through eighth ribs to be the most likely to

present fractures [105, 103, 102]. However, there does not appear to be a definitive location along

the ribs where fractures are most common. Aboughalia et al.

, Barber et al.

, and Barsness et

al. all noted higher rates of fracture in the posterior arc, whereas Kriss et al. saw higher lateral

fractures [113, 102, 95, 103]. Sidedness is another common pattern described, with studies showing

anywhere between 1.4-2 times higher likelihood of fractures on the patient’s left side as opposed

to their right, accounting for 59-67% of all fractures.

3.4.1 Stratifying Data into Age Groups

Starting with our 624 unique patients containing at least one rib fracture, 17 patients were above

the age of 2 years (ranging from 2-21 years old) and were removed, resulting in 607 patients under

40

the age of 2 that we use for the rest of this section. This data consists of 399 (65.73%) males and

208 (34.27%) females with an average age of 120 ± 105 days.

As previously mentioned, most of the established literature on the patterns of rib fractures in

children point out the higher tendency of presentation in children under 18 months of age. Multiple

studies investigated further by breaking up the cohort into different age groups. Quiroz et al. used

groups of 0-12 months and 12-48 months for younger patients [110]; this was similar to Worlock

et al. that used splits of 0-18 months and 19-60 months [106] in their study. The splits from Loder

and Feinberg stayed slightly younger at 0-12 months and 12-24 months, with older patients being

in much larger groups of 3-12 years and 13-20 years [109]. These helped influence the range of

ages we split our data into to further investigate fracture prevalence.

Table 3.1 Patient sex and age summary for the age-separated groups of the fracture-present data set
used for investigating fracture prevalence.

Age Group

Patients

Male

Female

Mean Age (days)

0-3 months
3-6 months
6-12 months
12-24 months
Age Unknown

116
112
58
16
305

70 (60.34%)
67 (59.82%)
50 (86.21%)
8 (50.00%)
204 (66.89%)

46 (39.66%)
45 (40.18%)
8 (13.79%)
8 (50.00%)
101 (33.11%)

40 ± 16
105 ± 22
222 ± 42
435 ± 102
–

Because of the size of our dataset, we were able to create relatively smaller age groups (0-3

months, 3-6 months, 6-12 months, and 12-24 months of age) than prior literature. It should be

noted that while the largest group is of patients with unknown ages, these patients were pulled

alongside the other examples in our dataset and likely correlate with the age ranges shown. For

groups with known ages, the largest were the 0-3 month and 3-6 month groups with 116 and 112

patients, respectively, accounting for over a third of our overall data. The calculated mean ages for

each group are estimates, as values in the DICOM headers with week, month, and year markers

were given approximate day equivalents (i.e., if a patient’s age was marked as “2M" the output

would be 56 days).

Much like previous studies, our overall dataset exhibits the same male majority with over 65%

of our patients being male. This gender skew is even higher in the 6-12 month age group, where

41

over 86% of patients were male. Only in the 12-24 month group with the fewest number of overall

patients were the gender ratios even. While there may not be a definitive reason provided for why

this slant exists, it is reassuring that our data aligns with previous investigations.

Table 3.2 shows the rib fracture statistics per age group. Looking first at patients with a known

age, the group with the highest number of annotated fractures were patients 3-6 months old, with

520 total fractures averaging 4.64 per image. Patients aged 6-12 months saw the highest count of

multiple fractures averaging 5.64 per patient and exhibited a staggering inter-quartile range of 10

whereas the next highest IQR was 5 (3-6 months and unknown age).

The number of left and right side fractures was calculated by finding the centroid of the thoracic

cavity and dividing the fractures based on the side from the centroid they belonged to. The overall

dataset has a slight left-side skew to fracture occurrence, 55.6% compared to 44.4%. However,

patients over the age of 6 months had a right-sidedness, especially in the cases of patients older

than 1 year of age (72.7%). Patients with unknown ages had the highest left-sidedness with over

60%.

Table 3.2 Fracture statistics for each age group.

Age Group

Fractures (per patient) Median IQR Left Fractures Right Fractures

All Patients
0-3 months
3-6 months
6-12 months
12-24 months
Age Unknown

2770 (4.56)
402 (3.47)
520 (4.64)
327 (5.64)
22 (1.38)
1499 (4.91)

3
3
3.5
3
1
4

4
4
5
10
1
5

1540 (55.6%)
227 (56.5%)
267 (51.3%)
137 (41.9%)
6 (27.3%)
903 (60.2%)

1230 (44.4%)
175 (43.5%)
253 (48.7%)
190 (58.1%)
16 (72.7%)
596 (39.8%)

3.4.2 Translating Fractures onto Reference Images

To get a visual sense of where fractures are more likely to present, we developed a method to

combine all ground truth annotations onto one image. For a given reference image, we utilized its

associated segmentation mask from the pre-processing cropping stage to locate the centroid of the

rib cage. We then chose an inner boundary of the segmentation to contain 85% of the width and

height pixels to create what we call the interquartile range (IQR) of the segmentation. This can be

42

(a)

(b)

Figure 3.3 (a) color-separated segmentation mask from the U-Net trained in Section 3.3. (b) stacked
segmentation mask with centroid (blue dot) and IQR box (teal).

illustrated by Figure 3.3(b) with the centroid on the mask shown as a blue dot and the inner 85%

IQR box of the mask in teal. This IQR range and the previously computed centroid are used to

create offset ratios for each of the bounding boxes in each image. These ratios were computed as

follows:

𝑅𝑥 =

𝑅𝑦 =

,

𝐵𝑥 − 𝐶𝑥
𝐼𝑄𝑅𝑥
𝐵𝑦 − 𝐶𝑦
𝐼𝑄𝑅𝑦

(3.1)

(3.2)

where 𝑅𝑥 and 𝑅𝑦 are 𝑥 and 𝑦 ratios from the centroid, respectively, 𝐵𝑥 and 𝐵𝑦 are the center pixels

of the ground truth bounding box, 𝐶𝑥 and 𝐶𝑦 are the centroid of the image’s segmentation mask,

and 𝐼𝑄𝑅𝑥 and 𝐼𝑄𝑅𝑦 are the interquartile ranges of the 𝑥 and 𝑦 pixels of the segmentation mask,

respectively. Once these ratios are computed for each bounding box on their respective images,

we then choose one of the images to serve as a reference image to place all of the bounding boxes

onto. Once that reference image is chosen and the centroid and IQR for it are calculated, the new

locations of the bounding boxes on the reference image can be calculated by solving for a new 𝐵𝑥

and 𝐵𝑦, i.e.,

𝐵𝑛𝑒𝑤
𝑥

= 𝑅𝑥 · 𝐼𝑄𝑅ref

𝑥 ,
𝑥 + 𝐶ref

𝐵𝑛𝑒𝑤
𝑦

= 𝑅𝑦 · 𝐼𝑄𝑅ref

𝑦 ,
𝑦 + 𝐶ref

43

(3.3)

(3.4)

Figure 3.4 shows every hand-annotated rib fracture in our dataset superimposed onto a single

reference image. These images are anterior-posterior (AP) views, thus the patient’s left side is on

the right side of the image. The heatmap colors were generated by creating a circle of a fixed radius

around the center of every fracture location after being mapped to the reference image and stacking

them onto a new array. A greater number of overlapping fractures is indicated by the brighter colors

on the image as shown in the accompanying colorbar.

Figure 3.4 Fracture prevalence map of all fractures in the dataset combined onto a single reference
image.

When looking at all fractures mapped onto a single image, the location of fractures are spread

fairly evenly across the ribs. There does not appear to be a specific region over another or specific

ribs for fracture prevalence. The lateral regions of the ribs do appear to have slightly higher

concentrations of fractures opposed to the anterior or posterior parts of the ribs, i.e., the regions

closer to the sternum (anterior) or spine (posterior). Due to limited knowledge from our annotation

process, we are not able to specify whether fractures are on the anterior or posterior of the rib;

going forward, we will be labeling fractures in these areas as either the inner or horizontal parts of

the rib.

After splitting the fractures into their respective age groups and mapping onto a reference image

44

(a)

(b)

(c)

(d)

(e)

Figure 3.5 Fracture prevalence maps for all age groups: (a) 0-3 months, (b) 3-6 months, (c) 6-12
months, (d) 12-24 months, (e) age unknown.

within each set, there are more noticeable regions of prevalence. In Figure 3.5(a) with patients aged

0-3 months, there seems to be a left-sidedness to the fractures with a heavy emphasis on the lateral

regions of the ribs. The 3-6 month group also shows a higher prevalence for the lateral region, but

instead on patient right side. An interesting separation of fracture clusters appears in the right side

of the 6-12 month group with highest overlap along the lateral but also high overlap on the inner

45

ribs. It is in the 3-6 month and 6-12 month groups (Figures 3.5(b) and (c)) that we see a pattern of

specific ribs that matches the established literature, where the highest concentration of overlapping

ribs appear on the fifth through ninth ribs.

Despite there being a comparative lack of patients and therefore fractures in the 12-24 month

group, there is a definite right-sidedness in these patients, going against the studied behavior of

left-sidedness. Overall, the age unknown group in Figure 3.4(e) presents very similarly to Figure 3.5

with all fractures, likely because approximately half of all fractures in our dataset exist in this group.

Though, there is still a slight bias toward the patient’s left side.

3.4.3 Three-by-Three Split Grid Prevalence Maps

While the prevalence maps provide an overall presentation of the patterns of fractures across

the stratified patient groups, we wanted to investigate further to see if we could more objectively

correlate findings with the previous studies, especially in terms of the most common ribs fractured

and the lateral versus inner rib (again, we note that the literature is able to differentiate posterior

versus anterior for the inner parts of the rib while we have to combine both). To achieve this,

we integrated an image-splitting mechanism to break each of the fracture prevalence maps into a

three-by-three grid, where we could then calculate the presence of fractures individually within

each.

Figure 3.6 visualizes two of the ways the prevalence images could be split.

In (a), the top

horizontal split was found by calculating the centroid of the segmentations of the left and right
)(cid:3). For the lower horizontal
lungs and setting it as the higher of the two, i.e., (cid:2) min(𝐶right lung

, 𝐶left lung
𝑦

𝑦

split, the highest pixel location based on the left and right subdiaphragms is used. The left vertical

index is based on the right lung’s centroid value, 𝐶right lung

𝑥

, and the right vertical index is from

the left lung’s centroid, 𝐶left lung

𝑥

. In the bottom image in Figure 3.6, it does appear to separate the

lateral regions of the ribs from the central region of the rib cage. However, this creates a very wide

central region with comparatively small left and right thirds, leading to quite large center top and

center bottom segments and much smaller middle left and middle right segments.

Method (b) is much simpler. From the combined segmentation mask of the thoracic cavity,

46

(a)

(b)

Figure 3.6 Splitting techniques for creating a 3x3 grid for fracture prevalence images. (a) uses the
left and right lung centroids and top of the subdiaphragms for indices, (b) takes the height and
width indices of the thoracic segmentation mask and splits them into three even sizes.

the height and width indices are extracted and the left vertical split index is the value one-third

through the index array, and the right vertical is the value two-thirds through the array. This process

is replicated for the two horizontal splitting indices. We proceeded with this method both for its

general simplicity and its consistency across images.

Looking across the five age group’s respective three-by-three grids, we can more easily extract

certain patterns in the prevalence of fractures. In every case, the highest horizontal third is the

middle third, ranging from 64.78% at the lowest across all age unknown cases to 77.27% in the

12-24 month group. This middle region accounts for the ribs previously mentioned to have the

highest likelihood of fracture presentation: ribs 5-8.

Multiple studies mentioned the left-sidedness in fracture presentation, and this is shown in most

47

Figure 3.7 Three-by-three grid of fracture prevalence for the 0-3 month age group.

Figure 3.8 Three-by-three grid of fracture prevalence for the 3-6 month age group.

of the groups with 43.03% of fractures in the 0-3 month group, 39.76% in the 6-12 month group,

and 50.63% in the age unknown group. The largest outlier is the 12-24 month group with 50%

fracture in the left vertical third, with the 3-6 month group also showing 4.62% more fractures in

the left third than the right vertical third.

3.5 Summary

This chapter outlines the custom data collection, labeling, and summarization efforts to support

the proposed rib fracture detection methods presented in the following chapters. The major

work presented here includes the development of a system to collect expert labels from pediatric

radiologists. This chapter also presents a method to map labeled fractures to a shared coordinate

system to enable analysis of fracture prevalence patterns. The results found when investigating

48

Figure 3.9 Three-by-three grid of fracture prevalence for the 6-12 month age group.

Figure 3.10 Three-by-three grid of fracture prevalence for the 12-24 month age group.

fracture prevalence corroborates previous literature on the study of patterns of fractures in physically

abused children. Furthermore, insights gained from these prevalence patterns, as well as a direct

integration of the segmentation-guided shared coordinate system, are used later in Chapter 5. Work

demonstrated in this chapter contributed to the following academic product:

• Gongol, Burkow, Junewick, Simms, Alessio, "Pediatric rib fracture prevalence patterns on

chest radiographs." In preparation for Pediatric Radiology, 2024.

49

Figure 3.11 Three-by-three grid of fracture prevalence for the age unknown group.

50

CHAPTER 4

IMPROVING ONE-STAGE DETECTORS

4.1 Training and Evaluation Setup

To perform detection on our radiographs, we chose to evaluate two state-of-the-art single-stage

detection architectures, RetinaNet and YOLOv5. Both models used were adapted from public

GitHub repositories, from user yhenon for RetinaNet [114] and from Ultralytics for YOLOv5 [33].

These are the hyperparameters and training setups we used for all runs.

For RetinaNet, we used a ResNet-50 backbone with pre-trained weights on ImageNet-1K. We

trained all RetinaNet models on our dataset using a NVIDIA V100S for a maximum of 300 epochs

at a batch size of 8 using the Adam optimizer. The learning rate was set initially to 0.0001 and was

decreased by one-tenth if validation performance did not improve within 4 epochs. The dataset

was augmented with a 50% chance of applying any of the following transformations to each image:

shift/scale/rotate, horizontal flip, random brightness, random contrast, or Gaussian blur. Training

would cease via an early stopping clause if performance on the validation set had not improved in

30 epochs.

We used the large YOLOv5-L6 model pre-trained on the COCO dataset prior to training on our

dataset. Similarly, all YOLOv5 models were trained on a NVIDIA V100S for a max of 300 epochs

and batch size of 8. A stochastic gradient descent (SGD) optimizer was used with momentum 0.937

and weight decay of 0.0005. Learning rate was initially 0.01 and decreased linearly each epoch.

There was also an early stopping feature that stopped training if no improvement in validation was

observed after 100 epochs.

Twenty percent of the total dataset were withheld as the fixed test set (N=222 images), with half

randomly drawn from fracture-present images and the other half randomly drawn from fracture-

absent images. While we know this split is heavily skewed toward the presence of fracture containing

images compared to real-world data, we want to ensure the architectures get enough information to

aid in discerning fracture present from absent. The remaining 80% of data was then used to create

the training (70%; N=776) + validation (10%; N=111) sets. All evaluations were performed after

51

training with a 10-fold cross-validation strategy in order to examine the range in model performance;

the ten separate training and validation sets were randomly drawn with replacement between each

set. It is worth noting that during the sampling process, unique patients were sampled rather than

bounding boxes to ensure that no images would overlap between training, validation, and test sets.

Figure 4.1 Summary of performance metrics versus IOU setting, which dictates whether a proposed
region matches an expert labeled region. The top row presents results from single-model RetinaNet
and the bottom row presents results from a suite of all different models. For all models, performance
is very similar for IOU thresholds ranging from 0.1 to 0.4. We selected an IOU setting of 0.3 as a
compromise that is slightly higher in this range requiring more overlap for concordance. It should
be noted that unlike many object detection tasks in the computer vision literature, rib fractures
generally do not have clear margins leading to clear, unambiguously defined bounds. For this
reason, IOU thresholds below the conventional 0.5 setting are warranted.

Object detection performance was evaluated in terms of recall, precision, and F2 score on the

fixed test set (as discussed in Section 1.5.3, we use F2 score to weight recall twice as much as

precision). An intersection-over-union (IOU) threshold of 0.30 was applied across all model and

ensemble evaluations to identify concordance between model predictions and labeled annotations.

52

Figure 4.1 provides rationale for the selection of this IOU threshold. We used F2 score rather

than F1 to give sensitivity (i.e., recall) twice the importance of precision, considering this task

warrants high sensitivity performance as discussed above. Max F2 scores are also provided for

each combination by finding the highest F2 score achieved across all potential decision thresholds.

Furthermore, to summarize average performance across a range of settings, we calculated mean

average precision (mAP) by computing the areas under multiple precision-recall curves generated at

IOU thresholds ranging from 0.25 to 0.75 in 0.05 increments. Note that mAP was not be calculated

for the avalanche decision schemes since precision-recall curves are not analogous between fixed

and dynamic decision thresholds.

For single-model calculations, we evaluated performance metrics for all ten trained models

(trained on each of the ten folds) and report the average ± standard deviation across these models.

For two-, three-, and six-model ensembles, twenty ensemble combinations were arbitrarily selected

from the multitude of different ways to combine 2, 3, or 6 models from the 10-fold data; In other

words, twenty model combinations were taken from the 10 choose 2 (45), 10 choose 3 (120), and

10 choose 6 (210) possible combinations, respectively, that were then evaluated and averaged. This

random selection of twenty models was also applied to the three-model ensembles that incorporate

multiple image processing techniques, which yield over 1, 000 total possible combinations, as well

as the hybrid-model, hybrid-input ensembles where there are as many as 1, 000, 000 total possible

combinations.

Figure 4.2 Overall workflow of our one-stage detector work.

53

4.2

Inter-reader Variability

In order to determine a baseline level of expert human performance, we explored inter-reader

variability among the expert radiologists. Of the total 625 fracture-present images, 338 were read

by two board-certified radiologists. We calculated inter-reader performance for two different data

sets: 1) images from the fixed test set (which contains 222 images, although only 111 of these

are fracture-present and therefore interpreted by radiologists), and 2) images from the set of 338

fracture-present images that have been read by two radiologists. For clarity, this inter-reader study

was performed on fracture-present only images, while the deep learning training and testing was

performed on present and absent images.

On the test set, the first reader marked 536 total rib fractures for an average of 4.83 ± 3.30 (range

1-14; median 4; IQR 5) fractures per image. The second reader marked 486 fractures overall,

averaging 4.38 ± 3.74 (range 1-27; median 4; IQR 4) fractures per image. Setting the first reader’s

annotations as “ground truth" between the two, fractures were scored as true positive (second reader

box matches a reader 1 box), false positive (reader 2 box has no corresponding reader 1 box), or

false negative (reader 1 box has no matching reader 2 box), dictated by an intersection-over-union

(IOU) threshold of 0.30.

Three-hundred eighty-five fractures were counted as true positive matches, with 101 false

positives and 151 false negatives. This led to the second reader scoring a precision of 0.792, recall

of 0.718, and F2 score of 0.732. Essentially, the second reader “detected" just under 72% of the

rib fractures discovered by the first reader. With these scores, the second reader’s boxes overlapped

reader 1 on average by 84% with a mean intersection-over-union of 0.63 across the 111 images. For

clarity, overlapping represents the percentage of reader 1’s annotated box pixels that are covered by

the pixels from reader 2’s matching box.

Inter-reader performance metrics remained very similar when looking at all 338 multi-read

images. Reader 1 marked 1, 719 fractures, averaging 5.09 ± 4.30 (range 1-22; median 4; IQR 5)

fractures per image and reader 2 marked 1, 567 fractures for an average of 4.64 ± 4.08 (range 1-27;

median 4; IQR 4) fractures per image. Percent overlap and IOU remained essentially identical at

54

84% and 0.62. Precision, recall, and F2 score all decreased slightly to 0.777, 0.709, and 0.721.

If we were to assume the first reader caught all fractures during their reads, the second reader was

able to find 71% of the fractures in their reads. This again leaves over one-quarter of all fractures

undetected between expert radiologists.

4.3 Base Network Results

Representative test set images with ground truth annotations and model predictions are presented

in Figure 4.3.

Figure 4.3 Test set images with ground truth (teal, red) and model predictions (green, yellow),
with true positives (green), false positives (yellow), and false negatives (red). Predictions from
the 6x-YOLOv5 ensemble trained on histogram equalized input images with a 𝛾 = 0.20 avalanche
scheme, achieving 0.536 ± 0.044 precision, 0.795 ± 0.022 recall, and 0.723 ± 0.010 F2 score.

Base network performance was evaluated for single-model performance of both RetinaNet or

YOLOv5 using histogram equalization image pre-processing (method (a) from Fig. 4.4). These

results are presented in the Standard rows in Table 4.2 (and below with the nomenclature 1x-Ra and

1x-Ya). RetinaNet achieved 0.892 ± 0.015 precision, 0.430 ± 0.014 recall, and 0.480 ± 0.014 F2

score, whereas YOLOv5 scored 0.897 ± 0.032 precision, 0.434 ± 0.040 recall, and 0.484 ± 0.037

F2 score. When compared to expert-level human performance, both networks had marked higher

values in precision but lower recall and therefore F2 scores. If either network were to predict a

region for a potential rib fracture, they were essentially 90% likely to be correct in that prediction.

However, both networks detected less than half of all rib fractures in the test set.

Changing from the histogram equalized inputs to the two alternative image processing methods

55

((b) binary and (c) blended) slightly alters performance for both networks in different ways.

RetinaNet with binary inputs (1x-Rb) incurs a small decrease in precision to 0.852 ± 0.015 (−4.5%)

and large decreases in recall and F2 score at 0.344 ± 0.027 (−18.75%) and 0.390 ± 0.028 (-20%),

respectively. Blended inputs (1x-Rc) also cause a 1.01-1.25% drop in all three measures compared

with method (a). YOLOv5 sees a similar drop in performance using the binary inputs, with 1x-Yb

achieving 0.872 ± 0.060 (-2.79%) precision, 0.320 ± 0.049 (-26.27%) recall, and 0.365 ± 0.050

(−24.59) F2 score. However, the blended inputs of method (c) for YOLOv5 led to the best single-

model recall and F2 performance of all variants: 1x-Yc scored 0.880 ± 0.024 (-1.9%) precision,

0.464 ± 0.043 (+6.91%) recall, and 0.512 ± 0.041 (+5.79%) F2 score. As presented later, the

YOLOv5 architectures using the varied inputs (ensembles of models using input processing a, b,

and c) had improvement over the other processing methods.

4.4

Image Processing

We wanted to explore how varying the type of processing performed on the input images

changed the performance of the trained models. All types of processing were applied following

the segmentation and cropping via the U-Net discussed previously. For the approaches discussed

above, we were processing our dataset by taking a single-channel array of histogram equalized

images and stacking identical arrays in order to create an RGB image to input into the network. We

thought that these additional channels could offer new ways to incorporate different information that

could potentially improve model performance. Figure 4.4 provides a visualization of the different

types of processing and how they are combined to create input images for training and evaluation.

Method a applies histogram equalization to the single-channel, grayscale image array after which

the array is replicated three times to provide the three channel input to the object detector models.

The two additional variations go an additional step by utilizing the pixel-level segmentation

information from the U-Net from the previous pre-processing stage. After cropping the image

around the segmented thoracic region, all background pixels are masked out to generate a masked

foreground image containing only anatomical structures. In method b, adaptive thresholding is

applied using a Gaussian weighting method to determine threshold values in a given neighborhood

56

Figure 4.4 The three types of image processing used in the varied-input-processing ensemble
models. a) normal-cropped histogram equalized 3x-stacked images; b) segmentation-masked
adaptive thresholding 3x-stacked images; c) segmentation-masked raw, histogram equalized, and
bilateral low-pass filtered blend.

of pixels. This transforms the image from grayscale to binary, with 1 (white) representing pixels

above the threshold and 0 (black) below, providing a rough segmentation of just the ribs. This

binary mask is then stacked three times as the final image.

In method c, the masked image goes through two separate filtering operations, inspired by

Heidari et al. [115]: histogram equalization (like method a) for increased contrast and bilateral

low-pass filtering for edge-preserving noise reduction. The low-pass filter uses mid-line 𝜎-space

and range values of 150, with a 9 pixel neighborhood diameter. The original masked image,

histogram equalized masked image, and bilateral filtered masked image are then stacked as the

three channels for the detector input, which we label as "blended" input.

57

4.5 Ensembling

We also investigated the impact of model ensembles on rib fracture detection which is commonly

one of the first ways to approach improving the performance of machine learning networks. Model

ensembles with deep neural networks have shown better generalizability as well as improved

performance on tasks with smaller datasets [116, 117, 118]. To survey this, we tested the following

types of ensembles:

(1) Same-Model Ensemble: The simplest form of ensembling is the combination the proposal

results of multiple identical models each with different training runs with either differently initialized

seeds similar to the deep ensembles analyzed by Lakshminarayanan et al. [119], or with different

combinations of training data.

(2) Hybrid-Model Ensemble: This is a slight variation to the same-model ensemble, combining

an equal number of training runs of both deep learning architectures we tested; for example,

combining one run of RetinaNet with one run of YOLOv5.

(3) Varied-Input-Processing Ensemble: The final type of ensembling models incorporated

all three of the different image pre-processing operations as summarized in Figure 4.4, requiring at

minimum three trained models trained on each of the input processing variations.

Prior to final evaluation, proposed bounding boxes from all members of each ensemble were

aggregated together and overlapped boxes then removed via non-maximum suppression (NMS)

with an intersection-over-union (IOU) threshold of 0.55. This threshold was set based on initial

validation experiments, but not fully optimized across all model variants. There is an interesting

aspect differentiating predictions from RetinaNet and YOLOv5: RetinaNet offers many more box

predictions than YOLOv5. We were interested to see whether that aspect led to better performance

for one architecture over the other when incorporating ensembling.

Figure 4.5 provides a visual of a handful of detector F2 score performance along with their

respective coefficient of variation. As the ensembles of models become more complex in terms

of both number of ensemble members and sources of model diversity, F2 scores increase while

the variation in F2 scores decreases. The hybrid-model, varied-input ensemble achieves both the

58

Figure 4.5 F2 score (black dot) and accompanying coefficient of variation (blue x) for each model
and ensemble. As ensembles become more intricate and diverse, F2 scores improve while overall
variation in performance decreases.

highest F2 score and the lowest coefficient of variation for F2. In other words, our best-performing

ensemble exhibits both the highest performance and smallest uncertainty, perhaps indicating greater

generalization ability that could carry over its performance to new images.

4.6 Avalanche Decision Scheme

We developed a novel inferencing strategy we coined the avalanche decision scheme [120] that

makes the decision threshold for fracture-present model predictions a function of the number of

already detected bounding box proposals. In other words, the decision threshold is not fixed, but

rather changes depending on the number of high probability proposed regions. This approach is

motivated by the reality that if a subject has one fracture they are very likely to have more than one

fracture, and a subject with two fractures is very likely to have three, and so on.

In order to determine how to adjust the decision threshold as a function of number of cleared

proposals, we used the training dataset to calculate the probabilities of there being more proposals

59

given that at least 𝑋 fractures are currently present in the images, i.e., 𝑃(𝑋 > 1|𝑋 ≥ 1), . . . , 𝑃(𝑋 >

4|𝑋 ≥ 4) as presented in the third column of Table 4.1. Then, for a given starting model threshold

𝛼0, if the model predicted 1 bounding box with a probability greater than 𝛼0, we scale down the

threshold to 𝛼1 = 𝛼0 · (1 − 𝑃(𝑋 > 1|𝑋 ≥ 1)) and the number of bounding box predictions that

clear this threshold will be re-evaluated. If now three proposals have probabilities greater than 𝛼1,

we scale the threshold down to 𝛼3 = 𝛼1 · (1 − 𝑃(𝑋 > 2|𝑋 ≥ 2) · (1 − 𝑃(𝑋 > 3|𝑋 ≥ 3). These

likelihoods are presented in Table 4.1.

Table 4.1 Posterior likelihood for the training dataset, with each row summarizing probabilities of
having more fractures, 𝑋 (a random variable), when at least 𝑥=1, 2, 3, and 4 fractures are present
(a deterministic variable).

X ≥ x Fractures

𝑁

P(X>x|X≥ x)

X ≥ 1
X ≥ 2
X ≥ 3
X ≥ 4

444
327
266
204

73.6%
81.3%
76.7%
77.0%

We explored an alternative calculation where if one proposal was found in an image at the given

starting threshold 𝛼0, the next threshold to re-evaluate proposals would be 𝛼1 = 𝛼0·𝑃(𝑋 > 1|𝑋 ≥ 1).

This is a more conservative reduction in confidence thresholds, since with the prior calculation

each successive threshold 𝛼𝑖+1 would be approximately 16 − 24% of 𝛼𝑖. With the new calculation,

each 𝛼𝑖+1 would be 76 − 84% of 𝛼𝑖. To further investigate variants of this more moderate version,

we tested cases where the decrease between each successive 𝛼𝑖 threshold was a constant rate 𝛾,

ranging from 10 − 30%. For example, if three rib fractures were proposed for a given starting

threshold 𝛼0, the new threshold will reduce to 𝛼3 = 𝛼2 · 𝛾 = 𝛼0 · 𝛾3.

Figure 4.6 provides a visual of the standard, posterior, and conservative schemes as well as

three representative schemes with constant 𝛾 reduction. This illustrates the posterior avalanche

scheme has the tendency to severely lower the decision threshold as soon as one, or especially two,

rib fractures are proposed by the architectures that exceed the starting threshold 𝛼0.

In brief, we explored different relationships for decision thresholds vs apparent number of

60

accepted fractures as presented in Figure 4.6. For the standard approach, the decision threshold is

constant no matter how many proposed regions clear this threshold level, and typically the threshold

is set of 0.5 (all proposals with greater than 50% are accepted). For the avalanche approaches, the

decision threshold decreases as more proposed regions are accepted.

Figure 4.6 Plot of relative decision threshold for bounding box acceptance as a function of the
number of accepted proposals.

Figure 4.7 shows F2 score performance of a single RetinaNet model as the initial decision

threshold 𝛼0 changes, using the “standard" approach of a constant threshold to get all proposals,

the posterior distribution avalanche scheme, the conservative scheme, and constant 𝛾 reduction

schemes with 𝛾 ∈ [0.10, 0.15, 0.20, 0.25, 0.30]. This figure influenced the decision of the 𝛾 values

used in the avalanche scheme tests in Table 4.2.

Based on the starting threshold 𝛼0 that obtained the highest F2 score for 𝛾 = 0.15 and 𝛾 = 0.20

in Figure 4.7, we chose 𝛼0 = 0.55 and 𝛼0 = 0.75 for the constant reduction avalanche schemes,

respectively, to test alongside the posterior and conservative decision schemes.

We improved on our previous work by implementing a non-maximum suppression (NMS) step,

a common approach to filter out regions proposals with a large overlap of each other. This NMS

step is applied after the avalanche decision schemes have been applied on the given trained model

predictions. This is particularly effective on networks like RetinaNet where the number of bounding

61

Figure 4.7 F2 scores for various avalanche decision schemes (as described in section 4.6) applied
on a single RetinaNet model across all possible starting decision thresholds 𝛼0. The "standard"
approach uses a constant threshold set at 𝛼0 for accepting proposed boxes. Remaining lines
represent different avalanche schemes where the threshold starts at 𝛼0 and decreases as a function
of accepted boxes by a) fixed percentages, b) conservative reductions based on posterior likelihood
values in table 4.1, b) and posterior reductions based on (1-posterior likelihood values).

box proposals per image is significantly higher than the more reserved models such as YOLOv5.

Table 4.2 compares how applying the avalanche schemes affect performance for single RetinaNet

and YOLOv5 models trained on the histogram equalized inputs. The posterior scheme with

RetinaNet drops precision to 0.141 ± 0.015, a significant 84.19% drop, whereas recall saw a large

102.8% increase to 0.872 ± 0.013. This, however, leads to an F2 score of 0.427 ± 0.026 which

is 11% lower than standard. Instead, the best performing avalanche scheme for RetinaNet is the

conservative scheme, where the 40.6% reduction in precision and 69.8% increase in recall sees the

F2 score increase to 0.679 ± 0.010 (+41.5%).

Interestingly, YOLOv5 experiences a relatively minor drop in precision but marked improvement

in recall, and therefore F2 score, with the avalanche decision schemes. Precision drops by 7-15%

across the schemes while recall increases between 35-49%. This means the lowest performing

YOLOv5 model with the 𝛾 = 0.15 scheme scoring 0.615 ± 0.050 is still 27.1% better than standard;

the best performance comes from the posterior scheme with an F2 score of 0.652 ± 0.051 (+34.7%).

62

Table 4.2 RetinaNet and YOLOv5 single-model results comparing performance with the standard
fixed decision threshold and applying the various avalanche schemes. 𝛾 represents the constant rate
reduction between each decision threshold in the avalanche scheme.

Model

Scheme

Precision

Recall

F2

RetinaNet

YOLOv5

Standard
Posterior
Conservative
𝛾 = 0.15
𝛾 = 0.20

Standard
Posterior
Conservative
𝛾 = 0.15
𝛾 = 0.20

0.892 ± 0.015
0.141 ± 0.015
0.530 ± 0.023
0.304 ± 0.028
0.256 ± 0.023

0.897 ± 0.032
0.759 ± 0.164
0.831 ± 0.101
0.814 ± 0.120
0.816 ± 0.114

0.430 ± 0.014
0.872 ± 0.013
0.730 ± 0.015
0.766 ± 0.015
0.770 ± 0.024

0.434 ± 0.040
0.647 ± 0.101
0.590 ± 0.075
0.587 ± 0.077
0.593 ± 0.087

0.480 ± 0.014
0.427 ± 0.026
0.679 ± 0.010
0.586 ± 0.019
0.548 ± 0.015

0.484 ± 0.037
0.652 ± 0.051
0.622 ± 0.053
0.615 ± 0.050
0.621 ± 0.060

Considering the large number of different avalanche schemes, the remaining results will only present

schemes corresponding to best F2 score performance for each model and/or ensemble.

4.7 Combining Avalanching, Input Processing, and Ensembling

This section outlines some of the results of the combination of avalanche decision schemes,

varied input processing, and ensembling. The nomenclature for these combined models is presented

in figure 4.8.

Figure 4.8 Explanation of model nomenclature for ensembling combined with different input
processing techniques. The selection of input processing [a,b,c] is described in Figure 4.4 and
varied input processing [*] uses one from each input processing type. For example, results
presented for method 3x-R* would be for an ensemble of three RetinaNet models (trained on
different folds) using varied input processing.

When applying the novel avalanche decision scheme to the base networks as presented above

in Table 4.2, we see anticipated results that recall increases while precision decreases. The level

of these changes varies depending on the avalanche scheme. Posterior with RetinaNet provides

63

the highest recall but very low precision. Conservative decision threshold reduction with YOLOv5

provides a reasonable compromise of recall and precision.

Performance of combining methods are presented in Table 4.3 and Table 4.4, with the former

including evaluations only with the standard decision scheme (i.e., fixed acceptance threshold

of 0.50 for all model predictions) and the latter including the best avalanche scheme for each

model and/or ensemble. Table 4.3 also includes inter-reader variability performance at the top for

comparison between expert human readers and deep learning models. For full results of all models

and ensembles, see Supplementary Tables S1 and S2.

As the avalanche schemes are applied to the ensembled prediction results, we see the anticipated

trend: precision drops as recall improves. With the ensembles that incorporate the various image

processing techniques, the avalanche schemes provide the best overall results. The precision and

recall of the 3x-RetinaNet and 3x-YOLOv5 approach the values found in the inter-reader variability

study we discussed earlier.

Table 4.3 Performance results of selected models with standard decision threshold. I.R.V. represents
inter-reader variability performance between two radiologists. Bolded values represent the top two
scores for each metric. Superscripts a, b, and c represent the type of input processing to train the
models as shown in Fig. 4.4. Ensembles with * have hybrid inputs, i.e., each ensemble member
was trained on a different input processing method.

Models

Precision

I.R.V. (Test set)
I.R.V.

0.792
0.777

Recall

0.718
0.709

F2

0.732
0.721

Max F2

mAP

-
-

-
-

1x-Ra
1x-Yc
3x-R*
3x-Yc
6x-Rc
6x-Yc
3x-R*+3x-Y*

0.892 ± 0.015
0.880 ± 0.024
0.812 ± 0.011
0.814 ± 0.017
0.762 ± 0.008
0.756 ± 0.010
0.752 ± 0.019

0.430 ± 0.014
0.464 ± 0.043
0.523 ± 0.014
0.599 ± 0.021
0.571 ± 0.009
0.653 ± 0.008
0.625 ± 0.021

0.480 ± 0.014
0.512 ± 0.041
0.563 ± 0.013
0.633 ± 0.018
0.601 ± 0.008
0.671 ± 0.007
0.647 ± 0.017

0.630 ± 0.011
0.644 ± 0.046
0.649 ± 0.006
0.694 ± 0.008
0.666 ± 0.004
0.699 ± 0.006
0.686 ± 0.008

0.480 ± 0.008
0.555 ± 0.016
0.493 ± 0.008
0.559 ± 0.006
0.499 ± 0.004
0.555 ± 0.004
0.522 ± 0.004

Using the standard decision scheme, the four three-model ensembles, 3x-Rc, 3x-R*, 3x-Yc, and

3x-Y*, performed very similarly with regards to precision, scoring within 0.6% of one another.

64

Compared to their single-model versions, ensemble methods resulted in improved recall and thus

F2 scores. While the best single-model recall was 0.464 ± 0.043, the worst performing three-model

ensemble (3x-R*) achieved 0.523 ± 0.014 (+12.7%) and the best performing ensemble (3x-Yc)

reached 0.599 ± 0.021 (+29.1%). This led to an F2 score of 0.633 ± 0.018 with the 3x-Yc model.

The mean average precision (mAP) trended upward as ensemble size increased and had similar

trends as the F2 score, demonstrating that these models have similar rankings in performance across

a range of inference hyper-parameters.

After applying avalanche decision schemes, we see the expected decrease in precision and

increase in recall. The 3x-Yc ensemble with the 𝛾 = 0.20 decision scheme had the superior

performance among three-model ensembles achieving 0.725 ± 0.012 F2 score, which is within 1%

of expert human-level performance. One interesting thing to note is that the three-model ensembles

with standard decision schemes have lower F2 scores than single-models with avalanche schemes at

the trade-off of maintaining much higher precision values that exceed the inter-reader performance.

Table 4.4 Performance of models from Table 4.3 with their best corresponding avalanche decision
scheme result with respect to F2 score. Bolded values represent the top two scores for each metric.
Superscripts 𝑎, 𝑏, and 𝑐 represent the type of input processing to train the models as shown in
Fig. 4.4. Ensembles with * have hybrid inputs, i.e., each ensemble member was trained on a
different input processing method.

Models

Avalanche

Precision

Recall

F2

Max F2

1x-Ra
1x-Yc
3x-R*
3x-Yc
6x-Rc
6x-Ya
3x-R*+3x-Y* Conservative

Conservative
Posterior
Conservative
𝛾 = 0.20
Conservative
𝛾 = 0.20

0.530 ± 0.023
0.645 ± 0.130
0.340 ± 0.013
0.573 ± 0.058
0.311 ± 0.005
0.536 ± 0.044
0.314 ± 0.015

0.730 ± 0.015
0.724 ± 0.085
0.809 ± 0.009
0.780 ± 0.030
0.816 ± 0.005
0.795 ± 0.022
0.841 ± 0.014

0.679 ± 0.010
0.695 ± 0.041
0.634 ± 0.011
0.725 ± 0.012
0.616 ± 0.005
0.723 ± 0.010
0.630 ± 0.011

0.897 ± 0.054
0.812 ± 0.056
0.898 ± 0.026
0.812 ± 0.036
0.912 ± 0.025
0.797 ± 0.016
0.833 ± 0.052

Six-model ensembles with standard decision schemes have lower precision scores than prior

model and ensemble sizes, though still being on-par with expert-level performance. Once again,

YOLOv5 with the blended, method (c) input images achieved the highest F2 score at 0.671±0.007, a

6% improvement over its three-model variant. Incorporating avalanche schemes with 3x-Yc utilizing

65

the 𝛾 avalanche scheme with a fixed rate of 0.20 provided the highest F2 score of 0.725±0.012. The

most complex 3x-R*+3x-Y* model was unable to achieve the highest performance in any metric,

and in fact even had a slightly lower mAP score than the 1x-Y𝐶 models, though its recall and F2

performance were still second-highest among the standard decision threshold models/ensembles.

4.7.1 Max F2 Score between Standard and Avalanche Schemes

In order to get a better understanding of how well the avalanche schemes perform compared

to the standard inferencing technique, we plotted the F2 scores of a handful of test cases across

all possible decision threshold values in Figure 4.9. For the avalanche methods, the x-axis of

these plots represents the starting threshold (𝑎0) that is then potentially reduced if proposed regions

have probabilities greater than 𝑎0. We chose one model from the single-model group: the single

Figure 4.9 F2 scores across all possible confidence thresholds for 1x-Ra models, 3x-Yc ensembles,
and the hybrid, 3x-R* + 3x-Y* ensemble. Each dashed line represents performance from one
model or combination of ensembles. The best performing avalanche scheme (’Conservative’) is
compared to the ’standard’ decision scheme. Generally, the avalanche scheme performed better
than conventional inferencing, except for very low decision thresholds in the 1x-R and 3x-Y cases.
In the 3x-R*+3x-Y* ensemble, performance shifts around a threshold of 0.5. In every case, the
avalanche decision scheme reaches higher max F2 scores than the standard technique.

RetinaNet using the histogram equalized images. Then we chose the best three-member ensemble

with the 3x-Yc ensemble, and the most diverse ensemble with the six-member 3x-R*+3x-Y*

ensemble. Each of their best corresponding avalanche decision schemes was plotted along with

their traditional decision scheme performances. In the 1x-Ra and 3x-Yc cases, the avalanche scheme

performs better than the standard decision scheme across a majority of the possible decision

66

thresholds. For the 3x-R*+3x-Y* ensembles, the standard scheme outperforms the avalanche

scheme for thresholds less than around 0.50. In all three cases, not only do the avalanche schemes

generally perform better than their standard decision scheme counterparts, but the maximum F2

scores were also higher than what the standard decision versions could attain. This maximum F2

performance can also be seen in the last column of tables 4.3 and 4.4. For each model and/or

ensemble, the Max F2 value of its corresponding avalanche scheme is significantly higher than the

standard schemes, many reaching a score above 0.9.

4.8 Regions Where Deep Networks Fail

In order to gain a better understanding of where the architectures are struggling, we are going

to apply the same shared coordinate mapping that we used in our fracture prevalence analysis. For

this, we will be looking at a single RetinaNet (1x-R𝑎) and a single YOLOv5 (1x-Y𝑐) network, which

were the two higher performing single-model runs in our testing. In this analysis, all of the IOU

thresholds were held at 0.30 and all acceptance thresholds for the bounding box probabilities was

held at 0.50. First, we will look at RetinaNet’s performance, split into three figures: Figure 4.10

shows the true positive, matching predictions, Figure 4.11 shows all false positives, where the model

provided predictions without a ground truth fracture, and Figure 4.12 shows the false negatives

where the model missed fractures from the test set.

Figure 4.10 True positive fractures captured by 1x-R𝑎.

67

Out of a total possibly 536 labeled rib fractures in the test set, the 1x-R𝑎 model correctly caught

222 (41.4%). It seems to have predicted predominantly in the central rib region, where 77% of

its predictions are located. The RetinaNet model did not over-predict regions that did not contain

fractures; only 27 fractures total were counted as false positives. It over-predicted in the central

region, which may explain why such a high percentage of the true positive fractures were captured

in this central band. Interestingly, it predicted nearly twice as many fractures on patient-right sides

compared to central or left, whereas our fracture prevalence analysis revealed there to be a slight

left-sidedness to fracture presence. Shifting finally to all fractures missed by the network, we see

Figure 4.11 False positive fractures predicted by 1x-R𝑎.

that despite its tendency of predicting in the middle horizontal region, there were still over 53% of

all missed fractures residing there. Of the three vertical zones, the most missed fractures were in

the patient left side.

Let us now look at the YOLOv5 network performance. Figure 4.13 shows the true positive

performance, where it captured almost the same number of fracture as RetinaNet with 218 (40.7%)

and overall performs almost identically to the RetinaNet model. Looking at the false positive

performance in Figure 4.14, we see that YOLOv5 falls into the same issue as RetinaNet: predicting

more fracture regions on the patient right side compared to patient left.

It appears that both

RetinaNet and YOLOv5 have a tendency to falsely predict fractures in the top-left region (patient

68

Figure 4.12 False negative fractures missed by 1x-R𝑎.

right on ribs 1-4), with RetinaNet proposing 5 and YOLOv5 proposing 6 in that region. Once

Figure 4.13 True positive fractures captured by 1x-Y𝑐.

again, YOLOv5 missed slightly more fractures in the patient-left vertical region (43.59% compared

to 38.46% on patient-right). Overall, both the single RetinaNet and YOLOv5 models perform

similarly, and therefore miss fractures in very similar ways. Over half of all missed fractures appear

in the middle-third region, which we know is the horizontal region with the most fractures, and

both also seem to be missing a higher percentage on the patient-left side.

69

Figure 4.14 False positive fractures predicted by 1x-Y𝑐.

Figure 4.15 False negative fractures missed by 1x-Y𝑐.

4.9 GAN Synthetic Fracture Generation via Fracture Prevalence

As mentioned previously, data scarcity is an unavoidable issue with medical imaging datasets.

Synthetically generated images have seen increasing interest as a method to address the limited data

problem. Generative Adversarial Networks (GANs) were first introduced in 2014 by Goodfellow

et al. [121] which demonstrated a concept where one neural network, called the generator, creates

images to try and “fool" a second neural network, the discriminator, into believing the synthetic

image is, in fact, real. Variations on the GAN architectures have been proposed over the years,

such as pix2pix [122], StyleGAN [123], and CycleGAN [124]. While these were introduced using

70

natural images, various other studies sought to tailor them toward generating synthetic medical

images. Sorin et al. [125] conducted a review of over 30 different studies that implemented a GAN

for some radiological applications, ranging from image reconstruction and denoising, transferring

between modalities, data augmentation, and image segmentation. However, despite there being 33

studies overall, only one of them used chest radiographs—for image segmentation—whereas nearly

every one of them dealt with MRI or CT scans regardless of application.

Figure 4.16 Schematic of near-pair patch GAN training process. Two sets of generator/discriminator
pairs are trained simultaneously using near pair patches. Generator 1 converts real fracture-absent
patches to fracture-present patches. Discriminator 1 distinguishes between synthetic fracture-
present and real fracture-present patches. Generator 2 removes pathology to create synthetic
fracture-absent patches. Discriminator 2 distinguishes between synthetic and real fracture-absent
patches. Figure from [126].

The task of generating synthetic chest radiographs is not only a difficult problem, but also

uncommon and is thus quite important given the limiting constraints of data. Our colleague, Ethan

Tu, in the Medical Imaging and Data Integration (MIDI) Lab sought to synthetically generate rib

fractures in chest radiographs with the goal of improving object detector performance compared

to only real rib fracture containing images [126]. To achieve this, he used a CycleGAN trained

on localized patches. A CycleGAN is advantageous because it does not require identically-paired

images from the two domains; these types of paired images would be impossible to obtain for

our task of needing fracture-absent and fracture-present images. Additionally, this method relies

on localized patches of rib regions. Our goal is to generate a localized portion of a rib with a

synthetic fracture. This localized portion is then inserted back into the full radiograph image. This

71

greatly reduces the computational requirements for generating synthetic fractures and constrains

the problem to localized scenes. Finally, an innovation of Tu’s work is to use near-pair patches,

where the images are from similarly localized scenes with one having a present fracture and one

being fracture-absent.

We utilized a subset of the data used in this chapter, consisting of 704 pediatric patients with 515

fracture-present and 189 fracture-absent patients. Near-pair patches were found by first isolating the

hand-labeled rib fractures then manually finding a matching fracture-free rib from the same patient,

yielding a closely similar rib region where one has a fracture and one does not. Each of these

localized patches were of size 128 × 128. A second GAN was trained using another set of paired

patches, however this time the fracture-absent patches were randomly selected from fracture-absent

patients in the dataset. A third and final GAN was trained using the Fréchet Inception Distance

(FID) [127] which is a metric used to evaluate the realness of image generated by GANs. A

schematic of this model is presented in figure 4.16. Each of these trained GANs will be abbreviated

as NPPGAN, RandGAN, and FIDGAN, respectively.

Table 4.5 The various training sets used to train YOLOv5 to investigate the effect of radiographs
with synthetically generated rib fractures on detection performance.

Real Images Synthetic Images Total Training Images

0
50
50
250
250
500
500

500
0
500
0
500
0
500

500
50
550
250
750
500
1000

We evaluated the performance of adding the images with synthetically generated rib fractures

to the training sets for a single YOLOv5 network, i.e., 1x-Y𝑎. This was due to YOLO’s efficient

implementation, making it quick to train on various dataset sizes of real and real + synthetic training

sets, and its tendency to be more constrained in the number of predictions provided as opposed to

RetinaNet. Table 4.5 shows the various training set sizes and number of validation and test images

72

used to evaluate YOLOv5. Each model was trained on a NVIDIA V100S like before, with a batch

size of 8 and at most 300 epochs with an early stopping protocol if validation performance did not

improve within 100 epochs. Unlike previously where 10 separate models were trained on different

randomizations of the training set to generate error bars for results, error bars for this work were

calculated by performing 5, 000 iterations of stratified bootstrapping on the test set results.

Table 4.6 Results from adding synthetically generated rib fractures from the random-pair patch
CycleGan, i.e., RandGAN, on the detection performance of a single YOLOv5 network.

Training Dataset

Precision

Recall

F2 Score

0 Real 500 Synthetic

0.125 ± 0.128

0.004 ± 0.004

0.005 ± 0.005

50 Real 0 Synthetic
50 Real 500 Synthetic

0.735 ± 0.070
0.815 ± 0.091

0.214 ± 0.043 0.249 ± 0.048
0.200 ± 0.050
0.169 ± 0.044

250 Real 0 Synthetic
250 Real 500 Synthetic

0.820 ± 0.035
0.816 ± 0.039

0.452 ± 0.048
0.496 ± 0.047
0.456 ± 0.053 0.499 ± 0.052

500 Real 0 Synthetic
500 Real 500 Synthetic

0.901 ± 0.030
0.860 ± 0.033

0.448 ± 0.051
0.498 ± 0.051
0.493 ± 0.048 0.539 ± 0.047

Table 4.7 Results from adding synthetically generated rib fractures from the near-pair patch Cycle-
Gan, i.e., NPPGAN, on the detection performance of a single YOLOv5 network.

Training Dataset

Precision

Recall

F2 Score

0 Real 500 Synthetic

0.000 ± 0.000

0.000 ± 0.000

0.000 ± 0.000

50 Real 0 Synthetic
50 Real 500 Synthetic

0.779 ± 0.059
0.922 ± 0.046

0.254 ± 0.035 0.294 ± 0.038
0.213 ± 0.040
0.179 ± 0.035

250 Real 0 Synthetic
250 Real 500 Synthetic

0.894 ± 0.047
0.909 ± 0.033

0.468 ± 0.048 0.517 ± 0.048
0.499 ± 0.050
0.449 ± 0.050

500 Real 0 Synthetic
500 Real 500 Synthetic

0.825 ± 0.035
0.874 ± 0.036

0.536 ± 0.044 0.576 ± 0.042
0.569 ± 0.046
0.524 ± 0.047

Tables 4.6, 4.7, and 4.8 provide the results after fine-tuning YOLOv5 for each the RandGAN,

NPPGAN, and FIDGAN synthetically generated fracture images, respectively. Training purely with

synthetically generated fractures, regardless of GAN, leads to abysmal performance. Interestingly,

only the RandGAN provided any metrics on the test set where NPP and FID both gave scores of 0

across all metrics. Looking at Table 4.6 with RandGAN, precision increases 10.8% with a small 50

73

Table 4.8 Results from adding synthetically generated rib fractures from the near-pair patch Cy-
cleGan with added FID scoring, i.e., FIDGAN, on the detection performance of a single YOLOv5
network.

Training Dataset

Precision

Recall

F2 Score

0 Real 500 Synthetic

0.000 ± 0.000

0.000 ± 0.000

0.000 ± 0.000

50 Real 0 Synthetic
50 Real 500 Synthetic

0.764 ± 0.051
0.981 ± 0.019

0.327 ± 0.048 0.369 ± 0.051
0.239 ± 0.036
0.201 ± 0.031

250 Real 0 Synthetic
250 Real 500 Synthetic

0.883 ± 0.031
0.923 ± 0.027

0.434 ± 0.042 0.482 ± 0.042
0.458 ± 0.050
0.408 ± 0.049

500 Real 0 Synthetic
500 Real 500 Synthetic

0.991 ± 0.009
0.855 ± 0.034

0.412 ± 0.041
0.466 ± 0.042
0.488 ± 0.041 0.534 ± 0.040

real image dataset while decreasing for both 250 (−0.49%) and 500 (−4.55%) real image datasets.

This trend flips when looking at recall, where with 50 real images there is a 21% decrease, but 250

and 500 real images experience an increase of 0.88% and 10%, respectively.

NPPGAN results (Table 4.7) are quite different. In every case, adding synthetically generated

fracture images leads to precision improvement while recall decreases. This consequently reduces

F2 score in each case as well. The same cannot be said for results in Table 4.8 for the FIDGAN

images. Like the RandGAN images, precision values see marked increases for both 50 real images

(+28.4%) and 250 real images (+4.53%), with a decrease for 500 images (−13.72%). Recall

also decreases for both 50 images (−38.53%) and 250 images (−5.99%), but then experiences a

significant improvement with 500 images (+18.45%). This leads to the largest improvement in F2

scores across all training combinations, increasing 14.59% from 0.466 ± 0.042 to 0.534 ± 0.040.

It must be noted that detector performance on these sets of training data cannot be compared

directly between the three GANs. There are two factors for this: first, the real-only images randomly

chosen for each GAN were not drawn equally—which was unfortunately overlooked—and second,

the images with synthetically generated fractures were also different across the GANs. This is why

the real-only performances vary in each of the tables, and why performance changes discussed

above were largely in terms of percent increases or decreases.

The performance from these sets of images from the RandGAN, NPPGAN, and FIDGAN all

74

show there can be value added by being able to synthetically generate rib fractures to improve an

object detector’s ability in correctly identifying and localizing fractures, but more can be done. A

new set of synthetically generated images from each of the GANs was created, this time with each

synthetic bounding box region and all images kept the same across the three generative models. To

compare with the results from above, a similar number of 500 images with synthetically generated

rib fractures was curated as well as 500 real images that were also held consistent when training with

the three sets of GAN images. In other words, the training set consisted of 500 real images, 389 of

which are fracture-present—with a mean 4.65 ± 4.07 fractures per image—and 61 fracture-absent,

and 500 synthetically modified images with an average of 3.01 ± 1.45 fractures per image. The

validation set is 62 real images with 31 each fracture-present (mean 5.58 ± 4.84 fractures per image)

and fracture-absent. Lastly, the test set size is the same 120 real images also split evenly between

fracture-present (mean 3.68 ± 3.43 fractures per image) and absent.

The performance of of YOLOv5 with the new synthetic images from the three GANs is shown

in Table 4.9. Unlike previously, these metrics can be directly compared to one another. Starting

with a training set of 500 real-only images, the YOLOv5 model performs similarly to what we saw

for base performance for the entire dataset, being just shy of 90% accurate when it predicts a box

but only catching 42% of the rib fractures in the test set. In each case, adding the images with

synthetic rib fractures decreases precision performance while improving recall. This is especially

the case with the FIDGAN results, where precision is barely reduced (−0.78%) and recall improves

to 47% of fractures (+11.6%).

Table 4.9 Performance of 1x-YOLOv5 using 500 real and 500 updated synthetic images to train.
The first row ( - ) is a 1x-YOLOv5 trained on just 500 real images.

GAN

Precision

Recall

F2 Score

-

0.895 ± 0.034

0.422 ± 0.052

0.471 ± 0.053

Rand GAN 0.826 ± 0.042
NPP GAN 0.860 ± 0.040
FID GAN 0.888 ± 0.036

0.439 ± 0.045
0.437 ± 0.053
0.471 ± 0.052

0.484 ± 0.045
0.484 ± 0.053
0.520 ± 0.051

Qualitatively, the generated synthetic fractures appear realistic, but only around one-third of the

75

generated fractures were clear and convincing. This could be an indication of the GANs output

being inconsistent, but there is also a chance that our lack of expertise prevents us from knowing

for sure that a fracture was generated in a more difficult patch. Regardless, because of the results

in Table 4.9, it is clear that the synthetically generated fracture images as a data augmentation

technique provides useful information during training which could be quite beneficial for improved

detector performance in the future.

4.10 Summary

Throughout this chapter, we have detailed the training setup and data splitting techniques we used

to trained a plethora of different models. We carried out an expert inter-reader performance study to

serve as a baseline understanding of experienced radiologists in order to compare the performance

of the deep learning methods. We show that through various techniques such as model ensembling,

varied image processing techniques, and our novel “avalanche" decision scheme we are able to

drive the recall performance of these deep learning models upward.

We also visualized the performance of standalone RetinaNet and YOLOv5 models using the

same method as our fracture prevalence study from Section 3.4. This demonstrated that both models

in their base configurations perform very similarly, and tend to predict regions and miss fractures in

similar ways. We also provided a summary of the work carried out by our colleague, Ethan Tu, with

his work on synthetically generated fracture regions as a method of data augmentation that, when

combined with our object detector work, presented a promising initial look into the feasibility of

using these synthetic images to improve the capability of deep learning detectors in capturing more

of fractures. Work demonstrated throughout this chapter contributed to the following academic

products:

• Gadgeel, Burkow, Perez, Junewick, Zbojniewicz, Otjen, Alessio, “Evaluation of inter-reader

reproducibility for detection and labeling of pediatric rib fractures on radiographs,” Inter-

national Pediatric Radiology Congress, Rome [Virtual], Oct 11-15, 2021.

• Burkow, Holste, Otjen, Perez, Junewick, Alessio, "Avalanche Decision Schemes to Improve

76

Pediatric Rib Fracture Detection," Proc. SPIE 12033, Medical Imaging 2022: Computer-

Aided Diagnosis, 120332A, Apr. 2022.

• Tu, Burkow, Otjen, Perez, Junewick, Alessio, “Synthetic Pathology Generation with Near-

Pair Cyclic GANs for Object Detectors,” IEEE Nucl. Sci. Symp. and Med. Imaging Conf.,

Milan, Nov. 2022.

• Burkow, Holste, Otjen, Perez, Junewick, Zbojniewicz, Romberg, Menashe, Frost, Alessio,

"High sensitivity methods for automated rib fracture detection in pediatric radiographs," Sci

Rep 14, 8372 (2024). https://doi.org/10.1038/s41598-024-59077-5.

• Tu, Burkow, Tsai, Perez, Junewick, Alessio, "Near-Pair Patch Generative Adversarial Net-

work for Data Augmentation of Focal Pathology Object Detection Models," In review for

Journal of Medical Imaging, 2024.

77

CHAPTER 5

DOMAIN-SPECIFIC TWO-STAGE DETECTOR

5.1 Exhaustive Search

To get a preliminary idea of how a two-stage network would perform on our data, we manually

created a patch-based version of our dataset using an exhaustive search process. All images

were rescaled based on the median pixel spacing of 0.125 mm so that physical spatiality was

homogeneous among the images. To ensure a consistent resolution for patch generation, the

Figure 5.1 Representation of the pre-processing stage of rescaling an image array to a fixed (1024,
1024) resolution with original image aspect ratio preserved and histogram equalization applied.

matching pixel-spacing arrays were then rescaled to a square resolution of 1024 × 1024, with the

aspect ratios of the original images preserved by padding with zeros in either the height or width

dimension as necessary. During this process, each image received the same histogram equalization

as applied previously prior to the rescaling so that equalization is applied evenly across the entire

image rather than calculating it per patch later. An example of this process is shown in Figure 5.1.

All of the associated annotated bounding boxes underwent a similar rescaling process, first adjusting

for the pixel spacing and subsequently for the resolution rescaling.

From these now rescaled, square images, patches of size 128 × 128 were generated with a stride

78

(2071, 1140)(1024, 1024)of 64 pixels, resulting in a 50% overlap between adjacent patches in each direction and yielding 225

patches per image. Figure 5.2 illustrates a single image represented by all of its patches. Patches of

this size were chosen to minimize the chance of introducing too much information within a single

patch; generally, a patch will at most show parts from two ribs. Overall, from the entire dataset

of 1, 109 patients this produced a total of 140,400 patches, among which 7,921 (5.6%) contained

fractures, while the remaining 132,479 were fracture-absent.

Figure 5.2 A grid view showing all 225 patches associated with a single image.

5.1.1 Training Setup

The dataset was split into training, validation, and test sets based on 70%, 10%, and 20% ratios,

respectively. The training set consists of 98, 282 patches, with 5, 545 fracture-present and 92, 737

fracture-absent patches. The validation set includes 14, 039 patches, with 792 fracture-present and

79

13, 247 fracture-absent patches. Finally, the test set contains 28, 079 patches, of which 1, 584 were

fracture-present and 26, 495 were fracture-absent.

Two classification architectures were trained and evaluated: ResNet-50 and DenseNet-121.

Both ResNet-50 and DenseNet-121 were trained to a maximum of 100 epochs with batch sizes of

64 with an Adam optimizer. All layers with trainable layers were unfrozen with the output of the

final prediction layer reduced down to a binary class output of either fracture-absent or fracture-

present. Given the massive class imbalance of the dataset, a simple weighting of about 16.7 (ratio

of fracture-absent to fracture-present patches in the training set) was applied for the positive class

on the binary cross entropy loss function. Using a single A100 NVIDIA graphics card, completing

all epochs of training for ResNet-50 and DenseNet-121 took 481 minutes (8.02 hours) and 375

minutes (6.25 hours), respectively.

5.1.2 Results for Exhaustive Search

Table 5.1 displays how ResNet-50 and DenseNet-121 perform under the normal binary classifi-

cation acceptance threshold (i.e., 𝑝𝑖 >= 0.50 ⇒ 𝑦𝑖 = 1). Accuracy appears relatively high for both,

reaching nearly 78% for ResNet-50 and 82% for DenseNet-121, though it is worth remembering the

test dataset contained 1, 584 fracture-present and 26, 495 fracture-absent patches; simply scoring

every patch as fracture-absent would yield an accuracy of 0.944. Looking at precision and recall

scores provides additional clarity on how the models are scoring the patches. Precision for both

are below 0.2, indicating that both have a high likelihood of assigning false positives. A trade-off

of this is that the classifiers both identify nearly three-fourths of the fracture-present patches in the

test set.

Table 5.1 Performance of ResNet-50 and DenseNet-121 on classifying fracture-present versus
absent patches found via exhaustive search. Threshold of 0.50 used for setting labels as 1 and 0.

Accuracy Precision Recall F2 Score

ResNet-50
DenseNet-121

0.779
0.819

0.169
0.198

0.745
0.724

0.442
0.473

To gain a more holistic understanding of both networks across all acceptance thresholds, we

80

look at the precision-recall and ROC curves (shown in the left and right images in Figure 5.3,

respectively).

In both cases, DenseNet-121 outperforms the ResNet-50 models, achieving an

average precision of 0.414 and a ROC-AUC of 0.906. They both perform respectably in terms

of ROC, achieving AUC scores well above a random classifier, but the precision-recall curve

demonstrates how much both networks struggle at both capturing the fracture patches and being

confident in their predictions.

Figure 5.3 Precision-Recall and ROC Curves for ResNet-50 and DenseNet-121 performance on the
exhaustive search patch test set.

5.2 Fracture Prevalence-inspired Two-Stage Detection

After an initial investigation at classification performance on an exhaustive search region ex-

traction approach from our data, we shift our attention toward creating a domain-specific two-stage

implementation. Ultimately, the goal is to utilize the fracture prevalence-style mapping system we

discussed in Section 3.4 as an a priori region proposal strategy.

5.2.1 First stage: Fracture Prevalence and U-Net Region Proposal

Through our analysis earlier on the fracture prevalence patterns of our dataset, we were able

to ascertain a few key points: fractures are spread relatively evenly across the rib cage, a higher

proportion are present on the central ribs and lateral margins, and there is a slight patient left-

81

0.00.20.40.60.81.0Recall0.00.20.40.60.81.0PrecisionPrecision-Recall CurveResNet50: AP=0.334DenseNet121: AP=0.3600.00.20.40.60.81.0False Positive Rate0.00.20.40.60.81.0True Positive RateROC CurveResNet50: AUC=0.855DenseNet121: AUC=0.863Randomsidedness. Knowing these, we believe that directly using this in the method for region proposal

could lead to improved fracture detection by eliminating erroneous boxes that the one-stage detectors

we tested in Chapter 4 automatically generate through anchor boxes.

Figure 5.4 Randomly drawn set of 200 points using the fracture prevalence map intensities as
weighting for the random drawing.

Figure 5.4 illustrates an example of using the fracture prevalence array as a weighted random

location generator for 200 randomly generated points. To achieve this, we took the set of all

(𝑅𝑥, 𝑅𝑦) x- and y-ratio offsets calculated from each individual segmentation mask’s centroid from

the patients in the training and validation sets from Section 4.1—to eliminate the chance of test

set bounding box locations directly influencing the random location generation—and mapped them

onto a reference image from the test set.

The values of the 2D fracture prevalence map was flattened into a 1D vector and used as

weights for generating a series of random points from within the reference image’s segmentation

82

mask. After the indices from the 1D vector are randomly drawn, the indices get converted back

into 2D indices that become the center coordinates for bounding boxes (white dots in Figure 5.4).

Figure 5.5 Example image with 200 bounding boxes randomly drawn from the fracture prevalence
map. The yellow area represents the segmentation mask output from the trained U-Net. (left) all 200
boxes, where red represents boxes that do not have at least 65% of pixels within the segmentation
mask, (right) red boxes removed, yielding the final set of 188 bounding box regions for the image.

Figure 5.5 shows what the bounding boxes generated from those center coordinates look like on

the reference image. The center coordinates were all drawn from the set of indices present inside

the yellow segmentation mask. In the left image, there are red boxes that represent boxes that do

not have at least 65% pixel-overlap with the segmentation mask; we decide to remove them from

the set since being on the periphery of the mask yields a box with a large portion being non-rib

containing. On this example with 200 randomly generated boxes, 12 boxes are below this 65%

threshold and are removed leaving a set of 188 bounding box regions for the given image.

5.2.2 Second stage: Patch Fracture Classifier

The second-stage classifier portion of our method will again be using ResNet-50 and DenseNet-

121 like explored in the preliminary exhaustive search work. However, due to the large difference

in sizes and patch content between the bounding boxes proposed in the fracture prevalence region

proposal stage, we cannot directly utilize the trained networks from before. Thus, we create a new,

label and segmentation-guided random patch dataset.

83

With each of the 1, 109 patients in our overall dataset, we use its associated segmentation mask

to sample 10, 000 random points within the mask. Each random point is a potential location for

the center of a bounding box, and temporary bounding box coordinates are created for all points.

The random drawing of these points were done uniformly and not weighted based on the fracture

prevalence map like above. The width and height of the bounding box are randomly drawn from

the mean ± standard deviation of the widths and heights of all labeled fractures from our radiologist

reads. Each generated box is then compared to the set of all ground truth boxes within the given

image using an IOU threshold of 0.40 if it is a fracture-present patient image.

Figure 5.6 Example output of proposals for an image using the label and segmentation-guided region
generator. Green boxes represent ground truth fracture labels. Magenta boxes are boxes generated
from ground truth labels. Orange boxes are boxes randomly sampled from the segmentation mask
(yellow pixels).

Every box is marked as whether they have a 65% pixel overlap with the segmentation mask,

similar to the region proposal stage. Then, for boxes that have a non-zero IOU value with a ground

truth, an additional threshold of 60% overlap with ground truth was added to ensure generated

84

bounding boxes were large enough to contain enough relevant information. For all generated

boxes that both exceeded the IOU and overlap thresholds, two bounding boxes were chosen for

each associated ground truth label in the image. In other words, if an image contained 6 labeled

annotations, there will be 12 generated boxes with each actual fracture being represented by two

boxes of varying size and aspect ratio. From the remaining possible locations with no overlap with

any ground truth fractures, 100 of the boxes were randomly selected. Figure 5.6 illustrates this

label and segmentation-guided random patch proposal for a single image.

Figure 5.7 An example of 30 random patches generated for this dataset. 15 of these are fracture-
present, and 15 are fracture-absent. This illustrates the difficulty of classifying patches as fracture-
present or absent.

Figure 5.7 shows a small 30-patch subset from this random patch generation, illustrating the

difficulty in classifying these patches. This process yields an overall dataset size of 118, 223

patches, of which 5, 049 (4.3%) are fracture-present and 113, 174 are fracture-absent. This is an

even smaller ratio of fracture-present to absent patches, however the patches containing fractures

in this set are specifically based on the ground-truth labels. Rather than doing a generic 70/10/20

85

train/val/test split like we did in the exhaustive section, we instead used one of the cross-validation

folds with the same split of training, validation, and test patients used throughout Chapter 4.

5.2.3 Results for Two-Stage Detection

As before, both ResNet-50 and DenseNet-121 were trained to a maximum of 100 epochs

with batch sizes of 64 with an Adam optimizer. Using a single V100S NVIDIA graphics card,

completing all epochs of training for ResNet-50 and DenseNet-121 took 549 minutes (9.15 hours)

and 524 minutes (8.73 hours), respectively.

We tested three different class-balancing weightings for the binary cross-entropy function, with

a maximum of approximately 21.6 (the ratio of fracture-absent to fracture-present patches in the

training set) and two smaller ratios of this value, one 0.66× (around 14.3) and one 0.33× (around

7.1). The precision-recall and ROC curves for ResNet-50 and DenseNet-121 are provided in

Figures 5.8 and 5.9.

Figure 5.8 Precision-Recall and ROC Curves for ResNet-50 with three different positive class-
balancing weights.

We can see that the performance for both of these classification networks is very similar on

the label + segmentation-guided data set. DenseNet-121 slightly out-performs ResNet-50 in both

average precision as well as ROC-AUC. For ResNet-50, the best performing model used a 0.66×

86

0.00.20.40.60.81.0Recall0.00.20.40.60.81.0PrecisionPrecision-Recall Curve1.0: AP=0.2580.66: AP=0.3490.33: AP=0.3020.00.20.40.60.81.0False Positive Rate0.00.20.40.60.81.0True Positive RateROC Curve1.0: AUC=0.7820.66: AUC=0.8630.33: AUC=0.823RandomFigure 5.9 Precision-Recall and ROC Curves for DenseNet-121 with three different positive class-
balancing weights.

positive class-balancing weight of 14.3, whereas DenseNet-121 saw the highest performance with

the full 21.6 weighting value. Both models perform slightly better than their trained versions on the

exhaustive search set; for instance, ResNet-50 improved AP from 0.334 to 0.349 and ROC-AUC

from 0.855 to 0.863.

Table 5.2 shows the performance of each of the three weighted ResNet-50 and DenseNet-121

models with a static acceptance threshold of 0.50. Once again, both networks seem to perform worse

than if every patch was simply negatively-labeled as ’0’, though with much closer performance than

seen in the exhaustive search test. Precision and recall values at this acceptance threshold are

relatively low but more consistent with one-another. Interestingly, ResNet-50 had higher precision

and lower recall than DenseNet-121. However, due to our concern for higher recall, DenseNet-121

performs significantly better in terms of F2 score.

Finally, we explore how these classifiers perform when implementing the fracture prevalence-

weighted region proposal stage on the test set of images. We use the best performing version of

each model: ResNet-50 with a positive class-weighting of 14.3 and DenseNet-121 with a class-

weighting of 21.6. We test a range of box proposals for each image (i.e., 50, 100, 150, 200, 250,

87

0.00.20.40.60.81.0Recall0.00.20.40.60.81.0PrecisionPrecision-Recall Curve1.0: AP=0.3960.66: AP=0.3560.33: AP=0.3480.00.20.40.60.81.0False Positive Rate0.00.20.40.60.81.0True Positive RateROC Curve1.0: AUC=0.8780.66: AUC=0.8590.33: AUC=0.853RandomTable 5.2 Performance of ResNet-50 and DenseNet-121 on classifying fracture-present versus
absent patches found through a label and segmentation-guided random proposal process. Threshold
of 0.50 used for setting labels as fracture-present and fracture-absent.

Model

Weighting Ratio Accuracy Precision Recall F2 Score

ResNet-50

DenseNet-121

1.0
0.66
0.33

1.0
0.66
0.33

0.947
0.957
0.952

0.944
0.935
0.948

0.345
0.473
0.391

0.361
0.315
0.373

0.317
0.315
0.309

0.483
0.497
0.391

0.322
0.337
0.322

0.452
0.446
0.387

300) to gauge performance as the number of proposals increases. For 50 patches per image, the

overall number of patches for the test set is 11, 099; for 300 boxes per image, there are 66, 481

boxes. The predictions were passed through a non-max suppression stage with an IOU threshold

of 0.50 to remove overlapping predictions. For reference, this reduced the 50 boxes per image set

down to around 9, 700 and the 300 boxes per image set down to just under 37, 800.

Figure 5.10 shows how both ResNet-50 and DenseNet-121 perform across these various proposal

counts. As we would expect, increasing the number of regions proposed improves the recall

performance of both methods. However, due to the networks’ tendencies to falsely over-predict,

the precision performance decreases as well. This leads to an almost unchanging F2 score across

amounts of boxes proposed. Interestingly, we see the flipped precision and recall performances

between the two networks leading to near-identical F2 performance between 100 and 250 proposals

per image.

Table 5.3 Performance of using fine-tuned ResNet-50 and DenseNet-121 on detecting fractures in
the test set using 200 region proposals per image from the fracture prevalence-weighted proposal
stage. Error bars were calculated through a 1000 iteration bootstrapping of the test set.

Precision

Recall

F2 Score

ResNet-50
DenseNet-121

0.097 ± 0.009
0.076 ± 0.006

0.384 ± 0.028
0.507 ± 0.028

0.241 ± 0.019
0.236 ± 0.016

Table 5.3 shows the performance of ResNet-50 and DenseNet-121 models compared to ground

truth when using the same evaluation parameters as used in Chapter 4: a confidence acceptance

88

Figure 5.10 Performance of ResNet-50 and DenseNet-121 in terms of Precision, Recall, and F2
score across various numbers of proposed boxes per image from the fracture prevalence region
proposal stage.

threshold of 0.50 and an IOU threshold of 0.30. As a reminder, the fine-tuned RetinaNet achieved

0.892 ± 0.015, 0.430 ± 0.014, and 0.480 ± 0.014 scores for precision, recall, and F2, respectively;

YOLOv5 scored 0.897 ± 0.032, 0.434 ± 0.040, and 0.484 ± 0.037.

There are a couple reasons to explain such low performance. First, as mentioned previously

these classifier models have a high likelihood to produce false positive results on the patches,

leading to the demonstrably low precision score (e.g., DenseNet-121 only 271 generated bounding

boxes matching as true positive matches compared to 3, 319 false positives). We do see DenseNet

covering a higher number of fractures from the test set (50% opposed to 43% for RetinaNet and

YOLOv5) though with much lower precision. This leads to much lower F2 score values compared

to the base one-stage networks.

89

50100150200250300Number of Boxes Proposed0.00.20.40.60.81.0ScoreDenseNet PrecisionDenseNet RecallDenseNet F2ResNet PrecisionResNet RecallResNet F2Taking inspiration from the ensembling efforts on the one-stage architectures in the previous

chapter, we looked into two methods of combining the performance of the ResNet-50 and DenseNet-

121 classifiers to possibly improve detection performance on the test set. The first technique is a

very rough maximum probability selection, i.e., following inferencing with both models whichever

model provided a higher probability score for each box was the score kept. Since these scores

can only possibly increase, we expect that more boxes will have scores exceeding the 0.50 model

confidence and therefore lead to more false positives as well as more accepted proposals matching

ground truth. The second method can be considered a late-model fusion. The outputs from both

models are averaged just prior to applying sigmoid. This offers more nuance, since the amplitude

of discrepancies between the models can lead boxes that may have been marked as fracture-present

for one model to be labeled fracture-absent, and vice-versa.

Figure 5.11 Performance of ResNet-50 and DenseNet-121 as well as a max probability and average
fusion, in terms of Precision, Recall, and F2 scores, across increasing numbers of proposed boxes
per image from the fracture prevalence-weighted region proposal stage.

Figure 5.11 shows the performance of the individual ResNet-50 and DenseNet-121 classifiers

from Figure 5.10 as well as the two combination efforts. As expected, with the max probability

method we see an improvement in recall across all amounts of box proposals with a decrease in

precision. This leads to F2 score performance between 100 and 250 proposed boxes that is below

any other method. With the average fusion method, we see that recall does not experience much

of an increase in score; in fact, it gains a marginal improvement over the standalone ResNet-50.

90

1002003000.00.20.40.60.81.0PrecisionDenseNetResNetMax ProbabilityAverage Fusion1002003000.00.20.40.60.81.0RecallDenseNetResNetMax ProbabilityAverage Fusion1002003000.00.20.40.60.81.0F2 ScoreDenseNetResNetMax ProbabilityAverage FusionNumber of Boxes ProposedHowever, this method led to the highest precision values across all region proposal counts. These

two aspects combine to an overall best performing F2 score for our two-stage work.

Table 5.4 compares the results from our inter-reader study on the matching test set, the best

performing one-stage detector including all of our recall-improving techniques, and the late-model

fusion two-stage detector. It is clear that the severely low precision from both DenseNet-121 and

ResNet-50 on this task is leading to a considerable drop in F2 score performance compared to

radiologists and our best one-stage network.

Table 5.4 Performance of the best results across our work: expert radiologists, a 3x-YOLOv5
ensemble using a 𝛾 avalanche scheme, and our fracture prevalence region proposal (FPRP) with
average late-model fusion with 200 boxes per image.

Avalanche

Precision

Recall

F2 Score

Expert Radiologists
One-Stage: 3x-Yc
Two-Stage: FPRP + Fusion

-
𝛾 = 0.20
-

0.792
0.573 ± 0.058
0.122 ± 0.011

0.718
0.780 ± 0.030
0.412 ± 0.028

0.732
0.725 ± 0.012
0.278 ± 0.019

5.3 Summary and Future Work

Within this chapter, we present new work developing a custom, domain-inspired two-stage rib

fracture detection network based on a weighted region proposal guided by fracture prevalence in

our dataset. Fine-tuning a classifier network such as ResNet-50 and DenseNet-121 on a large set

of fracture-present and fracture-absent patches reveals a high tendency of over-predicting fractures

when considering a standard acceptance threshold of 0.50. When considering the best performing

version of the two-stage work, using an averaged late-model fusion, the recall score achieved nearly

matches the base RetinaNet and YOLOv5 networks; however, the precision is drastically lower with

only a 12% likelihood of a predicted box correctly containing a rib fracture.

Future work can be done on both the region proposal stage as well as the classifier stage.

Despite being weighted by our fracture prevalence map, regions are still drawn randomly. This

can lead to newly drawn regions significantly overlapping existing boxes, both reducing the chance

of representing enough of the fractures within the dataset as well as increasing the chance for

incorrectly labeled boxes. Additionally, a more sophisticated version of the region proposer could

91

include a way of segmenting out ribs themselves so that regions drawn from the prevalence map

are centered on a rib instead of predominantly the intercostal regions.

The poor capabilities of the classifier networks is a main contributor to the lackluster detection

performance of the two-stage work. There could be further effort to tune hyperparameters to

yield the best possible results, such as more thorough positive class-balancing weight or varied

probability acceptance thresholds. Given the sheer difficulty of classifying small image patches

and the incredible class imbalance, perhaps different loss function (e.g. focal loss [28]) would lead

to better performance. There may also be more impressive ways of implementing multi-model

fusion to combine results as opposed to averaging the pre-sigmoid outputs. Finally, since we have

separately segmented out regions from our U-Net3+ architecture, incorporating thoracic-region

location information alongside the image patches in a multi-modal fashion could provide additional

context to the classifiers to yield better performance across the metrics.

92

CHAPTER 6

CONCLUSION

Throughout this dissertation, we presented multiple approaches to improve the performance for

automatic pediatric rib fracture detection. We acknowledge that there are various limitations

throughout the work in this project. For example, we demonstrated the value of ensembling models

to increase recall; there could be multiple ways to combine regions from ensembled models. In

this effort, we only explored combining model results with non-maximum suppression (NMS) of

all proposed boxes. This is an affirmative approach that inherently will only result in equal or better

recall (coupled with equal or worse precision). Future work could explore additional methods for

merging model results.

The data used in our training and testing was limited. We used a relatively small dataset from

a single institution; future efforts are needed to replicate these results and ensure generalizability

in larger, more diverse datasets. Moreover, our 624 fracture-present images were labeled with

single reads and therefore our fracture-present labels are noisy and likely contain errors. This is

especially likely considering the challenge of detecting pediatric fractures and given that our inter-

reader variability sub-study showed that along with other measures, the recall between radiologists

was just under 71% demonstrating over one-quarter of all fractures were missed by the second

reader. Future work using consensus interpretation is needed to improve our labels. Finally, the

performance was evaluated on a test set with half fracture-present cases and half fracture-absent

cases. While the prevalence of rib fractures in real-world clinical settings will vary depending

on the site and nature of the practice (out-patient versus emergency room setting, etc.), this 50%

prevalence test set contains a higher likelihood of fractures than would be encountered in practice.

Future work is needed to determine performance in realistic clinical settings with more thorough

comparisons to expert human performance.

The hypothetical end-goal of this work is to be implemented as a semi-autonomous procedure,

i.e., the work demonstrated here would constitute the automated portion providing an initial read of

potential regions of interest with a follow-up by a trained radiologist to use the model predictions

93

as guidance for their review. Given the performance of the detection networks compared to expert

radiologists, we are unsure how effective the model predictions would actually be in a clinical

setting. For example, on images where the detector outputs numerous box proposals, e.g. greater

than 10, would it possibly take longer for the radiologist to thoroughly vet each of the regions

proposed to reach a conclusion of actual rib fracture regions as opposed to performing their own

interpretation without the AI influence. Therefore, future work is needed to evaluate the proposed

methods as an in-line augmentation strategy with a radiologist serving as the arbitrator of model

predictions. The importance of being able to trust the object detector proposals extends much

further when considering the possibility of implementing an entirely automated system to serve as

a sole diagnostic decider, i.e., an image receiving any number of predicted rib fractures directly

generating an alert of possible child abuse. This raises ethical concerns, as the accusation of child

abuse and neglect is not minor and has potentially massive implications for anyone who may be

involved. Even if the object detectors we trained demonstrated performance far exceeding the

approximately 0.7 precision and recall of the radiologist inter-reader sub-study, a lack of expert

human input could lead to doubt regarding the actual presence of abuse on the patient given the

black-box nature of these types of deep learning architectures.

Foundation models [128] are some of the most cutting-edge models and have begun taking over

the AI and machine learning spheres by storm in recent years. The idea behind these models is

that they are trained on such a large corpus of multi-modal information that they do not need to be

specifically fine-tuned to be able to provide output for a given task. Prominent examples include

the BERT [129] and GPT [130] language models and the DALL-E [131] and Stable Diffusion [132]

text-to-image models. One especially interesting foundation model is the Segment Anything project

by Meta AI [133] that provides zero-shot segmentations of any images uploaded. We wanted to

see whether this model could provide any sort of comparison to the task-specific, fine-tuned work

we have discussed throughout this dissertation. In its demo, three different tasks can be performed:

hovering over a region and clicking within it to segment it, drawing a box to segment what it

contains, and an automatic “segment every object" task they call “everything."

94

Figure 6.1 Examples of outputs from the Segment Anything model [133] on three different patients
in our dataset. The top row is a hover-click option where hovering will show the region it would
segment based on that location and placing a dot locks in that part of the segmentation. The center
shows segmenting based on a user-drawn box. The bottom row uses their “everything" option to
segment every object it finds.

95

With the hover-click option, we see quite erratic performance. For the first patient, we attempted

to have it segment the spine and put a dot on each of the spinal vertebrae; it instead just segmented

out the entire patient outline. For the second patient, we wanted to segment out the patient’s right

lung but it segmented both lungs as well as a protruding region down the spine. Like the first

patient, the last patient just resulted in outlining the patient regardless of where dots were placed

within the lungs. The box-segmentation task was slightly better.

In the first patient we see a

mostly segmented region of the right lung. We see similar results in the second patient, where the

segmentation clearly avoids the heart. It struggles the most in the last patient, where it really has no

idea what to segment within the box. Finally, using their “everything" option which attempts to find

and segment every object within the image, we see quite varying results between the three patients.

The first patient is the only to have some of the lateral arcs of the ribs being highlighted as separate

objects. In the second patient, the only segmented regions are very roughly around the lungs as

well as the skull (disregarding the added anomalous square in the top right of the image). When

using an extreme case of an image where most of the image is blank, we see the model created

a segmentation of the skull, the top part of the torso, and the bottom torso. Across the example

patient images, we see that the segmentation performance of the thoracic cavity is no where near as

capable as the custom fine-tuned model we trained and discuss in Section 3.3 that we use for tight

cropping, masking for the image processing techniques, and in the region proposal two-stage work.

Additionally, with each of these patients containing rib fractures (see Figure 4.3 for locations of

fractures in each image), we only see overlap with the first patient with the lateral arcs of the ribs

being segmented out. However, this is clearly due to those regions presenting as much brighter,

separating them from the surrounding rib pixels rather than any indication that they may be rib

fractures. Once again, we see that, for now, a task as difficult as the detection of rib fractures on

pediatric radiographs, there is a need for high performing, customized, fine-tuned detector models.

That said, if current performance trends in AI continue, future foundation models may have a role

for this challenging task.

In conclusion, we demonstrated multiple methods that improve the sensitivity (i.e., recall)

96

performance of state-of-the-art object detectors on a custom curated dataset of pediatric chest

radiographs. We present developments for two-stage and one-stage objects detectors. The one-stage

methods performed much better than two-stage methods with innovations that include our novel

avalanche decision scheme as well as three methods of pre-processing the images. Additionally,

various ensembling approaches combined with the aforementioned techniques were investigated.

These techniques provided reduced precision with higher recall resulting in improvements in F2

score by extension. Simple ensembles, such as same-model three- and six-model ensembles, offered

straightforward improvements over single-model detectors. This is likely due to the enhanced

generalizability by training each ensemble member on different cross-validation folds of the training

and validation data sets. Interestingly, many of the best performing models utilized the blended

method (c) of pre-processing, where each channel of the input images was processed differently.

The method with the highest F2 score was an ensemble of three YOLOv5 models using the input (c)

pre-processing and with the 𝛾 = 0.20 avalanche scheme. This model achieved an F2 score of 0.725

± 0.012, which was only approximately 1% below the inter-reader variability of expert radiologists

of 0.732 and with a recall score exceeding the experts at 0.780 ± 0.030 versus 0.718. Overall, this

work demonstrates promising methods for high sensitivity rib fracture detection that could serve

as automatic approaches to augment the performance of radiologists performing pediatric chest

radiograph interpretation.

97

BIBLIOGRAPHY

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organiza-
tion in the brain,” Psychological Review, vol. 65, no. 6, pp. 386–408, 1958.

G. E. Hinton, “Rectified Linear Units Improve Restricted Boltzmann Machines Vinod Nair,”
2010.

D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-
propagating errors,” Nature, vol. 323, pp. 533–536, Oct. 1986.

H. Robbins and S. Monro, “A Stochastic Approximation Method,” The Annals of Mathemat-
ical Statistics, vol. 22, pp. 400–407, Sept. 1951.

D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” Jan. 2017.

D. Scherer, A. Müller, and S. Behnke, “Evaluation of Pooling Operations in Convolutional
Architectures for Object Recognition,” in Artificial Neural Networks – ICANN 2010 (K. Dia-
mantaras, W. Duch, and L. S. Iliadis, eds.), Lecture Notes in Computer Science, (Berlin,
Heidelberg), pp. 92–101, Springer, 2010.

Y.-L. Boureau, J. Ponce, and Y. LeCun, “A Theoretical Analysis of Feature Pooling in Visual
Recognition,” in Proceedings of the 27th International Conference on Machine Learning
(ICML-10), pp. 111–118, 2010.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Con-
volutional Neural Networks,” Advances in Neural Information Processing Systems, vol. 25,
2012.

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpa-
thy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual
Recognition Challenge,” arXiv:1409.0575 [cs], Jan. 2015.

[10] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to docu-

ment recognition,” Proceedings of the IEEE, vol. 86, pp. 2278–2324, Nov. 1998.

[11] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and
A. Rabinovich, “Going Deeper with Convolutions,” arXiv:1409.4842 [cs], Sept. 2014.

[12] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image

Recognition,” arXiv:1409.1556 [cs], Apr. 2015.

[13] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778,
June 2016.

[14] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated Residual Transformations for

Deep Neural Networks,” arXiv:1611.05431 [cs], Apr. 2017.

98

[15] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely Connected Convolu-

tional Networks,” arXiv:1608.06993 [cs], Jan. 2018.

[16] M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural

Networks,” arXiv:1905.11946 [cs, stat], Sept. 2020.

[17] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A ConvNet for the

2020s,” Mar. 2022.

[18] S.

Sarin,

(The Vanishing Gradient
ResNet
https://towardsdatascience.com/vggnet-vs-resnet-924e9573ca5c, Apr. 2020.

“VGGNet

vs

Problem).”

[19] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training Recurrent Neural

Networks,” arXiv:1211.5063 [cs], Feb. 2013.

[20]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale
hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern
Recognition, pp. 248–255, June 2009.

[21] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich Feature Hierarchies for Accurate
Object Detection and Semantic Segmentation,” in 2014 IEEE Conference on Computer
Vision and Pattern Recognition, pp. 580–587, June 2014.

[22]

J. R. R. Uĳlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders, “Selective Search
for Object Recognition,” International Journal of Computer Vision, vol. 104, pp. 154–171,
Sept. 2013.

[23] R. Girshick, “Fast R-CNN,” arXiv:1504.08083 [cs], Sept. 2015.

[24] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection

with Region Proposal Networks,” arXiv:1506.01497 [cs], Jan. 2016.

[25] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” arXiv:1703.06870 [cs], Jan.

2018.

[26]

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified,
Real-Time Object Detection,” in 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 779–788, June 2016.

[27] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single

Shot MultiBox Detector,” arXiv:1512.02325 [cs], vol. 9905, pp. 21–37, 2016.

[28] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal Loss for Dense Object Detec-
tion,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, pp. 318–327,
Feb. 2020.

[29] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The Pascal
Visual Object Classes (VOC) Challenge,” International Journal of Computer Vision, vol. 88,
pp. 303–338, June 2010.

99

[30] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ra-
manan, C. L. Zitnick, and P. Dollár, “Microsoft COCO: Common Objects in Context,”
arXiv:1405.0312 [cs], Feb. 2015.

[31]

J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,” arXiv:1804.02767
[cs], Apr. 2018.

[32] M. Tan, R. Pang, and Q. V. Le, “EfficientDet: Scalable and Efficient Object Detection,”

arXiv:1911.09070 [cs, eess], July 2020.

[33] G. Jocher, A. Chaurasia, A. Stoken, J. Borovec, NanoCode012, Y. Kwon, K. Michael,
TaoXie, J. Fang, imyhxy, Lorna, Z. Yifu, C. Wong, A. V, D. Montes, Z. Wang, C. Fati,
J. Nadar, Laughing, UnglvKitDe, V. Sonck, tkianai, yxNONG, P. Skalski, A. Hogan, D. Nair,
M. Strobel, and M. Jain, “Ultralytics/yolov5: V7.0 - YOLOv5 SOTA realtime instance
segmentation,” Nov. 2022.

[34] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature Pyramid

Networks for Object Detection,” arXiv:1612.03144 [cs], Apr. 2017.

[35] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal Speed and Accuracy

of Object Detection,” arXiv:2004.10934 [cs, eess], Apr. 2020.

[36] M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, and H. Greenspan, “Synthetic data aug-
mentation using gan for improved liver lesion classification,” in 2018 IEEE 15th International
Symposium on Biomedical Imaging (ISBI 2018), pp. 289–293, 2018.

[37] H. Rashid, M. A. Tanveer, and H. Aqeel Khan, “Skin lesion classification using gan based
data augmentation,” in 2019 41st Annual International Conference of the IEEE Engineering
in Medicine and Biology Society (EMBC), pp. 916–919, 2019.

[38] C. Bowles, L. Chen, R. Guerrero, P. Bentley, R. Gunn, A. Hammers, D. A. Dickie, M. V.
Hernández, J. Wardlaw, and D. Rueckert, “Gan augmentation: Augmenting training data
using generative adversarial networks,” 2018.

[39] K. Saleh, S. Szénási, and Z. Vámossy, “Occlusion handling in generic object detection:
A review,” in 2021 IEEE 19th World Symposium on Applied Machine Intelligence and
Informatics (SAMI), pp. 000477–000484, 2021.

[40] T. C. Arlen, “Understanding the mAP Evaluation Metric for Object Detection,” Mar. 2018.

[41] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized Inter-
section over Union: A Metric and A Loss for Bounding Box Regression,” arXiv:1902.09630
[cs], Apr. 2019.

[42] L. R. Dice, “Measures of the Amount of Ecologic Association Between Species,” Ecology,

vol. 26, no. 3, pp. 297–302, 1945.

[43] D. P. Chakraborty and L. H. Winter, “Free-response methodology: Alternate analysis and a
new observer-performance experiment.,” Radiology, vol. 174, pp. 873–881, Mar. 1990.

100

[44] A. B. Rosenkrantz, D. R. Hughes, and R. Duszak, “The U.S. Radiologist Workforce: An
Analysis of Temporal and Geographic Variation by Using Large National Datasets,” Radiol-
ogy, vol. 279, pp. 175–184, Apr. 2016.

[45] D. P. Naidich, C. H. Marshall, C. Gribbin, R. S. Arams, and D. I. McCauley, “Low-dose CT
of the lungs: Preliminary observations.,” Radiology, vol. 175, pp. 729–731, June 1990.

[46] C. Rampinelli, D. Origgi, and M. Bellomi, “Low-dose CT: Technique, reading methods and

image interpretation,” Cancer Imaging, vol. 12, pp. 548–556, Feb. 2013.

[47] S. Rob, T. Bryant, I. Wilson, and B. K. Somani, “Ultra-low-dose, low-dose, and standard-
dose CT of the kidney, ureters, and bladder: Is there a difference? Results from a systematic
review of the literature,” Clinical Radiology, vol. 72, pp. 11–15, Jan. 2017.

[48] S. A. Khan, Y. Gulzar, S. Turaev, and Y. S. Peng, “A modified hsift descriptor for medical

image classification of anatomy objects,” Symmetry, vol. 13, no. 11, 2021.

[49]

J. Kissane, J. A. Neutze, and H. Singh, Conventional Radiology, pp. 15–20. Cham: Springer
International Publishing, 2020.

[50] R. S.

o. N. A. R.

a. A. C.

of Radiology

(ACR),

“Radiation Dose.”

https://www.radiologyinfo.org/en/info/safety-xray, Nov. 2022.

[51] R. Fazel, H. M. Krumholz, Y. Wang, J. S. Ross, J. Chen, H. H. Ting, N. D. Shah, K. Nasir,
A. J. Einstein, and B. K. Nallamothu, “Exposure to Low-Dose Ionizing Radiation from
Medical Imaging Procedures,” New England Journal of Medicine, vol. 361, pp. 849–857,
Aug. 2009.

[52] D. Frush, “The cumulative radiation dose paradigm in pediatric imaging,” The British

Journal of Radiology, vol. 94, p. 20210478, Oct. 2021.

[53] A. Yeung, “The ’as low as reasonably achievable’ (alara) principle: a brief historical overview
and a bibliometric analysis of the most cited publications,” Radioprotection, 2019.

[54] E. Bercovich and M. C. Javitt, “Medical Imaging: From Roentgen to the Digital Revolution,

and Beyond,” Rambam Maimonides Medical Journal, vol. 9, p. e0034, Oct. 2018.

[55] M. Larobina and L. Murino, “Medical Image File Formats,” Journal of Digital Imaging,

vol. 27, pp. 200–206, Apr. 2014.

[56] T. Ridnik, E. Ben-Baruch, A. Noy, and L. Zelnik-Manor, “ImageNet-21K Pretraining for the

Masses,” Aug. 2021.

[57] C. Sun, A. Shrivastava, S. Singh, and A. Gupta, “Revisiting Unreasonable Effectiveness of

Data in Deep Learning Era,” Aug. 2017.

[58] World Health Organization, Communicating Radiation Risks in Paediatric Imaging: Infor-
mation to Support Health Care Discussions about Benefit and Risk. Geneva: World Health
Organization, 2016.

101

[59] L. Iyeke, R. Moss, R. Hall, J. Wang, L. Sandhu, B. Appold, E. Kalontar, D. Menoudakos,
M. Ramnarine, S. P. LaVine, S. Ahn, and M. Richman, “Reducing Unnecessary ‘Admission’
Chest X-rays: An Initiative to Minimize Low-Value Care,” Cureus, vol. 14, p. e29817, Oct.
2022.

[60] H.

H.

Publishing,

“Radiation

risk

from

medical

imaging.”

https://www.health.harvard.edu/cancer/radiation-risk-from-medical-imaging, Sept. 2021.

[61]

J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo,
R. Ball, K. Shpanskaya, J. Seekins, D. A. Mong, S. S. Halabi, J. K. Sandberg, R. Jones,
D. B. Larson, C. P. Langlotz, B. N. Patel, M. P. Lungren, and A. Y. Ng, “CheXpert: A Large
Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison,” Jan. 2019.

[62] K. Yan, X. Wang, L. Lu, and R. M. Summers, “DeepLesion Automated Deep Mining,
Categorization and Detection of Significant Radiology Image Findings using Large-Scale
Clinical Lesion Annotations,” arXiv:1710.01766 [cs], Oct. 2017.

[63] M. P. McBee, O. A. Awan, A. T. Colucci, C. W. Ghobadi, N. Kadom, A. P. Kansagra,
S. Tridandapani, and W. F. Auffermann, “Deep Learning in Radiology,” Academic Radiology,
vol. 25, pp. 1472–1480, Nov. 2018.

[64] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun,
“Dermatologist-level classification of skin cancer with deep neural networks,” Nature,
vol. 542, pp. 115–118, Feb. 2017.

[65] H. A. Haenssle, C. Fink, R. Schneiderbauer, F. Toberer, T. Buhl, A. Blum, A. Kalloo,
A. B. H. Hassen, L. Thomas, A. Enk, L. Uhlmann, C. Alt, M. Arenbergerova, R. Bakos,
A. Baltzer, I. Bertlich, A. Blum, T. Bokor-Billmann, J. Bowling, N. Braghiroli, R. Braun,
K. Buder-Bakhaya, T. Buhl, H. Cabo, L. Cabrĳan, N. Cevic, A. Classen, D. Deltgen,
C. Fink, I. Georgieva, L.-E. Hakim-Meibodi, S. Hanner, F. Hartmann, J. Hartmann, G. Haus,
E. Hoxha, R. Karls, H. Koga, J. Kreusch, A. Lallas, P. Majenka, A. Marghoob, C. Massone,
L. Mekokishvili, D. Mestel, V. Meyer, A. Neuberger, K. Nielsen, M. Oliviero, R. Pampena,
J. Paoli, E. Pawlik, B. Rao, A. Rendon, T. Russo, A. Sadek, K. Samhaber, R. Schneiderbauer,
A. Schweizer, F. Toberer, L. Trennheuser, L. Vlahova, A. Wald, J. Winkler, P. Wölbing, and
I. Zalaudek, “Man against machine: Diagnostic performance of a deep learning convolutional
neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists,”
Annals of Oncology, vol. 29, pp. 1836–1842, Aug. 2018.

[66] L. Shen, L. R. Margolies, J. H. Rothstein, E. Fluder, R. McBride, and W. Sieh, “Deep
Learning to Improve Breast Cancer Detection on Screening Mammography,” Scientific
Reports, vol. 9, p. 12495, Aug. 2019.

[67] Q. Hu, H. M. Whitney, and M. L. Giger, “A deep learning methodology for improved breast
cancer diagnosis using multiparametric MRI,” Scientific Reports, vol. 10, p. 10536, June
2020.

102

[68] S. M. McKinney, M. Sieniek, V. Godbole, J. Godwin, N. Antropova, H. Ashrafian, T. Back,
M. Chesus, G. S. Corrado, A. Darzi, M. Etemadi, F. Garcia-Vicente, F. J. Gilbert, M. Halling-
Brown, D. Hassabis, S. Jansen, A. Karthikesalingam, C. J. Kelly, D. King, J. R. Ledsam,
D. Melnick, H. Mostofi, L. Peng, J. J. Reicher, B. Romera-Paredes, R. Sidebottom, M. Su-
leyman, D. Tse, K. C. Young, J. De Fauw, and S. Shetty, “International evaluation of an AI
system for breast cancer screening,” Nature, vol. 577, pp. 89–94, Jan. 2020.

[69] G. Holste, S. C. Partridge, H. Rahbar, D. Biswas, C. I. Lee, and A. M. Alessio, “End-to-End
Learning of Fused Image and Non-Image Features for Improved Breast Cancer Classification
from MRI,” in 2021 IEEE/CVF International Conference on Computer Vision Workshops
(ICCVW), (Montreal, BC, Canada), pp. 3287–3296, IEEE, Oct. 2021.

[70] M. A. Mazurowski, M. Buda, A. Saha, and M. R. Bashir, “Deep learning in radiology: An
overview of the concepts and a survey of the state of the art with focus on MRI,” Journal of
Magnetic Resonance Imaging, vol. 49, no. 4, pp. 939–954, 2019.

[71] B. Zhang, C. Jia, R. Wu, B. Lv, B. Li, F. Li, G. Du, Z. Sun, and X. Li, “Improving rib fracture
detection accuracy and reading efficiency with deep learning-based detection software: A
clinical evaluation,” The British Journal of Radiology, vol. 94, p. 20200870, Feb. 2021.

[72] M. Kaiume, S. Suzuki, K. Yasaka, H. Sugawara, Y. Shen, Y. Katada, T. Ishikawa, R. Fukui,
and O. Abe, “Rib fracture detection in computed tomography images using deep convolu-
tional neural networks,” Medicine, vol. 100, p. e26024, May 2021.

[73]

J. E. Burns, J. Yao, and R. M. Summers, “Artificial Intelligence in Musculoskeletal Imaging:
A Paradigm Shift,” Journal of Bone and Mineral Research, vol. 35, no. 1, pp. 28–35, 2020.

[74] L. Yao, X. Guan, X. Song, Y. Tan, C. Wang, C. Jin, M. Chen, H. Wang, and M. Zhang, “Rib
fracture detection system based on deep learning,” Scientific Reports, vol. 11, p. 23513, Dec.
2021.

[75]

J. Zhang, Z. Li, S. Yan, H. Cao, J. Liu, and D. Wei, “An Algorithm for Automatic Rib Fracture
Recognition Combined with nnU-Net and DenseNet,” Evidence-Based Complementary and
Alternative Medicine, vol. 2022, p. e5841451, Feb. 2022.

[76] L. Jin, J. Yang, K. Kuang, B. Ni, Y. Gao, Y. Sun, P. Gao, W. Ma, M. Tan, H. Kang, J. Chen,
and M. Li, “Deep-learning-assisted detection and segmentation of rib fractures from CT
scans: Development and validation of FracNet,” eBioMedicine, vol. 62, p. 103106, Dec.
2020.

[77] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers, “ChestX-Ray8: Hospital-
Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and
Localization of Common Thorax Diseases,” in 2017 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pp. 3462–3471, IEEE Computer Society, July 2017.

[78] Y.-X. Tang, Y.-B. Tang, Y. Peng, K. Yan, M. Bagheri, B. A. Redd, C. J. Brandon, Z. Lu,
M. Han, J. Xiao, and R. M. Summers, “Automated abnormality classification of chest
radiographs using deep convolutional neural networks,” npj Digital Medicine, vol. 3, pp. 1–
8, May 2020.

103

[79] R. Lindsey, A. Daluiski, S. Chopra, A. Lachapelle, M. Mozer, S. Sicular, D. Hanel, M. Gard-
ner, A. Gupta, R. Hotchkiss, and H. Potter, “Deep neural network improves fracture detection
by clinicians,” Proceedings of the National Academy of Sciences, vol. 115, pp. 11591–11596,
Nov. 2018.

[80] S. Anis, K. W. Lai, J. H. Chuah, S. M. Ali, H. Mohafez, M. Hadizadeh, D. Yan, and Z.-C.
Ong, “An Overview of Deep Learning Approaches in Chest Radiograph,” IEEE Access,
vol. 8, pp. 182347–182354, 2020.

[81] Y. Gao, H. Liu, L. Jiang, C. Yang, X. Yin, J.-L. Coatrieux, and Y. Chen, “CCE-Net: A rib
fracture diagnosis network based on contralateral, contextual, and edge enhanced modules,”
Biomedical Signal Processing and Control, vol. 75, p. 103620, May 2022.

[82] P. K. Kleinman, S. C. Marks, M. R. Spevak, and J. M. Richmond, “Fractures of the rib head

in abused infants.,” Radiology, vol. 185, pp. 119–123, Oct. 1992.

[83] T. R. Sanchez, H. Nguyen, W. Palacios, M. Doherty, and K. Coulter, “Retrospective eval-
uation and dating of non-accidental rib fractures in infants,” Clinical Radiology, vol. 68,
pp. e467–e471, Aug. 2013.

[84] R. M. Schwend, J. A. Schmidt, J. L. Reigrut, L. C. Blakemore, and B. A. Akbarnia, “Patterns
of Rib Growth in the Human Child,” Spine Deformity, vol. 3, pp. 297–302, July 2015.

[85] S. L. Wootton-Gorges, R. Stein-Wexler, J. W. Walton, A. J. Rosas, K. P. Coulter, and K. K.
Rogers, “Comparison of computed tomography and chest radiography in the detection of rib
fractures in abused infants,” Child Abuse & Neglect, vol. 32, pp. 659–663, June 2008.

[86]

J. S. Kondis, J. Muenzer, and J. D. Luhmann, “Missed Fractures in Infants Presenting to the
Emergency Department With Fussiness,” Pediatric Emergency Care, vol. 33, p. 538, Aug.
2017.

[87] A. Ghosh, D. Patton, S. Bose, M. K. Henry, M. Ouyang, H. Huang, A. Vossough, R. Sze,
S. Sotardi, and M. Francavilla, “A Patch-Based Deep Learning Approach for Detecting Rib
Fractures on Frontal Radiographs in Young Children,” Journal of Digital Imaging, Mar.
2023.

[88] C. Kelly, C. Street, and M. E. S. Building, “Child Maltreatment 2020,” Child Maltreatment,

p. 313, 2020.

[89] C. Kelly, C. Street, and M. E. S. Building, “Child Maltreatment 2019,” Child Maltreatment,

p. 306, 2019.

[90] P. McMahon, W. Grossman, M. Gaffney, and C. Stanitski, “Soft-tissue injury as an indication

of child abuse.,” JBJS, vol. 77, p. 1179, Aug. 1995.

[91] A. M. Kemp, F. Dunstan, S. Harrison, S. Morris, M. Mann, K. Rolfe, S. Datta, D. P. Thomas,
J. R. Sibert, and S. Maguire, “Patterns of skeletal fractures in child abuse: Systematic
review,” BMJ, vol. 337, p. a1518, Oct. 2008.

104

[92] S. E. Darling, S. L. Done, S. D. Friedman, and K. W. Feldman, “Frequency of intrathoracic
injuries in children younger than 3 years with rib fractures,” Pediatric Radiology, vol. 44,
pp. 1230–1236, Oct. 2014.

[93] B. Bulloch, C. J. Schubert, P. D. Brophy, N. Johnson, M. H. Reed, and R. A. Shapiro, “Cause
and Clinical Characteristics of Rib Fractures in Infants,” Pediatrics, vol. 105, pp. e48–e48,
Apr. 2000.

[94] N. K. Pandya, K. Baldwin, H. Wolfgruber, C. W. Christian, D. S. Drummond, and H. S.
Hosalkar, “Child Abuse and Orthopaedic Injury Patterns: Analysis at a Level I Pediatric
Trauma Center,” Journal of Pediatric Orthopaedics, vol. 29, p. 618, Sept. 2009.

[95] K. A. Barsness, E.-S. Cha, D. D. Bensard, C. M. Calkins, D. A. Partrick, F. M. Karrer, and
J. D. Strain, “The Positive Predictive Value of Rib Fractures as an Indicator of Nonaccidental
Trauma in Children,” Journal of Trauma and Acute Care Surgery, vol. 54, pp. 1107–1110,
June 2003.

[96] G. Rosenberg, A. K. Bryant, K. A. Davis, and K. M. Schuster, “No breakpoint for mortality
in pediatric rib fractures,” Journal of Trauma and Acute Care Surgery, vol. 80, p. 427, Mar.
2016.

[97] D. F. Merten, M. A. Radkowski, and J. C. Leonidas, “The abused child: A radiological

reappraisal.,” Radiology, vol. 146, pp. 377–381, Feb. 1983.

[98] S. Jodogne, “The Orthanc Ecosystem for Medical Imaging,” Journal of Digital Imaging,

vol. 31, pp. 341–352, June 2018.

[99] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical

Image Segmentation,” May 2015.

[100] G. Holste, R. Sullivan, M. Bindschadler, N. Nagy, and A. Alessio, “Multi-class semantic
segmentation of pediatric chest radiographs,” in Medical Imaging 2020: Image Processing,
vol. 11313, pp. 323–330, SPIE, Mar. 2020.

[101] H. Huang, L. Lin, R. Tong, H. Hu, Q. Zhang, Y. Iwamoto, X. Han, Y.-W. Chen, and J. Wu,

“UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation,” Apr. 2020.

[102] I. Barber, J. M. Perez-Rossello, C. R. Wilson, and P. K. Kleinman, “The yield of high-
detail radiographic skeletal surveys in suspected infant abuse,” Pediatric Radiology, vol. 45,
pp. 69–80, Jan. 2015.

[103] S. Kriss, A. Thompson, G. Bertocci, M. Currie, and V. Martich, “Characteristics of rib
fractures in young abused children,” Pediatric Radiology, vol. 50, pp. 726–733, May 2020.

[104] V. K. Kundal, P. R. Debnath, A. K. Meena, S. Shah, P. Kumar, S. S. Sahu, and A. Sen,
“Pediatric Thoracoabdominal Trauma: Experience from a Tertiary Care Center,” Journal of
Indian Association of Pediatric Surgeons, vol. 24, no. 4, pp. 264–270, 2019.

105

[105] A. Tsai, J. Perez-Rossello, S. A. Connolly, K. Kawai, and P. K. Kleinman, “Temporal Pattern
of Radiographic Findings of Costochondral Junction Rib Fractures on Serial Skeletal Surveys
in Suspected Infant Abuse,” American Journal of Roentgenology, vol. 216, pp. 1649–1658,
June 2021.

[106] P. Worlock, M. Stower, and P. Barbor, “Patterns of fractures in accidental and non-accidental
injury in children: A comparative study.,” Br Med J (Clin Res Ed), vol. 293, pp. 100–102,
July 1986.

[107] C. W. Paine, O. Fakeye, C. W. Christian, and J. N. Wood, “Prevalence of Abuse Among
Young Children with Rib Fractures: A Systematic Review,” Pediatric emergency care,
vol. 35, pp. 96–103, Feb. 2019.

[108] H. Carty and A. Pierce, “Non-accidental injury: A retrospective analysis of a large cohort,”

European Radiology, vol. 12, pp. 2919–2925, Dec. 2002.

[109] R. T. Loder and J. R. Feinberg, “Orthopaedic Injuries in Children With Nonaccidental
Trauma: Demographics and Incidence From the 2000 Kids’ Inpatient Database,” Journal of
Pediatric Orthopaedics, vol. 27, p. 421, June 2007.

[110] H. J. Quiroz, J. J. Yoo, L. C. Casey, B. A. Willobee, A. R. Ferrantella, C. M. Thorson, E. A.
Perez, and J. E. Sola, “Can we increase detection? A nationwide analysis of age-related
fractures in child abuse,” Journal of Pediatric Surgery, vol. 56, pp. 153–158, Jan. 2021.

[111] E. G. Flaherty, J. M. Perez-Rossello, M. A. Levine, W. L. Hennrikus, and the AMERI-
CAN ACADEMY OF PEDIATRICS COMMITTEE ON CHILD ABUSE AND NEGLECT,
SECTION ON RADIOLOGY, SECTION ON ENDOCRINOLOGY, SECTION ON OR-
THOPAEDICS, the SOCIETY FOR PEDIATRIC RADIOLOGY, C. W. Christian, J. E.
Crawford-Jakubiak, E. G. Flaherty, J. M. Leventhal, J. L. Lukefahr, R. D. Sege, C. I. Cas-
sady, D. I. Bulas, J. A. Cassese, A. R. Mehollin-Ray, M.-G. Mercado-Deane, S. S. Milla,
I. N. Sills, C. A. Bloch, S. J. Casella, J. M. Lee, J. L. Lynch, K. A. Wintergerst, R. M.
Schwend, J. E. Gordon, N. Y. Otsuka, E. M. Raney, B. A. Shaw, B. G. Smith, L. Wells, and
P. W. Esposito, “Evaluating Children With Fractures for Child Physical Abuse,” Pediatrics,
vol. 133, pp. e477–e489, Feb. 2014.

[112] P. Jayakumar, M. Barry, and M. Ramachandran, “Orthopaedic aspects of paediatric non-
accidental injury,” The Journal of Bone and Joint Surgery. British volume, vol. 92-B, pp. 189–
195, Feb. 2010.

[113] H. A. Aboughalia, A.-V. Ngo, S. J. Menashe, H. H. Kim, and R. S. Iyer, “Pediatric rib
pathologies: Clinicoimaging scenarios and approach to diagnosis,” Pediatric Radiology,
vol. 51, pp. 1783–1797, Sept. 2021.

[114] Y. Henon, “pytorch-retinanet.” https://github.com/yhenon/pytorch-retinanet,

2021.

[115] M. Heidari, S. Mirniaharikandehei, A. Z. Khuzani, G. Danala, Y. Qiu, and B. Zheng,
“Improving the performance of CNN to predict the likelihood of COVID-19 using chest

106

X-ray images with preprocessing algorithms,” International Journal of Medical Informatics,
vol. 144, p. 104284, Dec. 2020.

[116] M. A. Ganaie, M. Hu, M. Tanveer*, and P. N. Suganthan*, “Ensemble deep learning: A

review,” arXiv:2104.02395 [cs], Apr. 2021.

[117] S. Qummar, F. G. Khan, S. Shah, A. Khan, S. Shamshirband, Z. U. Rehman, I. Ahmed Khan,
and W. Jadoon, “A Deep Learning Ensemble Approach for Diabetic Retinopathy Detection,”
IEEE Access, vol. 7, pp. 150530–150539, 2019.

[118] V. Gavrishchaka, Z. Yang, R. Miao, and O. Senyukova, “Advantages of Hybrid Deep Learning
Frameworks in Applications with Limited Data,” International Journal of Machine Learning
and Computing, vol. 8, no. 6, p. 11, 2018.

[119] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and Scalable Predictive Uncer-
tainty Estimation using Deep Ensembles,” arXiv:1612.01474 [cs, stat], Nov. 2017.

[120] J. Burkow, G. Holste, J. Otjen, F. Perez, J. Junewick, and A. Alessio, “Avalanche decision
schemes to improve pediatric rib fracture detection,” in Medical Imaging 2022: Computer-
Aided Diagnosis, vol. 12033, pp. 611–618, SPIE, Apr. 2022.

[121] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
and Y. Bengio, “Generative Adversarial Networks,” arXiv:1406.2661 [cs, stat], June 2014.

[122] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-Image Translation with Conditional

Adversarial Networks,” Nov. 2018.

[123] T. Karras, S. Laine, and T. Aila, “A Style-Based Generator Architecture for Generative

Adversarial Networks,” arXiv:1812.04948 [cs, stat], Mar. 2019.

[124] Q.-Q. Zhou, J. Wang, W. Tang, Z.-C. Hu, Z.-Y. Xia, X.-S. Li, R. Zhang, X. Yin, B. Zhang, and
H. Zhang, “Automatic Detection and Classification of Rib Fractures on Thoracic CT Using
Convolutional Neural Network: Accuracy and Feasibility,” Korean Journal of Radiology,
vol. 21, pp. 869–879, July 2020.

[125] V. Sorin, Y. Barash, E. Konen, and E. Klang, “Creating Artificial Images for Radiology
Applications Using Generative Adversarial Networks (GANs) – A Systematic Review,”
Academic Radiology, vol. 27, pp. 1175–1185, Aug. 2020.

[126] E. Tu, MACHINE LEARNED DATA AUGMENTATION TECHNIQUES FOR IMPROVING
PATHOLOGY OBJECT DETECTION. PhD thesis, Michigan State University, 2023.

[127] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs Trained by
a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium,” Jan. 2018.

[128] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bern-
stein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon,
N. Chatterji, A. Chen, K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya,

107

E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. Gille-
spie, K. Goel, N. Goodman, S. Grossman, N. Guha, T. Hashimoto, P. Henderson, J. Hewitt,
D. E. Ho, J. Hong, K. Hsu, J. Huang, T. Icard, S. Jain, D. Jurafsky, P. Kalluri, S. Karamcheti,
G. Keeling, F. Khani, O. Khattab, P. W. Koh, M. Krass, R. Krishna, R. Kuditipudi, A. Kumar,
F. Ladhak, M. Lee, T. Lee, J. Leskovec, I. Levent, X. L. Li, X. Li, T. Ma, A. Malik, C. D.
Manning, S. Mirchandani, E. Mitchell, Z. Munyikwa, S. Nair, A. Narayan, D. Narayanan,
B. Newman, A. Nie, J. C. Niebles, H. Nilforoshan, J. Nyarko, G. Ogut, L. Orr, I. Papadim-
itriou, J. S. Park, C. Piech, E. Portelance, C. Potts, A. Raghunathan, R. Reich, H. Ren,
F. Rong, Y. Roohani, C. Ruiz, J. Ryan, C. Ré, D. Sadigh, S. Sagawa, K. Santhanam, A. Shih,
K. Srinivasan, A. Tamkin, R. Taori, A. W. Thomas, F. Tramèr, R. E. Wang, W. Wang, B. Wu,
J. Wu, Y. Wu, S. M. Xie, M. Yasunaga, J. You, M. Zaharia, M. Zhang, T. Zhang, X. Zhang,
Y. Zhang, L. Zheng, K. Zhou, and P. Liang, “On the Opportunities and Risks of Foundation
Models,” July 2022.

[129] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirec-

tional Transformers for Language Understanding,” May 2019.

[130] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan,
P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan,
R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler,
M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever,
and D. Amodei, “Language Models are Few-Shot Learners,” July 2020.

[131] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever,

“Zero-Shot Text-to-Image Generation,” Feb. 2021.

[132] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-Resolution Image

Synthesis with Latent Diffusion Models,” Apr. 2022.

[133] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead,

A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick, “Segment Anything,” Apr. 2023.

108

APPENDIX

SUPPLEMENTAL TABLES

The following tables are the full versions of Table 4.3 and Table 4.4 discussed in Section 4.7,

containing all variations of ensembles and image processing methods.

Table A.1 Full set of results from Table 4.3. Precision, recall, and F2 values across all models were
measured using an IOU threshold of 0.30 with ground truth. mAP values were calculated using
IOU values from 0.25 to 0.75. Bolded values represent the top two scores for each ensemble size
and metric. Superscripts 𝑎, 𝑏, and 𝑐 represent the type of input processing to train the models as
described in Section 4.4. Ensembles with * have hybrid inputs, i.e., each ensemble member was
trained on a different input processing method.

Models
1x-Ra
1x-Rb
1x-Rc
1x-Ya
1x-Yb
1x-Yc
2x-Ra
2x-Rb
2x-Rc
2x-Ya
2x-Yb
2x-Yc
3x-Ra
3x-Rb
3x-Rc
3x-R*
3x-Ya
3x-Yb
3x-Yc
3x-Y*
6x-Ra
6x-Rb
6x-Rc
6x-Ya
6x-Yb
6x-Yc
3x-R*+3x-Y*

Precision

Recall

F2

Max F2

mAP

0.892 ± 0.015
0.430 ± 0.014
0.852 ± 0.015
0.344 ± 0.027
0.425 ± 0.027
0.883 ± 0.013
0.897 ± 0.032 0.434 ± 0.040 0.484 ± 0.037
0.872 ± 0.060
0.365 ± 0.050
0.320 ± 0.049
0.880 ± 0.024 0.464 ± 0.043 0.512 ± 0.041

0.630 ± 0.011
0.480 ± 0.014
0.390 ± 0.028
0.570 ± 0.015
0.474 ± 0.027 0.635 ± 0.012

0.480 ± 0.008
0.395 ± 0.010
0.473 ± 0.009
0.590 ± 0.062 0.563 ± 0.009
0.539 ± 0.077
0.462 ± 0.035
0.644 ± 0.046 0.555 ± 0.016

0.488 ± 0.014
0.408 ± 0.026
0.487 ± 0.025

0.495 ± 0.005
0.534 ± 0.013
0.859 ± 0.013
0.412 ± 0.009
0.453 ± 0.025
0.803 ± 0.014
0.488 ± 0.007
0.532 ± 0.023
0.847 ± 0.012
0.862 ± 0.032 0.523 ± 0.027 0.567 ± 0.024 0.672 ± 0.022 0.571 ± 0.015
0.610 ± 0.020
0.820 ± 0.062
0.468 ± 0.025
0.459 ± 0.042
0.414 ± 0.045
0.690 ± 0.012 0.562 ± 0.012
0.835 ± 0.017 0.558 ± 0.024 0.597 ± 0.021

0.651 ± 0.011
0.587 ± 0.011
0.652 ± 0.010

0.559 ± 0.011
0.492 ± 0.014
0.567 ± 0.014
0.563 ± 0.013

0.516 ± 0.012
0.500 ± 0.005
0.839 ± 0.012
0.452 ± 0.015
0.415 ± 0.008
0.769 ± 0.013
0.527 ± 0.015
0.494 ± 0.005
0.817 ± 0.015
0.493 ± 0.008
0.523 ± 0.014
0.812 ± 0.011
0.548 ± 0.024 0.589 ± 0.020 0.688 ± 0.016 0.572 ± 0.016
0.843 ± 0.029
0.623 ± 0.010
0.803 ± 0.050
0.467 ± 0.017
0.494 ± 0.030
0.451 ± 0.035
0.694 ± 0.008 0.559 ± 0.006
0.814 ± 0.017 0.599 ± 0.021 0.633 ± 0.018
0.543 ± 0.014
0.685 ± 0.011
0.598 ± 0.027
0.561 ± 0.033
0.817 ± 0.037

0.658 ± 0.007
0.598 ± 0.006
0.661 ± 0.006
0.649 ± 0.006

0.558 ± 0.007
0.501 ± 0.008
0.571 ± 0.009
0.609 ± 0.015
0.533 ± 0.021

0.507 ± 0.005
0.659 ± 0.006
0.593 ± 0.006
0.798 ± 0.008
0.423 ± 0.007
0.603 ± 0.004
0.532 ± 0.008
0.710 ± 0.007
0.601 ± 0.008
0.499 ± 0.004
0.666 ± 0.004
0.762 ± 0.008
0.638 ± 0.012 0.705 ± 0.003 0.568 ± 0.008
0.785 ± 0.017
0.725 ± 0.036
0.470 ± 0.172
0.635 ± 0.007
0.562 ± 0.016
0.756 ± 0.010 0.653 ± 0.008 0.671 ± 0.007 0.699 ± 0.006 0.555 ± 0.004
0.522 ± 0.004
0.752 ± 0.019

0.625 ± 0.021 0.647 ± 0.017

0.686 ± 0.008

109

Table A.2 Full set of results from Table 4.4. Precision, recall, and F2 values across all models were
measured using an IOU threshold of 0.30 with ground truth. mAP values were calculated using
IOU values from 0.25 to 0.75. Bolded values represent the top two scores for each ensemble size
and metric. Superscripts 𝑎, 𝑏, and 𝑐 represent the type of input processing to train the models as
described in Section 4.4. Ensembles with * have hybrid inputs, i.e., each ensemble member was
trained on a different input processing method.

Avalanche

Conservative
Conservative
Conservative
Posterior
Posterior
Posterior

Conservative
Conservative
Conservative
Posterior
Conservative
𝛾 = 0.20

Models
1x-Ra
1x-Rb
1x-Rc
1x-Ya
1x-Yb
1x-Yc
2x-Ra
2x-Rb
2x-Rc
2x-Ya
2x-Yb
2x-Yc
3x-Ra
3x-Rb
3x-Rc
3x-R*
3x-Ya
3x-Yb
3x-Yc
3x-Y*
6x-Ra
Conservative
6x-Rb
Conservative
6x-Rc
Conservative
𝛾 = 0.20
6x-Ya
6x-Yb
Conservative
6x-Yc
Conservative
3x-R*+3x-Y* Conservative

Conservative
Conservative
Conservative
Conservative
Posterior
Conservative
𝛾 = 0.20
Conservative

Precision

Recall

F2

Max F2

0.679 ± 0.010
0.897 ± 0.054
0.530 ± 0.023 0.730 ± 0.015
0.469 ± 0.030
0.634 ± 0.018 0.910 ± 0.067
0.695 ± 0.016
0.529 ± 0.026 0.745 ± 0.012 0.688 ± 0.012 0.908 ± 0.046
0.777 ± 0.045
0.759 ± 0.164
0.776 ± 0.090
0.581 ± 0.206
0.812 ± 0.056
0.645 ± 0.130

0.652 ± 0.051
0.647 ± 0.101
0.673 ± 0.141
0.620 ± 0.074
0.724 ± 0.085 0.695 ± 0.041

0.438 ± 0.011 0.768 ± 0.012
0.376 ± 0.017
0.743 ± 0.011
0.433 ± 0.021 0.779 ± 0.010
0.634 ± 0.160
0.582 ± 0.103
0.642 ± 0.065

0.736 ± 0.083 0.697 ± 0.028
0.648 ± 0.033
0.676 ± 0.060
0.720 ± 0.018
0.746 ± 0.038

0.667 ± 0.009 0.909 ± 0.033
0.621 ± 0.013
0.883 ± 0.040
0.671 ± 0.012 0.909 ± 0.031
0.786 ± 0.034
0.797 ± 0.081
0.807 ± 0.044

0.791 ± 0.010
0.394 ± 0.009
0.331 ± 0.010
0.767 ± 0.012
0.385 ± 0.012 0.797 ± 0.008
0.340 ± 0.013 0.809 ± 0.009
0.776 ± 0.056
0.558 ± 0.119
0.728 ± 0.043
0.523 ± 0.080
0.780 ± 0.030
0.573 ± 0.058
0.758 ± 0.028
0.590 ± 0.069

0.895 ± 0.041
0.658 ± 0.007
0.607 ± 0.010
0.886 ± 0.034
0.656 ± 0.008 0.916 ± 0.028
0.634 ± 0.011 0.898 ± 0.026
0.786 ± 0.029
0.710 ± 0.019
0.823 ± 0.089
0.670 ± 0.016
0.812 ± 0.036
0.725 ± 0.012
0.802 ± 0.040
0.714 ± 0.014

0.814 ± 0.006
0.320 ± 0.004
0.802 ± 0.007
0.258 ± 0.005
0.311 ± 0.005 0.816 ± 0.005
0.795 ± 0.022 0.723 ± 0.010
0.536 ± 0.044
0.664 ± 0.013
0.791 ± 0.010
0.405 ± 0.028
0.715 ± 0.005
0.508 ± 0.020
0.797 ± 0.010
0.630 ± 0.011
0.314 ± 0.015 0.841 ± 0.014

0.622 ± 0.006 0.867 ± 0.026
0.850 ± 0.023
0.564 ± 0.007
0.616 ± 0.005 0.912 ± 0.025
0.797 ± 0.016
0.851 ± 0.030
0.817 ± 0.026
0.833 ± 0.052

110