LEARNING THE INTRINSIC DIMENSION
OF COMPLEX LARGE DATASETS

By

Joseph M. Weaver

A THESIS

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Statistics - Master of Science

2024

ABSTRACT
Bayesian methods are used to estimate intrinsic dimensionality (ID) and to perform di-
mensionality reduction for a large complex dataset. Using 130,000 images over 100 categories,
we developed a process to reduce the dimensionality to a very small size while preserving the
ability to classify the images. The novelty of our approach is two-fold, 1) 2NN estimation of
target dimensionality is now used as a prior distribution, and 2) the overall mapping is gen-
erated by a Bayesian neural network which then selects an appropriate intrinsic dimension
based on the prior, variational inference, and multidimensional scaling.

1
2

3

4

5

6

.

.

.

.

.

.

.

.
.

.
.

.
.

.
.

.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

Introduction.
.
Method/Approach.
2.1
2.2
2.3
2.4
2.5
Theory .
3.1
3.2
Numerical Studies .
4.1
4.2
4.3
4.4
Conclusions .
5.1
5.2
5.3
Future Work .
6.1
6.2
6.3

.
.
.
.
.
.
.
.
.
.
.
Dataset (ImageNet100) .
.
.
TwoNN.
.
.
.
DeepMDS vs. BNN .
.
.
BNN with Prior ID Estimate
.
.

TABLE OF CONTENTS
.
.
.
.
.
.
.
.
.
Bayesian Prior for ID Estimation .
.
.
Ensemble .
.
.
.
.
.
Bayesian Neural Network .
.
.
.
Optimization .
.
.
.
.
Accuracy .
.
.
.
.
.
.
.
.
.
.
Multidimensional Scaling .
.
.
.
Variational Inference .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Recreation of Baseline DeepMDS Version .
.
.
.
Bayesian Version .
.
BNN with Prior ID Estimate
.
.
.
.
.
Transfer Learning .
.
.
.
.
.
.
ImageNet 1000 .
.
Asymptotic Maximum Intrinsic Dimension.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.

.
.
.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

1
.
3
.
3
.
4
.
4
.
6
.
7
.
9
.
9
.
9
.
. 11
11
.
11
.
.
12
14
.
. 15
15
.
15
.
.
15
. 16
16
.
16
.
.
16
. 17

REFERENCES .

iii

Introduction

1
Intrinsic dimensionality (ID), as coined by [Bennett, R. S. 1965], defines a lower bound of
the number of free parameters capable of producing a close approximation of an originating
signal or data.
Instead of preserving all information, we discard superfluous information
preserving only what is needed classification. As such, we define intrinsic dimensionality as
the number of parameters d of the lowest d-dimensional space such that some function of a
p-dimensional space is mostly preserved, such that d << p.

Finding such a manifold is motivated by exponentially increasing data collection and
the need to be able to categorize it. Our attention is directed towards image data, where
images are quantified in megapixels, also referred to statistically as millions of parameters.
This leads to a classical large-p small-n problem, which is computationally costly and leads
to model complexity. Naturally, this drives the need to reduce the p-dimensions to a much
smaller d-dimensional manifold.

Figure 1: Mapping of Raw Images to Intrinsic Manifold

For this reason, many techniques have been proposed to reduce dimensionality. Our
project focuses on the branch of dimensionality reduction dealing with nonlinear projections
to a lower dimensional manifold, similar to multidimensional scaling. It closely follows two
works; 1) On the Intrinsic Dimensionality of Image Representation [Gong, Boddeti, and
Jain 2019], and 2) Variational Bayes Ensemble Learning Neural Networks With Compressed
Feature Space [Liu, Bhattacharya, and Maiti 2021].

[Gong, Boddeti, and Jain 2019] proposed a dimensionality reduction process that first
estimates the intrinsic dimension using the K-nearest neighbor approach and the Intrinsic
Dimensionality Reduction Algorithm (IDEA). Second, they create a map to this intrinsic
dimension using a deep neural network, they called DeepMDS. Their process was able to
maintain a substantially higher degree of accuracy over PCA, Isomap, and denoising autoen-
coders when reducing to very small dimensions.

The important result from [Gong, Boddeti, and Jain 2019] demonstrates a reduction of
ImageNet/ResNet34 512-dimension image representation down to an intrinsic dimension of
19, with a loss of accuracy of 6%, (39% down to 33%). As our first objective, we will recreate

1

this experiment as a baseline.

In [Liu, Bhattacharya, and Maiti 2021], they propose a variational inference-based ap-
proach with Bayesian model averaging in the context of a Bayesian Neural Network to
improve upon the prior paper. This showed remarkable improvements in the case of syn-
thetic datasets such as Hyperspheres and Swissrolls, and a real dataset of handwritten digits,
MNIST. Given this success, our second objective will test this process on the larger dataset,
namely the ImageNet/ResNet34 512-dimension image representation and compare our re-
sults to the baseline described above.

Finally, we propose adding the 2NN estimation technique described in [Gong, Boddeti,
and Jain 2019] as a prior distribution to the Bayesian neural network described in [Liu, Bhat-
tacharya, and Maiti 2021]. Thus, as third objective we seek to obtain intrinsic dimensionality
estimate of the ImageNet data with a quantified uncertainty.

Altogether, this study 1) recreates the experiment described in [Gong, Boddeti, and Jain
2019], 2) demonstrate the effectiveness of BNN in dimensionality reduction at scale, and 3)
quantify the uncertainty of the intrinsic dimension using a 2NN prior.

2

2 Method/Approach
To implement our novel approach, we first estimate the intrinsic dimension using our two
nearest neighbor method, then proceed to train our map from ambient space to intrinsic
space using the TwoNN information as prior to quantify the uncertainty of each intrinsic
dimension.

2.1 Bayesian Prior for ID Estimation
To estimate our intrinsic dimension, we use a simple Bayesian formulation

p(d|µ) ∝ p(µ|d)p(d)

where µ ∈ Rn is the ratios of two nearest neighbors, and d is our intrinsic dimension. p(µ|d)
is the likelihood of the P areto(1, d) distribution.

n
(cid:89)

i=1

d
µd+1
i

= dne−(d+1)v

where v = (cid:80)n

i=1 ln(µi). p(d) ∼ Gamma(a, b) is our minimally informative conjugate prior.

p(µ|d)p(d) = dne−(d+1)v ×

ba
Γ(a)

da−1e−bd = cdn+a−1e−bd−(d+1)v

where c = ba

Γ(a) e−v, which implies that p(d|µ) ∼ Gamma(a + n, b + v).

2.1.1 Two Nearest Neighbor estimation of ID
Consider data X ∈ Rn×p, and let Xi denote the ith row vector. Let D ∈ Rn×n be the distance
matrix such that element dij = ∥Xi − Xj∥2 is the Euclidean distance between the ith and
jth rows of data X. Let ri,1, ri,2 be the two smallest distances (e.g. Two nearest neighbors)
in the ith row of D excluding dii = 0. Let µi = ri,2/ri,1 be the ratio of the two nearest
neighbors distance for ith row of X. Then µi ∼ P areto(1, d) which the unbiased MLE for d,
ˆd = n−1
(cid:80)n

[Allegra, et. al. 2020].

µi

i=1

2.1.2 Derivation of Pareto shape parameter d MLE
Due to conflicting material for the unbiased estimator for Pareto shape parameter, we verify
the correct estimator. Consider minimization with respect to d of the log-likelihood of a
µi ∼ P areto(1, d).

min
d

ℓ(d, µ) = min

d

ln

n
(cid:89)

i=1

d
xµd+1
i

= min

d

nln(d) − (d + 1)

n
(cid:88)

i=1

ln(µi)

∂ℓ
∂d

=

n
d

−

n
(cid:88)

i=1

ln(µi) = 0

3

According to [Rytgaard 1990], E( ˆd) = n
described in [Denti, et. al. 2022]. Thus the MLE of d is,

(1−n) d. We correct for the bias and verify what is

ˆd =

n − 1
i=1 ln(µi)

(cid:80)n

2.2 Ensemble
To create a map from high dimensional space, we describe a sequence of 3 neural networks
with decreasing input and output dimensions, starting with an input dimension of 512 and
reducing down to 256 and 128. At the last stage, we create a NN that reduces from 128 to
every dimension between 1 and 64, as diagrammed in Figure 2.

Figure 2: Hierarchy of models

2.3 Bayesian Neural Network
Largely based on [Gong, Boddeti, and Jain 2019], [Liu, Bhattacharya, and Maiti 2021], and
[Bishop 2006], the following is a mathematical formulation of the Bayesian neural network
we use in this study.

Given data D = [X|Y ] where the image category Y ∈ (1, 2, ..., k)n and image representa-
tion X ∈ Rp×n where row vector Xi is assumed to be generated from Poisson point process.
Let l ∈ (0, 1, ..., L) represent the index of each layer of the network, where l = 0 is the input
layer, and l = L is the output layer. Let,

• p(l) ∈ N be the dimension of layer l,

• a(l) ∈ Rp(l) be the output values of p(l)-dimensional vector,

• W (l) ∈ Rp(l−1)×p(l) ∼ N (µW , log(1 + eρW )) be the Bayesian weight matrix for layer

l = 1, 2, ..., L,

– µw ∈ Rp(l−1)×p(l) be the parameters representing the mean of the weight,

4

– ρw ∈ Rp(l−1)×p(l) be a parameter for an unconstrained transformed variance such

that σ2 = log(1 + eρ),

• b(l) ∈ Rp(l) ∼ N (µV , log(1 + eρV )) be the Bayesian bias vector of layer l = 1, 2, ..., L,

– µv ∈ Rp(l) be the mean of the bias,
– ρv ∈ Rp(l) be the transformed variance of the bias,

• f (l)

LReLU (.) : Rp(l)→Rp(l)
1, 2, ..., L − 1.

be the LeakyReLU activation vector function of layer l =

Then the general form of our model is

a(0) = X,

a(l) = f (l)

LReLU (W (l)a(l−1) + b(l)), l = 1, 2, ..., L − 1,

Z = W (L)a(L−1) + b(L),

where Z is the estimated reduced dimensionality vectors.
In a Bayesian neural network,
parameters are represented by distributions and during the feed forward process the weight
is approximated by W (l) = µw + zlog(1 + eρw) where z ∼ M V N (0, I) is randomly generated.
The bias vector b(l) = µv + zlog(1 + eρv ) is evaluated similarly.

To complete the description of the BNN, we define two different types of intermediate
layers, Linear and BatchNorm. The layers described above are actually a combination of a
Linear layer, followed by a BatchNorm layer which is then fed into our activation function.
Both Linear and BatchNorm layers have Bayesian parameters, but the later also aggregates
the inputs defining µ = E(a(l−1)) and Σ = V ar(a(l−1)) and normalizes them as follows,

a(l) = W (l) a(l−1) − µ

√

Σ

+ b(l).

Finally, we implement a skip layer which forwards the output of our first 2 layers to our
last 2 layers to enhance the training of our input layers. Altogether this results in a network
as diagrammed in Figure 3.
2.3.1 Loss Function
In training our neural network, the loss function uses the estimated lower bound (ELBO)
([Liu, Bhattacharya, and Maiti 2021]). This function is,

Loss = KL + LL

5

Figure 3: Bayesian NN Architecture

KL =

L
(cid:88)

(cid:34)p(l)
(cid:88)

p(l−1)
(cid:88)

l=1

i=1

j=1

KL(W (l)

ij ||ϕ) +

KL(b(l)
i

(cid:35)
||ϕ)

,

p(l)
(cid:88)

i=1

where W (l)
ij
eter on layer l, and ϕ ∼ N (0, σ2

0).

is the ijth weight random parameter on layer l, b(l)
i

is the ith bias random param-

KL(P ||Q) = ln(

σQ
σP

) +

P + (µP − µQ)2
σ2
2(σ2
Q)

−

1
2

which is the reduce form of KL-divergence for normal densities.

n
(cid:88)

n
(cid:88)

LL =

(cid:104)
∥Xi − Xj∥2 − ∥Zi − Zj∥2

(cid:105)2

i=1

j=1

where Xi is the ith input image representation, and Zi is the associated reduced image
representation.

2.4 Optimization
Our model training follows the classical feed forward - backwards propagation techniques of
neural network utilizing the AdamW optimizer. The process first initializes the parameters
of our Bayesian weights and biases. Then in successive steps we feed forward a batch of
input image representations resulting in lower dimensional image representations. We then
calculate our loss gradient ∇f and proceed to back-propagate, updating the parameters of
our Bayesian weights and biases.

6

Initialization

for l = 1, 2, ..., L, let µ(l)

i,j be randomly generated from unif orm(−.6, .6),

i = log(1 + e−6) and likewise for each element b(l)

2.4.1
For each element W (l)
ij
and Σ(l)
2.4.2 AdamW
AdamW optimization [Loshchilov and Hutter 2017] is a variant of gradient descent that
incorporates adaptive learning rates, momentum, and weight decay that is specially designed
for training deep neural networks. Let β1 = 0.9, and β2 = 0.999 be how much the first two
moments are maintained between steps. Let λ = 1e − 4 be the weight decay, γ0 = 3e − 4 be
the initial learning rate, ϵ = 1e − 8 be a small adjustment to prevent division by zero, and
m0 = 0, v0 = 0 be the initial value of the first two moments.

for l = 1, 2, ..., L.

i

For t in 1, 2, ... apply the following update rule until there is not a significant change in

loss,

θt = θt−1 − γtλθt−1 − γt

mt√
vt + ϵ

where the term −γtλθt−1 implements the weight decay, and momentum is formulated as the
following,

mt =

β1mt−1 + (1 − β1)∇f (θ(t−1))
1 − βt
1

, vt =

β1vt−1 + (1 − β2)∇f (θt−1)2
1 − βt
2

,

Finally, the variable learning rate, Cosine Annealing, is implemented as follows,

γt = γmin +

1
2

(γmax − γmin)(1 + cos(t/tmax))

2.5 Accuracy
The following section details the algorithms we use to determine if the projection has main-
tained the same ability to classify as the input data. Primarily, we use mAP as both a
baseline and post training metric to quantify the accuracy or change in accuracy as a result
of data reduction. The goal being to maintain a nearly identical mAP value between stages
of reduction to a minimum intrinsic dimension.
2.5.1 mAP
The mean average precision (mAP) is a standard averaging of the average precision for each
category. For instance, the ImageNet data contained 100 different categories and we can
calculate the average precision for each category. Thus the mAP would be;

mAP =

1
100

100
(cid:88)

i=1

APi

7

Where the APi is the average precision for the ith category.
2.5.2 Average Precision
Average precision is a measure of Type I and Type II errors that take into account the
trade-off between the two values for different selections of a threshold. Specifically, average
precision is the area under the precision-recall curve and is in practice estimated by summing
the precision P (r) times the change in recall ∆r(k) over all k changes in the recall.

AP =

n
(cid:88)

[P (k)∆r(k)]

i=1

2.5.3 Precision and Recall
Precision and Recall are measures of Type I and Type II errors, respectively. Precision is
the ratio of true positives over all the positive predictions T P/(T P + F P ). Thus when the
precision is 1.0, this indicates that there are no false positives (Type I errors). Recall is
the ratio of true positives over actual positives T P/(T P + F N ). Thus when recall is 1.0
when there are no false negatives (Type II errors). Therefore, as a threshold for deciding
if a prediction is positive or negative increases the precision decreases and recall increases,
thus creating a trade-off between the two. In a perfect prediction scenario, there would not
be any Type I or II errors, and therefore precision and recall are always 1.0, and therefore
the maximum value of average precision is also 1.0.

8

3 Theory
To derive our loss function we combine concepts from multidimensional scaling (MDS) with
variational inference (VI).

3.1 Multidimensional Scaling
At a high level, MDS is a technique that preserves the pairwise similarity or dissimilarity
between a set of points that have been projected to lower-dimensional space.

In our model, the mapping between ambient space and intrinsic space is performed by
a neural network fθ(.) with weights and biases of θ such that Zi = fθ(Xi) where Xi and Zi
are the ambient and reduced vector representations respectively.

To preserved the pairwise difference in Euclidean distance we minimize dij =

(cid:104)
∥Xi −

(cid:105)2

Xj∥2 − ∥Zi − Zj∥2
in the following likelihood,

which are assumed to be normally distributed N (0, σ2), thus resulting

log L(θ) = −

1
2σ2

n
(cid:88)

n
(cid:88)

(cid:104)

i=1

j=1

∥Xi − Xj∥2 − ∥Zi − Zj∥2

(cid:105)2

+

1
2

log(2πσ2)

Of which the scalar
plifying to,

1

2σ2 and constant 1

2 log(2πσ2) can be ignored during optimization sim-

log L(θ) =

n
(cid:88)

n
(cid:88)

i=1

j=1

dij

3.2 Variational Inference
Utilizing a Bayesian neural network composed of a myriad weights and biases represented
by distributions with unspecified parameters θ, we employ VI to train these distribution
parameters.

We define our posterior distribution

p(θ|data) ∝ elog L(θ)+log(p(θ))

where p(θ) ∼ N (0, σ2
0I) is a Gaussian prior and log L(θ) is the MDS loss described above.
Then let q(θ) be a variational family such that q(θ) ∼ N (µ, Σ) where µ and Σ are unknown
parameters in θ.

To find the variational posterior, we choose q(θ) such that the Kullback-Leibler divergence

KL(q(θ), p(θ|data)) is minimized. Or equivalently minimize,

Eq(log q(θ) − log p(θ|data)),

9

= Eq(log q(θ) − log L(θ) − log p(θ)) + C,

where C the proportionality constant in p(θ|data) which is ignored under optimization. So
finally we minimize the following,

−Eq(log L(θ)) + KL(q, p)

10

4 Numerical Studies

4.1 Dataset (ImageNet100)
We obtained a cluster-sampled of 100 subject categories containing 1350 images for each
subject from the sizable ImageNet image repository used in the [Gong, Boddeti, and Jain
2019]’s paper. The images were processed using ResNet34, creating 512-dimensional image
representations. These image representations encapsulate quantified visual information that
differentiates images among various categories. The data was separated into training data
by performing a stratified sample without replacement of 1300 image vectors from each the
100 categories, and the remaining 50 vectors became the test data. The training data is use
to calculate our TwoNN prior ID Estimation and also use to train our BNN for mapping to
lower dimension. The test data is used to validate the accuracy of the model by calculating
the mAP value of the original data and reduced data.

4.2 TwoNN
In this study, we estimate the intrinsic dimensionality of a high dimensional dataset using
TWO-NN and the MLE of the Pareto shape parameter. Specifically, we calculate the Eu-
clidean norm between all image representations and find the two nearest neighbors to each
point. We then calculate the ratio of two nearest neighbors and estimate the shape param-
eter of the Pareto distribution, then find the gamma posterior distribution using the Pareto
and its conjugate prior.

Figure 4: Estimated ID based on sample size

Figure 5: ID estimate distribution

To verify that the ID calculation is consistent, we calculated the TwoNN for various
stratified sample sizes increasing up to our entire ImageNet100 sample. We collect 30 esti-
mates per sample size. As you can see in Figure 4, the ID estimate increases as sample size

11

increases, but only up to a certain sample size.

As part of this experiment, we had to overcome the challenge of calculating and storing
a distance matrix with 17 billion elements and a memory requirement of 540 GB. Since the
ratio calculation for a single point was independent of other calculations, we have imple-
mented parallelization techniques to distribute our computation across a network of 130,000
concurrent processes.

Then given the posterior distribution described above, we estimate the intrinsic dimension
to be a Gamma(359.5831, 11.4508) with a mean of 28.96 and variance of 2.33. As graphed
in Figure 5 this distribution that will become the prior for our final ID estimation as a result
of our neural network.

4.3 DeepMDS vs. BNN
The details below catalog an experiment in which we directly compare the Bayesian neural
network to predecessor, the network outlined in [Gong, Boddeti, and Jain 2019]. We confirm
the results of the prior study and also report on effectiveness of the Bayesian version of the
same model.
4.3.1 DeepMDS
In Figure 6, we show the results of the DeepMDS process final negative loss and mAP values
for reduction from 32 dimensions to lower dimensions between 2 and 31. Here we see a final
mAP value for an intrinsic dimension of 19. [Gong, Boddeti, and Jain 2019] reported a 6%
loss of mAP of 39% to 33%. Here we achieve the same reduction with an 8.8% loss of mean
average precision (36.2% down to 27.4%).

Figure 6: DeepMDS mAP and Loss

12

4.3.2 BNN
In Figure 7, we plot the map and negative loss for the Bayesian model seeing similar results.
From the loss curve we can estimate the point in which we begin to see little change in loss
for an increasing dimension suggesting an intrinsic dimension between 15 and 25.

Figure 7: Bayesian mAP and Loss

With both models run, we can directly compare the mean Average Precision of the two
methods. In Figure 8 below, we see a small (about 1%) increase of accuracy for values around
the intrinsic dimension for our Bayesian process.

Figure 8: DeepMDS. vs. BNN Comparison of mAP

13

4.4 BNN with Prior ID Estimate
Figure 9 shows the transformation of the intrinsic dimension produced by the BNN when
given the prior estimate from our Two NN ID Estimation. The original data in blue circles,
and resting on top and slightly below is the transformed data in orange squares, suggesting
nearly identical results.

Figure 9: Bayesian Neural Network with ID prior

14

5 Conclusions

5.1 Recreation of Baseline DeepMDS Version
Overall, the results by [Gong, Boddeti, and Jain 2019] were successfully reproduced. We
achieved 512 to 19 dimensionality reduction with an 8.8% loss of mean average precision
(36.2% down to 27.4%). This was done with minimal parameter tuning, which may explain
the 2% difference in Gong’s results.

5.2 Bayesian Version
In swapping out the loss function and neural network nodes for Bayesian equivalent versions
we also achieved comparable results reducing dimensionality from 512 to 19 with 7.7% loss
(36.2% down to 28.5%). This was done using identical hyperparameters to the baseline
version. Thus, our Bayesian version attained similar if not a small improvement over the
frequentest version. Moreover, the Bayesian version appears to have a small improvement
at all dimensions from 15 and above. Our technique provides a range of intrinsic dimension
accuracy suggesting an optimal intrinsic dimension of 15 to 24. Lastly, our process outputs a
range of solutions which can quantify the uncertainty for a given intrinsic dimension, which
is a notable improvement in itself. While our method showed modest gains in ImageNet, we
suspect our model application would be more applicable in other settings.

5.3 BNN with Prior ID Estimate
Lastly, adding the TwoNN prior to our BNN estimation resulted in a very small shift in
values, likely due to the overwhelming amount of data. This contained a slightly more
pronounced maxima and a slightly narrower range of Intrinsic Dimensions.

15

6 Future Work

6.1 Transfer Learning
We would like to explore the concept of transfer learning as the training of the ImageNet
data requires a large of amount of time. With transfer learning we expect that we can train
a general form of dimensionality reduction such that given new categories we can include
them into the lower dimensional manifold with minimal training.

6.2 ImageNet 1000
We seek to utilize the the full ImageNet library of 1000 categories, totaling over 1.3 million
images to analysis how K categories impacts the ID.

6.3 Asymptotic Maximum Intrinsic Dimension
We wish to explore the question ”As K categories go to infinity, is there an upper bound on
the intrinsic dimension such that we can capture 99% of all imaged subjects.”

16

REFERENCES

[Gong, Boddeti, and Jain 2019] Gong, s.,Boddeti, V. N., Jain, A. K. (2019). On the Intrin-
sic Dimensionality of Image Representation, IEEE Trans. Neural Networks Learn. Wiley,
New York.

[Liu, Bhattacharya, and Maiti 2021] Liu, Z., Bhattacharya, S., Maiti, T. (2021). Variational
Bayes Ensemble Learning Neural Networks With Compressed Feature Space, 2nd ed.
Wiley, New York.

[Bennett, R. S. 1965] Bennett, R. S. (1965). Representation and Analysis of Signals Part
XXI. The Intrinsic Dimensionality of Signal Collections, Johns Hopkins University, Balti-
more.

[Rytgaard 1990] Rytgaard, R. (1990). Estimation in the Pareto Distribution, ASTIN Bul-

letin, Vol. 20, No. 2, Copenhagen

[Denti, et. al. 2022] Denti, F., Doimo, D., Laio, A., and Mira, A. (1990). The generalized

ratios intrinsic dimension estimator, Nature Portfolio, Copenhagen.

[Bishop 2006] Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Springer,

New York.

[Loshchilov and Hutter 2017] Loshchilov, I. Hutter, F. (2017). Decoupled Weight Decay

Regularization, Cornell University.

[Allegra, et. al. 2020] Allegra, M. Facco, E. Denti, F. Laio, A. Mira, A. (2017). Data seg-

mentation based on the local intrinsic dimension, Nature.

17