PREDICTIVE AND GENERATIVE MODELING OF TIME SERIES EXTREMES

By

Asadullah Hill Galib

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Computer Science—Doctor of Philosophy

2024

ABSTRACT

The accurate modeling of extreme values in time series data is a critical yet challenging task that has

garnered significant interest in recent years. The impact of extreme events on human and natural

systems underscores the need for effective and reliable modeling methods. To address this need,

we develop novel methods that combine extreme value theory (EVT) with deep neural networks

(DNNs) for a range of time series modeling tasks, including forecasting, representation learning,

and generative modeling. While integrating EVT into DNNs can provide a robust framework for

understanding the behavior of extreme values, several challenges arise in effectively integrating

EVT with deep learning architectures. Successfully addressing these challenges necessitates a

comprehensive strategy that leverages the strengths of both methodologies. Thus, this thesis posits

that integrating extreme value theory within deep learning frameworks offers a sound approach

to modeling extreme values in time series data by enhancing forecasting accuracy, advancing

representation learning, and improving generative capabilities.

This thesis introduces four novel deep learning frameworks: DeepExtrema, Self-Recover,

SimEXT, and FIDE, which offer promising solutions for forecasting, imputation, representation

learning, and generative modeling of extreme values in time series data. DeepExtrema focuses

on integrating extreme value theory with deep learning formulation to improve the accuracy and

reliability of extreme events forecasting. Self-Recover addresses data fusion challenges that arise

from varying temporal coverage associated with long-term and random missing values of predictors.

SimEXT explores how deep learning can be utilized to learn useful time series representations that

effectively capture tail distributions for modeling extreme events. FIDE introduces a high-frequency

inflation-based conditional diffusion model tailored towards preserving extreme value distributions

within generative modeling. These frameworks are evaluated using real-world and synthetic datasets,

demonstrating superior performance over existing state-of-the-art methods. The contributions of this

research are significant in advancing the field of time series modeling and have practical implications

across various domains, such as climate science, finance, and engineering.

Copyright by
ASADULLAH HILL GALIB
2024

This thesis is dedicated to my beloved parents Abdul Mannan and Khaleda Akter and my loving
wife Samima Aktar Shoshy.

iv

ACKNOWLEDGEMENTS

I would like to express my deepest gratitude to my Ph.D. advisor, Dr. Pang-Ning Tan, for his

invaluable guidance, support, and encouragement throughout my research journey. His passion for

the field of Computer Science has been a constant source of inspiration for me.

Besides, I would like to extend my heartfelt thanks to my Ph.D. guidance committee members

(Dr. Jiayu Zhou, Dr. Sijia Liu, and Dr. Lifeng Luo) for their invaluable feedback and constructive

criticism throughout my research. Their expertise and insights have greatly enriched my work and

helped me refine my ideas.

I am also grateful to my lab mates at DMiner@MSU for their support and stimulating discussions.

Their diverse perspectives, expertise, and willingness to help have been instrumental in shaping my

ideas and approach to research.

In addition, this thesis is partially supported by the National Science Foundation under grant

#IIS-2006633. I would also like to thank Dr. Pang-Ning Tan for providing assistantships during my

Ph.D.

I am deeply grateful to my parents, Abdul Mannan and Khaleda Aktar, my brothers, Khaled

Mosharraf Mukut and Akib Al Muyeed, for their unconditional love, support, and encouragement

throughout my academic pursuits. Their guidance and prayers have been the driving force behind

my success. I owe a special thanks to my wife, Samima Aktar, whose unwavering belief, patience,

and understanding have been an unwavering source of strength during this challenging journey.

v

TABLE OF CONTENTS

CHAPTER 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

1

CHAPTER 2

RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 11

CHAPTER 3

DEEPEXTREMA: A DEEP LEARNING APPROACH FOR FORE-
CASTING BLOCK MAXIMA IN TIME SERIES DATA . . . . . . . . . 19

CHAPTER 4

SELF-RECOVER: FORECASTING BLOCK MAXIMA IN TIME
SERIES FROM PREDICTORS WITH DISPARATE TEMPORAL
COVERAGE USING SELF-SUPERVISED LEARNING . . . . . . . . . 36

CHAPTER 5

SIMEXT: SELF-SUPERVISED REPRESENTATION LEARNING
FOR EXTREME VALUES IN TIME SERIES . . . . . . . . . . . . . . 53

CHAPTER 6

FIDE: FREQUENCY-INFLATED CONDITIONAL DIFFUSION MODEL
FOR EXTREME-AWARE TIME SERIES GENERATION . . . . . . . . 74

CHAPTER 7

CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . .

. 98

BIBLIOGRAPHY .

.

.

.

.

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

vi

CHAPTER 1

INTRODUCTION

Time series is a sequence of observations recorded over time, where each observation depends on

preceding ones, creating complex temporal interdependencies. Time series data is pervasive and

comes in various forms, from stock prices and traffic patterns to climate data, disease outbreaks, and

energy usage trends. Because time series data is so prevalent, capturing its dynamics has critical

implications across numerous fields. For instance, in finance, time series analysis aids in stock

market predictions; in energy, it informs load forecasting; in climate science, it models temperature

and precipitation patterns. The broad applicability of time series data underscores its crucial role in

scientific, economic, and social decision-making.

With the growing availability and significance of time series data in decision-making, accurate

modeling underlying patterns, trends, and dynamics of time series is essential for strategic insights

across diverse applications. Time series modeling includes several distinct tasks, such as forecasting,

classification, generation, and anomaly detection. Numerous techniques have been developed

to address these tasks, ranging from classical statistical approaches like GARCH (Generalized

Autoregressive Conditional Heteroskedasticity) and ARIMA (Autoregressive Integrated Moving

Average) to advanced machine learning techniques, broadly categorized into predictive and generative

modeling paradigms. Predictive modeling focuses on learning mappings from input sequences

to target variables, while generative modeling aims to capture the underlying data distribution

for sequence generation. Recent advancements in deep learning have led to the development of

sophisticated predictive modeling techniques for time series data, many of which are adapted from

natural language processing. These techniques include Recurrent Neural Networks (RNNs) [1],

Long Short-Term Memory networks (LSTMs) [2, 3, 4], and attention-based models (Informer [5],

Autoformer [6]). Parallel developments in generative modeling have led to specialized time series

adaptations including TimeGAN [7], TimeVAE [8], and Diffusion-TS [9] expanding the toolkit for

time series modeling.

A key challenge in time series modeling is capturing extreme values, which are crucial due to

1

their significant impact on human and environmental systems. There are two primary approaches

to defining extremes in time series: the peak-over-threshold approach, which considers values

exceeding a specified threshold, and the block maxima approach, which defines extremes as the

maximum within a given time window. Block maxima analysis offers insights into the worst-case

scenario within time series, enabling a detailed understanding of maximum potential impacts.

For instance, analyzing block maxima in hurricane data reveals the peak wind speeds of the

hurricanes. Despite its importance, modeling block maxima poses significant challenges. Current

time series modeling approaches primarily aim to minimize mean-square prediction error, focusing

on predicting the conditional expectation rather than extreme values, thereby overlooking the critical

tail distribution that characterizes extreme events.

To address this limitation, Extreme Value Theory (EVT) can be useful as it provides a

robust statistical framework for modeling the distribution of extreme values. However, classical

implementations of EVT struggle to capture the highly complex, nonlinear relationships inherent

in time series data. This thesis hypothesizes that integrating EVT with advanced deep learning

techniques can enhance the modeling of block maxima in time series by effectively capturing tail

distributions while leveraging the representational power of neural networks. The primary challenge

lies in developing an effective integration of EVT principles into deep learning architectures, a gap

this research aims to address with significant implications for risk management and decision-making

under uncertainty across multiple domains.

1.1 Research Challenges

Accurately forecasting extremes in time series is a complex and challenging task that demands

specialized models and approaches, as advanced models like Transformer-based models struggle to

predict extreme values with high accuracy. The difficulty of forecasting extremes is emphasized

by Figure 1.1, which illustrates the limitations of existing models and techniques in predicting

extreme events. The figure shows the results of three different settings for time series forecasting

of a temperature dataset1 using a Transformer-based model. The top row shows the performance

1https://www.narccap.ucar.edu/data/index.html

2

Figure 1.1 The figure illustrates the challenge of forecasting extremes, as even advanced models like
Transformer struggle with accuracy. Directly predicting block maxima improves performance but
still falls short of the ground truth distribution, as shown by Kernel Density Estimation (KDE). The
left column shows a sample sequence, and the right column displays the KDE of all sequences in a
temperature dataset.

of sequence-to-sequence forecasting, which does not perform well considering all values in the

forecasting window, particularly for extreme values, as shown by the Kernel Density Estimation plots.

The middle row depicts the results of sequence-to-sequence forecasting, where the model predicts

all time steps in the forecasting window but only uses block maxima for optimization and evaluation.

It also fails to perform well for extreme values. The bottom row shows the model’s performance

when directly predicting block maxima in the forecasting window, which is an improvement over the

previous settings, but it still falls short of the ground truth distribution. This indicates that forecasting

extreme events requires more sophisticated and specialized models, as accurately capturing these

extremes remains a complex and challenging task, as shown in Figure 1.1.

The pursuit of accurate forecasting of extreme values in time series data presents several key

research challenges. First, the rarity of extreme values poses a significant challenge in modeling,

as these extreme values represent a small fraction of the dataset, often leading to an insufficient

representation of extreme events during training. In some cases, extreme values may have been

ignored as outliers during training to improve the generalization performance of the model. This

scarcity can hinder a model’s ability to generalize and accurately predict future extremes.

3

Figure 1.2 The figure highlights that incorporation of GEV into deep learning through our proposed
approach performs well compared to ground truth distribution of extreme values.

Second, regularizing deep learning models is crucial, especially with limited data. Deep

learning models typically require large amounts of data to achieve optimal performance, and a

scarcity of data can result in overfitting, where the model excels on training data but fails to generalize

to new, unseen instances. In cases of limited data, regularization techniques become essential to

mitigate overfitting risks. In this context, Extreme Value Theory (EVT) could offer a solution, as

EVT provides a statistically sound approach to characterizing the limiting distribution of extreme

values. Integrating EVT into deep learning models not only aids in regularization but also enhances

the model’s ability to capture extreme behaviors by establishing constraints based on the underlying

statistical properties of extremes.

Third, modeling extreme values in time series data requires a specialized focus on the tail

distribution, as opposed to conventional existing deep learning methods, which prioritize modeling

the conditional mean of the target variable by minimizing the mean-square prediction error [10].

The rarity of extreme values also necessitates a distinctive modeling approach that emphasizes

the tail distribution rather than the overall distribution of the dataset. A robust representation

learning approach tailored to the tail distribution or directly modeling the underlying distribution of

extreme values is essential for accurately capturing and modeling these phenomena. This approach

must effectively address the unique characteristics of the tail distribution, ensuring that the model

can capture the critical information inherent in the distribution of extreme values and enhance its

predictive performance for future extremes.

Fourth, effective data fusion is essential for enhancing the modeling of extreme events.

4

To enhance the forecasting of extreme values, relying solely on historical observations is often

inadequate. This is particularly true when dealing with non-stationary time series, where historical

data may not provide a robust foundation for accurately predicting block maxima. In such cases,

incorporating domain-driven process-based models can effectively supplement historical data and

improve forecast accuracy. For example, when forecasting climate extremes, integrating predictors

from global climate models (GCMs) can provide valuable insights into potential future extremes

that historical data alone might overlook. However, the integration of these diverse data sources

introduces significant technical and conceptual challenges. Key issues include handling missing or

inconsistent data, addressing varying temporal coverage, and accounting for different sources of

uncertainty. Specifically, for time series extremes, merging outputs from process-based models with

historical observations can be complicated by disparities in temporal coverage among different data

sources. Addressing these challenges is critical to ensure the reliability and accuracy of extreme

event predictions, thereby enhancing the effectiveness of decision-making and risk management

strategies.

Fifth, enhancing generative modeling techniques for extreme values is critical for effective

decision-making across various fields. The generative modeling of extreme values in time

series plays a critical role in informed decision-making across various domains, such as climate

science, energy consumption, and disease outbreak detection. Effective generative modeling

of these extremes enables the understanding of the underlying data distribution, facilitates data

augmentation, and enhances uncertainty estimations. These capabilities are crucial for developing

robust risk management strategies and improving disaster preparedness measures. Despite the

increasing interest in applying generative models to time series analysis [11, 12], their efficacy in

accurately preserving the distribution of extreme values has been insufficiently explored. Notably,

state-of-the-art generative models, including diffusion models [13, 9], demonstrate a strong ability to

generate samples that conform to the overall data distribution. However, they often fail to effectively

capture the distribution of block maxima values. This limitation highlights the urgent need for

further research aimed at improving generative models’ performance in preserving the characteristics

5

of extreme values within time series data.

In short, the key research challenges include the rarity of extreme values, the need for specialized

regularization, capturing tail distributions, effective data fusion, and enabling generative models

to preserve extreme value distribution. Addressing these challenges is essential for advancing the

modeling of extreme values in time series.

1.2 Thesis Statement

Integrating extreme value theory within deep learning frameworks offers a robust approach

to modeling extreme values in time series data by enhancing forecasting accuracy, advancing

representation learning, and improving generative capabilities.

1.3 Thesis Contributions

To validate the thesis statement, The primary objective of this thesis is to develop deep learning

frameworks that tackle the key research challenges discussed in the previous section. Chapter 3

addresses the challenge of rarity in extreme values and limited training data by incorporating extreme

value theory (EVT) as a form of regularization into deep learning models, improving their ability

to forecast extreme events using historical data. Chapter 4 extends this approach to multivariate

model-based forecasts, with techniques for handling long-term and randomly occurring missing

values, addressing both data fusion and generalization issues. Chapter 5 explores the challenge of

accurately modeling tail distributions by developing deep learning methods that focus on learning

effective representations for extremes in time series. Chapter 6 focuses on enhancing generative

models to accurately preserve the distribution of extreme values by combining frequency-domain

manipulation with EVT. A high-level overview of these contributions is presented in the following

subsections.

1.3.1 DeepExtrema: A Deep Learning Approach for Forecasting Block Maxima in Time

Series Data

Forecasting block maxima in time series data is essential for assessing risk and impact in various

domains but remains challenging due to the rarity of extreme values and limited training data, as

highlighted in the first two research challenges. To address these, we introduce DeepExtrema,

6

a novel framework that integrates deep neural networks (DNNs) with EVT for effective block

maxima forecasting. This approach leverages EVT, more specifically generalized extreme value

(GEV), to capture the distribution of extreme values and regularize the training, while the DNN

captures complex dependencies in the data. Integrating EVT into deep learning involves certain

technical challenges, such as maintaining GEV constraints during training and preventing violations

of regularity constraints due to random network initialization. DeepExtrema overcomes these

issues with innovations like model bias offset, constraint reformulation, and reparameterization,

effectively addressing the first two research challenges. Extensive experiments on synthetic and

real-world datasets show that DeepExtrema significantly outperforms baseline methods, improving

the accuracy of extreme event forecasting by 6.5%-16% in time series data.

1.3.2 Self-Recover: Forecasting Block Maxima in Time Series from Predictors with

Disparate Temporal Coverage using Self-Supervised Learning

Relying solely on historical observations is often insufficient for accurately predicting block

maxima, particularly in non-stationary time series where historical data may not capture the

complexities of future extremes. To address this, integrating domain-driven process-based models

can significantly improve forecast accuracy by providing additional context and insights. However,

this integration introduces challenges, such as handling missing data, reconciling different temporal

coverage, and managing uncertainties from diverse sources. To tackle these issues, we propose

Self-Recover, a self-supervised learning framework designed to predict block maxima by

effectively fusing disparate temporal data sources. This approach employs a combination of

contrastive learning and generative modeling techniques to impute long-term missing values and

a denoising autoencoder for handling data that are missing at random. By seamlessly combining

representations from historical observations and process model outputs, Self-Recover ensures

comprehensive modeling of the available data while also incorporating the GEV distribution for

consistency in predictions. Extensive experiments on real-world datasets demonstrate the superiority

of Self-Recover in forecasting block maxima, showcasing its ability to enhance predictive

performance through effective data fusion strategies.

7

1.3.3 SimEXT: Enhancing Time Series Forecasting of Extreme Values via Self-supervised

Representation Learning

This approach addresses the critical need for robust representation learning tailored specifically

to extreme values. Modeling extreme values in time series data requires a specialized focus on the

tail distribution rather than the overall distribution, as highlighted in the third research challenge.

To effectively address this challenge, we introduce SimEXT, a self-supervised learning framework

designed to learn robust representations that preserve the fidelity of the tail distribution. SimEXT

utilizes a combination of contrastive learning and a reconstruction-based autoencoder architecture,

facilitating the capture of temporal patterns linked to extreme events. To further enhance the learning

process, the framework incorporates a wavelet-based data augmentation technique alongside a

distribution-based loss function that emphasizes the extreme value distribution. We establish

probabilistic guarantees for the wavelet-based augmentation, ensuring that perturbations to the

wavelet coefficients do not significantly alter the extreme values within the time series. Experimental

results on real-world datasets demonstrate that SimEXT effectively captures the unique properties

of the tail distribution, significantly improving the performance of downstream tasks focused on

forecasting block maxima.

1.3.4 FIDE: Frequency-Inflated Conditional Diffusion Model for Extreme-Aware Time Series

Generation

Time series generation is a crucial aspect of data analysis, playing a pivotal role in learning

temporal patterns and their underlying dynamics across diverse domains. However, conventional

time series generation methods often struggle to adequately capture extreme values, diminishing

their value in critical applications such as scenario planning and risk management in healthcare,

finance, climate change adaptation, and beyond. To tackle this challenge, we introduce a conditional

diffusion model called FIDE, specifically designed to preserve the distribution of extreme values in

generative modeling for time series. FIDE employs a novel high-frequency inflation strategy in the

frequency domain to prevent the premature fade-out of extreme values. Additionally, it extends

traditional diffusion-based models to enable the generation of samples conditioned on the block

maxima, thereby enhancing the model’s capacity to capture extreme events. The FIDE framework

8

also incorporates the Generalized Extreme Value (GEV) distribution within its generative modeling

framework, ensuring fidelity to both block maxima and overall data distribution. Experimental

results on real-world and synthetic datasets demonstrate the efficacy of FIDE over baseline methods,

highlighting its potential to advance generative AI for time series analysis, particularly in accurately

modeling extreme events. This approach addresses the critical need for enhanced generative

modeling techniques for extreme values, which is essential for effective decision-making across

various fields.

1.4 Publications

The following works were used as a source for some of the content in this thesis:

• Asadullah Hill Galib, Andrew McDonald, Tyler Wilson, Lifeng Luo, & Pang-Ning Tan

(2022, Jul.). DeepExtrema: A Deep Learning Approach for Forecasting Block Maxima

in Time Series Data. In Proceedings of the Thirty-First International Joint Conference on

Artificial Intelligence, IJCAI 2022 (pp. 2980-2986). (Chapter 3)

• Asadullah Hill Galib,Andrew McDonald , Pang-Ning Tan & Luo, L. (2023, Jan.). Self-

Recover: Forecasting Block Maxima in Time Series from Predictors with Disparate Temporal

Coverage using Self-Supervised Learning. In Proceedings of the Thirty-Second International

Joint Conference on Artificial Intelligence, IJCAI 2023 (pp. 3723-3731). (Chapter 4)

• Asadullah Hill Galib, Pang-Ning Tan & Lifeng Luo (2023, Feb.). SimEXT: Self-supervised

Representation Learning for Extreme Values in Time Series. In Proceedings of the 23rd IEEE

International Conference on Data Mining, ICDM 2023, (pp. 1031-1036), IEEE. (Chapter 5)

• Asadullah Hill Galib, Pang-Ning Tan & Lifeng Luo (2023, Feb.). FIDE: Frequency-Inflated

Conditional Diffusion Model for Extreme-Aware Time Series Generation. Advances in Neural

Information Processing Systems NeurIPS 2024, 36. [To Appear] (Chapter 6)

In addition, here are the publications related to this research that I have published:

• Yue Deng, Asadullah Hill Galib, Pang-Ning Tan & Lifeng Luo (2024, Aug.). Unraveling

Block Maxima Forecasting Models with Counterfactual Explanation. In Proceedings of the

9

30th ACM SIGKDD 2022 Conference on Knowledge Discovery and Data Mining, KDD 2024

(pp. 562-573).

• Tyler Wilson, Asadullah Hill Galib, Pang-Ning Tan, and Lifeng Luo. Beyond Point Prediction:

Capturing Zero-Inflated & Heavy-Tailed Spatiotemporal Data with Deep Extreme Mixture

Models. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery

and Data Mining, KDD 2022, Washington, DC (2022).

• Tyler Wilson, Pang-Ning Tan, Lifeng Luo, and Asadullah Hill Galib. Deep Learning With

Extreme Value Theory for Modeling Precipitation Events. In AGU Fall Meeting Abstracts,

vol. 2021, pp. A15Q-07. 2021.

1.5 Thesis Organization

The remainder of this thesis is organized as follows. Chapter 2 provides a review of relevant

literature and previous research related to the topic. Chapter 3 presents the DeepExtrema framework,

which combines a deep neural network with a generalized extreme value distribution to predict

the block maximum value of a time series. In Chapter 4, a new deep learning framework named

Self-Recover is introduced, which overcomes the temporal data availability problem by using

self-supervised learning to forecast block maxima. Chapter 5 presents SimEXT, a self-supervised

learning framework that learns a robust representation of a time series to accurately capture its tail

distribution. Chapter 6 presents FIDE, a frequency-inflated conditional diffusion model to preserve

block maxima distribution in time series generation. Finally, Chapter 7 concludes this thesis by

outlining potential future research.

10

CHAPTER 2

RELATED WORK

In this section, we describe the current state of research in the application of deep learning to time

series data. Additionally, we will explore the advancements in extreme value theory as it pertains to

time series analysis. Also, we will discuss the emerging trend of self-supervised learning, which has

been gaining traction as an effective approach to analyzing time series data. Finally, we will discuss

anomaly detection in time series and how it differs from forecasting extreme values.

2.1 Deep Learning for Time Series Forecasting

Deep Learning (DL) has emerged as a promising technique for time series forecasting due to

its ability to handle complex patterns and nonlinear relationships in the data. In this section, we

summarize recent advances and developments in DL-based methods for time series forecasting.

One of the most common DL-based methods for time series forecasting is Recurrent Neural

Networks (RNNs), which have been shown to be effective in capturing temporal dependencies in

sequential data. In particular, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU)

have been widely adopted in various applications. LSTM networks [14] are able to learn long-range

dependencies in data, which is essential for forecasting time series that exhibit trends and seasonality.

Also, it overcomes the vanishing and exploding gradient problem in other recurrent neural networks.

Since its inception, LSTM and its variants have been successfully applied to forecasting time series

in various application domains. For example, [3] used an LSTM-based model to predict time series

electric load prediction [3]. [2] proposed a variation of LSTM to predict petroleum production.

Another popular deep learning architecture for time series forecasting is the gated recurrent unit

(GRU) network. GRU networks are similar to LSTM networks, but they are simpler and faster to

train. [1] employed an encoder-decoder-based GRU to predict host workloads in a cloud computing

environment.

Another direction of DL-based methods for time series forecasting is using 1D Convolutional

Neural Networks (CNNs), which have been shown to be effective in capturing local patterns in the

data. Some of the earliest successful applications of neural networks have involved convolutional

11

neural networks. In applications to time series, a small filter is passed over the temporally localized

region to model temporal relationships. For example, [15] proposed a deep CNN-based model for

forecasting time series data, achieving competitive performance compared to traditional methods.

In addition, [16] proposed a deep spatio-temporal CNN for traffic flow prediction, outperforming

various baseline methods. [17] proposed a CNN architecture for activity recognition while [18]

employed a CNN-based architecture for a variety of time series classification problems.

[19]

empirically demonstrated the effectiveness of CNNs as alternatives to LSTMs in sequence-based

modeling.

Moreover, DL-based methods for time series forecasting have also been extended to include

attention mechanisms [20], which can selectively focus on important features and temporal patterns

in the data. Attention-based RNN seemed to be a promising direction in time series analysis for

its consideration of the context apart from the typical data-driven approach. For example, [21]

proposed a deep learning model combining CNN, LSTM, and attention mechanism for stock price

prediction, achieving superior performance compared to traditional methods. [22] proposed a

transformer-based architecture for time series forecasting.

[23] used attention-based RNN for

multi-step time series prediction of process performance. [24] proposed an attention-based LSTM

framework for financial time series forecasting. [5] focuses on transformer challenges and introduces

an effective transformer-based model, called Informer, that significantly enhances the inference

speed of long time-series forecasting (LSTF) by predicting entire sequences in one forward operation

instead of step-by-step.

There are a number of other deep learning architectures that have been used for time series

forecasting. These include recurrent convolutional neural networks (RCNNs) [25, 26] temporal

convolutional networks (TCNs) [27, 28], reservoir computing [29, 30], graph neural networks [31],

etc.

While deep learning models have demonstrated substantial predictive accuracy in time series

forecasting, capturing uncertainty remains a significant challenge, especially for extreme values

that diverge from typical patterns. Traditional forecasting models are generally optimized for point

12

predictions, limiting their capacity to account for the inherent uncertainty in time series data and

often overlooking extreme event probabilities. To address this, recent research has focused on

integrating uncertainty quantification techniques with deep learning models to better capture the

range of possible outcomes. For instance, [32] proposed a distribution-free novel approach called

DeepPIPE. It simultaneously predicted point estimations as well as quantile estimations without any

prior assumption about the data distribution. They proposed a hybrid loss function based on point

estimations and point intervals to leverage point and quantile estimations. Although DeepPIPE

provides a robust baseline for uncertainty estimation, it does not incorporate extreme value theory

(EVT), which is crucial for accurately predicting rare events in the tails of the distribution. [4]

proposed an LSTM-based architecture for time series extreme event forecasting where they combined

Bootstrap and Bayesian approaches for uncertainty estimation.

2.2 Deep Learning for Time Series Extremes

[33] is representative of much of the traditional statistical work utilizing Extreme Value Theory

(EVT). They analyzed GEV parameters assuming there was a simple relationship between those

parameters and a single predictor, i.e., time. However, they did not provide a method for making

point estimates and their model was only able to model very simple relationships. Most of the

prior works combining deep learning with EVT suffer from significant limitations. For instance,

[34] incorporated the GP distribution in their loss function to forecast excesses over a threshold.

However, instead of predicting the GP parameters from covariates, they assume the GP parameter

values are known a-priori and can be provided as hyperparameters of their algorithm. Thus, the GP

parameters are assumed to be fixed (constant) for the all-time series, which is a strong assumption

especially if the time series is generated for different locations. [35] proposed to combine a deep

learning model with extreme value theory to model the tail behavior of a time series. They applied

the GP distribution for modeling excess values and used its negative log-likelihood as the loss

function. However, a major issue with their proposed framework is that it does not incorporate a

mechanism to enforce constraints on parameters of the learned GP distribution, which is essential

to ensure the predicted distribution is well-behaved. [36] combine copulas and normalizing flows

13

with the GP distribution to model multivariate extremes. [37, 38] combine deep learning with

the GP distribution for spatiotemporal data using the maximum log-likelihood of approach and

a reparameterization technique. Similar to other works, their approach is not designed for block

maxima prediction nor does it handle the varying temporal data availability issue.

2.3 Self-supervised Representation Learning for Time Series

Self-supervised learning has become increasingly popular in recent years due to its excellent

performance on various benchmark datasets along with strong theoretical foundations [39, 40, 41].

Self-supervised learning has been shown to outperform supervised learning in various instances [39,

42]. There are a number of studies focusing on the theoretical foundations of self-supervised learning.

For example, [43] provided provable guarantees on the performance of the learned representations

through self-supervised learning. [44] showed that self-supervised learned representations can

extract task-relevant information and discard task-irrelevant information. Several studies theoretically

explored these insights and demonstrated provable guarantees for contrastive [41, 45], non-contrastive

[46], and reconstruction-based [47] self-supervised learning.

In the context of time series, SSL has been shown to be an effective method for representation

learning and feature extraction, which can benefit various downstream tasks such as forecasting,

classification, and anomaly detection. One of the most popular SSL-based methods for time series is

Autoencoding, which aims to learn a compressed representation of the input data by reconstructing

it from a compressed latent space. For instance, [48] proposed a self-supervised framework for time

series forecasting, which uses recurrent autoencoder ensembles to learn a compressed representation

of the input data and generate future predictions. The proposed method achieved competitive

performance compared to traditional methods.

Another direction of SSL-based methods for time series is using contrastive learning, which aims

to learn a representation that maximizes the similarity between positive examples and minimizes

the similarity between negative examples. For example, [49] proposed a Contrastive Predictive

Coding (CPC) framework that employs a contrastive objective to learn a representation that captures

temporal dependencies in audio signals. Similarly, [50] proposed a Contrastive Multiview Coding

14

(CMC) framework for time series that uses a contrastive objective to learn a representation capturing

different perspectives of the input data. The proposed method has shown exceptional performance

in various downstream tasks such as forecasting and classification. Researchers have extended

contrastive learning in several ways. For instance, [51] use contrastive learning hierarchically over

augmented context views for representation learning. [52] apply intra-level contrastive learning to

disentangle seasonal and trend representations from time series. [53] incorporate frequency domain

information while maintaining consistency between time and frequency domains. [54] combine

temporal and contextual contrasting for representation learning. Additionally, [55] propose novel

data augmentation techniques that not only generate phase shifts and amplitude changes but also

retain the structure and feature information of the time series along with contrastive learning. It is

evident from these studies that Contrastive Learning is a powerful tool for representation learning in

time series analysis.

Moreover, SSL-based methods for time series have also been extended to include Generative

Adversarial Networks (GANs), which can learn a generator model that can generate realistic samples

from the input data distribution. For example, [56] proposed a self-supervised framework for time

series forecasting, which uses GANs to generate future predictions from the input data. Similarly,

[57] proposed a self-supervised framework for time series anomaly detection, which uses GANs

to generate normal samples from the input data distribution and detect anomalies based on the

reconstruction error.

SSL-based methods for time series have shown remarkable progress in recent years, with a variety

of architectures and techniques being proposed and achieving competitive performance. Future

research directions may include exploring more complex models and architectures, developing more

efficient training algorithms, and investigating the interpretability and explainability of SSL-based

methods.

2.4 Generative Modeling for Time Series

Time series generation has been an active area of research, with various statistical and machine

learning techniques employed to capture temporal dependencies and complexities in data. Classical

15

statistical models, such as AR, ARMA, MA, and ARIMA [58], have been utilized for time series

generation. However, these models often struggle to capture non-linear and complex relationships,

limiting their applicability to diverse problems.

To address these limitations, advanced generative models have emerged, including Generative

Adversarial Networks (GANs) [59], Variational Autoencoders (VAEs) [60], normalizing flows

[61], and diffusion-based approaches [13, 62]. These methods have demonstrated efficacy in time

series generation due to their ability to learn underlying data distributions for data generation.

While normalizing flows are constrained by computational complexity, limited expressiveness, and

suboptimal sample quality, GANs and VAEs offer advantages in this regard. Numerous GAN

architectures, such as RcGAN [63] and TimeGAN [7], have been proposed for generating realistic

time series data. TimeGAN [7] employs an encoder-decoder architecture to transform time series

samples into latent vectors. However, GAN-based models are susceptible to issues like mode

collapse and unstable training behavior. VAEs, although not extensively applied to synthetic time

series generation, have shown promise in related tasks like time series imputation [64], suggesting

their potential utility in this domain.

Diffusion-based models are also gaining traction for their ability to generate high-quality data,

bypassing challenges associated with discriminator networks in GANs and avoiding artifact-prone

lower-dimensional latent spaces of VAEs. While a few diffusion-based works [11, 12] have been

employed for time series, they are specifically designed for discriminative tasks. Despite the

advantages of generative AI for time series, modeling extreme values using these models remains

largely unexplored. Previous research [65] has recognized the difficulty of capturing extremes

using generative models like normalizing flows. Studies [66, 67] have highlighted the inability of

normalizing flows to accurately capture heavy-tailed marginal distributions, as mapping heavy-tailed

distributions to light-tailed distributions cannot maintain Lipschitz-boundedness. However, this

challenge remains largely unaddressed within the realm of diffusion models.

16

2.5 Anomaly Detection in Time Series

While it may seem that anomaly detection and extreme value forecasting are similar, it is crucial

to recognize that they have distinct goals and necessitate different methods. Although the terms may

occasionally be used interchangeably or confused with one another, it is essential to understand

their differences. Anomaly detection primarily involves identifying abnormal patterns or data points

in a time series, without necessarily predicting the specific magnitude or value of the anomaly. In

contrast, extreme value forecasting aims to forecast the magnitude or value of a specific event, such

as a maximum or minimum value, in the future.

Anomaly detection in time series is a crucial task in various applications, including finance,

healthcare, and industrial process control. The ability to detect deviations from the expected behavior

in real time can prevent critical failures, avoid financial losses, and improve overall efficiency. One

of the most popular approaches for anomaly detection in time series is statistical methods, which

rely on modeling the probability distribution of the data and identifying deviations from it. For

instance, the use of Gaussian distribution-based methods, such as the Z-score and the Mahalanobis

distance, has been extensively studied in the literature [68]. However, these methods are often

limited by their assumption of normality and may not be effective in detecting anomalies in complex

and high-dimensional data.

In recent years, machine learning techniques, especially deep learning-based methods, have

shown remarkable progress in anomaly detection for time series data. One of the most popular

deep learning-based methods for anomaly detection is the use of autoencoder networks, which can

learn a compressed representation of the input data and reconstruct it from a latent space. Anomaly

detection can then be performed by identifying samples with high reconstruction errors [69]. For

example, [70] proposed a self-supervised framework for time series anomaly detection, which

uses Variational Autoencoders (VAE) to learn a compressed representation of the input data and

reconstruct it from the latent space. Another direction of deep learning-based methods for anomaly

detection in time series is the use of recurrent neural networks (RNNs), which can capture temporal

dependencies and patterns in sequential data. For instance, [71] proposed a framework for anomaly

17

detection in time series data based on LSTMs, which achieved state-of-the-art performance on

benchmark datasets. Moreover, the use of attention mechanisms in RNN-based models has been

shown to improve the accuracy of anomaly detection by focusing on important temporal features

in the data [72]. Furthermore, recent studies have also explored the use of unsupervised learning

methods for anomaly detection in time series data, which do not require labeled data for training.

For example, [73] proposed an autonomous anomaly detection technique for multivariate time series

data (TimeAutoAD) based on a novel self-supervised contrastive loss. Machine learning techniques,

especially deep learning-based methods, have shown great potential in improving the accuracy and

efficiency of anomaly detection in time series data.

18

CHAPTER 3

DEEPEXTREMA: A DEEP LEARNING APPROACH FOR FORECASTING BLOCK
MAXIMA IN TIME SERIES DATA

3.1

Introduction

Extreme events such as droughts, floods, and severe storms occur when the values of the

corresponding geophysical variables (such as temperature, precipitation, or wind speed) reach their

highest or lowest point during a period or surpass a threshold value. Extreme events have far-reaching

consequences for both humans and the environment. For example, four of the most expensive

hurricane disasters in the United States since 2005—Katrina, Sandy, Harvey, and Irma—have each

incurred over $50 billion in damages with enormous death tolls [74]. Accurate forecasting of the

extreme events [75] is, therefore, crucial as it not only helps provide timely warnings to the public

but also enables emergency managers and responders to better assess the risk of potential hazards

caused by future extreme events.

Figure 3.1 Types of extreme values in a given time window.

Despite its importance, forecasting time series with extremes can be tricky as the extreme values

may have been ignored as outliers during training to improve the generalization performance of

the model. Furthermore, as mentioned in chapter 1, current approaches are mostly designed to

minimize the mean-square prediction error, their fitted models focus on predicting the conditional

expectation of the target variable rather than its extreme values [76]. Extreme value theory (EVT)

offers a statistically well-grounded approach to derive the limiting distribution governing a sequence

of extreme values [10]. The two most popular distributions studied in EVT are the generalized

extreme value (GEV) and generalized Pareto (GP) distributions. Given a prediction time window,

19

GEV governs the distribution of its block maxima, whereas GP is concerned with the distribution of

excess values over a certain threshold, as shown in Figure 3.1. This work focuses on forecasting the

block maxima as it allows us to assess the worst-case scenario in the given forecast time window

and avoids making ad-hoc decisions regarding the choice of excess threshold to use for the GP

distribution. As mentioned in chapter 1, classical EVT has limited capacity in terms of modeling

highly complex, nonlinear relationships present in time series data. For example, [33] uses a simple

linear model to infer the GEV parameters.

Deep learning methods have grown in popularity in recent years due to their ability to capture

nonlinear dependencies in the data. Previous studies have utilized a variety of deep neural network

architectures for time series modeling, including long short-term memory networks [2, 3, 4],

convolutional neural networks [19, 18, 17], encoder-decoder based RNN [1], and attention-based

models [24, 23]. However, these works are mostly focused on predicting the conditional mean

of the target variable, as mentioned in chapter 1. While there have been some recent attempts to

incorporate EVT into deep learning [38, 34, 35], they are primarily focused on modeling the tail

distribution, i.e., excess values over a threshold, using the GP distribution, rather than forecasting

the block maxima using the GEV distribution. Furthermore, instead of inferring the distribution

parameters from data, some methods [34] assume that the parameters are fixed at all times and

can be provided as user-specified hyperparameters while others [35] do not enforce the necessary

constraints on parameters of the extreme value distribution.

Incorporating the GEV distribution into the deep learning formulation presents many technical

challenges. First, the GEV parameters must satisfy certain positivity constraints to ensure that the

predicted distribution has a finite bound [10]. Maintaining these constraints throughout the training

process is a challenge since the model parameters depend on the observed predictor values in a

mini-batch. Another challenge is the scarcity of data since there is only one block maximum value

per time window. This makes it hard to accurately infer the GEV parameters for each window

from a set of predictors. Finally, the training process is highly sensitive to model initialization.

For example, the random initialization of a deep neural network (DNN) can easily violate certain

20

regularity conditions of the GEV parameters estimated using maximum likelihood (ML) estimation.

An improper initialization may lead to 𝜉 estimates that defy the regularity conditions. For example,

if 𝜉 < −0.5, then resulting ML estimators may not have the standard asymptotic properties [77],

whereas if 𝜉 > 1, then the conditional mean is not well-defined. Without proper initialization, the

DNN will struggle to converge to a feasible solution with acceptable values of the GEV parameters.

Thus, controlling the initial estimate of the parameters is difficult but necessary.

To overcome these challenges, we propose a novel framework called DeepExtrema that utilizes

the GEV distribution to characterize the distribution of block maximum values for a given forecast

time window. The parameters of the GEV distribution are estimated using a DNN, which is trained

to capture the nonlinear dependencies in the time series data. This is a major departure from

previous work by [34], where the distribution parameters are assumed to be a fixed, user-specified

hyperparameter. DeepExtrema reparameterizes the GEV formulation to ensure that the DNN output

is compliant with the GEV positivity constraints. In addition, DeepExtrema offers a novel, model

bias offset mechanism in order to ensure that the regularity conditions of the GEV parameters are

satisfied from the beginning.

In summary, the main contributions of the work are:

1. We present a novel framework to predict the block maxima of a given time window by

incorporating GEV distribution into the training of a DNN.

2. We propose a reformulation of the GEV constraints to ensure they can be enforced using

activation functions in the DNN.

3. We introduce a model bias offset mechanism to ensure that the DNN output preserves the

regularity conditions of the GEV parameters despite its random initialization.

4. We perform extensive experiments on both real-world and synthetic data to demonstrate the

effectiveness of DeepExtrema compared to other baseline methods.

21

3.2 Preliminaries

3.2.1 Problem Statement

Let 𝑧1, 𝑧2, · · · , 𝑧𝑇 be a time series of length 𝑇. Assume the time series is partitioned into

a set of time windows, where each window [𝑡 − 𝛼, 𝑡 + 𝛽] contains a sequence of predictors,

𝑥𝑡 = (𝑧𝑡−𝛼, 𝑧𝑡−𝛼+1, · · · , 𝑧𝑡), and target, ˜𝑦𝑡 = (𝑧𝑡+1, 𝑧𝑡+2, · · · , 𝑧𝑡+𝛽). Note that 𝛽 is known as the

forecast horizon of the prediction. For each time window, let 𝑦𝑡 = max𝜏∈{1,··· ,𝛽} 𝑧𝑡+𝜏 be the block

maxima of the target variable at time 𝑡. Our time series forecasting task is to estimate the block

maxima, ˆ𝑦𝑡, as well as its upper and lower quantile estimates, ˆ𝑦𝑈 and ˆ𝑦 𝐿, of a future time window

based on current and past data, 𝑥𝑡.

3.2.2 Generalized Extreme Value Distribution

The GEV distribution governs the distribution of block maxima in a given window. Let

𝑌 = max{𝑧1, 𝑧2, · · · , 𝑧𝑡 }. If there exist sequences of constants 𝑎𝑡 > 0 and 𝑏𝑡 such that

𝑃𝑟 (𝑌 − 𝑏𝑡)/𝑎𝑡 ≤ 𝑦 → 𝐺 (𝑦)

as 𝑡 → ∞

for a non-degenerate distribution 𝐺, then the cumulative distribution function 𝐺 belongs to a family

of GEV distribution of the form [10]:

𝐺 (𝑦) = exp

(cid:20)

(cid:26)

−

1 + 𝜉 (

𝑦 − 𝜇
𝜎

)

(cid:21) −1/𝜉 (cid:27)

(3.1)

The GEV distribution is characterized by the following parameters: 𝜇 (location), 𝜎 (scale), and

𝜉 (shape). The expected value of the distribution is given by

𝑦𝑚𝑒𝑎𝑛 = 𝜇 +

(cid:21)

Γ(1 − 𝜉) − 1

(cid:20)

𝜎
𝜉

(3.2)

where Γ(𝑥) denotes the gamma function of a variable 𝑥 > 0. Thus, 𝑦𝑚𝑒𝑎𝑛 is only well-defined for

𝜉 < 1. Furthermore, the 𝑝th quantile of the GEV distribution, 𝑦 𝑝, can be calculated as follows:

𝑦 𝑝 = 𝜇 +

(− log 𝑝)−𝜉 − 1

(cid:21)

(cid:20)

𝜎
𝜉

(3.3)

22

Given 𝑛 independent block maxima values, {𝑦1, 𝑦2, · · · , 𝑦𝑛}, with the distribution function given

by Equation (3.1) and assuming 𝜉 ≠ 0, its log-likelihood function is given by:

ℓ𝐺𝐸𝑉 (𝜇, 𝜎, 𝜉) = −𝑛 log 𝜎 − (

1
𝜉

+ 1)

𝑛
∑︁

𝑖=1

log(1 + 𝜉

𝑦𝑖 − 𝜇
𝜎

)

−

𝑛
∑︁

𝑖=1

(1 + 𝜉

𝑦𝑖 − 𝜇
𝜎

)−1/𝜉

(3.4)

The GEV parameters (𝜇, 𝜎, 𝜉) can be estimated using the maximum likelihood (ML) approach

by maximizing (3.4) subject to the following positivity constraints:

𝜎 > 0

and ∀ 𝑖 : 1 +

𝜉
𝜎

(𝑦𝑖 − 𝜇) > 0

(3.5)

In addition to the above positivity constraints, the shape parameter 𝜉 must be within a certain

range of values in order for the ML estimators to exist and have regular asymptotic properties [10].

Specifically, the ML estimators have regular asymptotic properties as long as 𝜉 > −0.5. Otherwise,

if −1 < 𝜉 < −0.5, then the ML estimators may exist but will not have regular asymptotic properties.

Finally, the ML estimators do not exist if 𝜉 < −1 [77].

3.3 Proposed Framework: DeepExtrema

This section presents the proposed DeepExtrema framework for predicting the block maxima

of a given time window. The predicted block maxima ˆ𝑦 follows a GEV distribution, whose

parameters are conditioned on observations of the predictors 𝑥. Figure 3.2 provides an overview of

the DeepExtrema architecture. Given the input predictors 𝑥, the framework uses a stacked LSTM

network to learn a representation of the time series. The LSTM will output a latent representation,

which will used by a fully connected layer to generate the GEV parameters:

(𝜇, 𝜎, 𝜉𝑢, 𝜉𝑙) = 𝐿𝑆𝑇 𝑀 (𝑥)

(3.6)

where 𝜇, 𝜎, and 𝜉’s are the location, shape, and scale parameters of the GEV distribution. Note that

𝜉𝑢 and 𝜉𝑙 are the estimated parameters due to the reformulation of the GEV constraints, which will

be described in the next subsection.

The proposed Model Bias Offset (MBO) component performs bias correction on the estimated

GEV parameters to ensure that the LSTM outputs preserve the regularity conditions of the GEV

23

Figure 3.2 Proposed DeepExtrema framework for predicting block maxima using GEV distribution.

parameters irrespective of how the network was initialized. The GEV parameters are subsequently

provided to a fully connected layer to obtain point estimates of the block maxima, which include its

expected value ˆ𝑦 as well as upper and lower quantiles, ˆ𝑦𝑈 and ˆ𝑦 𝐿, using the equations given in (3.2)

and (3.3), respectively. The GEV parameters are then used to compute the negative log-likelihood of

the estimated GEV distribution, which will be combined with the root-mean-square error (RMSE)

of the predicted block maxima to determine the overall loss function. Details of the different

components are described in the subsections below.

3.3.1 GEV Parameter Estimation

Let D = {(𝑥𝑖, 𝑦𝑖)}𝑛

𝑖=1 be a set of training examples, where each 𝑥𝑖 denotes the predictor time
series and 𝑦𝑖 is the corresponding block maxima for time window 𝑖. A naïve approach is to assume

that the GEV parameters (𝜇, 𝜎, 𝜉) are constants for all time windows. This can be done by fitting

a global GEV distribution to the set of block maxima values 𝑦𝑖’s using the maximum likelihood

approach given in (3.4). Instead of using a global GEV distribution with fixed parameters, our

24

goal is to learn the parameters (𝜇𝑖, 𝜎𝑖, 𝜉𝑖) of each window 𝑖 using the predictors 𝑥𝑖. The added

flexibility enables the model to improve the accuracy of its block maxima prediction, especially for

non-stationary time series.

The estimated GEV parameters generated by the LSTM must satisfy the two positivity constraints

given by the inequalities in (3.5). While the first positivity constraint on 𝜎𝑖 is straightforward to

enforce, maintaining the second one is harder as it involves a nonlinear relationship between 𝑦𝑖 and

the estimated GEV parameters, 𝜉𝑖, 𝜇𝑖, and 𝜎𝑖. The GEV parameters may vary from one input 𝑥𝑖 to

another, and thus, learning them from the limited training examples is a challenge. Worse still, some

of the estimated GEV parameters could be erroneous, especially at the initial rounds of the training

epochs, making it difficult to satisfy the constraints throughout the learning process.

To address these challenges, we propose a reformulation of the second constraint in (3.5). This

allows the training process to proceed even though the second constraint in (3.5) has yet to be

satisfied especially in the early rounds of the training epochs. Specifically, we relax the hard

constraint by adding a small tolerance factor, 𝜏 > 0, as follows:

∀ 𝑖 : 1 +

𝜉
𝜎

(𝑦𝑖 − 𝜇) + 𝜏 ≥ 0.

(3.7)

The preceding soft constraint allows for minor violations of the second constraint in (3.5) as long as

1 +

𝜉
𝜎 (𝑦𝑖 − 𝜇) > −𝜏 for all time windows 𝑖. Furthermore, to ensure that the inequality holds for all
𝑦𝑖’s, we reformulate the constraint in (3.7) in terms of 𝑦min = min𝑖 𝑦𝑖 and 𝑦max = max𝑖 𝑦𝑖 as follows:

Theorem 1. Assuming 𝜉 ≠ 0, the soft constraint in (3.7) can be reformulated into the following

bounds on 𝜉:

−

𝜎
𝑦max − 𝜇

(1 + 𝜏) ≤ 𝜉 ≤

𝜎
𝜇 − 𝑦min

(1 + 𝜏)

(3.8)

where 𝜏 is the tolerance on the constraint in 3.5.

Proof. For the lower bound on 𝜉, set 𝑦𝑖 to be 𝑦max in (3.7):

1 +

𝜉
𝜎

(𝑦max − 𝜇) + 𝜏 ≥ 0 =⇒

𝜉
𝜎

(𝑦max − 𝜇) ≥ − (1 + 𝜏)

=⇒ 𝜉 ≥ −

𝜎
(𝑦max − 𝜇)

(1 + 𝜏)

25

To obtain the upper bound on 𝜉, set 𝑦𝑖 to be 𝑦min in (3.7):

1 +

𝜉
𝜎

(𝑦min − 𝜇) + 𝜏 ≥ 0 =⇒

𝜉
𝜎

(𝜇 − 𝑦min) ≤ (1 + 𝜏)

=⇒ 𝜉 ≤

𝜎
(𝜇 − 𝑦min)

(1 + 𝜏)

Following Theorem 1, the upper and lower bound constraints on 𝜉 in (3.8) can be restated as follows:

□

(1 + 𝜏) − 𝜉 ≥ 0

𝜎
𝜇 − 𝑦min
𝜎
𝜉 +
𝑦max − 𝜇

(1 + 𝜏) ≥ 0

(3.9)

The reformulation imposes lower and upper bounds on 𝜉, which can be used to re-parameterize

the second constraint in (3.5). Next, we describe how the reformulated constraints in (3.9) can be

enforced by DeepExtrema in a DNN.

Given an input 𝑥𝑖, DeepExtrema will generate the following four outputs: 𝜇𝑖, 𝑃1𝑖, 𝑃2𝑖, and 𝑃3𝑖.

A softplus activation function, 𝑠𝑜 𝑓 𝑡 𝑝𝑙𝑢𝑠(𝑥) = log (1 + exp (𝑥)), which is a smooth approximation

to the ReLU function, is used to enforce the non-negativity constraints associated with the GEV

parameters. The scale parameter 𝜎𝑖 can be computed using the softplus activation function on 𝑃1𝑖

as follows:

𝜎𝑖 = 𝑠𝑜 𝑓 𝑡 𝑝𝑙𝑢𝑠(𝑃1𝑖)

(3.10)

This ensures the constraint 𝜎𝑖 ≥ 0 is met. The lower and upper bound constraints on 𝜉𝑖 given by the

inequalities in (3.9) are enforced using the softplus function on 𝑃2𝑖 and 𝑃3𝑖:

𝜎𝑖
𝜇𝑖 − 𝑦min
𝜎𝑖
𝑦max − 𝜇𝑖

(1 + 𝜏) − 𝜉𝑢,𝑖 = 𝑠𝑜 𝑓 𝑡 𝑝𝑙𝑢𝑠(𝑃2𝑖)

(1 + 𝜏) + 𝜉𝑙,𝑖 = 𝑠𝑜 𝑓 𝑡 𝑝𝑙𝑢𝑠(𝑃3𝑖)

(3.11)

By re-arranging the above equation, we obtain

𝜉𝑢,𝑖 =

𝜎𝑖
𝜇𝑖 − 𝑦min
𝜉𝑙,𝑖 = 𝑠𝑜 𝑓 𝑡 𝑝𝑙𝑢𝑠(𝑃3𝑖) −

(1 + 𝜏) − 𝑠𝑜 𝑓 𝑡 𝑝𝑙𝑢𝑠(𝑃2𝑖)
𝜎𝑖
𝑦max − 𝜇𝑖

(1 + 𝜏)

26

(3.12)

DeepExtrema computes the upper and lower bounds on 𝜉𝑖 using the formulas in (3.12). During

training, it will minimize the distance between 𝜉𝑢,𝑖 and 𝜉𝑙,𝑖 and will use the value of 𝜉𝑢,𝑖 as the

estimate for 𝜉𝑖. Note that the two 𝜉𝑖’s converge rapidly to a single value after a small number of

training epochs.

3.3.2 Model Bias Offset (MBO)

Although the constraint reformulation approach described in the previous subsection ensures

that the DNN outputs will satisfy the GEV constraints, the random initialization of the network can

produce estimates of 𝜉 that violate the regularity conditions described in Section 3.2.2. Specifically,

the ML-estimated distribution may not have the asymptotic GEV distribution when 𝜉 < −0.5 while

its conditional mean is not well-defined when 𝜉 > 1. Additionally, the estimated location parameter

𝜇 may not fall within the desired range between 𝑦min and 𝑦max when the DNN is randomly initialized.

Thus, without proper initialization, the DNN will struggle to converge to a good solution and produce

acceptable values of the GEV parameters.

One way to address this challenge is to repeat the random initialization of the DNN until a

reasonable set of initial GEV parameters, i.e., 𝑦min ≤ 𝜇 ≤ 𝑦max and −0.5 < 𝜉 < 1, is found.

However, this approach is infeasible given the size of the parameter space of the DNN. A better

strategy is to control the initial output of the neural network in order to produce an acceptable set of

GEV parameters, (𝜇, 𝜎, 𝜉) during initialization. Unfortunately, controlling the initial output of a

neural network is difficult given its complex architecture.

We introduce a simple but effective technique called Model Bias Offset (MBO) to address this

challenge. The key insight here is to view the GEV parameters as a biased output due to the random

initialization of the DNN and then perform bias correction to alleviate the effect of the initialization.

To do this, let 𝜇desired, 𝜎desired, and 𝜉desired be an acceptable set of initial GEV parameters. The values

of these initial parameters must satisfy the regularity conditions −0.5 < 𝜉desired < 1, 𝜎desired > 0,

and 𝑦min ≤ 𝜇desired ≤ 𝑦max. This can be done by randomly choosing a value from the preceding

range of acceptable values, or more intelligently, using the GEV parameters estimated from a global

GEV distribution fitted to the block maxima 𝑦𝑖’s in the training data via the ML approach given in

27

Figure 3.3 Model Bias Offset (MBO) to ensure the initial estimates of the GEV parameters are
reasonable even when the DNN is randomly initialized.

(3.4), without considering the input predictors (see the discussion at the beginning of Section 3.3.1).

We find that the latter strategy works well in practice as it can lead to faster convergence especially

when the global GEV parameters are close to the final values after training.

When the DNN is randomly initialized, let 𝜇0, 𝜎0, 𝜉𝑢,0, and 𝜉𝑙,0 be the initial DNN output for

the GEV parameters. These initial outputs may not necessarily fall within their respective range of

acceptable values. We consider the difference between the initial DNN output and the desired GEV

parameters as a model bias due to the random initialization:

𝜇bias = 𝜇0 − 𝜇desired

𝜎bias = 𝜎0 − 𝜎desired

𝜉𝑢,bias = 𝜉𝑢,0 − 𝜉𝑢,desired

𝜉𝑙,bias = 𝜉𝑙,0 − 𝜉𝑙,desired

(3.13)

The model bias terms in (3.14) can be computed during the initial forward pass of the algorithm.

The gradient calculation and back-propagation are disabled during this step to prevent the DNN

from computing the loss and updating its weight with the unacceptable GEV parameters. After the

initial iteration, the gradient calculation will be enabled and the bias terms will be subtracted from

28

the DNN estimate of the GEV parameters in all subsequent iterations 𝑡:

𝜇𝑡 → 𝜇𝑡 − 𝜇bias

𝜉𝑢,𝑡 → 𝜉𝑢,𝑡 − 𝜉𝑢,bias

𝜎𝑡 → 𝜎𝑡 − 𝜎bias

𝜉𝑙,𝑡 → 𝜉𝑙,𝑡 − 𝜉𝑙,bias

(3.14)

Note that, when 𝜇𝑡 is set to 𝜇0, then the debiased output 𝜇𝑡 − 𝜇bias will be equal to 𝜇desired. By

debiasing the output of the DNN in this way, we guarantee that the initial GEV parameters satisfy

the GEV regularity conditions.

3.3.3 Block Maxima Prediction

Given an input 𝑥𝑖, the DNN will estimate the GEV parameters needed to compute the block

maxima ˆ𝑦𝑖 along with its upper and lower quantiles, ˆ𝑦𝑈,𝑖 and ˆ𝑦 𝐿,𝑖, respectively. The quantiles are

estimated using the formula given in (3.3). The GEV parameters are provided as input to a fully

connected network (FCN) to generate the block maxima prediction, ˆ𝑦𝑖.

DeepExtrema employs a combination of the negative log-likelihood function (−ℓ𝐺𝐸𝑉 (𝜇, 𝜎, 𝜉))

of the GEV distribution and a least-square loss function to train the model. This enables the framework

to simultaneously learn the GEV parameters and make accurate block maxima predictions. The loss

function to be minimized by DeepExtrema is:

L = 𝜆1 ˆL + (1 − 𝜆1)

𝑛
∑︁

𝑖=1

(𝑦𝑖 − ˆ𝑦𝑖)2

(3.15)

where ˆL = −𝜆2 ℓ𝐺𝐸𝑉 (𝜇, 𝜎, 𝜉) + (1 − 𝜆2) (cid:205)𝑛
𝑖=1(𝜉𝑢,𝑖 − 𝜉𝑙,𝑖)2 is the regularized GEV loss. The first
term in ˆL corresponds to the negative log-likelihood function given in Equation (3.4) while the

second term minimizes the difference between the upper and lower-bound estimates of 𝜉. The loss

function L combines the regularized GEV loss ( ˆL) with the least-square loss. Here, 𝜆1 and 𝜆2 are

hyperparameters to manage the trade-off between different factors of the loss function.

3.4 Experimental Evaluation

This section presents our experimental results comparing DeepExtrema against the var-

ious baseline methods. The code and datasets are available at https://github.com/galib19/

DeepExtrema-IJCAI22-.

29

3.4.1 Data

3.4.1.1 Synthetic Data

As the ground truth GEV parameters are often unknown, we have created a synthetic dataset to

assess our method’s ability to correctly infer the GEV parameters. The data is generated assuming

the GEV parameters are functions of some input predictors 𝑥 ∈ R6. We first generate 𝑥 by random

sampling from a uniform distribution. We then assume a non-linear mapping from 𝑥 to the GEV

parameters 𝜇, 𝜎, and 𝜉, via the following nonlinear equations:

𝜇(𝑥) = 𝑤𝑇

𝜇 (exp (𝑥) + 𝑥)

𝜉 (𝑥) = 𝑤𝑇

𝜉 (exp (𝑥) + 𝑥)

𝜎(𝑥) = 𝑤𝑇

𝜎 (exp (𝑥) + 𝑥)

(3.16)

where 𝑤 𝜇, 𝑤𝜎, and 𝑤𝜉 are generated from a standard normal distribution. Using the generated 𝜇, 𝜎,

and 𝜉 parameters, we then randomly sample 𝑦 from the GEV distribution governed by the GEV

parameters. Here, 𝑦 denotes the block maxima as it is generated from a GEV distribution. We

created 8,192 block maxima values for our synthetic data.

3.4.1.2 Real World Data

We consider the following 3 datasets for our experiments.

Hurricane: This corresponds to tropical cyclone intensity data obtained from the HURDAT2

database [78]. There are altogether 3,111 hurricanes spanning the period between 1851 and 2019.

For each hurricane, wind speeds (intensities) were reported at every 6-hour interval. We consider

only hurricanes that have at least 24-time steps at minimum for our experiments. For each hurricane,

we have created non-overlapping time windows of length 24 time steps (6 days). We use the first 16

time steps (4 days) in the window as the predictor variables and the block maxima of the last 8 time

steps (2 days) as the target variable.

Solar: This corresponds to half-hourly energy use (kWh) for 55 families over the course of 284

days from the Ausgrid database [79]. We preprocess the data by creating non-overlapping time

windows of length 192 time steps (4 days). We use the first 144 time steps (3 days) in the window as

the predictor variables and the block maxima of the last 48 time steps (1 day) as the target variable.

30

Weather: We have used a weather dataset from the Kaggle competition [80]. The data is based

on hourly temperature data for a city over a ten-year period. We use the first 16 time steps (16 hours)

in the window as the predictor variables and the block maxima of the last 8 time steps (8 hours) as

the target variable.

3.4.2 Experimental Setup

For evaluation purposes, we split the data into separate training, validation, and testing with

a ratio of 7:2:1. The data is standardized to have zero mean and unit variance. We compare

DeepExtrema against the following baseline methods: (1) Persistence, which uses the block maxima

value from the previous time step as its predicted value, (2) fully-connected network (FCN), (3)

LSTM, (4) Transformer, (5) DeepPIPE [32], and (6) EVL [34]. We will use the following metrics to

evaluate the performance of the methods: (1) Root mean squared error (RMSE) and correlation

between the predicted and ground truth block maxima and (2) Negative log-likelihood (for synthetic

data). Finally, hyperparameter tuning is performed by assessing the model performance on the

validation set. The hyperparameters of the baseline and the proposed methods are selected using Ray

Tune, a tuning framework with an ASHA (asynchronous successive halving algorithm) scheduler for

early stopping.

3.4.3 Experimental Results

3.4.3.1 Results On Synthetic Data

In this experiment, we have compared the performance of DeepExtrema against using a single

(global) GEV parameter to fit the data. Based on the experiments, DeepExtrema achieves a

substantially lower negative log-likelihood of 4410 than the global GEV estimate, which has a

negative log-likelihood of 4745. This result supports the assumption that each block maxima comes

from different GEV distributions rather than a single (global) GEV distribution. Figure 3.4 also

illustrates that DeepExtrema can accurately predict the GEV parameters: location ((𝜇), scale ((𝜎),

and shape ((𝜉).

31

Figure 3.4 Actual and predicted GEV parameters (location, scale, and shape) using DeepExtrema.

Methods

Persistence
FCN
LSTM
Transformer
DeepPIPE
EVL
DeepExtrema

Hurricanes
𝜌
0.6
0.87
0.88
0.88
0.87
0.83
0.90

RMSE
28.6
14.14
13.31
13.89
13.67
15.72
12.81

Ausgrid

RMSE
0.84
0.69
0.65
0.68
0.71
0.75
0.63

𝜌
0.65
0.65
0.64
0.62
0.59
0.54
0.67

Weather
𝜌
0.96
0.97
0.97
0.98
0.94
0.90
0.97

RMSE
4.16
2.5
2.53
2.43
2.59
2.71
2.27

Table 3.1 Performance comparison on real-world data using RMSE and Correlation (𝜌).

3.4.3.2 Results On Real World Data

Evaluation of real-world data shows that DeepExtrema outperforms other baseline methods

used for comparison for all data sets (see Table 3.1). For RMSE, DeepExtrema generates lower

RMSE compared to all the baselines on all 3 datasets, whereas for correlation, DeepExtrema

outperforms the baselines on 2 of the 3 datasets.

(a) DeepExtrema

(b) EVL [34]

Figure 3.5 Comparison between Actual and predicted block maxima of hurricane intensities for
DeepExtrema and EVL [34].

32

Figure 3.6 90% confidence interval of the hurricane intensity predictions for DeepExtrema, sorted
in increasing block maxima values. Ground truth values are shown in red.

To demonstrate how well the model predicts the extreme values, Figure 3.5 shows a scatter plot

of the actual versus predicted values generated by DeepExtrema on the test set of the hurricane

intensity data. The results suggest that DeepExtrema can accurately predict hurricane intensities

for a wide range of values, especially those below 140 knots. DeepExtrema also does a better

job at predicting the high-intensity hurricanes compared to EVL [34]. Figure 3.6 shows the 90%

confidence interval of the predictions. Apart from the point and quantile estimations, DeepExtrema

can also estimate the GEV parameter values for each hurricane.

3.4.3.3 Ablation Studies

The hyperparameter 𝜆1 in the objective function of DeepExtrema denotes the trade-off between

regularized GEV loss and RMSE loss. Experimental results show that the RMSE of block

maxima prediction decreases when 𝜆1 increases (see Table 3.2). This validates the importance of

incorporating GEV theory to improve the accuracy of block maxima estimation instead of using the

mean squared loss alone.

3.4.3.4 Case Study

To further illustrate the efficacy of our proposed approach, DeepExtrema, we conducted a

case study focused on forecasting maximum hurricane intensity. This case study analyzed two

samples from our test dataset: Major Hurricane Harvey (2017) and Tropical Storm Ernesto (2012).

Major Hurricane Harvey was a particularly devastating event, ranking as the second most expensive

33

Hyperparameter
𝜆1 = 0.0
𝜆1 = 0.5
𝜆1 = 0.9

RMSE of Block maxima
Hurricanes Ausgrid Weather

13.28
13.03
12.81

0.68
0.63
0.64

2.52
2.41
2.27

Table 3.2 Effect of 𝜆1 on RMSE of block maxima prediction.

(a) Major Hurricane Harvey (2017)

(b) Tropical Storm Ernesto (2012)

Figure 3.7 Case Study: Forecasting Maximum Hurricane Intensity using DeepExtrema and
Transformer.

hurricane in U.S. history, causing $152.2 billion in damages and claiming at least 103 lives. Tropical

Storm Ernesto also had significant impacts, with heavy rainfall and flooding resulting in $500

million in losses.

In this study, we aimed to predict the maximum intensity for the next two days based on the

intensity values from the preceding four days. By employing our DeepExtrema framework alongside

the second-best performing baseline model, Transformer, we observed notable differences in their

forecasting performance. As illustrated in Figure 3.7, DeepExtrema demonstrated a remarkably

close alignment with the ground truth in forecasting maximum intensity. In contrast, the Transformer

model consistently underestimated the maximum intensity for both hurricane events.

This case study highlights the effectiveness and accuracy of DeepExtrema in capturing and

forecasting extreme values in time series data, particularly in the context of high-impact weather

events like hurricanes. The superior performance of our proposed approach compared to the baseline

model further reinforces the potential of DeepExtrema in enabling more accurate predictions of

34

extreme events, which can have far-reaching implications for disaster preparedness, risk mitigation,

and resource allocation strategies.

3.5 Conclusion

This work presents a novel deep learning framework called DeepExtrema that combines extreme

value theory with deep learning to address the challenges of predicting extremes in time series.

We offer a reformulation and re-parameterization technique for satisfying constraints and a model

bias offset technique for proper model initialization. We evaluated our framework on synthetic and

real-world data and showed its effectiveness. For future work, we plan to extend the formulation to

enable more complex deep learning architectures such as attention mechanisms. Furthermore, the

system will be expanded to model spatiotemporal extremes.

35

CHAPTER 4

SELF-RECOVER: FORECASTING BLOCK MAXIMA IN TIME SERIES FROM
PREDICTORS WITH DISPARATE TEMPORAL COVERAGE USING
SELF-SUPERVISED LEARNING

4.1

Introduction

Forecasting of extreme values such as block maxima in a future time window is an important

but challenging problem. Accurate forecasting of the block maxima enables us to anticipate the

worst-case scenario expected to occur within the forecast period. As mentioned in chapter 2, deep

learning [2, 3, 4, 1, 20, 24, 23] has become increasingly popular in recent years for time series

forecasting due to their capacity to learn complex nonlinear relationships in data. As outlined in

Chapter 1, these approaches are primarily designed to predict the conditional mean rather than the

tail distribution, which often limits their effectiveness in forecasting extreme values. This has led to

considerable interest in incorporating sound statistical principles from extreme value theory (EVT)

into the deep learning formulation [81, 36, 38, 34, 35].

Another challenge in forecasting block maxima is the limited availability of extreme historical

observations. Towards this end, process-based models have been developed in many domains to

incorporate our understanding of the physical processes that regulate the dynamical system. For

example, general circulation models (GCMs) are widely-used process-based models for generating

robust representations of future climate scenarios. The forecasts generated from these process-based

models can be used as domain-informed predictors that can be coupled with historical observations

to enhance the performance of time series forecasting. However, a key challenge in integrating the

Figure 4.1 Disparate temporal coverage between historical observations and process-based model
outputs.

36

process-based models forecasts with historical observations is their disparate temporal coverage,

as shown in Figure 4.1. For example, long-term historical records of climate observations, dating

back before the 20th century, are available for various locations, whereas archived scenarios of

GCM runs may only be available for the 20th century and beyond since scenario development is

time-consuming and costly [82]. The presence of large blocks of missing values can significantly

impede the ability to make accurate predictions [83].

Self-supervised learning [84, 85] offers a possible approach to address this challenge. Self-

supervised learning is well-suited for learning representations from available data and for generating

new data. This approach can help address the challenges presented by missing or disparately

distributed predictor variables in time series. By using self-supervised techniques, it is possible to

extract useful information from the data that is available and to generate new data that can be used

to fill in gaps in the original data set. This can improve the accuracy of predictions and increase

the overall utility of the data. However, despite its success in various domains such as image and

language processing, its application in time series imputation remains relatively under-explored. In

particular, how to account for the missing values from one of the two sources and integrate their

representations to train an accurate DNN model are two key issues that need to be addressed.

In this work, we present a framework called Self-Recover to enable accurate forecasts of block

maxima in time series using both historical observations and model-based forecasts in addressing the

issue of disparate temporal coverage. A combination of contrastive and generative self-supervised

schemes followed by a denoising autoencoder is used to impute long-term and random missing

values of model-based forecasts. Also, Self-Recover employs a residual learning technique to

combine the representations of historical and model-based forecasts. The combined representation

is used to generate parameters of the generalized extreme value (GEV) distribution characterizing

the block maxima values. It also uses a reparameterization technique to ensure that the DNN output

adheres to the GEV positivity constraints [10] and a model bias offset technique [81] to ensure

that the GEV parameters’ regularity conditions are satisfied in spite of the random initialization of

DNNs.

37

Figure 4.2 Illustration of the long-term and random missing values for model-based forecasts. Time
period (a) is the case of long-term missing values where all of the model-based forecasts are missing.
Time period (b) is the case of random missing values where a few of the model-based forecasts are
missing.

In summary, the main contributions of the work are:

1. We present a novel framework, Self-Recover, to forecast the block maxima of a future time

window while simultaneously recover missing values from the model-based forecasts.

2. To impute the missing values, a novel self-supervised learning approach is proposed by

integrating contrastive and generative schemes followed by a denoising autoencoder. This

approach can impute both structured and random missing values in the model-based forecasts.

3. We present a residual technique to combine the representations from historical and model-based

forecasts and show that it is better than simple concatenation of the two features.

4. We perform extensive experiments on real-world data to demonstrate the effectiveness of

Self-Recover compared to state-of-the-art time series forecasting methods.

4.2 Preliminaries

4.2.1 Problem Statement

Consider a time series data that is partitioned into a set of overlapping time windows,

{𝑤1, 𝑤2, · · · , 𝑤𝑛}, where each window 𝑤𝑖 is defined by its start time, 𝑡𝑖 − 𝛼, and end time, 𝑡𝑖 + 𝛽.
Let the sample of historical observation values is denoted by 𝑧𝐻

, · · · , 𝑧𝐻

, · · · , 𝑧𝐻

𝑡−𝛼, 𝑧𝐻

𝑡 , 𝑧𝐻
𝑡+1

𝑡+𝛽 for
𝑡−𝛼+1
𝑖 ∈ R. We assume the time window [𝑡 − 𝛼, 𝑡] contains

time window [𝑡 − 𝛼, 𝑡 + 𝛽], where each 𝑧𝐻

historical observations while the time window [𝑡 + 1, 𝑡 + 𝛽] contains the future values to be predicted,

where 𝛽 is the forecast horizon. We further assume there are 𝑚 process model forecasts associated

with some of the time windows. Let 𝑧𝑀
𝑡
𝑖 ∈ R𝑚. For each time window [𝑡 − 𝛼, 𝑡 + 𝛽], let 𝑥𝐻

𝑡+𝛽, be the sample of model-based forecasts,
𝑡 ) ∈ R𝛼+1
𝑡−𝛼, 𝑧𝐻

where each 𝑧𝑀

𝑡 = (𝑧𝐻

, · · · , 𝑧𝑀

, · · · , 𝑧𝐻

, 𝑧𝑀
𝑡+1

𝑡−𝛼+1

38

Figure 4.3 Self-Recover: Proposed framework for forecasting block maxima by using both
historical observations and model-based forecasts as predictors along with the generalized extreme
value (GEV) distribution.

be the historical predictors, 𝑥 𝑀0
and 𝑦𝑡 = max𝜏∈{1,··· ,𝛽} 𝑧(𝐻)
model-based forecasts 𝑥 𝑀◦

𝑡

𝑡 = (𝑧𝑀
𝑡

𝑡+𝛽) ∈ R𝑚×𝛽 be the initial model-based forecasts
𝑡+𝜏 ∈ R be the target variable (block maxima) to be predicted. The initial

, · · · , 𝑧𝑀

, 𝑧𝑀
𝑡+1

could be partially or completely missing as illustrated in Figure 4.2.

Our initial objective is to impute the partially and completely missing values of the initial

) for the future time window [𝑡 + 1, 𝑡 + 𝛽] using the historical predictors

model-based forecasts (𝑥 𝑀0
(𝑥𝐻
initial model-based forecasts 𝑥 𝑀0

𝑡

𝑡

𝑡 ) in [𝑡 − 𝛼, 𝑡]. We will train a model 𝐹𝑖𝑚 𝑝𝑢𝑡𝑒 : R𝛼+1 → R𝑚×𝛽 to impute the missing values in the
= 𝐹𝑖𝑚 𝑝𝑢𝑡𝑒 (𝑥𝐻

and get the full imputed model-based forecasts 𝑥 𝑀 𝑓

𝑡 ).

𝑡

Our final objective is to predict the block maxima value (𝑦𝑡) of the future time window [𝑡 +1, 𝑡 + 𝛽]

using both the historical predictors (𝑥𝐻

𝑡 ) in [𝑡 − 𝛼, 𝑡] and model-based forecasts (𝑥 𝑀 𝑓

𝑡

) in the future

time window [𝑡 + 1, 𝑡 + 𝛽]. We will train a model 𝐹 𝑓 𝑜𝑟𝑒𝑐𝑎𝑠𝑡 : R𝛼+1 × R𝑚×𝛽 → R to predict the

block maxima value, ˆ𝑦𝑡 = 𝐹 𝑓 𝑜𝑟𝑒𝑐𝑎𝑠𝑡 (𝑥𝐻

𝑡 , 𝑥 𝑀 𝑓

𝑡

), of the time window along with its upper and lower

quantiles, ˆ𝑦𝑈 and ˆ𝑦 𝐿.

39

4.2.2 DeepExtrema Framework

The DeepExtrema framework [81] is designed to predict the block maxima of a time window by

incorporating the Generalized Extreme Value (GEV) distribution [10] given in (3.1) into the training

of deep neural networks (DNNs). DeepExtrema learns the parameters of the GEV distribution and

predicts the block maxima using DNNs. Furthermore, it can produce estimates of uncertainties in

its prediction using the 𝑝th quantile of the GEV distribution, 𝑦 𝑝 using (3.3).

Given training data consisting of 𝑛 block maxima values, {𝑦1, 𝑦2, · · · , 𝑦𝑛}, DeepExtrema is

trained to minimize the mean-square error of the block maxima prediction as well as the negative

log-likelihood function in (3.4) to ensure its predictions are consistent with the GEV distribution.

DeepExtrema also reformulates the GEV constraints given in (3.5) and reparameterizes the

DNN output accordingly to ensure the constraints can be enforced during DNN training. It further

uses a novel model bias offset mechanism to ensure that the DNN output preserves the GEV

constraints in spite of its random initialization. Specifically, it computes the bias introduced by

the randomly initialized network and performs bias correction to mitigate its effect in subsequent

training epochs. This strategy of debiasing the DNN outputs guarantees that the GEV parameters

estimated by the DNN would satisfy the regularity conditions at all times even if the DNN is not

properly initialized.

4.3 Proposed Self-Recover Framework

Self-Recover builds upon the DeepExtrema framework to enable the incorporation of both

historical and model-based forecasts into its formulation. Figure 4.3 presents a high-level overview

of the proposed Self-Recover framework. Given the historical observations and model-based

forecasts, the framework initially imputes the long-term and random missing values. It employs a

combination of contrastive and generative self-supervised learning schemes to impute long-term

missing values of model-based forecasts. Also, a denoising autoencoder (DAE) is employed to

impute random missing values. Then, it employs two separate DNNs to learn individual feature

representations of the historical and model-based forecasts. Afterward, it uses a residual technique

to augment the learned model-based forecasts representation, ℎ𝑀, into the representation of the

40

historical observations, ℎ𝐻. It then passes the combined representation to the downstream modules

for parameter estimation of the GEV distribution and prediction of block maxima. These downstream

modules are derived from DeepExtrema. Specifically, it assumes the following data generation

process for the block maxima ˆ𝑦𝑡 in each time window [𝑡 − 𝛼, 𝑡 + 𝛽]:

ℎ𝐻 = 𝑔𝐻 (𝑥𝐻) and ℎ𝑀 = 𝑔𝑀 (𝑥 𝑀 𝑓 )

(𝜇𝑡, 𝜎𝑡, 𝜉𝑡)

=

𝑓 (ℎ𝐻 + 𝛼ℎ𝑀)

ˆ𝑦𝑡

∼ 𝐺𝐸𝑉 (𝜇𝑡, 𝜎𝑡, 𝜉𝑡)

(4.1)

(4.2)

(4.3)

Equation (4.1) states that the hidden representations for the historical observations and model-based

forecasts are functions of their corresponding predictors. Equation (4.2) states that the GEV

parameters are functions of these hidden representations. Finally, the observed block maxima are

drawn from the GEV distribution according to Equation (4.3).

4.3.1 Self-supervised Learning for Handling Disparate Temporal Coverage

To deal with the disparate temporal coverage, Self-Recover presents a self-supervised approach

based on contrastive and generative learning schemes followed by a denoising autoencoder (DAE)

for simultaneously imputing long-term and random missing values. The proposed approach is

illustrated in Figure 4.4.

First, it employs a contrastive learning scheme to learn representations from historical obser-

vations. Contrastive learning [85] is a method of learning representations from unlabeled data by

comparing different augmented versions of the same data sample. In the context of time series, this

would involve training the model on a set of time series data and then presenting it with pairs of time

series data, where one is a positive example (e.g., a similar time series) and the other is a negative

example (e.g., a dissimilar time series). The model is then trained to identify the differences between

the positive and negative examples.

Given the historical time series 𝑥𝐻
𝑡

for each window, it augments the time series to get similar

examples (𝑥𝐻

1,𝑡, 𝑥𝐻

2,𝑡). For data augmentation, jittering, scaling, and flipping are employed, but

jittering performs better for imputation overall. All other samples in the mini-batch are treated

41

Figure 4.4 Long-term and random missing values imputation with the proposed self-supervised
learning approach. For data augmentation, jittering, scaling, and flipping are used.

as negative examples. Then, it passes the samples through an encoder-decoder model to generate

model-based forecasts as follows:

𝑥 𝑀 ′
𝑡 = 𝐷𝑒𝑐𝑜𝑑𝑒𝑟 (𝑒𝐻

𝑡 ) where 𝑒𝐻

𝑡 = 𝐸𝑛𝑐𝑜𝑑𝑒𝑟 (𝑥𝐻
𝑡 )

(4.4)

The intermediate embedding (𝑒𝐻

𝑡 ) by the encoder is then projected using a projection head 𝑔(·)

to get the projected values, 𝑐𝐻

𝑡 = 𝑔(𝑒𝐻

𝑡 ). The projected values should be similar for similar input

samples. The projection head maps the intermediate embeddings to a lower-dimensional space,

where the contrastive loss is calculated. The contrastive loss is a measure of similarity between the

projected values of two similar samples. Therefore, the encoder part of the encoder-decoder model

is used to learn embeddings of the historical observations, which are used in contrastive loss.

The encoder-decoder model is trained to generate initial model-based forecasts (𝑥 𝑀 ′

𝑡

) from

historical observations (𝑥𝐻
However, they may not be consistent with the true model-based forecasts 𝑥 𝑀 ′

𝑡 ). These initial model-based forecasts do not have any missing values.

. Thus, the initial

0

𝑡

model-based forecasts need to be calibrated with the available ground truth model-based forecasts.

To ensure the initial model-based forecasts are robust against noise and to make the framework

42

capable of imputing random missing values, a denoising autoencoder (DAE) [86] is used. DAE is a

type of autoencoder that is trained to reconstruct the original input from a corrupted version of the

input. The goal is to learn a representation of the input that is robust to noise. A DAE can also

be used for random missing value imputation. The idea is to treat the missing values as noise and

train the DAE to reconstruct the original data without the missing values. For creating corrupting

data, missing values need to be introduced as noise. To introduce missing values in the initially

generated model-based forecasts, we consider two cases. If there are any ground truth model-based

forecasts available with missing values, we introduce the same missing values for the generated

model-based forecasts as well. In other cases (if there is not any ground truth available or there

are not any missing values in the ground truth), we introduce missing values randomly. So, the
corrupted model-based forecasts 𝑥 𝑀 ′

are generated from the initial model-based forecasts using

𝑜

𝑡

ground truth guided projection or random projection of missing values. Then, an autoencoder is used

to reconstruct the original input from the corrupted data. Also, the intermediate embedding (𝑒𝑀

the encoder is projected using another projection head 𝑔′(·) to get the projected values 𝑐𝑀

𝑡 ) by
𝑡 = 𝑔′(𝑒𝑀

𝑡 ).

Ideally, the projected values should be similar for similar input samples. Thereby, another contrastive
loss can be used here. Finally, when the ground truth model-based forecasts 𝑥 𝑀0
are available, a
root-mean-square error (RMSE) loss can be used to ensure the corrupted model-based forecasts 𝑥 𝑀 ′
is close to the ground truth. The output of the autoencoder is the final model-based forecasts (𝑥 𝑀 𝑓

0

𝑡

𝑡

).

𝑡

4.3.2 Optimization of Self-supervised Learning

Self-Recover trains the self-supervised imputation in two phases based on time period (a)

and (b) as illustrated in Figure 4.2. It starts its training with the time period (b) as both historical

and model-based forecasts are available. After several hundred epochs with time period (b) only, it

starts simultaneous training with the both time periods (a) and (b). The optimization is carried out

this way to take advantage of time period (b) that has the ground truth model-based forecasts. The

model-based forecasts in time period (b) would facilitate the training initially with an additional

RMSE loss.

For the training batch with the time period (b), combines contrastive loss, reconstruction loss,

43

and RMSE loss. For contrastive learning, it adopts the NT-Xent (Normalized Temperature-scaled

Cross Entropy) loss [87]. The contrastive losses historical (ℓ𝐻

contrast) and model-based (ℓ𝑀

contrast)

representations are defined for a positive pair of examples (𝑖, 𝑗) as follows:

ℓ𝐻
contrast =

ℓ𝑀
contrast =

∑︁

𝑖, 𝑗

∑︁

𝑖, 𝑗

− log

− log

𝑡,𝑖, 𝑐𝐻
𝑒𝑥 𝑝(𝑠𝑖𝑚(𝑐𝐻
𝑡, 𝑗 ))
1𝑘≠𝑖𝑒𝑥 𝑝(𝑠𝑖𝑚(𝑐𝐻
𝑡,𝑖, 𝑐𝐻
𝑡,𝑘 )
𝑡,𝑖 , 𝑐𝑀
𝑒𝑥 𝑝(𝑠𝑖𝑚(𝑐𝑀
𝑡, 𝑗 ))
𝑡,𝑖 , 𝑐𝑀
1𝑘≠𝑖𝑒𝑥 𝑝(𝑠𝑖𝑚(𝑐𝑀
𝑡,𝑘 )

(cid:205)2𝑁
𝑘=1

(cid:205)2𝑁
𝑘=1

(4.5)

For the denoising autoencoder architecture, the reconstruction loss can be expressed as follows:

ℓ𝑀
reconstruction =

1
𝑛

𝑛
∑︁

𝑖=1

||𝑥 𝑀 ′

𝑡,𝑖 − 𝑥 𝑀 𝑓

𝑡,𝑖

||2

where 𝑥 𝑀 ′
𝑡,𝑖

is the initial generated model-based forecasts and 𝑥 𝑀 𝑓

𝑡,𝑖

is final model-based forecasts for

𝑖-th sample.

The RMSE loss is defined between the ground truth 𝑥 𝑀0

𝑡,𝑖 and the projected model-based forecasts

𝑥 𝑀 ′
𝑡,𝑖 as follows:

0

ℓ𝑀
RMSE =

(cid:118)(cid:117)(cid:116) 𝑛
∑︁

𝑖=1

𝑀 ′
0
𝑡

)2

(𝑥 𝑀0

𝑡 − 𝑥
𝑛

For the training batch with the time period (b), the overall loss can be expressed as follows:

LSSL = 𝛾1 ℓ𝐻

contrast + 𝛾2 ℓ𝑀

contrast + 𝛾3 ℓ𝑀
+(1 − 𝛾1 − 𝛾2 − 𝛾3) ℓ𝑀

RMSE

reconstruction

(4.6)

where 𝛾1, 𝛾2, 𝛾3, and 𝛾4 are hyperparameters for the weights.

In summary, the self-supervised loss LSSL is a weighted sum of contrastive loss on historical

observations, contrastive loss on model-based forecasts, reconstruction loss on model-based forecasts,

and RMSE on model-based forecasts. For the training batch with the time period (a), 𝛾4 is set to 0

as there is not any ground truth for the time period (a).

44

4.3.3 Learning Temporal Representations using a Residual Learning Approach

After the self-supervised learning, all the long-term and random missing values are imputed.

Then, Self-Recover learns the individual temporal representations (ℎ𝐻

𝑡 and ℎ𝑀

𝑡 ) of the historical

predictors 𝑥𝐻

𝑡 and the final model-based forecasts 𝑥 𝑀 𝑓

𝑡

using stacked bi-directional LSTM networks

followed by fully connected networks. The next step is to combine the two learned feature

representations (ℎ𝐻

𝑡 and ℎ𝑀

𝑡 ) into a joint representation that can be used to predict the GEV

distribution parameters. The simplest way to do this would be to concatenate them together into a

single representation, but this would increase the number of DNN parameters to be estimated from

the data. Furthermore, the DNN can be susceptible to the vanishing gradient problem [88].

Alternatively, we may consider the learned representation from the historical predictors (ℎ𝐻

𝑡 ) as

the basis for a residual learning approach. The learned representation from model-based forecasts

(ℎ𝑀

𝑡 ) can then be augmented to refine and correct the representation error in ℎ𝐻
𝑡 :

ℎ(𝐻+𝑀)
𝑡

= ℎ𝐻

𝑡 + 𝛼 × ℎ𝑀
𝑡

,

(4.7)

where 𝛼 is a scalar hyperparameter for the residual weight.

In this work, we investigate the effectiveness of both approaches—simple concatenation versus

residual learning—and compare their relative performance in terms of their effectiveness (forecast

accuracy) and efficiency (convergence rate). As will be shown in our experiments, although the

accuracy of both approaches is quite comparable, Self-Recover achieves faster convergence when

using the residual approach compared to simple concatenation.

4.3.4

Incorporating GEV Distribution for Block Maxima Prediction

The combined representation ℎ(𝐻+𝑀)

𝑡

in (4.7) is used as input to a fully connected network to

estimate the GEV parameters, 𝜇, 𝜎, and 𝜉. Similar to DeepExtrema, Self-Recover must learn

the GEV parameters in a way that preserves their inter-dependent constraints in (3.5). To do this,

the hard GEV constraints are reformulated as soft constraints as follows:

∀ 𝑖 : 1 +

𝜉
𝜎

(𝑦𝑖 − 𝜇) + 𝜏 ≥ 0.

(4.8)

45

The slack variable 𝜏 accommodates for minor violations of the second constraint in (3.5) as long

as 1 +

𝜉
𝜎 (𝑦𝑖 − 𝜇) > −𝜏 for all time windows 𝑖. The reformulation allows us to reparameterize the

constraint as learning an upper (𝜉𝑢) and lower bound (𝜉𝑙) on 𝜉:

𝜎
𝜇 − 𝑦min

(1 + 𝜏) ≥ 𝜉 ≥ −

𝜎
𝑦max − 𝜇

(1 + 𝜏)

(4.9)

The fully connected network will generate 𝜉𝑢 and 𝜉𝑙 as its output instead of just a single 𝜉 (along

with 𝜇 and 𝜎). A regularization penalty involving (𝜉𝑢 − 𝜉𝑙)2 is also introduced into the loss function

to ensure the estimated parameters 𝜉𝑢 ≈ 𝜉𝑙 ≈ 𝜉. The estimated GEV parameters are subsequently

provided to the Model Bias Offset module to debias them in order to deal with issues due to the

random initialization of the deep neural networks. More details on the Model Bias Offset can be

found in [81].

Finally, the debiased GEV parameters are passed to a fully connected layer to predict the block

maxima ( ˆ𝑦𝑡) for the time window. The debiased GEV parameters are also used to compute its upper

and lower quantiles, ˆ𝑦𝑈 and ˆ𝑦 𝐿, respectively, using the GEV quantile formula given in (3.3).

4.3.5 Optimization of Block Maxima Prediction

To learn the network parameters, Self-Recover integrates the regularized negative log-

likelihood loss with the mean-square error function. This combination of loss functions allows

the framework to find a GEV distribution that best fits the data while making an accurate point

estimation of the block maxima. The combined loss function is given by:

LBlock-maxima = 𝜆1

𝑛
∑︁

𝑖=1

(𝑦𝑖 − ˆ𝑦𝑖)2+

(cid:26)

(1 − 𝜆1)

𝜆2 ℓ𝐺𝐸𝑉 (𝜇, 𝜎, 𝜉) + (1 − 𝜆2)

(cid:27)

(𝜉𝑢,𝑖 − 𝜉𝑙,𝑖)2)

𝑛
∑︁

𝑖=1

(4.10)

LBlock-maxima combines the negative log-likelihood function of the GEV distribution and

the difference between the upper and lower-bound estimates of 𝜉. Note that 𝜆1 and 𝜆2 are

hyperparameters that manage the trade-off between the different components of the loss function.

Adam [89] optimizer is used while training. The trained network can be used to generate the block

46

maxima prediction, ˆ𝑦, for any future input, along with its upper and lower bounds, ( ˆ𝑦 𝐿, ˆ𝑦𝑈), as well

as the GEV parameters ( ˆ𝜇, ˆ𝜎, ˆ𝜉).

4.4 Experimental Results

We have performed extensive experiments to evaluate the performance of our Self-Recover

framework. We consider the following two real-world datasets for our experiments.

Hurricane: We utilize the same HURDAT2 database [78] employed for DeepExtrema in

Chapter 3. We created non-overlapping time windows of length 16-time steps (i.e., 4 days) from

the 3,111 hurricanes. We use the first 8 time steps for the predictor variables and the remaining

8-time steps as the forecast window. We also extracted the corresponding model output forecasts

for 396 hurricanes (between 2011 and 2020) from 21 statistical and dynamical models, such as

CMC (Canadian Global Model Forecast), SHIP (SHIPS Model Intensity Forecast), etc. from

the Hurricane Forecast Model Output website at the University of Wisconsin-Milwaukee.1 After

preprocessing, we have 5912 time windows, out of which 768 contain both historical observations

and model-based forecasts.

Climate: We consider the daily maximum temperature data from the North American Regional

Climate Change Assessment Program (NARCCAP) data archive2. The historical observations

contain daily maximum temperature data of 3 climate stations (Maple City, Hart, and Eau Claire)

along the eastern shore of Lake Michigan for 7665 consecutive days (from 1978 to 1998). We then

generate non-overlapping time windows of 14 days in length, with the first 7 days serving as the

historical time window and the last 7 days serving as the target window for predicting its block

maxima. The corresponding model-based forecasts are simulation data from four regional climate

models (RCMs), namely, CRCM, WRFG, HRM3, and RCM3. These four RCM simulations are

driven by the four GCMs - CCSM, CGCM3, GFDL, and HadCM3. The model-based simulation

data consists of 8 predictor variables. For evaluating self-supervised missing values imputation,

we completely remove model-based simulation data from 1978 to 1985 and randomly remove 5%

data for each time window from 1986 to 1998. After the data preprocessing, there are 3285 time

1http://derecho.math.uwm.edu/models
2https://www.narccap.ucar.edu/data/index.html

47

windows in total.

To demonstrate its effectiveness, we compare Self-Recover against the following baseline

methods: (1) Persistence: This approach uses the block maxima from the historical time window

as the block maxima for the forecast time window. (2) MF (Model-based Forecasts): This approach

simply computes the block maxima of the model-based forecasts as its prediction. (3) FCN: This

approach uses a fully connected network to predict the block maxima given a set of predictors. (4)

LSTM: This approach employs a bi-directional stacked LSTM network followed by a fully connected

network to predict the block maxima. (5) Transformer: This approach uses an attention-based

transformer network to predict the block maxima. (6) InceptionTime[90]: This is a state-of-the-art

CNN-based time series classification method. We replace the final softmax layer with a fully

connected network for block maxima prediction.

(7) DeepPIPE[32]: This is an uncertainty

quantification-based approach to predict block maxima by using a hybrid loss function. (8) EVL[34]:

This approach uses an ad-hoc EVT-based loss function to predict the block maxima.

4.4.1 Evaluation Setup

For the evaluation of block maxima prediction, we split the data into separate training, validation,

and testing sets with a ratio of 8:1:1. For evaluating self-supervised missing values imputation, we

further split the training data of block maxima prediction into separate training, validation, and

testing sets with a ratio of 8:1:1. We repeated our experiment 10 times, each time using a different

random split. The data is standardized to have zero mean and unit variance. All the methods are

trained by varying their DNN hyperparameters as follows: number of layers (2-6), number of nodes

(8-128), learning rate (10−5, 10−4, 10−3, 10−2), and batch size (32, 64, 128, 256), while assessing

their performance on the validation set. The best hyperparameters are chosen using Ray Tune,

a tuning framework with an ASHA scheduler. The training and evaluation are carried out in an

NVIDIA Tesla K80 GPU with 12GB RAM.

We evaluate the performance of the proposed self-supervised missing values imputation on the

corresponding test set using the following metrics: (1) Root mean squared error (RMSE) and (2)

correlation between the imputed and ground truth values. For comparison, we also use a randomized

48

imputation technique based on the mean (𝜇) and standard deviation (𝜎) of the available data. The

missing values are imputed by randomly drawing values from N(𝜇, 𝜎). For block maxima prediction,

we evaluate the performance of the methods on the test set using: (1) RMSE and (2) correlation

between the predicted and ground truth block maxima as well as (3) Accuracy and (4) F1 score for

classifying extreme events. For the Hurricane dataset, we define extreme hurricane intensity as

values that exceed either (1) 111 mph (i.e., category 3 and above hurricanes) or (2) 130 mph (i.e.,

category 4 and above hurricanes). For the climate dataset, we define extreme temperature events as

values that are either above the 80th or 90th percentiles of the temperature values at a location.

4.4.2 Results

Methods

Hurricane (Intensity)
RMSE
28.05
17.12

Correlation
0.57
0.80

Persistence
MF
16.62 ± 0.27 0.84 ± 0.04
FCN
16.11 ± 0.24 0.85 ± 0.04
LSTM
15.31 ± 0.23 0.87 ± 0.04
Transformer
15.51 ± 0.28 0.86 ± 0.03
Inception
17.02 ± 0.33 0.81 ± 0.04
DeepPIPE
15.61 ± 0.26 0.85 ± 0.04
EVL
15.86 ± 0.29 0.85 ± 0.03
DeepExtrema
Self-Recover 14.88 ± 0.22 0.90 ± 0.03

Climate (Temperature)
RMSE
7.48
3.94
3.81 ± 0.38
3.57 ± 0.34
3.10 ± 0.24
3.07 ± 0.28
3.76 ± 0.21
3.28 ± 0.31
3.49 ± 0.24
2.79 ± 0.26

Correlation
0.46
0.58
0.59 ± 0.06
0.62 ± 0.04
0.67 ± 0.03
0.66 ± 0.04
0.61 ± 0.05
0.63 ± 0.05
0.64 ± 0.05
0.71 ± 0.04

Table 4.1 Overall performance comparison in terms of RMSE and correlation of block maxima
prediction.

Table 4.1 compares the forecasting performance of the various methods in terms of their RMSE

and correlation. The results show that the proposed Self-Recover framework outperforms all the

baselines on both Hurricane and Climate datasets. To verify its effectiveness in terms of modeling

extreme values, Figure 4.5 shows scatter plots of the actual versus predicted block maxima values

for Self-Recover and other baseline methods (MF, EVL, Transformer, DeepPIPE, and Inception)

on the Hurricane intensity data. The plots indicate that Self-Recover can accurately estimate

the block maxima values of hurricane intensity including the upper and lower extreme values. In

contrast, the block maxima predictions generated by other competing baselines have larger biases.

49

(a) Self-Recover

(b) MF

(c) EVL

(d) Transformer

(e) DeepPIPE

(f) Inception

Figure 4.5 Comparison of actual versus predicted block maxima for various methods on the Hurricane
dataset.

Accuracy (F-1 Score)

Methods

Hurricane (Intensity)

Category 4
and above
0.81 (0.80)
MF
0.75 (0.78)
FCN
0.81 (0.82)
LSTM
0.84 (0.85)
Transformer
0.83 (0.82)
Inception
0.77 (0.80)
DeepPIPE
0.85 (0.79)
EVL
DeepExtrema
0.84 (0.85)
Self-Recover 0.89 (0.90)

Category 5
and above
0.83 (0.84)
0.80 (0.82)
0.82 (0.83)
0.87 (0.88)
0.85 (0.86)
0.81 (0.82)
0.86 (0.81)
0.86 (0.87)
0.93 (0.92)

Climate (Temperature)
About 80
About 90
Percentile
Percentile
0.79 (0.80)
0.77 (0.75)
0.79 (0.77)
0.78 (0.79)
0.80 (0.81)
0.79 (0.78)
0.84 (0.82)
0.82 (0.83)
0.82 (0.84)
0.80 (0.81)
0.79 (0.78)
0.78 (0.80)
0.83 (0.81)
0.80 (0.77)
0.83 (0.85)
0.82 (0.82)
0.89 (0.91)
0.86 (0.88)

Table 4.2 Performance comparison in terms of accuracy and F1 score for predicting extreme events
only.

For example, EVL tends to overestimate the block maxima values whereas Transformer, Inception,

and DeepPIPE tend to underestimate the values, particularly at the upper extremes, which is not

surprising since the methods do not incorporate EVT into their modeling approach.

Finally, Table 4.2 compares the accuracy of each method in terms of classifying extremely high

50

Methods

Randomized Imputation
Self-supervised Imputation

Hurricane

Climate

RMSE
5.21 ± 0.31
4.51 ± 0.36

Correlation
0.70 ± 0.05
0.78 ± 0.04

RMSE
2.81 ± 0.28
2.26 ± 0.24

Correlation
0.73 ± 0.05
0.81 ± 0.03

Table 4.3 Comparing randomized missing values imputation via Gaussian Distribution against the
proposed self-supervised missing values imputation in Self-Recover.

Methods

Concatenation
Residual

Hurricane

Climate

RMSE
14.95 ± 0.23
14.88 ± 0.22

Runtime (s)
1503.4
1215.9

RMSE
2.86 ± 0.24
2.79 ± 0.26

Runtime (s)
1012.7
891.6

Table 4.4 Comparing concatenation against residual learning approaches for merging representations
in Self-Recover.

hurricane intensity and temperature events. Observe that Self-Recover outperforms all other

baselines in terms of accuracy and F1 score.

Effect of Self-supervised Missing Values Imputation: Table 4.3 compares the missing

values imputation performance of Self-Recover when using randomized imputation via Gaussian

distribution against the proposed self-supervised imputation approach. The results show that

our self-supervised imputation approach outperforms random imputation in terms of RMSE and

correlation.

Effect of Residual Learning: Table 4.4 compares the performance of Self-Recover when

using simple concatenation to combine its learned representations against the residual learning

approach. While their accuracies are quite similar, the training runtime for the simple concatenation

approach is significantly higher due to its slower convergence.

4.4.3 Ablation Studies

We examine the effects of varying two hyperparameters of our algorithm, 𝛼, and 𝜆1. The

hyperparameter 𝛼 determines the residual weight of the model-based representation when combined

with the historical representation. As shown in Figure 4.6 (left), the RMSE generally increases

with increasing value of 𝛼. For both data sets, a smaller value of 𝛼 results in higher performance,

which validates our strategy for incorporating the model-based representation as a “residual" term

to enhance the representation of historical observations.

The trade-off between minimizing point prediction (RMSE loss) and preserving the GEV

51

Figure 4.6 Examining the effect of 𝛼 (left) and 𝜆1 (right) on RMSE of the block maxima prediction
for Hurricane dataset with Self-Recover.

distribution (GEV loss) of Self-Recover is determined by the hyperparameter 𝜆1 in Equation

4.10. A smaller 𝜆1 places greater emphasis on GEV loss, which focuses more on extreme values.

This can be seen from the results shown in Figure 4.6 (right), where the RMSE of the block maxima

prediction increases when 𝜆1 decreases. This suggests that the combined loss is useful rather

than using only the RMSE loss, thus validating the need to incorporate GEV distribution into the

Self-Recover framework.

4.5 Conclusion

This work introduces Self-Recover, a novel deep learning framework for predicting the block

maxima of time series by handling disparate temporal coverage of predictors. A combination of

contrastive and generative self-supervised learning schemes followed by a DAE is proposed to

impute long-term and random missing values of model-based forecasts. To learn and merge the

representations from historical and model-based forecasts, we provide a residual technique and show

that it is more efficient than the straightforward concatenation method. Our experimental results on

real-world data demonstrate the superiority of Self-Recover compared to other state-of-the-art

methods.

52

CHAPTER 5

SIMEXT: SELF-SUPERVISED REPRESENTATION LEARNING FOR EXTREME
VALUES IN TIME SERIES

5.1

Introduction

Deep time series forecasting models are widely used to predict the future outcomes of complex

processes that evolve over time as discussed in chapter 2. The accuracy and effectiveness of these

models often depend on their ability to discern the underlying patterns of the data and utilize them

to make inferences about the future evolution of the time series. One crucial aspect to consider

in time series forecasting is the ability to predict extreme values, i.e., values that fall far outside

the typical range of observations. Forecasting of extreme values is important in many application

domains as extreme events may signify dire scenarios such as natural disasters, financial collapse, or

public health crises.

Block maxima or minima [10] are commonly used to define extreme values in a time series.

Prediction of block maxima or minima is of paramount importance as it enables us to anticipate the

worst-case scenario within the forecast period. For instance, predicting the maximum intensity of an

upcoming hurricane or amplitude of seismic activity for a future time window can assist emergency

planners in assessing its potential damages.

Accurate forecasting of extreme values in time series is challenging for several reasons. First, the

modeling of extreme values requires a specific focus on capturing the tail distribution [10], which is a

major departure from conventional techniques that are often biased toward modeling the conditional

mean. The problem is further exacerbated by the fact that the extreme values are rare, making them

hard to predict even with a large amount of historical data available. Finally, the extreme values

could be associated with certain peculiarities in the time series, such as abrupt changes, volatility

clustering, persistent dependencies, etc [91, 92]. Advanced representation learning approach is

therefore needed to learn the underlying patterns in the time series that can be utilized for extreme

value forecasting.

Self-supervised learning (SSL) [85, 84] is an emerging technique in machine learning to learn

53

useful feature representation from available data. One of the main advantages of self-supervised

learning is that it can facilitate robust representation learning despite limited data. For example,

self-supervised contrastive learning [85] is a specific approach that uses data augmentation to

overcome the labeled data scarcity problem. It is based on the idea of comparing different augmented

versions of the same input and learning representations that are robust to changes introduced by the

augmentations. For time series, it can be used to learn representations that are invariant to time

shifts, scaling, or warping [93], thereby improving the generalizability of the model to new data.

Another advantage of using SSL is that it can help capture the complex patterns associated with

the non-linear dependencies of the time series. For example, reconstruction-based self-supervised

models such as auto-encoders can effectively capture non-linear relationships and subtle variations

in the data.

Though existing SSL approaches have shown promise in extracting meaningful representations

from time series [85, 84], they are mostly focused on capturing the common patterns in the time series

instead of its extreme values. Worse still, the perturbation introduced by existing data augmentation

techniques may alter the extreme values of the time series, leading to a distorted representation of its

tail distribution. Preserving the fidelity of the tail behavior during augmentation is crucial to ensure

that the learned representations are robust to different scenarios yet able to capture the characteristics

of extreme events. Therefore, it is important to develop data augmentation techniques that are aware

of the extreme values when transforming the time series data for self-supervised learning.

To address this problem, we propose an SSL framework called SimEXT to learn a useful

representation that captures extreme values in time series. Our framework leverages the contrastive

learning approach along with a reconstruction-based autoencoder architecture for time series

modeling. A novel wavelet-based data augmentation technique is introduced to maintain the

distribution of extreme values after data augmentation. Theoretical proofs are also given on the

probabilistic guarantees provided by our wavelet-based augmentation to ensure that the block maxima

(or minima) of the original time series are preserved by the augmented data. SimEXT uses the

reconstruction loss between the original and reconstructed sample to promote robust representation

54

learning. In addition, we investigate two distribution loss functions to ensure the fidelity of the

tail distribution is preserved—one based on the weighted distance of their empirical cumulative

distribution functions (CDFs) while the other corresponds to the Cramér–von Mises distance of

fitted Generalized Extreme Value (GEV) distributions. The learned representations are then applied

to downstream tasks that focus on predicting future block maxima (or minima) of a time series.

5.2 Related Works

Deep learning techniques such as LSTM networks [1, 3], convolutional neural networks

(CNNs) [18, 17], attention models [22], and their variants have found success in various time

series applications. Prior works on modeling extreme values with deep learning have explicitly

incorporated either the Generalized Pareto (GP) distribution for modeling excess values over a

threshold or the Generalized Extreme Value (GEV) distribution for modeling the block maxima (or

minima) of a time series. For instance, [34] [37, 38] combined deep learning with the GP distribution

for predictive modeling of spatiotemporal data using the maximum log-likelihood approach. [36]

combined copula theory and normalizing flows with the GP distribution for generating multivariate

extremes. [81] integrated the GEV distribution with deep learning formulation to predict block

maxima in univariate time series.

Self-supervised learning (SSL) has become increasingly popular in recent years due to its excellent

performance on various benchmark datasets along with strong theoretical foundations [39, 40, 41].

In particular, it has been shown to outperform supervised learning methods in various instances

[39, 42]. SSL for time series has been shown to be an effective method for representation learning

and feature extraction, benefitting various downstream tasks such as forecasting, classification, and

anomaly detection. One of the most popular SSL-based methods for time series is autoencoding,

which aims to learn a compressed representation of the input data by reconstructing it from its latent

features. For instance, [48] proposed an SSL framework that uses recurrent autoencoder ensembles

to learn a compressed representation of the input data and generate future predictions. The proposed

method achieved competitive performance compared to traditional methods.

Contrastive learning is another well-known SSL method, which aims to learn a representation

55

that maximizes the similarity between positive examples and minimizes the similarity between

negative examples. For example, [49] proposed a Contrastive Predictive Coding (CPC) framework

that employs a contrastive objective to learn a representation that captures temporal dependencies

in audio signals. Similarly, [50] proposed a Contrastive Multiview Coding (CMC) framework for

time series that uses a contrastive objective to learn a representation capturing different perspectives

of the input data. [52] applied intra-level contrastive learning to disentangle seasonal and trend

representations from time series. [53] incorporated frequency information into contrastive learning

while maintaining consistency between the time and frequency domains. [54] combined temporal

and contextual contrasting for representation learning while [55] proposed augmentation techniques

that not only generate phase shifts and amplitude changes but also retain the structure and feature

information of the time series.

Another class of SSL methods for time series is based on Generative Adversarial Networks

(GANs), which can learn a generator model that produces realistic samples from the input data

distribution. For example, [56] proposed an SSL framework for time series forecasting, which uses

GANs to generate future predictions from the input data. Similarly, [57] proposed an SSL framework

for time series anomaly detection, which uses GANs to generate normal samples from the input

data distribution and detect anomalies based on the reconstruction error. Despite the extensive

literature, to the best of our knowledge, none of the existing studies have focused on developing SSL

for forecasting extreme values.

5.3 Preliminaries

5.3.1 Problem Statement

Consider a time series of 𝑇 time steps, 𝑦1, 𝑦2, · · · , 𝑦𝑇 , where 𝑦𝑖 ∈ R. Assume the time series

is divided into a set of non-overlapping time windows, where each window 𝑤𝑡 corresponds to the

interval [𝑡 − 𝛼, 𝑡 + 𝛽] and contains a segment (𝑦𝑡−𝛼, · · · , 𝑦𝑡, · · · , 𝑦𝑡+𝛽) of the input time series. For

each window 𝑤𝑡, we define its predictor window as the interval [𝑡 − 𝛼, 𝑡] and its forecast window as

the interval [𝑡 + 1, 𝑡 + 𝛽].

Definition 1. Given a time window 𝑤𝑡, with its corresponding forecast window, [𝑡 + 1, 𝑡 + 𝛽], the

56

block maxima (or minima) of the time series for the forecast window is defined as follows:

𝑚𝑡 = max

𝜏∈[𝑡+1,𝑡+𝛽]

𝑦𝑡 or 𝜇𝑡 = min

𝜏∈[𝑡+1,𝑡+𝛽]

𝑦𝑡

Note that the block minima of a forecast window is equivalent to the block maxima of its flipped

segment, (−𝑦𝑡+1, −𝑦𝑡+2, · · · , −𝑦𝑡+𝛽). For brevity, our discussion in the remainder of this chapter

focuses on block maxima only, though the approach is equally applicable to block minima.

Our first goal is to learn a robust feature representation of the time series in each predictor

window, i.e., z𝑡 = ℎ𝜃 (x𝑡) ∈ R𝑑, where x𝑡 = (𝑦𝑡−𝛼, · · · , 𝑦𝑡) ∈ R𝛼+1 is the time series segment

associated with the predictor window, ℎ𝜃 (·) is an encoder model that maps the input time series

segment into its latent representation, and 𝜃 is the learnable parameters. Our second goal is to

predict the block maxima, ˆ𝑚𝑡, of the forecast window by learning a mapping function 𝑓𝜙 (·) such

that ˆ𝑚𝑡 = 𝑓𝜙 (x𝑡, z𝑡), where 𝜙 is the learnable parameters.

5.3.2 Contrastive Learning

Contrastive learning is an SSL approach that aims to learn the feature representation of data in

such a way that similar instances are close to each other in the latent representation space while

dissimilar instances are far apart. To apply contrastive learning, we first create two augmented views,

denoted as x1

𝑡 and x2

𝑡 , respectively, of each input sample x𝑡. These augmented views are often

generated by applying random transformation or augmentation to x𝑡. The choice of augmentation

depends on the specific application and is typically designed to encourage robustness and invariance

to various input perturbations. In contrastive learning, the terms "positive" or "negative" examples

refer to the relationships between pairs of augmented views from the same input (positive) or

augmented views of different inputs (negative). Given an augmented view x1

𝑡 from the input x𝑡,

the corresponding positive example is denoted as x2

𝑡 , representing another augmented view of the

same input x𝑡. Conversely, negative examples are augmented views x1

𝑡 and x2

𝑡′, where x1

𝑡 is an

augmented view of input x𝑡 while x2

𝑡′ represents an augmented view of another input sample x𝑡′.
, · · · , x𝑡 𝑁 }, let {x1, x2, · · · , x2𝑁 } be the set
of derived pairs of augmented samples, where each (x2𝑘−1, x2𝑘 ) denotes a pair of augmented views

Given a random minibatch of 𝑁 samples, {x𝑡1

, x𝑡2

generated from the sample x𝑡𝑘 . Let sim(u, v) = u·v

|u|·|v| represent the cosine similarity between the

57

Figure 5.1 The Discrete Wavelet Transform (DWT) employing the Cascading Filter Banks Algorithm
with Haar filters, as denoted in Eq. 5.5. It is noteworthy that with the original sequence containing 8
data points, the number of approximation and detail coefficients are halved to 4 at Level 1.

vectors u and v. The objective of contrastive learning is to minimize the following normalized

temperature-scaled cross-entropy loss function, also known as the NT-Xent loss [87]:

ℓ(𝑖, 𝑗) = − log

exp (cid:2)𝑠𝑖𝑚 (cid:0)ℎ(x𝑖), ℎ(x 𝑗 )(cid:1) /𝜏(cid:3)
1𝑘≠𝑖 exp (cid:2)𝑠𝑖𝑚 (cid:0)ℎ(x𝑖), ℎ(x𝑘 )(cid:1) /𝜏(cid:3)

(cid:205)2𝑁
𝑘=1

(5.1)

where 1𝑘≠𝑖 is an indicator function whose value is equal to 1 if 𝑘 ≠ 𝑖, ℎ(·) denotes the representation

learning function, and 𝜏 is the temperature hyperparameter. The final loss is computed across all

positive pairs within the minibatch as follows [87]:

Lcontrastive =

1
2𝑁

𝑁
∑︁

𝑘−1

[ℓ(2𝑘 − 1, 2𝑘) + ℓ(2𝑘, 2𝑘 − 1)]

(5.2)

Minimizing the contrastive loss encourages the learned representation to maximize the similarity

between positive pairs of instances and minimize the similarity between negative pairs.

5.3.3 Discrete Wavelet Transform

The wavelet transform is a powerful tool for analyzing non-stationary time-series data with both

high-frequency and low-frequency components. The discrete wavelet transform (DWT) decomposes

a signal into a set of approximation (scaling) and detail (wavelet) coefficients. Through decomposition

into detail and approximation coefficients, DWT enables us to perform a multi-resolution analysis of

the input signal. The approximation coefficients represent low-frequency components, conveying

information about the overall trend or shape of the signal, while the detail coefficients capture

high-frequency components, revealing the noisy temporal variations of the signal. An input signal

58

x𝑡 can be expressed as linear combination of the scaling functions 𝜙(𝑡) and wavelet functions 𝜓(𝑡)

using the detail and approximation coefficients as follows:

x𝑡 =

∑︁

𝑘

𝑐 𝑗 (𝑘)𝜙 𝑗,𝑘 (𝑡) +

∑︁

∑︁

𝑘

𝑗

𝑑 𝑗 (𝑘)𝜓 𝑗,𝑘 (𝑡)

(5.3)

where 𝑗 and 𝑘 are the scale and dilation parameters respectively, 𝑐 𝑗 (𝑘) denotes the approximation

(scaling) coefficient, and 𝑑 𝑗 (𝑘) denotes the detail (wavelet) coefficient. There are several popular

wavelet families commonly used for DWT through their distinct scaling and wavelet functions.

These wavelet families include Haar, Biorthogonal, Daubechies, etc. Using orthogonal scaling and

wavelet functions, the coefficients 𝑐 𝑗 (𝑘) and 𝑑 𝑗 (𝑘) can be calculated by taking the inner product

with the original signal x𝑡 as follows:

𝑐 𝑗 (𝑘) = ⟨x𝑡, 𝜙 𝑗,𝑘 (𝑡)⟩, 𝑑 𝑗 (𝑘) = ⟨x𝑡, 𝜓 𝑗,𝑘 (𝑡)⟩

(5.4)

In most practical applications, it is important to note that the scaling and wavelet functions are

not explicitly used to calculate the approximation and detail coefficients. Instead, the coefficients are

obtained through the cascading filter banks algorithm, which allows for an efficient implementation

of the DWT by employing a series of low-pass and high-pass filters. Specifically, the high-pass and

low-pass filters can be applied in a cascading manner after signal extension to recursively compute

the coefficients 𝑐 𝑗 (𝑘) and 𝑑 𝑗 (𝑘) according to the following equations:

𝑐 𝑗 (𝑘) =

𝑑 𝑗 (𝑘) =

∑︁

𝑚
∑︁

𝑚

ℎlp(𝑚 − 2𝑘) 𝑐 𝑗+1(𝑚),

ℎhp(𝑚 − 2𝑘) 𝑑 𝑗+1(𝑚)

(5.5)

where 𝑚 = 2𝑘 + 𝑛 while ℎlp(·) and ℎhp(·) denote the low-pass and high-pass filters, respectively.

Figure 5.1 depicts an example of how the cascading filter banks algorithm has been used to

decompose a signal using Haar filters. The Haar scaling and wavelet filters are applied separately,

followed by downsampling by a factor of 2 to obtain the detail and approximation coefficients.

The selection of an appropriate wavelet function depends on the specific characteristics of

the signal being analyzed and the application objectives. For instance, the Haar wavelet is often

59

Figure 5.2 An overview of the proposed SimEXT framework for time series forecasting of extreme
values.

employed in scenarios where rapid computation is crucial. When using the Haar wavelet, the

number of approximation and detail coefficients are halved at each level of decomposition, leading

to a more compact representation of the signal. This reduction in coefficients can be advantageous

when dealing with large-scale datasets or real-time applications where computational efficiency is

paramount.

5.4 Methodology

SimEXT is an SSL framework for representation learning of time series with an emphasis on

preserving the fidelity of its tail distribution. An overview of the proposed framework is shown in

Fig. 5.2. The framework combines a contrastive learning approach with a reconstruction-based

autoencoder to generate a robust latent representation of its input time series. For contrastive learning,

SimEXT uses a novel wavelet-based data augmentation technique to ensure that the augmented time

series would maintain the block maxima (or minima) value of its original time series. Details of the

data augmentation approach are given in Section 5.4.1. It also employs a distribution-based loss

60

Figure 5.3 An overview of the wavelet-based data augmentation.
function to ensure that the learned representation better captures patterns characterizing the extreme

values of the time series. The learned representation can be utilized by any downstream forecasting

models for block maxima prediction, as shown in the right part of Fig. 5.2.

5.4.1 Wavelet-based Data Augmentation

As mentioned in Section 5.3.2, contrastive learning requires performing data augmentation

to generate conceptually similar samples by perturbing the original data. In principle, there are

many ways to perform data augmentation for time series, which include jittering, flipping, shuffling,

time warping, etc [94]. Despite the wide variety of approaches available, choosing the right

data augmentation approach is crucial for block maxima forecasting to ensure that the learned

representation preserves the overall pattern of the time series without significantly altering its block

maxima. To achieve this, the framework introduces a wavelet-based data augmentation technique.

Before describing the approach, we first examine the effect of jittering on the block maxima of a

time series.

Theorem 1. Let 𝑦1, 𝑦2, · · · , 𝑦𝑛 be a sequence, where 𝑦𝑖 ∈ R and 𝑀𝑛 = max𝑖 𝑦𝑖. Assuming

ˆ𝑀𝑛 = max𝑖 ˆ𝑦𝑖, where ˆ𝑦𝑖 = 𝑦𝑖 + 𝜖𝑖 and 𝜖𝑖 ∼ N (0, 𝜎2), it can be shown that:

E𝜖 (cid:2) ˆ𝑀𝑛 − 𝑀𝑛(cid:3) ≤ log 𝑛 +

𝜎2

2

(5.6)

61

Proof. First, observe that E𝜖 (cid:2) ˆ𝑀𝑛 − 𝑀𝑛(cid:3) = E𝜖 (cid:2) ˆ𝑀𝑛(cid:3) − 𝑀𝑛. Since the exponent function is non-

negative, the expected value of ˆ𝑀𝑛 can be expressed as follows:

E𝜖 (cid:2) ˆ𝑀𝑛(cid:3) = E𝜖

(cid:20)

(𝜖𝑖 + 𝑦𝑖)

max
𝑖

(cid:21)

= E𝜖

(cid:20)

(cid:34)

(cid:21)

(cid:0)log 𝑒𝜖𝑖+𝑦𝑖 (cid:1)

max
𝑖

≤ E𝜖

log

(cid:35)

𝑒𝜖𝑖+𝑦𝑖

𝑛
∑︁

𝑖=1

The inequality above can be further simplified using Jensen inequality as follows:

E𝜖 (cid:2) ˆ𝑀𝑛(cid:3) ≤ log E𝜖

(cid:35)

𝑒𝜖𝑖 𝑒𝑦𝑖

(cid:34) 𝑛
∑︁

𝑖=1

= log

(cid:32) 𝑛
∑︁

𝑖=1

(cid:33)

E𝜖 [𝑒𝜖𝑖 ] 𝑒𝑦𝑖

Assuming 𝜖𝑖 ∼ N (0, 𝜎2), it can be shown that

E [𝑒𝜖 ] =

∫ ∞

−∞

𝑒𝜖

√

1
2𝜋𝜎2

𝑒− 𝜖 2

2𝜎2 𝑑𝜖 = 𝑒 𝜎2

2

Replacing the expected value into the inequality given in 5.7, we obtain the following:

E𝜖 (cid:2) ˆ𝑀𝑛(cid:3) ≤ log

(cid:33)

𝑒 𝜎2

2 +𝑦𝑖

(cid:32) 𝑛
∑︁

𝑖=1

≤ log

(cid:33)

𝑒 𝜎2

2 +𝑀𝑛

(cid:32) 𝑛
∑︁

𝑖=1

= log 𝑛 +

𝜎2

2

+ 𝑀𝑛

The proof follows by subtracting 𝑀𝑛 from the expected value given above.

□

Theorem 1 provides an upper bound on the expected value of the difference between the perturbed

block maxima and the original block maxima of a time series of length 𝑛. Note that the bound is

proportional to the variance of the noise as well as the number of perturbed data points. Thus, by

using a smaller variance during the jittering process, fewer perturbed data points, or both, this can

help avoid altering the block maxima value significantly.

Algorithm 5.1 summarizes the pseudocode of our proposed wavelet-based data augmentation

approach. Given an input segment, x𝑡, we first employ DWT to decompose the time series into its

approximation coefficients (𝑐 𝑗0,𝑘 ) and wavelet (detail) coefficients (𝑑 𝑗,𝑘 ). We apply the Daubechies

1 wavelet, also known as the Haar wavelet, in this study but the methodology is applicable to other

wavelet functions. As mentioned in Section 5.3.3 and shown in Figure 5.1, by employing the Haar

62

Algorithm 5.1 Wavelet-based Data Augmentation Algorithm

Input: Time series predictor xt and noise variance, 𝜎2.
Output: Augmented time series pair, x1
(𝑐 𝑗0,𝑘 , 𝑑 𝑗,𝑘 ) ← DWT(xt)
for i = 1 to 2 do

𝑡 and x2
𝑡 .

𝛿𝑖
𝑗,𝑘 ← 𝑑 𝑗,𝑘 + 𝜖𝑖, where 𝜖𝑖 ∼ N (0, 𝜎2)
𝑡 ← IDWT(𝑐 𝑗0,𝑘 , 𝛿𝑖
x𝑖
end for
return x1

𝑡 and x2
𝑡

𝑗,𝑘 )

wavelet, the number of detail and approximation coefficients will be halved of the length of the

original time series. Since the approximation coefficient captures the general trend of the input signal

while the detail coefficient captures the noisy, high-frequency component, we perform augmentation

by applying jittering to the detail coefficients only. Specifically, each detail coefficient is perturbed

by adding a Gaussian noise with variance 𝜎2. As the number of detail coefficients is half of the

length of x𝑡 and using a low variance 𝜎2, following Theorem 1, this ensures that the block maxima

of the augmented time series are close to that of its input time series. Finally, we employ the inverse

discrete wavelet transform (IDWT) to reconstruct the corresponding augmented time series using

the approximation coefficients as well as the perturbed wavelet (detail) coefficients.

In summary, the proposed wavelet-based data augmentation perturbs the detail coefficients only,

without altering the approximation coefficients. This ensures that the overall temporal pattern and

block maxima distribution are preserved, as only the high-frequency components and essentially

half of the data points in the signal are perturbed.

5.4.2 Self-supervised Representation Learning

The wavelet-based data augmentation technique described in the previous section is used to

create augmented pairs of similar samples for each predictor window x𝑡. Each augmented pair

(x1

𝑡 , x2

𝑡 ) is passed to an autoencoder to generate its corresponding feature embedding (z1

𝑡 , z2

𝑡 ),

capable of reconstructing the original sample:

𝑡 = 𝐷𝑒𝑐𝑜𝑑𝑒𝑟 (z𝑖
x′𝑖

𝑡) where z𝑖

𝑡 = 𝐸𝑛𝑐𝑜𝑑𝑒𝑟 (x𝑖
𝑡)

(5.7)

63

Figure 5.4 Tail-Weighted Distance (TWD) between Empirical Cumulative Distribution Functions
(ECDFs).
A reconstruction loss based on the squared Euclidean norm between the original and reconstructed

samples after data augmentation is used to ensure that the learned representation preserve the

important features of the original sample. The contrastive loss given by Eq. 5.2 is also calculated

for all positive samples (z1

𝑡 , z2

𝑡 ) to ensure that the learned representations of similar augmented

pairs are close to each other.

In the context of the proposed framework, the combination of contrastive learning and

reconstruction-based autoencoder is used to facilitate robust representation learning. By jointly

minimizing the reconstruction loss and contrastive loss, the framework is able to learn representations

that are robust to minor perturbations of the original data.

5.4.3 Enforcing Fidelity of Tail Distribution

The reconstruction and contrastive losses described in the previous section are insufficient to

ensure that the learned representation would preserve the fidelity of the block maxima (or minima)

distribution. To overcome this limitation, we introduce a distribution loss into the objective function

to emphasize learning features that consider extreme values of the time series. Let x𝑡 be the original

input time series and x′

𝑡 be the reconstructed time series. Their corresponding empirical cumulation

distribution functions (ECDFs) are defined as

𝐹𝑥 (𝑧) =

1
𝑛

𝑛
∑︁

𝑖=1

1[𝑥(𝑖) ≤ 𝑧], 𝐹𝑥′ (𝑧) =

1
𝑛

𝑛
∑︁

𝑖=1

1[𝑥′

(𝑖) ≤ 𝑧]

(5.8)

where 𝑛 is the sample size, 𝑥(𝑖) is the 𝑖-th largest observation in x, and 1[𝑥(𝑖) ≤ 𝑧] is an indicator

function, whose value is 1 if 𝑥(𝑖) ≤ 𝑧 and 0 otherwise. We consider two approaches for measuring

64

the distribution loss.

5.4.3.1 Tail-Weighted Distance (TWD) between Empirical Cumulative Distribution Functions

(ECDFs)

Figure 5.4 shows an overview of the proposed TWD approach. The loss is motivated by

the Kolmogorov–Smirnov (KS) distance [95] for determining whether a sample is drawn from a

particular distribution: 𝐷𝐾𝑆 (𝐹𝑥′, 𝐹𝑥) = sup𝑧 |𝐹𝑥′ (𝑧) − 𝐹𝑥 (𝑧)|. Specifically, the KS distance between

two ECDFs is given by the largest absolute deviation at any given position. However, the distance

does not consider whether the extreme values of the distribution are well-preserved and have high

variance unless 𝑛 is sufficiently large. Instead, we introduce the following tail-weighted distance to

put more emphasis on the extreme values:

TWD =

1
𝑛

𝑛
∑︁

𝑗=1

𝛾( 𝑝 𝑗 ) ·

|𝐹𝑥′ (𝑥 𝑗 ) − 𝐹𝑥 (𝑥 𝑗 )|

(5.9)

where 𝐹𝑥 and 𝐹𝑥′ are the ECDFs of the original and reconstructed time series respectively,

𝑝 𝑗 = 𝐹𝑥 [𝑥 𝑗 ] is the percentile of the 𝑗-th observation and 𝛾( 𝑝 𝑗 ) = 𝑝2

𝑗 is a tail-weighted function of

the percentile 𝑝 𝑗 , whose value grows quadratically with increasing 𝑝 𝑗 . This would bias the TWD to

put more emphasis on the right tail of the distribution. Alternatively, to put a large emphasis on both

upper and lower tails, the weight function could be defined as 𝛾( 𝑝 𝑗 ) = 2( 𝑝 𝑗 − 0.5)2.

5.4.3.2 Cramér–von Mises Distance (CMD) between the ECDFs of GEV Distribution for

Block Maxima

In this approach, we employ the Cramér-von Mises (CM) distance [96, 97] to measure the

deviation between the GEV distribution of the block maxima values in the original and reconstructed

time series. Let 𝑚 = max𝑥𝑖 ∈x 𝑥𝑖 be the block maxima of the original series and 𝑚′ = max𝑥𝑖 ∈x′ 𝑥𝑖

be the block maxima of the reconstructed series. Given a set of block maxima values generated

from the predictor windows of the training data, we use the maximum likelihood approach, given in

Eq. 3.4 to estimate GEV parameters of the block maxima values. Let 𝐺𝑥 and 𝐺𝑥′ be the ECDF of

the fitted GEV distributions of block maxima values associated with the original and reconstructed

time series, respectively. We then compute the distribution loss using Cramér–von Mises Distance

65

[96, 97] between the ECDFs of the two GEV distributions as follows:

CMD =

(cid:118)(cid:116) 𝑛
∑︁

𝑖=1

|𝐺𝑥′ (𝑧𝑖) − 𝐺𝑥 (𝑧𝑖)|

(5.10)

where {𝑧1, 𝑧2, · · · , 𝑧𝑛} are the samples drawn from the fitted GEV distribution of the original time

series and 𝑛 is the number of samples drawn.

5.4.4 Downstream Task

The utilization of learned representations derived from self-supervised learning holds promise

in enhancing the performance of downstream extreme value prediction tasks. This application

is illustrated in the right part of Figure 5.2. Specifically, an input time series x𝑡 is presented to

the trained SimEXT model to generate its feature embedding, which can be then combined with

feature representation from a downstream model to predict the block maxima value associated with

its forecast window. The integration of the SimEXT model’s feature embedding with the feature

representation from a downstream model creates a synergistic relationship, where the combined

knowledge and insights contribute to more accurate and robust predictions of block maxima values.

5.4.5 Optimization

The SimEXT framework is trained to minimize the following loss function:

L = 𝜆1 L𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡𝑖𝑣𝑒 + 𝜆2 L𝑟𝑒𝑐𝑜𝑛 + 𝜆3 L𝑑𝑖𝑠𝑡

(5.11)

where L𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡𝑖𝑣𝑒 is the contrastive loss given in Eq. 5.2, L𝑟𝑒𝑐𝑜𝑛 = (cid:205)𝑡 ∥x𝑡 −x′

2 is the reconstruction
loss of the autoencoder, and L𝑑𝑖𝑠𝑡 is either the TWD or CMD distribution losses described in Section

𝑡 ∥2

5.4.3. Note that 𝜆1, 𝜆2, and 𝜆3 are hyperparameters that manage the trade-off between the different

components of the loss function. Adam [89] optimizer is used while training. The trained network

can be used to generate latent representation given input time series.

5.5 Experimental Evaluation

We have performed extensive experiments to evaluate the performance of our SimEXT framework.

5.5.1 Datasets

We consider the following three datasets for our experiments.

66

Hurricane We utilize the same HURDAT2 database [78] employed for DeepExtrema in Chapter

3. For each hurricane, we created non-overlapping time windows of length 24-time steps (6 days).

We use the first 16 time steps (4 days) as the predictor window and the last 8 time steps (2 days) as

the forecast window.

Climate We use the same daily maximum temperature data from the North American Regional

Climate Change Assessment Program (NARCCAP) data archive1employed for DeepExtrema in

Chapter 3. We generate non-overlapping time windows of 14 days in length, with the first 7 days

serving as the predictor time window and the last 7 days serving as the target window for predicting

its block maxima. After the data preprocessing, there are 3285 time windows in total.

ECL This dataset comprises of hourly electricity consumption of 321 clients obtained from

the UCI repository 2. We partition the time series into non-overlapping time windows of 14-day

duration. The initial 7-day period is used as a predictor time window, while the remaining 7-day

interval defines the forecast window. After preprocessing, the cumulative number of time windows

is equal to 5638.

5.5.2 Baseline Methods

In order to demonstrate the efficacy of our proposed self-supervised representation learning

framework, SimEXT, we conduct a comparative analysis against the following state-of-the-art

methods in time series representation learning:

1. CoST [52]: CoST focuses on acquiring disentangled seasonal and trend representations from

time series data. It explores both the time domain and frequency domain concurrently by

utilizing the discrete Fourier transform. Additionally, CoST incorporates intra-level contrastive

learning to enhance the quality of learned representations.

2. TS2Vec [51]: TS2Vec adopts a hierarchical contrastive learning strategy, leveraging augmented

context views to capture multi-scale contextual information. It introduces the concept of

contextual consistency for the selection of positive pairs, thereby improving the discriminative

1https://www.narccap.ucar.edu/data/index.html
2https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014

67

power of learned representations.

3. TNC [98]: TNC employs a debiased contrastive objective to learn time series representations.

It ensures that, in the encoding space, the distribution of signals within a neighborhood is

distinguishable from the distribution of signals that are not neighboring.

4. TimeCLR [55]: This approach uses dynamic time warping for augmentation. This augmen-

tation method not only considers phase shifts and amplitude changes but also preserves the

underlying structure and feature information of the time series. TimeCLR further combines

the augmentation technique with InceptionTime, a powerful deep architecture for time series

prediction.

To evaluate the robustness of the learned representations generated by our proposed SimEXT

framework, we conducted experiments using several downstream models for block maxima predic-

tion. Specifically, we employ the following downstream models: (1) LSTM, which is a bi-directional

stacked Long Short-Term Memory network, coupled with a fully connected network. (2) Trans-

former, which is an attention-based transformer network. (3) Informer[5], which is a modified

transformer-based framework. (4) EVL[34], which is a forecasting model that incorporates extreme

value theory into its loss function. (5) DeepExtrema[81], which applies the Generalized Extreme

Value (GEV) distribution for block maxima prediction.

To efficacy of the features generated by our proposed SimEXT framework is demonstrated by

comparing them under the following experimental settings: (1) Original Features: In this setting,

we directly feed the original features of the time series into different downstream forecasting

models, without additional transformations. (2) Autoencoder only (AE): This approach utilizes

only the autoencoder to reconstruct the original sample and learn the representation of the time

series. (3) Autoencoder + Contrastive Learning (AE + CL): This setting incorporates both the

autoencoder and contrastive learning modules to jointly learn the representation of the time series.

(4) Autoencoder + Contrastive Learning + TWD (AE + CL + TWD): In this configuration, we

employ the Tail-Weighted Distance (TWD) between empirical cumulative distribution functions

(ECDFs) as distribution loss to further enhance the representation learning of the time series. (5)

68

Autoencoder + Contrastive Learning + CMD (AE + CL + CMD): This approach incorporates the

Cramér-von Mises Distance (CMD) between the ECDFs of the GEV distributions for measuring

distribution loss, enabling improved representation learning of the time series.

Methods

Evaluation on Representation Learning Models

Climate

Hurricane

RMSE

Corr

F1

RMSE

Corr

F1

RMSE

ECL

Corr

F1

CoST [52]

3.08 ± 0.27 0.80 ± 0.03 0.85 ± 0.02

15.12 ± 0.25

0.87 ± 0.04

0.84 ± 0.04

0.98 ± 0.05

0.79 ± 0.03

0.82 ± 0.03

TS2Vec [51]

2.99 ± 0.28 0.81 ± 0.03 0.86 ± 0.01

15.03 ± 0.26

0.88 ± 0.03

0.86 ± 0.03

0.95 ± 0.03

0.78 ± 0.03

0.80 ± 0.04

TNC [98]

3.05 ± 0.23 0.80 ± 0.02 0.84 ± 0.03

15.06 ± 0.28

0.89 ± 0.02

0.85 ± 0.03

0.94 ± 0.04

0.78 ± 0.04

0.82 ± 0.04

TimeCLR [55]

3.18 ± 0.28 0.79 ± 0.03 0.82 ± 0.03

15.22 ± 0.30

0.85± 0.03

0.83 ± 0.04

1.04 ± 0.07

0.76 ±0.03

0.81 ± 0.03

SimEXT (TWD)

2.82 ± 0.25 0.82 ± 0.02 0.87 ± 0.02

14.86 ± 0.22

0.90 ± 0.03 0.88 ± 0.04

0.88 ± 0.05

0.81 ± 0.03

0.85 ± 0.03

SimEXT (CMD)

2.84 ± 0.28 0.80 ± 0.03 0.88 ± 0.03

14.82 ± 0.24

0.89 ± 0.03 0.90 ± 0.02

0.90 ± 0.04

0.80 ± 0.04

0.84± 0.04

Table 5.1 Performance comparison against state-of-the-art time series representation learning models
for block maxima forecasting and classification. RMSE and correlation are calculated for block
maxima prediction. F1 score is measured for classification extreme events (Category 4 and above
for Hurricane and Above 90th Percentile for Climate and ECL data).

5.5.3 Experiment Settings

We partitioned each dataset into training, validation, and testing, according to an 8:1:1 ratio. We

repeated the experiments 5 times using different random partitioning of the data. Prior to applying

the various algorithms, the time series data is standardized to have zero mean and unit variance.

The encoder and decoder components of our framework employ a 4-layer bidirectional Long

Short-Term Memory (LSTM) architecture, accompanied by fully connected layers. The training was

facilitated using the Adam optimizer. For all the methods, we perform extensive hyperparameter

tuning on the length of the embedding vector, the number of hidden layers (ranging from 2 to 6), the

number of nodes (ranging from 8 to 128), the learning rate (ranging from 10−5 to 10−2), and the

batch size (ranging from 32 to 256). The optimal hyperparameters were determined using the Ray

Tune framework, integrating an Asynchronous Successive Halving Algorithm (ASHA) scheduler to

enable early stopping. All experiments were conducted on a single NVIDIA T4 GPU.

To evaluate the performance of the proposed framework, we employ the following evaluation

metrics: (1) Root Mean Squared Error (RMSE): This metric quantifies the root mean squared

error between the predicted block maxima and the ground truth block maxima within the forecast

69

Methods

LSTM

Informer

Transformer

EVL

DeepExtrema

Configuration

AE

AE

AE

AE + CL

AE + CL

Original Features

RMSE
3.22 ± 0.25
3.18 ± 0.26
3.11 ± 0.24
AE + CL + TWD 3.02 ± 0.25
AE + CL + CMD 2.97 ± 0.24
3.19 ± 0.23
Original Features
3.15 ± 0.21
3.08 ± 0.23
AE + CL + TWD 2.97 ± 0.25
AE + CL + CMD 2.99 ± 0.26
3.21 ± 0.26
Original Features
3.12 ± 0.23
3.03 ± 0.25
AE + CL + TWD 2.94 ± 0.29
AE + CL + CMD 2.91 ± 0.26
3.28 ± 0.27
Original Features
3.21 ± 0.25
3.18 ± 0.27
AE + CL + TWD 3.07 ± 0.24
AE + CL + CMD 3.05 ± 0.26
3.10 ± 0.17
Original Features
3.02 ± 0.19
2.92 ± 0.21
AE + CL + TWD 2.79 ± 0.24
AE + CL + CMD 2.81 ± 0.26

AE + CL

AE + CL

AE + CL

AE

AE

Evaluation on Downstream Models

Climate

Corr
0.73 ± 0.04
0.72 ± 0.05
0.75 ± 0.04
0.77 ± 0.02
0.76 ± 0.03
0.73 ± 0.04
0.74 ± 0.05
0.77 ± 0.03
0.81 ± 0.02
0.78 ± 0.03
0.72 ± 0.05
0.73 ± 0.06
0.78 ± 0.04
0.80 ± 0.03
0.79 ± 0.04
0.72 ± 0.03
0.74 ± 0.04
0.76 ± 0.03
0.79 ± 0.02
0.79 ± 0.03
0.73 ± 0.05
0.75 ± 0.04
0.78 ± 0.05
0.83 ± 0.02
0.81 ± 0.04

F1
0.74 ± 0.04
0.75 ± 0.05
0.78 ± 0.03
0.78 ± 0.03
0.79 ± 0.04
0.78 ± 0.05
0.79 ± 0.05
0.82 ± 0.04
0.83 ± 0.03
0.85 ± 0.04
0.78 ± 0.05
0.77 ± 0.06
0.83 ± 0.05
0.86 ± 0.03
0.85 ± 0.04
0.76 ± 0.05
0.78 ± 0.04
0.82 ± 0.03
0.84 ± 0.03
0.83 ± 0.04
0.78 ± 0.04
0.79 ± 0.05
0.84 ± 0.04
0.88 ± 0.02
0.89 ± 0.02

RMSE
15.51 ± 0.35
15.55 ± 0.37
15.34 ± 0.32
15.12 ± 0.26
15.09 ± 0.24
15.33 ± 0.32
15.30 ± 0.29
15.08 ± 0.24
14.94 ± 0.23
14.97 ± 0.25
15.41 ± 0.36
15.36 ± 0.35
15.13 ± 0.29
14.98 ± 0.25
14.95 ± 0.28
15.44 ± 0.30
15.40 ± 0.32
15.21 ± 0.28
15.10 ± 0.21
15.07 ± 0.24
15.18 ± 0.26
15.21 ± 0.28
15.04 ± 0.25
14.81 ± 0.23
14.78 ± 0.25

Hurricane

Corr
0.76 ± 0.35
0.78 ± 0.05
0.81 ± 0.03
0.82 ± 0.04
0.82 ± 0.03
0.79 ± 0.04
0.81 ± 0.03
0.86 ± 0.04
0.86 ± 0.04
0.87 ± 0.04
0.78 ± 0.03
0.80 ± 0.04
0.87 ± 0.05
0.88 ± 0.03
0.86 ± 0.04
0.78 ± 0.04
0.80 ± 0.05
0.80 ± 0.03
0.84 ± 0.04
0.83 ± 0.03
0.83 ± 0.04
0.82 ± 0.03
0.88 ± 0.04
0.90 ± 0.02
0.89 ± 0.02

F1
0.78 ± 0.03
0.79 ± 0.04
0.82 ± 0.03
0.82 ± 0.03
0.84 ± 0.02
0.82± 0.03
0.81 ± 0.03
0.84 ± 0.04
0.86 ± 0.02
0.89 ± 0.02
0.81 ± 0.04
0.82 ± 0.03
0.85 ± 0.04
0.87 ± 0.03
0.89 ± 0.02
0.78 ± 0.03
0.78 ± 0.03
0.81 ± 0.03
0.84 ± 0.04
0.84 ± 0.02
0.84 ± 0.04
0.84 ± 0.03
0.88 ± 0.03
0.88 ± 0.03
0.89 ± 0.02

ECL

Corr
RMSE
0.74 ± 0.05
1.22 ± 0.09
0.76 ± 0.03
1.12 ± 0.08
0.76 ± 0.02
1.06 ± 0.05
0.78 ± 0.03
1.00 ± 0.06
0.79± 0.03
0.99 ± 0.07
0.76 ± 0.03
1.11 ± 0.08
0.75 ± 0.03
1.05 ± 0.05
0.77 ± 0.04
0.99 ± 0.07
0.79 ± 0.03
0.96 ± 0.05
0.80 ± 0.03
0.98 ± 0.06
0.75 ± 0.04
1.14 ± 0.07
0.77 ± 0.02
1.04 ± 0.06
0.77 ± 0.03
1.01 ± 0.07
0.80 ± 0.02
0.99 ± 0.06
0.80 ± 0.01
0.95 ± 0.04
0.77 ± 0.03
1.09 ± 0.07
0.77 ± 0.02
1.04 ± 0.05
0.78 ± 0.03
1.02 ± 0.05
0.80 ± 0.02
0.98 ± 0.07
0.79± 0.02
0.99 ± 0.05
0.77 ± 0.04
1.02 ± 0.07
0.79 ± 0.02
0.98 ± 0.06
0.95 ± 0.06
0.78 ± 0.03
0.86 ± 0.06 0.83 ± 0.02
0.82 ± 0.03
0.89 ± 0.05

F1
0.76 ± 0.03
0.77 ± 0.03
0.79 ± 0.02
0.80 ± 0.02
0.79 ± 0.03
0.77 ± 0.03
0.77 ± 0.04
0.80 ± 0.03
0.81 ± 0.03
0.82 ± 0.03
0.75 ± 0.04
0.78 ± 0.03
0.79 ± 0.02
0.81 ± 0.04
0.83 ± 0.02
0.74 ± 0.05
0.77 ± 0.02
0.78 ± 0.03
0.82 ± 0.03
0.81 ± 0.04
0.80 ± 0.03
0.82 ± 0.02
0.83 ± 0.03
0.85 ± 0.03
0.86 ± 0.03

Table 5.2 Performance comparison of downstream models under different experiment settings
for block maxima forecasting and classification. RMSE and correlation are calculated for block
maxima prediction. F1 score is measured for classification extreme events (Category 4 and above
for Hurricane and Above 90th Percentile for Climate and ECL data).

window. (2) Correlation: The correlation between the predicted and ground truth block maxima

within the forecast window serves as a measure of their alignment. (3) F1 score: This metric gauges

the accuracy of extreme event detection within the forecast window. For the Hurricane dataset, an

extreme event is defined as a hurricane intensity value surpassing 130 mph (i.e., category 4 and

above hurricanes). For the climate and ECL datasets, extreme events are identified as temperature

or electricity consumption values exceeding the 90th percentile of the values at a specific location.

5.5.4 Experimental Results

Table 5.1 provides a comprehensive comparison of the performance of SimEXT compared to

other state-of-the-art representation learning methods. The results suggest that the proposed SimEXT

framework, which incorporates distribution losses (TWD and CMD), outperforms all the baseline

methods, namely CoST [52], TS2Vec [51], TNC [98], and TimeCLR [55], in terms of RMSE,

70

Figure 5.5 Probability distribution comparison of ground truth, SimEXT, and TS2Vec for block
maxima forecasting with hurricane data.

Correlation, and F1 Score. To evaluate the significance of the observed performance improvement, a

paired t-test was conducted to measure the significance of the difference in RMSE between SimEXT

against the best-performing baseline method across 10 iterations for each dataset. The conventional

threshold for statistical significance was set at 𝑝 < 0.05. The results show that, for the Climate

dataset, the p-value for RMSE is 0.0211; for the Hurricane dataset, the p-value is 0.0024 for RMSE;

and for the ECL dataset, the p-value is 0.03379 for RMSE. Furthermore, Figure 5.5 provides a

visual representation of the probability distribution comparison between the ground truth, SimEXT,

and TS2Vec for block maxima forecasting using the hurricane dataset. The figure demonstrates

that while the proposed SimEXT method does not accurately capture the ground truth distribution

for tropical depression and storm categories, it significantly outperforms TS2Vec in matching the

ground truth distribution for Category 1-5 hurricanes. From a practical perspective, the accurate

forecasting of extreme events, particularly Category 1-5 hurricanes, holds far greater importance.

These findings serve as compelling evidence supporting the superior performance of SimEXT in

capturing tail distributions in time series while simultaneously learning representations. Notably,

among the baseline methods, TS2Vec stands out as the top performer as this approach proves to be

more suitable for time series data with diverse distributions and scales. Moreover, it is important to

consider that extreme values in this context exhibit atypical distributions due to the influence of a

71

Figure 5.6 RMSE comparison of the different modules of SimEXT under different experimental
setting for block maxima forecasting.

unique shape parameter governing their diverse shapes.

Table 2 presents the results of applying SimEXT to five downstream models (LSTM, Informer,

Transformer, EVL, and DeepExtrema) for block maxima forecasting. The table showcases the

performance of SimEXT under five different settings: Original Features, AE, AE + CL, AE + CL

+ TWD, and AE + CL + CMD. Consistently, the results demonstrate that the incorporation of

distribution losses (AE + CL + TWD and AE + CL + CMD) significantly enhances the performance

of block maxima prediction for all downstream models. This finding underscores the effectiveness

of the proposed distribution losses in capturing extreme values and improving forecasting accuracy.

Moreover, the results reveal that the DeepExtrema model for downstream tasks, when using it with

the SimEXT representation learning approach, achieves the best performance, exhibiting the lowest

RMSE and the highest correlation and F1 score.

The combined evidence from Tables 5.1 and 5.2 supports the superiority of SimEXT in capturing

extreme values and enhancing representation learning for time series data. By incorporating

distribution losses, SimEXT outperforms state-of-the-art methods. Additionally, when applied to

downstream models for block maxima forecasting, SimEXT consistently improves performance

across various settings.

5.5.5 Ablation study

In this study, investigate the effect of gradually incorporating the AE, CL, and TWD/CMD

modules into the proposed SimEXT framework. Figure 5.6 shows the RMSE values obtained

72

when using different modules of the SimEXT framework. The results suggest that incorporating

CL (contrastive learning) alone yields more substantial benefits compared to incorporating only

AE (Autoencoder) for all datasets used. This finding aligns with the current understanding that

contrastive learning is generally more beneficial in learning meaningful representations compared

to autoencoder-based methods. Nevertheless, our results also highlight the positive impact of

incorporating the distribution loss, ultimately leading to the best performance across all three

datasets. This suggests that while the autoencoder module alone may not deliver optimal results

on its own, it does exert a positive impact when integrated alongside the CL and distribution loss

(TWD or CMD).

5.6 Conclusion

This chapter introduces SimEXT, a novel self-supervised learning framework for modeling

extreme values in time series. The framework employs a wavelet-based data augmentation approach

along with a combination of contrastive learning with auto-encoders to learn the representation of

the time series. To ensure that the representation preserves the extreme values, SimEXT incorporates

a distribution loss function that is biased towards capturing the block maxima of the time series. Our

experimental results on real-world data demonstrate SimEXT can boost the performance of existing

representation learning and downstream approaches for forecasting block maxima.

73

CHAPTER 6

FIDE: FREQUENCY-INFLATED CONDITIONAL DIFFUSION MODEL FOR
EXTREME-AWARE TIME SERIES GENERATION

6.1

Introduction

Generative models [61, 59, 60] have revolutionized the AI landscape, demonstrating their broad

applicability across diverse domains, including computer vision and natural language processing.

Such models are designed to learn the underlying data distribution and exhibit resilience to overfitting

while promoting automatic feature extraction. Diffusion-based models [13, 62], in particular, have

emerged as a popular generative AI method due to their capability to generate realistic, high-quality

data. This chapter examines the application of diffusion-based models for time series generation. In

particular, we investigate the following issue: How well do existing diffusion models preserve the

fidelity of extreme values (i.e., tail distribution) of the original time series?

The modeling of extreme values in time series is essential for informed decision-making across

diverse applications, including weather forecasting, earthquake prediction, and disease outbreak

detection. Effective generative modeling of these extremes is important as it aids in learning the

underlying data distribution, facilitating data augmentation, and improving uncertainty estimation,

all of which are crucial for developing robust risk management strategies and enhancing disaster

preparedness measures. While there has been growing research on applying diffusion models for

time series [11, 12], their ability to preserve the distribution of extreme values remains largely

underexplored. In this study, we examine how effectively diffusion models preserve extreme values

in the form of block maxima [10], defined as the peak value within a specified time window.

To illustrate the difficulty of modeling the distribution of block maxima, Figure 6.1 shows the

result of applying the Denoising Diffusion Probabilistic Model (DDPM) [13] to a synthetic AR(1)

dataset. While DDPM shows proficiency in generating samples that closely align with the overall

data distribution (left diagram), it struggles to preserve the distribution of block maxima values

(right diagram) when the generated time series is partitioned into disjoint time windows.

In this chapter, we identify the key shortcomings of existing diffusion models that hamper their

74

(a) All Values Distribution

(b) Block Maxima Distribution

Figure 6.1 Comparing the distributions of all values and block maxima values for real and generated
samples using DDPM [13] when applied to the synthetic AR(1) dataset.

ability to accurately model block maxima values. We then present a novel framework to overcome

this limitation. Our key observation is that unusually large block maxima values, often linked

to abrupt temporal changes, are strongly associated with high-frequency components of the time

series. As the diffusion-based generative model gradually introduces noise with a linearly increasing

variance schedule, it slowly diminishes the long-term trends (low-frequency components) of the time

series while quickly attenuating the high-frequency components. These high-frequency components

are crucial for reproducing extreme block maxima values. This limitation hampers the accurate

representation of the block maxima, necessitating the development of new techniques.

To address this challenge, we propose an end-to-end diffusion model framework termed FIDE.

First, to mitigate the rapid dissipation of high-frequency components in the diffusion model, we

introduce a novel high-frequency inflation strategy within the frequency domain. This strategic

augmentation ensures the sustained emphasis on block maxima, preventing their premature fade-out.

We further employ a conditional diffusion-based generative modeling approach to guide the time

series generation by conditioning on their block maxima. To enhance the preservation of the

block maxima distribution while learning the overall data distribution, we extend the conventional

framework with a regularization term in the loss function based on the negative log-likelihood of

the Generalized Extreme Value (GEV) distribution. Using these strategies, we empirically show

that our approach effectively addresses the challenges of learning the overall data distribution while

simultaneously preserving the block maxima distribution.

75

6.2 Related Works

Time series generation has been a subject of extensive research, leveraging a variety of

statistical [58] and machine-learning [99, 7] techniques to capture temporal dependencies and

complexities within data. Generative methods, including Generative Adversarial Networks (GANs)

[59], Variational Autoencoders (VAEs) [60], normalizing flows [61], and diffusion-based approaches

[13, 62], have demonstrated efficacy in time series generation and garnered interest due to their

ability to learn underlying data distributions for data generation. Normalizing flows are constrained

by their computational complexity, limited expressiveness, and suboptimal sample quality, thereby

restricting their capacity for effective modeling. Numerous works have delved into enhancing GANs,

introducing variations like RcGAN [63] and TimeGAN [7], which have demonstrated improvements

in generating realistic time series data. TimeGAN [7] specifically adopts a GAN architecture to

generate time-series data, employing an encoder and decoder to transform a time-series sample into

latent vectors. However, GAN-based generative models are susceptible to issues like mode collapse

and unstable behavior during training. While VAEs have not been extensively applied to synthetic

time series generation, their effectiveness in addressing related challenges, such as time series

imputation [64], suggests their potential utility in this domain. Diffusion-based models are also

gaining traction for their ability to generate high-quality data such as images and videos, bypassing

the challenges associated with discriminator networks in GANs and avoiding the artifact-prone

lower-dimensional latent spaces of VAEs. There are a couple of diffusion-based works [11, 12] that

have been employed for time series, but they are specifically designed for discriminative tasks.

While generative AI for time series offers numerous advantages, it has not been extensively

explored, especially in terms of modeling extreme values. The difficulty of modeling extremes using

generative models such as normalizing flows [65] has been recognized in previous research. [66]

and [67] highlight the inability of normalizing flows to accurately capture heavy-tailed marginal

distributions. Specifically, these studies show that any attempt to map heavy-tailed distributions to

light-tailed distributions (e.g., Gaussian) cannot maintain Lipschitz-boundedness. However, this

challenge remains largely unexplored within the realm of diffusion models.

76

6.3 Preliminaries

Consider a time series dataset D = {x𝑚,0}𝑀

𝑚=1 comprising of 𝑀 samples, where each sample
𝑚,0) is a univariate time series of finite length 𝑇. Let f𝑚,0 ∈ R𝑇 be the
Fourier coefficients, whose 𝑘-th frequency component is obtained by applying the following discrete

x𝑚,0 = (𝑥1

, · · · , 𝑥𝑇

, 𝑥2

𝑚,0

𝑚,0

Fourier transform on x𝑚,0:

𝑓 𝑘
𝑚,0 =

𝑇
∑︁

𝑡=1

𝑥𝑡
𝑚,0

𝑒−𝑖2𝜋𝑡𝑘/𝑇 =

𝑇
∑︁

(cid:20)

𝑡=1

𝑥𝑡
𝑚,0 cos

(cid:19)

(cid:18) 2𝜋𝑡𝑘
𝑇

− 𝑖 · 𝑥𝑡

𝑚,0 sin

(cid:19)(cid:21)

(cid:18) 2𝜋𝑡𝑘
𝑇

(6.1)

The time series can be recovered from its Fourier coefficients using the following inverse discrete

Fourier transform:

𝑥𝑡
𝑚,0 =

1
𝑇

𝑇
∑︁

𝑘=1

𝑓 𝑘
𝑚,0

𝑒𝑖2𝜋𝑡𝑘/𝑇 =

1
𝑇

𝑇
∑︁

(cid:18)

𝑘=1

𝑓 𝑘
𝑚,0 cos

(cid:19)

(cid:18) 2𝜋𝑡𝑘
𝑇

+ 𝑖 · 𝑓 𝑘

𝑚,0 sin

(cid:19)(cid:19)

(cid:18) 2𝜋𝑡𝑘
𝑇

(6.2)

For brevity, we will drop the sample subscript 𝑚 when it is clear from the context. Let, 𝜔𝑘 = 2𝜋𝑘

𝑇 be

the 𝑘-th frequency in Fourier transform.

Given a sample x0, let 𝑦0 be its corresponding block maxima value, where 𝑦0 = max𝜏∈{1,··· ,𝑇 } 𝑥𝜏
0.

The distribution of the block maxima values is governed by the Generalized Extreme Value

(GEV) distribution given in (3.1). Given 𝑀 independent block maxima values, denoted as

{𝑦1,0, 𝑦2,0, · · · , 𝑦 𝑀,0}, with the cumulative distribution function given by Equation (3.1), the

distribution parameters can be estimated using the maximum likelihood approach by minimizing the

negative log-likelihood function in (3.4)

6.4 On the Rapid Dissipation of Block Maxima in Diffusion Models

While diffusion models have demonstrated remarkable capabilities in learning complex data

distributions, a significant challenge arises in accurately capturing the distribution of block maxima

values, as evidenced by Figure 6.1. Addressing this shortcoming is crucial for enhancing the

performance and applicability of these models across various domains. In this section, we delve

into the root cause of this phenomenon and present insightful observations that shed light on the

underlying issue.

77

Figure 6.2 Removal of high-frequency components from daily temperature time series significantly
alters the magnitude of its block maxima value (at time step 20), as evidenced by its high residual.

6.4.1 Relationship between Abrupt Block Maxima and High-Frequency Components

Our first key observation reveals a connection between block maxima with abrupt changes and

the high-frequency components of many real-world time series. Block maxima, often characterized

by their rarity and abrupt temporal changes, are intrinsically linked to the high-frequency components

of the data. This relationship is observed in many real-world datasets, where the block maxima

values do not typically evolve smoothly but rather emerge through large deviations from their

adjacent values.

To illustrate this, consider the real-world temperature time series depicted in Figure 6.2. In this

plot, we first transform the time series into its Fourier domain, obtaining its frequency components,

and selectively zeroing out its top-5 highest frequency components. We then reconstruct the

time series via its inverse Fourier transform and compute the difference between the original and

reconstructed time series. The recovered signal exhibits a notable distortion around the block maxima

value, as evidenced by the larger residual at time step 20, where the block maxima value occurs.

This suggests that the removal of high-frequency components of a time series has a significant

impact on the accurate representation of block maxima values.

78

Figure 6.3 Percentile distribution of first-order derivatives for the block maxima values in different
time series datasets. Observe that the derivatives tend to exhibit elevated percentile values.

Definition 2 (Abrupt Block Maxima). Let, x0 ∈ R𝑇 be a time series of length 𝑇 and 𝑦0 ≡ 𝑥𝜏
0 be its block maxima, where 𝜏 is the time step of the block maxima. Then, 𝑥𝜏
max𝑡∈1,...,𝑇 𝑥𝑡
considered an abrupt block maxima if 𝑑𝑥

𝑑𝑡 |𝑡=𝜏 > 𝜌, where 𝜌 is a threshold.

0 =

0 is

We argue that the block maxima values in real-world time series often exhibit an abrupt change

behavior compared to the non-block maxima values. To substantiate this, we conduct an empirical

analysis across five distinct datasets, wherein we assess the percentile distribution of the first-order

derivatives associated with the block maxima values, as depicted in Figure 6.3. Specifically, given a

time series, we first partition it into a set of disjoint time windows and compute the block maxima

value within each window along with the first-order difference, Δ𝑥𝑡 = 𝑥𝑡
0

− 𝑥𝑡−1
0

, for each time step 𝑡.

We then compute the percentile of Δ𝑥𝜏 associated with the block maxima 𝑥𝜏

0 of the window and plot

its distribution using a boxplot as shown in Figure 6.3. Our findings affirm the conjecture that the

block maxima tend to exhibit an elevated value for its time derivative (with a median larger than

70% of all first-order differences), thereby indicating a notable association between block maxima

occurrences and the abrupt changes in a time series.

More importantly, the abrupt changes are strongly influenced by the high-frequency components

of the time series. This can be observed by differentiating the inverse Fourier transform shown in

79

Figure 6.4 The summation of high-frequency terms for abrupt (A) changes is consistently higher
than for smooth (S) changes. (A) denotes abrupt changes, (S) denotes smooth changes.

Equation 6.2 and decomposing the derivative into low and high-frequency components:

𝑑𝑥𝑡

𝑚,0
𝑑𝑡

=

1
𝑇

𝑇
∑︁

𝑘=1

𝑓 𝑘
𝑚,0

𝑑
𝑑𝑡

(cid:2)𝑒𝑖2𝜋𝑡𝑘/𝑇 (cid:3) =

1
𝑇

𝑇
∑︁

𝑘=1

𝑖2𝜋𝑘
𝑇

𝑓 𝑘
𝑚,0

𝑒𝑖2𝜋𝑡𝑘/𝑇 =

1
𝑇

𝑇
∑︁

𝑘=1

𝑖𝜔𝑘 𝑓 𝑘
𝑚,0

𝑒𝑖𝜔𝑘𝑡,

where 𝜔𝑘 = 2𝜋𝑘

𝑇 . Let 𝜅 be the threshold index for dividing the frequencies into low (𝑘 ≤ 𝜅) and high

(𝑘 > 𝜅) frequency components. Then, we have:

𝑑𝑥𝑡

𝑚,0
𝑑𝑡

=

1
𝑇

𝜅
∑︁

𝑙=1

𝑖𝜔𝑙 𝑓 𝑙

𝑚,0

𝑒𝑖𝜔𝑙𝑡 +

1
𝑇

𝑇
∑︁

ℎ=𝜅+1

𝑖𝜔ℎ 𝑓 ℎ
𝑚,0

𝑒𝑖𝜔ℎ𝑡

(6.3)

To illustrate the impact of high-frequency components on the abrupt changes, we compute

the value of the second term in Equation (6.4) for time steps with abrupt1 (A) and non-abrupt (S)

changes for various datasets. The results shown in Fig. 6.4 suggest that the sum of high-frequency

terms for abrupt changes is consistently higher than the sum of high-frequency terms for non-abrupt

changes. In essence, abrupt block maxima values often manifest as high-frequency components in

the Fourier domain as they introduce sharp transitions in the time domain signal. This explains the

high residual shown in Fig. 6.2 for the block maxima when the high-frequency components are

zeroed out.

6.4.2 Attenuation of Block Maxima in Diffusion Model

Our second key observation unveils a concerning behavior of diffusion models: the addition of

noise diminishes high-frequency components, i.e., block maxima, at a faster rate compared to other

1A time step has abrupt change if it’s 𝑑 𝑥

𝑑𝑡 is in the top 90%.

80

(a) Attenuation of Block Maxima by DDPM

(b) Effect of High Frequency Inflation

Figure 6.5 A comparison of the effects of noise addition by existing DDPM versus high-frequency
inflation on the block maxima of generated samples.

values in the signal. As diffusion-based generative models gradually introduce noise characterized

by a linearly increasing variance scheduler, they inadvertently attenuate the signals associated

with high-frequency components. These components, as established in our first observation, are

crucial for accurately reproducing block maxima. Concurrently, the models effectively capture the

long-term trends and low-frequency components, which are conducive to learning the overall data

distribution. However, the extents of extreme events dissipate more rapidly, hindering the model’s

ability to adeptly learn the distribution of these rare occurrences.

Figure 6.5a illustrates this phenomenon. By tracking the evolution of residuals, or the differences

between the original and perturbed time series generated by DDPM, we observe a discernible pattern:

block maxima dissipate at a faster rate compared to other values, as evidenced by the higher residuals

associated with these extreme points. Notably, in the early iterations highlighted by the green circle,

the substantially higher residual suggests that the block maxima signal is rapidly transformed into

noise, outpacing the dissipation rate of other values. This behavior poses a formidable challenge for

diffusion models in effectively capturing the distributions of the block maxima values.

To substantiate our observations, we provide a theoretical justification for the rapid dissipation

of block maxima during the forward process of the diffusion model

Let x0 be an input sample and x𝑛 be the perturbed sample after 𝑛 iterations of the forward

process, where 𝑥𝑡

𝑛 = 𝑥𝑡

𝑛−1

+ 𝜖 𝑡

𝑛 and 𝜖 𝑡

𝑛 ∼ N(0, 𝜎2
𝜖 𝑡
𝑛

) is Gaussian noise. Due to the linearity of the

81

Fourier transform operator F , we have:

F (x𝑛) = F (x𝑛−1) + F (𝜖𝑛) =⇒ 𝑓 𝑘

𝑛 = 𝑓 𝑘

𝑛−1 + E 𝑘

𝑛

(6.4)

Let 𝜎2

𝑛 and 𝜎2

𝜖𝑛 be the variances of perturbed time series and noise respectively at diffusion step

𝑛 while 𝜎2

𝑛−1 is the variance of the time series at diffusion step 𝑛 − 1. Note that 𝜎2

𝜖𝑛 increases linearly

according to a linear noise scheduler as the diffusion step increases. Let S

𝑓
𝑛 (𝜔𝑘 ) and SE

𝑛 (𝜔𝑘 ) denote

the power spectral density (PSD) for the 𝑘-th frequency component of the perturbed time series and
noise respectively for the diffusion step 𝑛, where 𝜔𝑘 = 2𝜋𝑘
𝑇 .

Our theoretical analysis is based on the following assumptions:

Assumption 1. An abrupt block maxima is linked to the high-frequency components of the time

series.

Assumption 2. The noise 𝜖 𝑡

𝑛 is a stationary random process with a constant power spectral density

(PSD) for diffusion step 𝑛, i.e., ∀𝑘 : SE

𝑛 (𝜔𝑘 ) ≈ 𝜎2
𝜖𝑛.

Assumption 3. The power spectral density (PSD) of the perturbed time series for diffusion step 𝑛

can be modeled using the following generalized Gaussian function:

𝑓
𝑛 (𝜔𝑘 ) = 𝜎2

S

𝑛 · exp(−𝛼𝑛|𝜔𝑘 | 𝛽𝑛)

(6.5)

where 𝛼𝑛 is a scaling factor, and 𝛽𝑛 is a shape parameter.

Remark 1. The rationale and supporting evidence for Assumption 1 is presented in Section 6.4.

Remark 2. Assumption 2 is intuitive as the Gaussian noise used in the diffusion model has

approximately constant PSD over the frequency range of interest.

Remark 3. Assumption 3 is reasonable as for most real-world time series, the energy spectrum

is localized at the lower frequency (𝜔𝑘 ≈ 0), also known as the fundamental frequency, which

quickly decays with increasing frequency (𝜔𝑘 → 𝜔max) [100]. Therefore, we use the generalized

Gaussian function to model this decaying behavior. Note that, the exponential decay behavior

82

will eventually transition into a uniform distribution according to the diffusion model’s forward

process. Consequently, the shape 𝛽𝑛 of the distribution function (Eqn 6.5) is not constant; instead, it

evolves with the diffusion step 𝑛 to remain consistent with the diffusion model’s forward process.

Initially, when 𝑛 is small, 1 ≤ 𝛽𝑛 ≤ 2, representing an exponential decay. However, as 𝑛 → 𝑁,

where 𝑁 is the final diffusion step, 𝛽𝑛 → ∞, and the PSD transitions to a uniform distribution.

This transformation occurs because, according to the forward process of the diffusion model, as 𝑛

increases, Gaussian noise with linearly increasing variance is added to the perturbed time series,

making the signal increasingly noise-like until it becomes white noise at step 𝑁.

We then present the following lemma for the diffusion forward process.

Lemma 1. The difference between the variance of the perturbed time series 𝑥𝑡

𝑛 and the variance of

the noise 𝜖 𝑡

𝑛 is equal to a constant 𝜁 such that:

𝜎2

𝑛 − 𝜎2

𝜖𝑛 = 𝜁

where 𝜁 = 𝜎2

0 + (cid:205)𝑛−1
𝑖=1

𝜎2
𝜖𝑖

Proof. First, we have the following:

𝑛 = 𝑥𝑡
𝑥𝑡

𝑛−1 + 𝜖 𝑡

𝑛

(6.6)

(6.7)

Applying the variance operator to both sides of Equation 6.7 and utilizing the property that the

variance of a sum of independent random variables is the sum of their individual variances, we

obtain:

𝜎2

𝑛 = 𝜎2

𝑛−1 + 𝜎2
𝜖𝑛

(6.8)

where 𝜎2

𝜖𝑛 denotes the variance of the noise at step 𝑛. Recursively applying Equation 6.8, we can

express the variance of the perturbed time series at step 𝑛 as:

𝜎2

𝑛 = 𝜎2

0 +

𝑛
∑︁

𝑖=1

𝜎2
𝜖𝑖

Subtracting 𝜎2

𝜖𝑛 from both sides of Equation 6.9, we get:
𝑛−1
∑︁

𝜎2

𝑛 − 𝜎2

𝜖𝑛 = 𝜎2

0 +

𝑖=1

83

(6.9)

(6.10)

𝜎2
𝜖𝑖

Therefore, the difference between the variance of the perturbed time series and the variance of the
noise at step 𝑛 is equal to the constant 𝜁 = (cid:205)𝑛−1
𝑖=1

𝜖𝑖 , which completes the proof.

𝜎2

□

The lemma establishes a crucial bound on the difference between the variances of the perturbed

time series and the noise, which is leveraged in the subsequent theorem to analyze the behavior of

the Fourier transform of the perturbed time series at low and high frequencies.

Using Lemma 1 and Assumptions 1, 2, and 3, we can present the following theorem:

Theorem 1. Under certain mild assumptions (1, 2, 3), the ratio of high-frequency and low-frequency

components after perturbation during the forward process of the diffusion model is:
lim𝑘→𝑘max | 𝑓 𝑘
𝑛 |2
lim𝑘→0 | 𝑓 𝑘
𝑛 |2

= 𝛿 ≪ 1

(6.11)

where 𝑘max is the index of the maximum frequency and 𝛿 = 𝑓 𝑘max

𝑛

, which is generally close to 0.

Proof. For low frequencies, i.e., 𝜔𝑘 → 0, taking the limit as 𝜔𝑘 → 0 on the expression for the

signal power spectral density S

𝑓
𝑛 (𝜔), we have:

lim
𝑘→0

S

𝑓
𝑛 (𝜔𝑘 ) = lim
𝑘→0

𝑛 · exp(−𝛼𝑛|𝜔𝑘 | 𝛽𝑛) = 𝜎2
𝜎2
𝑛

since exp(−𝛼𝑛|0| 𝛽𝑛) = exp(0) = 1. Therefore, as 𝑘 → 0, S

𝑓
𝑛 (𝜔𝑘 ) = | 𝑓 𝑘

𝑛 |2 approaches the variance

𝜎2

𝑛 . So, we can write:

lim
𝑘→0

S

𝑓
𝑛 (𝜔𝑘 ) = lim
𝑘→0

| 𝑓 𝑘

𝑛 |2 = 𝜎2

𝑛 = 𝜁 + 𝜎2

𝜖𝑛 = 𝜎2

0 +

𝑛−1
∑︁

𝑖=1

𝜖𝑖 + 𝜎2
𝜎2

𝜖𝑛 = 𝜎2

0 +

𝑛
∑︁

𝑖=1

𝜎2
𝜖𝑖

Similarly, for high frequencies, i.e., 𝑘 → 𝑘max, taking the limit as 𝜔 → 𝜔max on the expression for

the signal power spectral density S

𝑓
𝑛 (𝜔𝑘 ) (Eq. (6.5)), we have:

lim
𝑘→𝑘max

S

𝑓
𝑛 (𝜔𝑘 ) = lim
𝑘→𝑘max

𝑛 · exp(−𝛼𝑛|𝜔| 𝛽𝑛) → 𝜎2
𝜎2

𝑛 · 𝛿

where 𝛿 = 𝑓 𝑘max

𝑛

≈ 0.

As 𝑘 → 𝑘max, S

𝑓
𝑛 (𝜔𝑘 ) approaches 𝜎2

| 𝑓 𝑘

𝑛 |2 = 𝛿 ·

lim
𝑘→𝑘max

𝑛 · 𝛿. We can also write:
𝑛
∑︁

(cid:32)

(cid:16)

(cid:17)

𝜁 + 𝜎2
𝜖𝑛

= 𝛿 ·

𝜎2
0 +

(cid:33)

𝜎2
𝜖𝑖

𝑖=1

84

Taking the ratio of high-frequency and low-frequency components after perturbations yields:

lim𝑘→𝑘max | 𝑓 𝑘
𝑛 |2
lim𝑘→0 | 𝑓 𝑘
𝑛 |2

= 𝛿

□

Thereby, high-frequency components or abrupt block maxima dissipate rapidly compared to

low-frequency components or smooth changes. In short, our findings shed light on a fundamental

limitation of diffusion models while modeling block maxima and underscore the need for tailored

approaches to preserve its distribution.

6.5 Proposed Framework: FIDE

In this section, we present the detailed methodology of our proposed framework, addressing

the challenges associated with capturing extreme event distributions within generative modeling

frameworks. Figure 6.6 provides an overview of the FIDE framework.

6.5.1 High Frequency Components Inflation

In order to counteract the rapid decay of high-frequency components in the frequency domain

while adding noise in the forward process of DDPM, we present a strategy for high-frequency

inflation. Let f0 = FFT (x0) denote the vector of Fourier coefficients resulting from applying

the discrete Fourier transform to the time seriesx0. These coefficients are arranged in ascending

order from lowest to highest frequency. Consequently, the last 𝜅 elements of f0 correspond to the

coefficients associated with the 𝜅 highest frequencies. Our goal is to inflate the top-𝜅 frequency

components of f0 as follows:

Γ𝑖 =

1,

𝛾,





if 𝑖 ≤ 𝜅

if 𝑖 > 𝑇 − 𝜅

and f0 = 𝚪 ⊙ f0

where 𝛾 > 1 is the inflation weight and ⊙ denotes the element-wise multiplication.

With the modified coefficients f0, the inverse Fourier transform (IFFT) is applied to get the

modified time series, x0 = IFFT

(cid:17)

(cid:16)

f0

, containing the inflated high-frequency components. Here, the

high-frequency components are inflated by 𝛾 > 1. The following theorem shows how this inflation

85

Figure 6.6 Proposed FIDE framework for generating time series with extreme events.

86

strategy helps the high-frequency components (block maxima) diminish less rapidly in the diffusion

forward process compared to before.

𝑛 be the Fourier coefficient after inflating high-frequency components with a

Theorem 2. Let 𝑓 𝑘
factor of 𝛾 such that 𝛾 > 1. Let, 𝛿 = 𝑓 𝑘max
before inflation and 𝛿′ = 𝛿 · 𝛾 = 𝑓 𝑘max

𝑛

𝑛

be the Fourier coefficient of the maximum frequency

be the Fourier coefficient of the maximum frequency after

inflation and. Then, using Lemma 1 and under certain mild assumptions (1, 2, 3), the ratio of

high-frequency and low-frequency components after inflation and perturbations is:

𝑛 |2

lim𝑘→𝑘max | 𝑓 𝑘
lim𝑘→0 | 𝑓 𝑘
𝑛 |2

= 𝛿 · 𝛾

(6.12)

Proof. For low frequencies, i.e., 𝜔𝑘 → 0, taking the limit as 𝜔𝑘 → 0 on the expression for the

signal power spectral density S

𝑓
𝑛 (𝜔), we have:

lim
𝑘→0

S

𝑓
𝑛 (𝜔𝑘 ) = lim
𝑘→0

𝑛 · exp(−𝛼𝑛|𝜔𝑘 | 𝛽𝑛) = 𝜎2
𝜎2
𝑛

since exp(−𝛼𝑛|0| 𝛽𝑛) = exp(0) = 1. Therefore, as 𝑘 → 0, S

𝑓
𝑛 (𝜔𝑘 ) = | 𝑓 𝑘

𝑛 |2 approaches the variance

𝜎2

𝑛 . So, we can write:

lim
𝑘→0

S

𝑓
𝑛 (𝜔𝑘 ) = lim
𝑘→0

| 𝑓 𝑘

𝑛 |2 = 𝜎2

𝑛 = 𝜁 + 𝜎2

𝜖𝑛 = 𝜎2

0 +

𝑛−1
∑︁

𝑖=1

𝜖𝑖 + 𝜎2
𝜎2

𝜖𝑛 = 𝜎2

0 +

𝑛
∑︁

𝑖=1

𝜎2
𝜖𝑖

Similarly, for high frequencies, i.e., 𝑘 → 𝑘max, taking the limit as 𝜔 → 𝜔max on the expression for

the signal power spectral density S

𝑓
𝑛 (𝜔𝑘 ) (Eq. (6.5)), we have:

lim
𝑘→𝑘max

S

𝑓
𝑛 (𝜔𝑘 ) = lim
𝑘→𝑘max

𝑛 · exp(−𝛼𝑛|𝜔| 𝛽𝑛) → 𝜎2
𝜎2

𝑛 · 𝛿′

where 𝛿′ = 𝛿 · 𝛾

As 𝑘 → 𝑘max, S

𝑓
𝑛 (𝜔𝑘 ) approaches 𝜎2

𝑛 · 𝛿 · 𝛾. We can also write:

| 𝑓 𝑘

𝑛 |2 = 𝛿 · 𝛾 ·

(cid:16)

𝜁 + 𝜎2
𝜖𝑛

(cid:17)

lim
𝑘→𝑘max

(cid:32)

= 𝛿 · 𝛾 ·

𝜎2
0 +

(cid:33)

𝜎2
𝜖𝑖

𝑛
∑︁

𝑖=1

87

Taking the ratio of high-frequency and low-frequency components after perturbations yields:

𝑛 |2

lim𝑘→𝑘max | 𝑓 𝑘
lim𝑘→0 | 𝑓 𝑘
𝑛 |2

= 𝛿 · 𝛾

□

Thus, by applying high-frequency inflation, the high-frequency components including abrupt

block maxima will be preserved by a factor of 𝛾 compared to the previous case. We can see the

effects of this inflation empirically as well. Figure 6.5b shows how inflating the high-frequency

components helps in preserving the block maxima values for longer iterations of the diffusion model.

This enables the block maxima after high-frequency inflation to dissipate at a similar rate compared

to other values in the earlier iterations. The diffusion model will have more iterations to capture the

block maxima signal.

6.5.2 Forward Process

We use the inflated time series x0 as input time series to be perturbed during the forward process

instead of x0. By adopting x0 as the reference for the unperturbed sample, we ensure that the

denoising diffusion process takes advantage of the enhanced representation provided by the inflated

high-frequency components. This nuanced adjustment contributes to the efficacy of our proposed

framework in capturing and preserving essential information during the diffusion process.

6.5.3 Conditional Reverse Diffusion Process

To enable the generation of samples conditioned on block maxima, we extend the conventional

diffusion model to a conditional model. Here, the reverse process is conditioned on block maxima

𝑦0. Grounded in extreme value theory [10], the block maxima 𝑦0 are mandated to adhere to an

extreme value distribution, distinctly diverging from the distribution of all values x0 ∼ 𝑝𝜃 (x0).

This mandates a strategic shift in our learning objective. Rather than marginally targeting 𝑝𝜃 (x0),

our objective now extends to mastering the joint distribution 𝑝𝜃 (x0, 𝑦0), driven by a nuanced

understanding of the unique characteristics inherent in extreme events and their crucial impact on

the overall distribution. We formally extend the diffusion model’s marginal distribution to a joint

distribution in the following theorem.

88

Theorem 3. Let us consider the extension of the conventional diffusion model from learning a

marginal distribution 𝑝𝜃 (x0) to a joint distribution 𝑝𝜃 (x0, 𝑦0) conditioned on block maxima 𝑦0. In

this context, the variational lower bound can be formulated as follows:

− log 𝑝𝜃 (x0, 𝑦0) ≤ E𝑞

(cid:104)

log

𝑞(x1:𝑁 |x0, 𝑦0)
𝑝𝜃 (x0:𝑁 )

(cid:105)

− log 𝑝𝜃 (𝑦0)

(6.13)

Proof. The proof begins by expressing the negative log-likelihood of the joint distribution

− log 𝑝𝜃 (x0, 𝑦0) in terms of conditional probabilities:

− log 𝑝𝜃 (x0, 𝑦0) = − log 𝑝𝜃 (x0|𝑦0) · 𝑝𝜃 (𝑦0) = − log 𝑝𝜃 (x0|𝑦0) − log 𝑝𝜃 (𝑦0)

≤ 𝐷KL(𝑞(x1:𝑁 |x0, 𝑦0) ∥ 𝑝𝜃 (x1:𝑁 |x0, 𝑦0))

− log 𝑝𝜃 (x0|𝑦0) − log 𝑝𝜃 (𝑦0)
𝑞(x1:𝑁 |x0, 𝑦0)
𝑝𝜃 (x0:𝑁 )

log

(cid:104)

= E𝑞

+ log 𝑝𝜃 (x0, 𝑦0)

(cid:105)

(6.14)

− log 𝑝𝜃 (𝑦0)

(cid:104)

= E𝑞

log

𝑞(x1:𝑁 |x0, 𝑦0)
𝑝𝜃 (x0:𝑁 )

(cid:105)

− log 𝑝𝜃 (𝑦0)

□

First, we adopt x0 as the reference for the unperturbed sample x0 as discussed in the previous

subsection. After reparameterization and ignoring the weighting term, as suggested by [13], the first

term of the variation lower bound can be expressed as:

LDDPM = E𝑥𝑛,𝜖𝑛,𝑛,𝑦0 ∥ (cid:101)

𝜖𝑛 (x𝑛, 𝑛, 𝑦0) − 𝜖𝑛 ∥2
2

(6.15)

Additionally, considering a Generalized Extreme Value (GEV) distribution for block maxima, the

second term is simplified as log LGEV(𝜇, 𝜎, 𝜉), as defined in Eq. (3.4).

Conclusively, the remark establishes a clear link between the variational lower bound and an

interpretable objective loss function:

− log 𝑝𝜃 (x0, 𝑦0) ≤ LDDPM − 𝜆 log LGEV(𝜇, 𝜎, 𝜉) := 𝐿

(6.16)

89

Here, LDDPM represents the expected reconstruction error between actual and estimated noise,

and log LGEV(𝜇, 𝜎, 𝜉) captures the negative log-likelihood of the block maxima governed by a GEV

distribution.

6.5.4 GEV Distribution Enforcement Module

To enforce fidelity to both the block maxima and overall data distribution, we incorporate the

Generalized Extreme Value (GEV) distribution within the DDPM framework following Theorem

3. We first fit a GEV distribution using maximum log-likelihood estimation with all the block

maxima (𝑦0) values in the training data. The fitted GEV distribution is parameterized by 𝜇, 𝜎, and 𝜉,

denoted as 𝜃gev = {𝜇, 𝜎, 𝜉}. Using the conditional diffusion process, the estimated noise is given by
𝜖𝑛 (x𝑛, 𝑛, 𝑦0). Consequently, the estimated denoised sample can be obtained as: (cid:101)x0 = x𝑛 − (cid:101)
(cid:101)
utilizing the fitted GEV distribution, the log-likelihood of the estimated denoised block maxima,
𝑦0 = max𝜏∈{1,··· ,𝑇 } (cid:101)x𝜏
(cid:101)
incorporated into the loss function of training.

0, is calculated. This negative log-likelihood, − log LGEV(𝜇, 𝜎, 𝜉,

𝑦0), is finally
(cid:101)

𝜖𝑛. Then,

Algorithm 6.1 Training

, 𝑥2
0

, · · · , 𝑥𝑇
0 )

6.5.5 Optimization

repeat

x0 ∼ 𝑞(x0) where x0 = (𝑥1
0
f0 = FFT(x0)
f0 = 𝛾 ⊙ f0
x0 = IFFT(f0)
𝑛 ∼ Uniform({1, · · · , 𝑁 })
𝜖𝑛 ∼ N (0, I)
𝑦0 = max(x0)
√
𝛼𝑛 x0 +
x𝑛 =
(cid:101)y0 = max(x𝑛 − (cid:101)
Take the gradient step on

1 − 𝛼𝑛 𝜖𝑛
𝜖𝑛 (x𝑛, 𝑛, 𝑦0))

√

∇𝜃 || (cid:101)

𝜖𝑛 (x𝑛, 𝑛, 𝑦0)) − 𝜖𝑛||2
2

−𝜆 · log LGEV(𝜇, 𝜎, 𝜉,

𝑦0)
(cid:101)

until converged

Algorithm 6.1 summarizes the pseudocode for training and Algorithm 6.2 summarizes the

pseudocode of sampling. The overall loss function LFIDE is constructed by combining two key

90

Algorithm 6.2 Sampling
Input: Block maxima ˆ𝑦0 ∼ GEV(𝑦0) and Trained Model (cid:101)
𝜖
Output: Generate time series, ˆx0.
ˆx𝑁 ∼ N (0, 𝜎2)
for 𝑛 = 𝑁, · · · , 1 do

z ∼ N (0, I)
ˆx𝑛−1 = 1√
𝛼𝑛

( ˆx𝑛 − 1−𝛼𝑛√
+𝜎𝑛z

1−𝛼𝑛 (cid:101)

𝜖𝑛 ( ˆx𝑛, 𝑛, ˆ𝑦0))

end for
return ˆx0

components: the DDPM loss LDDPM and the negative log-likelihood of the Generalized Extreme

Value (GEV) distribution −LGEV. The formulation is expressed as follows:

LFIDE = Ex𝑛,𝜖𝑛,𝑛,𝑦0

𝜖𝑛 (x𝑛, 𝑛, 𝑦0) − 𝜖𝑛 ∥2
∥(cid:101)

2 − 𝜆 · log LGEV(𝜇, 𝜎, 𝜉,

𝑦0)
(cid:101)

(6.17)

where 𝜆 is a hyperparameter controlling the influence of the GEV distribution on the loss.

In this context, LDDPM evaluates the mean squared difference between the estimated noise term

𝜖𝑛 and the true noise term 𝜖𝑛 within the conditional diffusion process. Its purpose is to guide the
(cid:101)

generative model towards effectively capturing the underlying data distribution. The second element,

− log LGEV(𝜇, 𝜎, 𝜉,

𝑦0), encapsulates the negative log-likelihood of the GEV distribution. This
(cid:101)

component assesses how well the fitted GEV distribution aligns with the estimated block maxima

𝑦0 derived from the denoised samples. Here, the log-likelihood is inverted to represent a

values (cid:101)
minimization objective, aligning with the overall goal of minimizing the loss function.

6.6 Experimental Evaluation

We have performed extensive experiments to evaluate the performance of our FIDE framework.

All the code and datasets used in this chapter are available at https://github.com/galib19/FIDE.

6.6.1 Data

We performed our experiments using the following datasets. (1) Synthetic Data (AR2): AR(2)

dataset comprises synthetic time series data generated using an autoregressive model of order 2.

(2) Financial Data (Stocks): It features continuous-valued and aperiodic sequences, such as daily

historical Google stocks data spanning from 2004 to 2019. We consider the adjusted closing price

91

data for this work. (3) Energy Data (Appliance Energy): The UCI Appliances energy prediction

dataset [101] encompasses multivariate, continuous-valued measurements. We consider appliance

energy data for analysis. (4) Weather/Climate Data (Daily Minimum Temperature): This dataset

[102] comprises daily minimum temperatures in Melbourne, Australia, from 1981 to 1990. (5)

Medical Data (ECG5000: Congestive Heart Failure): The original dataset [103] for "ECG5000"

originates from a 20-hour long electrocardiogram (ECG) obtained from the Physionet database.

Specifically, it is derived from the BIDMC Congestive Heart Failure Database (chfdb), with the

record labeled as "chf07." The processed data encompasses 5,000 heartbeats randomly selected

from the original dataset.

6.6.2 Baseline Methods

We compared our proposed framework against various generative models: (1) GAN-based:

We utilize two GAN-based approaches as our baselines. The first approach is Conditional GAN

(cGAN [63]), which introduces conditional information to the training process, enabling targeted

generation based on specified conditions. The second baseline is TimeGAN [7], which is a generative

model designed specifically for time-series generation. (2) VAE-based: We employ beta-VAE

[104], conditional beta-VAE [99], and TimeVAE [8] as baseline methods for comparison. Both

beta-VAE and conditional beta-VAE incorporate a specific disentanglement objective to encourage

the model to learn more interpretable and factorized representations while TimeVAE [8] promotes

interpretability. (3) Flow-based: We use normalizing flows-based approaches such as RealNVP

[105] and Fourier-Flows [106] as our baseline methods. (4) Diffusion-based: We consider two

baselines for comparison, namely, the denoising diffusion probabilistic model (DDPM) [13] and a

time series diffusion model called Diffusion-TS [9].

6.6.3 Experimental Settings

We partitioned each dataset into training, validation, and testing, according to a 8:1:1 ratio. We

repeated the experiments 5 times. Prior to applying the various algorithms, the time series data

is standardized to have zero mean and unit variance. The encoder component of our framework

employs a 3-layer transformer architecture, accompanied by fully connected layers. The training was

92

(a) Using FIDE

(b) Using DDPM [13]

(c) Using FIDE

(d) Using DDPM [13]

Figure 6.7 Comparison of block maxima distribution and all values distribution for real and generated
samples using the proposed FIDE model and DDPM [13] when applied to the synthetic AR(1)
dataset.

facilitated using the Adam optimizer. For all the methods, we perform extensive hyperparameter

tuning on the length of the embedding vector, the number of hidden layers, the number of nodes,

the learning rate, and the batch size. The optimal hyperparameters were determined using the Ray

Tune framework, integrating an Asynchronous Successive Halving Algorithm (ASHA) scheduler to

enable early stopping. All experiments were conducted on NVIDIA T4 GPU.

To assess the effectiveness of the proposed framework, we utilize four metrics: Jensen-Shannon

Divergence, KL Divergence, CRPS (Continuous Rank Probability Score), and Predictive Score. The

first three metrics are intricately linked to the analysis of distributions. Jensen-Shannon Divergence

and KL Divergence are metrics measuring the difference between probability distributions. CRPS

(Continuous Rank Probability Score) assesses the accuracy of predicted cumulative probabilities

against observed outcomes. The fourth metric, the Predictive Score [7], was introduced with

the purpose of evaluating the ability of a generative model to accurately replicate the temporal

characteristics of the original data. This involves the deployment of a naive LSTM-based sequence

model for time-series forecasting using synthesized samples. The performance of this predictive

model is gauged by the mean absolute error (MAE) on the original test data, providing insight

into the generative model’s capacity to faithfully reproduce temporal properties. For our case, we

evaluate the forecasting performance of block maxima on the test dataset by the model trained on

generated data.

6.6.4 Experimental Results

Table 6.1 compares the performance of FIDE against several state-of-the-art baselines in terms

of their ability to capture the block maxima distribution for 5 diverse datasets (AR1, Stock, Energy,

93

KL
Divergence

JS
Divergence

Metrics Methods
beta-VAE
c-beta-VAE
TimeVAE
TimeGAN
cGAN
RealNVP
Fourier-Flows
DDPM
Diffusion-TS
FIDE (Ours)
beta-VAE
c-beta-VAE
TimeVAE
TimeGAN
cGAN
RealNVP
Fourier-Flows
DDPM
Diffusion-TS
FIDE (Ours)
beta-VAE
c-beta-VAE
TimeVAE
TimeGAN
cGAN
RealNVP
Fourier-Flows
DDPM
Diffusion-TS
FIDE (Ours)
beta-VAE
c-beta-VAE
TimeVAE
TimeGAN
cGAN
RealNVP
Fourier-Flows
DDPM
Diffusion-TS
FIDE (Ours)

CRPS

Predictive
Score

AR1
0.0211±0.0187
0.0190±0.0125
0.0015±0.0003
0.0840±0.0109
0.0690±0.0091
0.0754±0.0121
0.0612±0.0045
0.0010±0.0007
0.0011±0.0008
0.0004±0.0001
0.0110±0.0024
0.0091±0.0012
0.0105±0.0007
0.1920±0.0156
0.1240±0.0122
0.1298±0.0215
0.1235±0.0104
0.0062±0.0008
0.0054±0.0007
0.0030±0.0009
0.1247±0.0189
0.1154±0.0151
0.1511±0.0081
0.1858±0.0214
0.1224±0.0157
0.1325±0.0144
0.1305±0.0254
0.0422±0.0084
0.0398±0.0092
0.0310±0.0098
0.6350±0.0201
0.6240±0.0145
0.6150±0.0104
0.6050±0.0104
0.6120±0.0014
0.6884±0.0011
0.6925±0.0031
0.6148±0.0081
0.6105±0.0045
0.6081±0.0098

Stock
0.1105±0.0188
0.1011±0.0152
0.1054±0.0071
0.1411±0.1585
0.1211±0.0205
0.1185±0.0108
0.1108±0.0195
0.0912±0.0062
0.0854±0.0045
0.0700±0.0061
0.1947±0.0184
0.1744±0.0105
0.2514±0.0152
0.2425±0.0251
0.2101±0.0115
0.2295±0.0154
0.2045±0.0255
0.1915±0.0125
0.1889±0.0108
0.1504±0.0128
0.3149±0.0348
0.2698±0.0214
0.2547±0.0155
0.2825±0.0418
0.2689±0.0301
0.2545±0.0258
0.2589±0.0214
0.2422±0.0187
0.2358±0.0211
0.2115±0.0152
0.9528±0.0314
0.9226±0.0165
0.9140±0.0218
0.8950±0.0198
0.9354±0.0210
0.9988±0.0354
0.9844±0.0241
0.8997±0.0111
0.8912±0.0105
0.8871±0.0104

Energy
0.0722±0.0095
0.0710±0.0088
0.0795±0.0085
0.0950±0.0089
0.0890±0.0093
0.0905±0.0084
0.0820±0.0044
0.0752±0.0082
0.0712±0.0071
0.0680±0.0092
0.1210±0.0146
0.1160±0.0174
0.1625±0.0095
0.1590±0.0198
0.1510±0.0211
0.1605±0.0310
0.1458±0.0345
0.1120±0.0108
0.1089±0.0115
0.0950±0.0098
0.2410±0.0298
0.2574±0.0241
0.2853±0.1082
0.2685±0.0284
0.2385±0.0187
0.2541±0.0214
0.2415±0.0211
0.2199±0.0874
0.2125±0.0454
0.2085±0.0985
0.7410±0.0187
0.7317±0.0163
0.7325±0.0195
0.7280±0.0187
0.7310±0.0147
0.7898±0.0254
0.7955±0.0088
0.7350±0.0102
0.7355±0.0084
0.7240±0.0087

Temperature
0.0140±0.0125
0.0109±0.0098
0.0096±0.0002
0.0112±0.0012
0.0091±0.0008
0.0089±0.0007
0.0078±0.0010
0.0082±0.0009
0.0077±0.0008
0.0007±0.0001
0.0410±0.0128
0.0360±0.0114
0.0490±0.0006
0.0550±0.0145
0.0490±0.0125
0.0512±0.0108
0.0505±0.0136
0.0326±0.0090
0.0311±0.0078
0.0029±0.0008
0.1554±0.0214
0.1420±0.0311
0.1847±0.0071
0.2110±0.0287
0.1990±0.0214
0.2014±0.0354
0.1975±0.0251
0.1516±0.0211
0.1525±0.0315
0.0517±0.0082
0.6814±0.0108
0.6718±0.0025
0.6723±0.0036
0.6718±0.0047
0.6847±0.0041
0.7852±0.0017
0.7871±0.0021
0.6708±0.0098
0.6708±0.0108
0.6694±0.0082

ECG
0.1210±0.0214
0.1120±0.0352
0.0985±0.0078
0.1620±0.0221
0.1440±0.0211
0.1411±0.0116
0.1398±0.0202
0.1041±0.0122
0.1005±0.0108
0.0930±0.0082
0.2020±0.0048
0.1880±0.0079
0.2254±0.0068
0.2540±0.0254
0.2210±0.0184
0.2305±0.0145
0.2254±0.0141
0.1905±0.0094
0.1894±0.0081
0.1810±0.0084
0.3059±0.0454
0.3150±0.0414
0.3252±0.0204
0.3240±0.0401
0.2985±0.0311
0.2824±0.0425
0.2884±0.0215
0.2488±0.0388
0.2415±0.0451
0.2345±0.0204
0.9420±0.0142
0.9310±0.0214
0.9150±0.0112
0.8960±0.0084
0.9220±0.0191
0.9730±0.0215
0.9655±0.0221
0.9121±0.0121
0.9089±0.0095
0.9040±0.0112

Table 6.1 Comparison of generated samples’ block maxima distribution metrics and predictive score.
Bold and Underlined entries denote the best and second-best result.

Temperature, and ECG). In terms of the 3 distribution metrics (JS divergence, KL divergence,

and CRPS), FIDE consistently achieves the best results, providing evidence of FIDE’s superior

performance in preserving the block maxima distribution. For the Predictive Score metric, FIDE

achieves the best results in 3 out of 5 datasets and ranks second in the remaining 2 datasets. The

94

KL
Divergence

JS
Divergence

Metrics Methods
beta-VAE
c-beta-VAE
TimeVAE
TimeGAN
cGAN
RealNVP
Fourier-Flows
DDPM
Diffusion-TS
FIDE (Ours)
beta-VAE
c-beta-VAE
TimeVAE
TimeGAN
cGAN
RealNVP
Fourier-Flows
DDPM
Diffusion-TS
FIDE (Ours)
beta-VAE
c-beta-VAE
TimeVAE
TimeGAN
cGAN
RealNVP
Fourier-Flows
DDPM
Diffusion-TS
FIDE (Ours)
beta-VAE
c-beta-VAE
TimeVAE
TimeGAN
cGAN
RealNVP
Fourier-Flows
DDPM
Diffusion-TS
FIDE (Ours)

CRPS

Predictive
Score

AR1
0.0013±0.0004
0.0011±0.0004
0.0010±0.0003
0.0015±0.0004
0.0012±0.0002
0.0014±0.0003
0.0013±0.0004
0.0007±0.0001
0.0008±0.0001
0.0008±0.0001
0.0020±0.0003
0.0017±0.0004
0.0016±0.0003
0.0025±0.0003
0.0018±0.0003
0.0023±0.0004
0.0019±0.0003
0.0010±0.0001
0.0011±0.0001
0.0012±0.0001
0.0201±0.0041
0.1984±0.0022
0.1848±0.0038
0.2412±0.0019
0.1974±0.0012
0.2511±0.0019
0.2214±0.0024
0.1595±0.0018
0.1565±0.0016
0.1541±0.0021
0.8121±0.0410
0.7951±0.0555
0.7714±0.0345
0.7514±0.0451
0.7694±0.0354
0.8011±0.0384
0.7985±0.0324
0.7711±0.0441
0.7684±0.0405
0.7651±0.0488

Stock
0.0091±0.0011
0.0088±0.0012
0.0086±0.0008
0.0092±0.0015
0.0091±0.0010
0.0093±0.0011
0.0087±0.0012
0.0061±0.0007
0.0057±0.0008
0.0068±0.0010
0.0188±0.0016
0.0178±0.0019
0.0169±0.0015
0.0182±0.0025
0.0178±0.0018
0.0185±0.0019
0.0173±0.0021
0.0117±0.0011
0.0114±0.0012
0.0121±0.0015
0.4955±0.0125
0.4205±0.0148
0.4841±0.085
0.3941±0.0115
0.4451±0.0201
0.4254±0.0194
0.3814±0.0164
0.2955±0.0144
0.2985±0.0174
0.3001±0.0191
1.0915±0.0215
1.0841±0.0121
1.0662±0.0211
1.0621±0.0198
1.0721±0.0188
1.0914±0.0178
1.1008±0.0205
1.0751±0.0184
1.0722±0.0189
1.0692±0.0192

Energy
0.0091±0.0008
0.0089±0.0007
0.0081±0.0011
0.0082±0.0009
0.0085±0.0008
0.0092±0.0008
0.0083±0.0007
0.0058±0.0005
0.0061±0.0005
0.0056±0.0006
0.0181±0.0015
0.0177±0.0017
0.0159±0.0021
0.0161±0.0016
0.0169±0.0016
0.0185±0.0017
0.0165±0.0015
0.0114±0.0010
0.0116±0.0009
0.0109±0.0009
0.4985±0.0102
0.4514±0.0210
0.4815±0.0189
0.4415±0.0171
0.3914±0.0211
0.5125±0.0184
0.4514±0.0123
0.3215±0.0154
0.3285±0.0149
0.3251±0.0177
0.8515±0.0104
0.8442±0.0110
0.8394±0.0089
0.8379±0.0151
0.8433±0.0181
0.8533±0.0154
0.8501±0.0151
0.8488±0.0133
0.8501±0.0125
0.8458±0.0151

Temperature
0.0012±0.0002
0.0010±0.0001
0.0009±0.0001
0.0011±0.0003
0.0008±0.0001
0.0010±0.0002
0.0009±0.0001
0.0005±0.0001
0.0008±0.0001
0.0006±0.0002
0.0025±0.0003
0.0022±0.0004
0.0018±0.0002
0.0021±0.0005
0.0015±0.0003
0.0019±0.0004
0.0017±0.0003
0.0009±0.0001
0.0010±0.0002
0.0011±0.0002
0.0914±0.0010
0.0899±0.0009
0.0889±0.0009
0.0911±0.0008
0.0903±0.0007
0.0919±0.0007
0.0912±0.0008
0.0875±0.0006
0.0863±0.0007
0.0893±0.0007
0.8021±0.0109
0.7958±0.0089
0.7821±0.0105
0.7856±0.0098
0.7905±0.0122
0.8033±0.0135
0.7988±0.0140
0.7912±0.0125
0.7889±0.0129
0.7895±0.0135

ECG
0.0015±0.0003
0.0014±0.0002
0.0014±0.0001
0.0018±0.0003
0.0015±0.0002
0.0017±0.0001
0.0015±0.0002
0.0010±0.0001
0.0009±0.0001
0.0011±0.0001
0.0031±0.0004
0.0028±0.0004
0.0026±0.0003
0.0034±0.0005
0.0029±0.0003
0.0036±0.0002
0.0028±0.0003
0.0019±0.0003
0.0019±0.0003
0.0021±0.0004
0.1425±0.0049
0.1388±0.0068
0.1422±0.0077
0.1262±0.0062
0.1298±0.0056
0.1405±0.0035
0.1281±0.0077
0.1028±0.0062
0.1018±0.0045
0.1061±0.0054
0.9911±0.0133
0.9891±0.0151
0.9822±0.0101
0.9862±0.0125
0.9874±0.0151
0.9981±0.0201
0.9954±0.0188
0.9925±0.0167
0.9910±0.0155
0.9852±0.0158

Table 6.2 Comparison of the generated sample distribution for the distribution metrics and predictive
score. Bold and Underlined entries denote the best and second-best result.

distribution plots for the synthetic AR(1) data generated by DDPM [13] and FIDE, as shown in

Figure 6.7-(a) and (b), provide further evidence of FIDE’s capabilities. While DDPM struggles to

capture the block maxima distribution accurately, FIDE generates samples that more effectively

maintain the fidelity of the distribution. This improvement is particularly noticeable in the upper tail

95

behavior, which is critical for applications that require precise modeling of extreme block maxima.

This superior performance is not surprising as it directly results from our method’s emphasis on

block maxima distribution, achieved through the introduction of frequency inflation, conditional

generation based on block maxima, and the incorporation of the GEV distribution into the generative

modeling framework.

As FIDE prioritizes the accurate modeling of block maxima, we have also evaluated its efficacy

in capturing the distribution of all (block maxima and non-block maxima) values in time series. The

results are shown in Table 6.2. FIDE achieves comparable performance to state-of-the-art methods

like DDPM [13] and Diffusion-TS [9]. This is further illustrated by the distribution plots of all

values for DDPM and FIDE given in Figure 6.7-(c) and (d). The results in Table 6.2 also show that

FIDE consistently outperforms VAE-based, GAN-based, and Flow-based alternatives. For Predictive

Score, while TimeGAN and TimeVAE show marginally better results, FIDE maintains competitive

performance against other baseline methods. These results suggest minimal performance trade-off

when applying FIDE to time series data. Despite its emphasis on block maxima values, this does

not significantly compromise its ability to model the overall distribution. This positions FIDE as a

robust and versatile generative model for capturing extreme values in time series.

6.6.5 Ablation Study

In our ablation study depicted in Table 6.3, we systematically assessed the individual contributions

of each component within our proposed framework. By selectively deactivating elements such as

the GEV loss, conditional block maxima input, and high-frequency inflation module, we observed

consistent performance degradation across all scenarios. Notably, the absence of the conditional

block maxima input significantly impacted the Jenson-Shannon Divergence and KL Divergence

metrics, while the lack of the GEV loss had the most pronounced effect on the CRPS metric.

Surprisingly, the predictive score remained relatively resilient to the deactivation of any single

component, suggesting a degree of redundancy or compensatory mechanisms among the remaining

components. Overall, our ablation study highlights the indispensable role of each component in

achieving optimal performance in our model. In summary, our findings underscore the holistic

96

Metrics Methods

Jensen–
Shannon
Divergence

KL
Divergence

CRPS

Predictive
Score

FIDE - frequency inflation
FIDE - conditional
FIDE - GEV Loss
FIDE
FIDE - frequency inflation
FIDE - conditional
FIDE - GEV Loss
FIDE
FIDE - frequency inflation
FIDE - conditional
FIDE - GEV Loss
FIDE
FIDE - frequency inflation
FIDE - conditional
FIDE - GEV Loss
FIDE

AR1
0.0006+0.0001
0.0007+0.0001
0.0005+0.0001
0.0004+0.0001
0.0042+0.0008
0.0051+0.0010
0.0039+0.0007
0.0030+0.0009
0.0391+0.0078
0.0335+0.0089
0.0415+0.0087
0.0310+0.0098
0.6070+0.0112
0.6095+0.0079
0.6089+0.0089
0.6081+0.0098

Stock
0.0898+0.0054
0.1054+0.0089
0.0813+0.0035
0.0700+0.0061
0.1559+0.0161
0.1689+0.0210
0.1551+0.0188
0.1504+0.0128
0.2172+0.0158
0.2165+0.0132
0.2232+0.0203
0.2115+0.0152
0.8942+0.0158
0.8901+0.0141
0.8891+0.0122
0.8871+0.0104

Energy
0.0822+0.0084
0.0941+0.0098
0.0715+0.0041
0.0680+0.0092
0.1054+0.0049
0.1089+0.0095
0.1021+0.0088
0.0950+0.0098
0.2152+0.0791
0.2082+0.0768
0.2189+0.0874
0.2085+0.0985
0.7264+0.0069
0.7261+0.0081
0.7269+0.0074
0.7240+0.0087

Temperature
0.0009+0.0001
0.0010+0.0002
0.0008+0.0001
0.0007+0.0001
0.0036+0.0006
0.0041+0.0010
0.0032+0.0009
0.0029+0.0008
0.0649+0.0081
0.0651+0.0047
0.0815+0.0104
0.0517+0.0082
0.6711+0.0091
0.6715+0.0078
0.6712+0.0009
0.6694+0.0082

ECG
0.0984+0.0058
0.1102+0.0098
0.0922+0.0056
0.0930+0.0082
0.1854+0.0064
0.1901+0.0063
0.1823+0.0092
0.1810+0.0084
0.2372+0.0181
0.2382+0.0184
0.2456+0.0399
0.2345+0.0204
0.9081+0.0154
0.9059+0.0122
0.9062+0.0058
0.9040+0.0112

Table 6.3 Ablation Study of generated samples’ block maxima distribution metrics and predictive
score using the proposed FIDE model and without individual component of the model.

importance of the individual components, with their synergistic interplay contributing to the overall

effectiveness of FIDE.

6.7 Conclusions

This framework entails a comprehensive strategy that involves refining the model’s ability to

capture and emphasize extreme event distributions. Through a comprehensive exploration of the

constraints within current diffusion-based models, the introduced FIDE framework innovatively

addresses these limitations by introducing strategies to maintain high-frequency components.

Furthermore, it extends conventional diffusion models to enable conditional generation while

integrating the Generalized Extreme Value (GEV) distribution. One limitation of the framework is

that the theoretical justifications are based on some mild assumptions. However, we have justified

those assumptions with empirical evidence and prior literature. The framework’s superiority over

baseline methods is confirmed through rigorous experiments on both real-world and synthetic data.

97

CHAPTER 7

CONCLUSIONS AND FUTURE WORK

The proposed thesis explores the subject of modeling extremes in time series, with a focus on

predictive and generative modeling techniques. The initial three works of the thesis aim to explore

and develop predictive modeling techniques for time series extremes, with the goal of accurate

forecasting and identifying extreme events in time series data. In the final work of the thesis, attention

shifts towards exploring conditional generative modeling for time series extremes, specifically

diffusion-based conditional generative modeling.

This dissertation presents significant advancements in the field of time series modeling by

developing deep learning frameworks that improve the prediction and generative modeling of

extreme values. These contributions provide robust solutions to the challenges posed by extreme

events in time series data, with applications spanning risk management, disaster preparedness,

and various other domains. The methodologies introduced herein have the potential to advance

both theoretical understanding and practical applications in time series forecasting and generative

modeling. Here are the key contributions of this thesis:

DeepExtrema: Introduced in Chapter 3, this framework combines a deep neural network with a

generalized extreme value (GEV) distribution to forecast block maxima in time series data. This

approach addresses the challenge of maintaining positivity constraints on GEV parameters and

demonstrates superior performance through extensive experiments on real-world and synthetic data.

Self-Recover: Presented in Chapter 4, this self-supervised learning framework addresses

the challenge of disparate temporal coverage in time series predictors. It employs a combination

of contrastive and generative self-supervised schemes to impute long-term missing values and a

denoising autoencoder for random missing values, effectively enhancing block maxima predictions.

Self-Recover: Detailed in Chapter 5, this self-supervised learning framework focuses on

robust representation learning of time series data to capture tail distributions. By integrating

wavelet-based data augmentation and a loss function based on empirical cumulative distribution

functions, Self-Recover improves the forecasting of extreme values.

98

FIDE: Introduced in Chapter 6, FIDE is a conditional diffusion model that addresses the generative

modeling of extreme values in time series. By employing a high-frequency inflation strategy and

incorporating the GEV distribution, FIDE enhances the time series generation focusing on extreme

events, demonstrating its efficacy through experiments on both real-world and synthetic data.

7.1 Future Work

The application of deep learning to time series extremes presents numerous avenues for further

research. Two broad areas warrant particular attention. First, extending research into modeling

spatio-temporal relationships could significantly enhance the accuracy and comprehensiveness of

forecasting models by capturing how extreme events evolve and interact across both space and

time. This is especially relevant in fields like meteorology, environmental science, and public

health. Second, improving the interpretability of models predicting time series extremes is crucial

for gaining the trust of stakeholders and enabling informed decision-making. Enhancing model

transparency and explainability through new visualization techniques and inherently interpretable

model designs will facilitate a deeper understanding of the factors driving extreme events and

improve the communication of these insights to decision-makers.

7.1.1 Spatio-Temporal Modeling of Extreme Events

A promising direction for future research is the development of frameworks that can effectively

model the spatio-temporal dependencies and dynamics of extreme events. Extreme events, such

as natural disasters, often exhibit complex spatial and temporal patterns, with their occurrences

influenced by various environmental factors and their impacts propagating across different geographic

regions over time. Capturing these intricate spatio-temporal relationships is essential for accurate

forecasting and risk assessment.

Potential avenues for exploration include the integration of deep learning techniques with

spatio-temporal models, such as convolutional LSTMs, graph neural networks, or transformer

architectures. These approaches could leverage the ability of deep learning to learn complex

representations from data, while explicitly accounting for spatial and temporal dependencies through

the model structure. Additionally, incorporating domain knowledge and physics-based models

99

into the deep learning frameworks could further enhance their capability to capture the underlying

mechanisms driving extreme events.

Furthermore, the development of multivariate and multi-task learning frameworks could enable

the joint modeling of multiple extreme event types and their interactions. Such approaches could

provide a more comprehensive understanding of the complex dynamics at play and facilitate more

holistic risk assessment and decision-making.

There are some potential challenges in this direction as follows:

• Handling high-dimensional and multi-scale data: Extreme events often involve complex

interactions across multiple spatial and temporal scales, requiring models to effectively process

and integrate high-dimensional data from various sources.

• Accounting for non-stationarities and regime shifts: Many extreme events exhibit non-

stationary behavior, with their patterns and drivers subject to abrupt changes or regime shifts

over time, posing challenges for traditional spatio-temporal models.

• Incorporating physical constraints and domain knowledge:

Integrating domain-specific

knowledge, physical laws, and constraints into deep learning models can improve their ability

to capture realistic spatio-temporal dynamics, but doing so in a principled and effective manner

remains a challenge.

• Scalability and computational efficiency: Modeling intricate spatio-temporal dependencies

across large geographic regions and extended time horizons can be computationally intensive,

necessitating the development of scalable and efficient architectures.

Overcoming these challenges would enable the development of comprehensive spatio-temporal

models that can accurately capture the complex interplay between extreme events and their

environmental drivers across different spatial and temporal scales. Such models would have

far-reaching applications in fields like meteorology, climate science, environmental monitoring, and

disaster risk management. By providing accurate forecasts and insights into the spatial evolution and

propagation of extreme events, these models could inform proactive mitigation strategies, resource

100

allocation decisions, and targeted interventions to minimize the adverse impacts on communities

and ecosystems.

7.1.2

Interpretability and Explainability in Extreme Event Prediction

While deep learning models presented in this thesis have demonstrated remarkable predictive

performance in modeling extreme events, their inherent complexity often hinders interpretability

and explainability. Improving the transparency and interpretability of these models is crucial for

gaining stakeholder trust, enabling effective communication with decision-makers, and facilitating a

deeper understanding of the underlying factors driving extreme events.

One promising direction is the development of inherently interpretable deep learning architectures,

such as attention-based models, disentangled representations, or prototype-based models. These

approaches could provide insights into the specific features or patterns that contribute to the

prediction of extreme events, enabling better interpretability.

Another avenue for exploration is the integration of post-hoc explanation techniques, such as

saliency maps, concept activation vectors, or counterfactual explanations. These methods aim to

provide human-understandable explanations for model predictions by highlighting the most relevant

input features or by generating counterfactual examples that elucidate the decision boundaries.

Furthermore, the development of novel visualization techniques tailored for extreme event predic-

tions could significantly enhance the communication and interpretation of model outputs. Interactive

visualizations and dashboards that effectively convey the spatial and temporal characteristics of

predicted extreme events, along with their associated uncertainties, could facilitate more informed

decision-making by stakeholders.

Potential challenges in this area include:

• Quantifying and communicating uncertainty: Extreme event predictions are inherently

uncertain, and effectively quantifying and communicating this uncertainty to stakeholders in

an interpretable manner is crucial but challenging.

• Maintaining interpretability in complex models: As models become more sophisticated (e.g.,

incorporating spatio-temporal dependencies or multi-task learning), preserving interpretability

101

becomes increasingly challenging, requiring novel approaches to disentangle and explain the

model’s decision-making process.

• Evaluation of interpretability and explainability: Developing rigorous metrics and evaluation

frameworks for assessing the quality and effectiveness of interpretability and explainability

methods remains an open challenge.

• Balancing accuracy and interpretability: There is often a trade-off between model com-

plexity (which improves accuracy) and interpretability, requiring careful model design and

regularization techniques to strike an appropriate balance.

By addressing the interpretability and explainability challenges, these future research directions

could contribute to building more trustworthy and actionable models for extreme event prediction,

ultimately enabling better risk assessment, mitigation strategies, and decision-making processes.

102

BIBLIOGRAPHY

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Chenglei Peng, Yang Li, Yao Yu, Yu Zhou, and Sidan Du. Multi-step-ahead host load
prediction with gru based encoder-decoder in cloud computing. In 2018 10th International
Conference on Knowledge and Smart Technology (KST), pages 186–191. IEEE, 2018.

Alaa Sagheer and Mostafa Kotb. Time series forecasting of petroleum production using deep
lstm recurrent networks. Neurocomputing, 323:203–213, 2019.

Shamsul Masum, Ying Liu, and John Chiverton. Multi-step time series forecasting of electric
load using machine learning models. In International conference on artificial intelligence
and soft computing, pages 148–159. Springer, 2018.

Nikolay Laptev, Jason Yosinski, Li Erran Li, and Slawek Smyl. Time-series extreme event
forecasting with neural networks at uber. In International conference on machine learning,
volume 34, pages 1–5, 2017.

Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai
Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In
Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115,
2021.

Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition
transformers with auto-correlation for long-term series forecasting. Advances in neural
information processing systems, 34:22419–22430, 2021.

Jinsung Yoon, Daniel Jarrett, and Mihaela Van der Schaar. Time-series generative adversarial
networks. Advances in neural information processing systems, 32, 2019.

Abhyuday Desai, Cynthia Freeman, Zuhui Wang, and Ian Beaver. Timevae: A variational
auto-encoder for multivariate time series generation. arXiv preprint arXiv:2111.08095, 2021.

Xinyu Yuan and Yan Qiao. Diffusion-TS: Interpretable Diffusion for General Time Series
Generation. arXiv preprint arXiv:2403.01742, 2024.

[10] Stuart Coles, Joanna Bawa, Lesley Trenner, and Pat Dorazio. An introduction to Statistical

Modeling of Extreme Values, volume 208. Springer, 2001.

[11] Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon. CSDI: Conditional Score-
based Diffusion Models for Probabilistic Time Series Imputation. Advances in Neural
Information Processing Systems, 34:24804–24816, 2021.

[12] Marin Biloš, Kashif Rasul, Anderson Schneider, Yuriy Nevmyvaka, and Stephan Günnemann.
Modeling Temporal Data as Continuous Functions with Stochastic Process Diffusion. In
International Conference on Machine Learning, pages 2452–2470. PMLR, 2023.

[13]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.
Advances in neural information processing systems, 33:6840–6851, 2020.

103

[14] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,

9(8):1735–1780, 1997.

[15] Anastasia Borovykh, Sander Bohte, and Cornelis W Oosterlee. Conditional time series

forecasting with convolutional neural networks. arXiv preprint arXiv:1703.04691, 2017.

[16] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun
Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting.
Advances in neural information processing systems, 28, 2015.

[17]

Jianbo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiao Li Li, and Shonali Krishnaswamy.
Deep convolutional neural networks on multichannel time series for human activity recognition.
In Twenty-fourth international joint conference on artificial intelligence, 2015.

[18] Bendong Zhao, Huanzhang Lu, Shangfeng Chen, Junliang Liu, and Dongya Wu. Convolutional
neural networks for time series classification. Journal of Systems Engineering and Electronics,
28(1):162–169, 2017.

[19] Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. An Empirical Evaluation of Generic
Convolutional and Recurrent Networks for Sequence Modeling. arXiv:1803.01271 [cs], April
2018. arXiv: 1803.01271.

[20] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information
processing systems, 30, 2017.

[21] Hao Li, Yanyan Shen, and Yanmin Zhu. Stock price prediction using attention-based

multi-input lstm. In Asian conference on machine learning, pages 454–469. PMLR, 2018.

[22] George Zerveas, Srideepika Jayaraman, Dhaval Patel, Anuradha Bhamidipaty, and Carsten
Eickhoff. A transformer-based framework for multivariate time series representation learning.
In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data
Mining, pages 2114–2124, 2021.

[23] Majid Moradi Aliabadi, Hajar Emami, Ming Dong, and Yinlun Huang. Attention-based
recurrent neural network for multistep-ahead prediction of process performance. Computers
& Chemical Engineering, 140:106931, 2020.

[24] Xuan Zhang, Xun Liang, Aakas Zhiyuli, Shusen Zhang, Rui Xu, and Bo Wu. At-lstm: An
attention-based lstm model for financial time series prediction. In IOP Conference Series:
Materials Science and Engineering, volume 569, page 052037. IOP Publishing, 2019.

[25] Manuel R Vargas, Beatriz SLP De Lima, and Alexandre G Evsukoff. Deep learning for stock
market prediction from financial news articles. In 2017 IEEE international conference on
computational intelligence and virtual environments for measurement systems and applications
(CIVEMSA), pages 60–65. IEEE, 2017.

[26] Mohammed F Alsharekh, Shabana Habib, Deshinta Arrova Dewi, Waleed Albattah, Muham-
mad Islam, and Saleh Albahli. Improving the efficiency of multistep short-term electricity
load forecasting via r-cnn with ml-lstm. Sensors, 22(18):6913, 2022.

104

[27] Rajat Sen, Hsiang-Fu Yu, and Inderjit S Dhillon. Think globally, act locally: A deep neural
network approach to high-dimensional time series forecasting. Advances in neural information
processing systems, 32, 2019.

[28] Pradeep Hewage, Ardhendu Behera, Marcello Trovati, Ella Pereira, Morteza Ghahremani,
Francesco Palmieri, and Yonghuai Liu. Temporal convolutional neural (tcn) network for
an effective weather forecasting using time-series data from the local weather station. Soft
Computing, 24:16453–16482, 2020.

[29] Daniel J Gauthier, Erik Bollt, Aaron Griffith, and Wendson AS Barbosa. Next generation

reservoir computing. Nature communications, 12(1):5564, 2021.

[30] Francis Wyffels and Benjamin Schrauwen. A comparative study of reservoir computing
strategies for monthly time series prediction. Neurocomputing, 73(10-12):1958–1964, 2010.

[31] Defu Cao, Yujing Wang, Juanyong Duan, Ce Zhang, Xia Zhu, Congrui Huang, Yunhai Tong,
Bixiong Xu, Jing Bai, Jie Tong, et al. Spectral temporal graph neural network for multivariate
time-series forecasting. Advances in neural information processing systems, 33:17766–17778,
2020.

[32] Bin Wang, Tianrui Li, Zheng Yan, Guangquan Zhang, and Jie Lu. Deeppipe: A distribution-
free uncertainty quantification approach for time series forecasting. Neurocomputing,
397:11–19, 2020.

[33] Viatcheslav V Kharin and Francis W Zwiers. Estimating extremes in transient climate change

simulations. Journal of Climate, 18(8):1156–1173, 2005.

[34] Daizong Ding, Mi Zhang, Xudong Pan, Min Yang, and Xiangnan He. Modeling extreme
events in time series prediction. In Proceedings of the 25th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, pages 1114–1122, 2019.

[35] Michael Polson and Vadim Sokolov. Deep learning for energy markets. Applied Stochastic

Models in Business and Industry, 36(1):195–209, 2020.

[36] Andrew McDonald, Pang-Ning Tan, and Lifeng Luo. Comet flows: Towards generative
modeling of multivariate extremes and tail dependence. In Lud De Raedt, editor, Proceedings
of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages
3328–3334. International Joint Conferences on Artificial Intelligence Organization, 7 2022.
Main Track.

[37] Tyler Wilson, Andrew McDonald, Asadullah Hill Galib, Pang-Ning Tan, and Lifeng Luo.
Beyond point prediction: Capturing zero-inflated & heavy-tailed spatiotemporal data with
deep extreme mixture models. In Proceedings of the 28th ACM SIGKDD Conference on
Knowledge Discovery and Data Mining, pages 2020–2028, 2022.

[38] Tyler Wilson, Pang-Ning Tan, and Lifeng Luo. DeepGPD: A Deep Learning Approach for
Modeling Geospatio-Temporal Extreme Events. In Proceedings of the 36th AAAI Conference
on Artificial Intelligence, 2022.

105

[39] Nenad Tomasev, Ioana Bica, Brian McWilliams, Lars Buesing, Razvan Pascanu, Charles
Blundell, and Jovana Mitrovic. Pushing the limits of self-supervised resnets: Can we
outperform supervised learning without labels on imagenet? arXiv preprint arXiv:2201.05119,
2022.

[40] Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan Misra. Scaling and benchmarking
self-supervised visual representation learning. In Proceedings of the ieee/cvf International
Conference on computer vision, pages 6391–6400, 2019.

[41]

Jeff Z HaoChen, Colin Wei, Adrien Gaidon, and Tengyu Ma. Provable guarantees for
self-supervised deep learning with spectral contrastive loss. Advances in Neural Information
Processing Systems, 34:5000–5011, 2021.

[42] Xinrui Lyu, Matthias Hueser, Stephanie L Hyland, George Zerveas, and Gunnar Rätsch.
Improving clinical predictions through unsupervised time series representation learning.
arXiv preprint arXiv:1812.00490, 2018.

[43] Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj
Saunshi. A theoretical analysis of contrastive unsupervised representation learning. arXiv
preprint arXiv:1902.09229, 2019.

[44] Yao-Hung Hubert Tsai, Yue Wu, Ruslan Salakhutdinov, and Louis-Philippe Morency. Self-
supervised learning from a multi-view perspective. arXiv preprint arXiv:2006.05576, 2020.

[45] Zixin Wen and Yuanzhi Li. Toward understanding the feature learning process of self-
supervised contrastive learning. In International Conference on Machine Learning, pages
11112–11122. PMLR, 2021.

[46] Yuandong Tian, Xinlei Chen, and Surya Ganguli. Understanding self-supervised learning
dynamics without contrastive pairs. In International Conference on Machine Learning, pages
10268–10278. PMLR, 2021.

[47]

Jason D Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo. Predicting what you already
know helps: Provable self-supervised learning. Advances in Neural Information Processing
Systems, 34:309–323, 2021.

[48] Alaa Sagheer and Mostafa Kotb. Unsupervised pre-training of a deep lstm-based stacked
autoencoder for multivariate time series forecasting problems. Scientific reports, 9(1):1–16,
2019.

[49] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive

predictive coding. arXiv preprint arXiv:1807.03748, 2018.

[50] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for
unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 9729–9738, 2020.

[51] Zhihan Yue, Yujing Wang, Juanyong Duan, Tianmeng Yang, Congrui Huang, Yunhai Tong,
and Bixiong Xu. Ts2vec: Towards universal representation of time series. In Proceedings of
the AAAI Conference on Artificial Intelligence, volume 36, pages 8980–8987, 2022.

106

[52] Gerald Woo, Chenghao Liu, Doyen Sahoo, Akshat Kumar, and Steven Hoi. Cost: Contrastive
learning of disentangled seasonal-trend representations for time series forecasting. arXiv
preprint arXiv:2202.01575, 2022.

[53] Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, and Marinka Zitnik. Self-supervised
contrastive pre-training for time series via time-frequency consistency. arXiv preprint
arXiv:2206.08496, 2022.

[54] Emadeldeen Eldele, Mohamed Ragab, Zhenghua Chen, Min Wu, Chee Keong Kwoh, Xiaoli
Li, and Cuntai Guan. Time-series representation learning via temporal and contextual
contrasting. arXiv preprint arXiv:2106.14112, 2021.

[55] Xinyu Yang, Zhenguo Zhang, and Rongyi Cui. Timeclr: A self-supervised contrastive
learning framework for univariate time series representation. Knowledge-Based Systems,
245:108606, 2022.

[56] Sven Festag, Joachim Denzler, and Cord Spreckelsen. Generative adversarial networks for
biomedical time series forecasting and imputation a systematic review. Journal of Biomedical
Informatics, page 104058, 2022.

[57] Duc Hoang Tran, Van Linh Nguyen, Huy Nguyen, and Yeong Min Jang. Self-supervised
learning for time-series anomaly detection in industrial internet of things. Electronics,
11(14):2146, 2022.

[58] Mingda Zhang. Time series: Autoregressive models ar, ma, arma, arima. University of

Pittsburgh, 2018.

[59]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural
information processing systems, 27, 2014.

[60] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint

arXiv:1312.6114, 2013.

[61] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In

International conference on machine learning, pages 1530–1538. PMLR, 2015.

[62] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon,
and Ben Poole. Score-based generative modeling through stochastic differential equations.
arXiv preprint arXiv:2011.13456, 2020.

[63] Olof Mogren. C-rnn-gan: Continuous recurrent neural networks with adversarial training.

arXiv preprint arXiv:1611.09904, 2016.

[64] Vincent Fortuin, Dmitry Baranchuk, Gunnar Rätsch, and Stephan Mandt. GP-VAE: Deep
Probabilistic Time Series Imputation. In International conference on artificial intelligence
and statistics, pages 1651–1661. PMLR, 2020.

107

[65] Andrew McDonald, Pang-Ning Tan, and Lifeng Luo. COMET Flows: Towards Generative
Modeling of Multivariate Extremes and Tail Dependence. In Proceedings of the Thirty-First
International Joint Conference on Artificial Intelligence, IJCAI-22, pages 3328–3334, 2022.

[66] Magnus Wiese, Robert Knobloch, and Ralf Korn. Copula & marginal flows: Disentangling

the marginal from its joint. arXiv preprint arXiv:1907.03361, 2019.

[67] Priyank Jaini, Ivan Kobyzev, Yaoliang Yu, and Marcus Brubaker. Tails of lipschitz triangular
flows. In International Conference on Machine Learning, pages 4673–4681. PMLR, 2020.

[68] V Chandola, A Banerjee, and V Kumar. Anomaly detection: A survey. acm computing

surveys. vol, 41:15, 2009.

[69] Tung Kieu, Bin Yang, Chenjuan Guo, and Christian S Jensen. Outlier detection for time

series with recurrent autoencoder ensembles. In IJCAI, pages 2725–2732, 2019.

[70] Shuyu Lin, Ronald Clark, Robert Birke, Sandro Schönborn, Niki Trigoni, and Stephen Roberts.
Anomaly detection for time series using vae-lstm hybrid model. In ICASSP 2020-2020
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages
4322–4326. Ieee, 2020.

[71] Pankaj Malhotra, Lovekesh Vig, Gautam Shroff, Puneet Agarwal, et al. Long short term
memory networks for anomaly detection in time series. In ESANN, volume 2015, page 89,
2015.

[72] Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Daeki Cho,
and Haifeng Chen. Deep autoencoding gaussian mixture model for unsupervised anomaly
detection. In International conference on learning representations, 2018.

[73] Yang Jiao, Kai Yang, Dongjing Song, and Dacheng Tao. Timeautoad: Autonomous anomaly
detection with self-supervised contrastive loss for multivariate time series. IEEE Transactions
on Network Science and Engineering, 9(3):1604–1619, 2022.

[74] US GAO. Natural disasters: Economic effects of hurricanes katrina, sandy, harvey, and irma.

https://www.gao.gov/products/gao-20-633r, 2020. Accessed: 2021-04-09.

[75] Ding Wang and Pang-Ning Tan. Johan: A joint online hurricane trajectory and intensity
forecasting framework. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge
Discovery & Data Mining, pages 1677–1685, 2021.

[76] Christopher M Bishop. Pattern recognition. Springer, New York, 2006.

[77] Richard L. Smith. Maximum likelihood estimation in a class of nonregular cases. Biometrika,

72(1):67–90, 1985.

[78] Christopher W Landsea and James L Franklin. Atlantic hurricane database uncertainty and
presentation of a new database format. Monthly Weather Review, 141(10):3576–3592, 2013.

108

[79] Elizabeth L Ratnam, Steven R Weller, Christopher M Kellett, and Alan T Murray. Residential
load and rooftop pv generation: an australian distribution network dataset. International
Journal of Sustainable Energy, 36(8):787–806, 2017.

[80]

J Muthukumar. Weather dataset. https://www.kaggle.com/muthuj7/weather-dataset, Dec
2017. Accessed: 2021-09-30.

[81] Asadullah Hill Galib, Andrew McDonald, Tyler Wilson, Lifeng Luo, and Pang-Ning Tan.
DeepExtrema: A Deep Learning Approach for Forecasting Block Maxima in Time Series
Data. arXiv preprint arXiv:2205.02441, 2022.

[82]

Julie A Winkler, Galina S Guentchev, Malgorzata Liszewska, and Pang-Ning Tan. Climate
scenario development and applications for local/regional climate change impact assessments:
An overview for the non-climate scientist: Part ii: Considerations when using climate change
scenarios. Geography Compass, 5(6):301–328, 2011.

[83] Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent
neural networks for multivariate time series with missing values. Scientific reports, 8(1):1–12,
2018.

[84] Linus Ericsson, Henry Gouk, Chen Change Loy, and Timothy M Hospedales. Self-supervised
representation learning: Introduction, advances, and challenges. IEEE Signal Processing
Magazine, 39(3):42–62, 2022.

[85] Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee, and
Fillia Makedon. A survey on contrastive self-supervised learning. Technologies, 9(1):2, 2020.

[86] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting
and composing robust features with denoising autoencoders. In Proceedings of the 25th
international conference on Machine learning, pages 1096–1103, 2008.

[87] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework
for contrastive learning of visual representations. In International conference on machine
learning, pages 1597–1607. PMLR, 2020.

[88] Sepp Hochreiter. The vanishing gradient problem during learning recurrent neural nets and
problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based
Systems, 6(02):107–116, 1998.

[89] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv

preprint arXiv:1412.6980, 2014.

[90] Hassan Ismail Fawaz, Benjamin Lucas, Germain Forestier, Charlotte Pelletier, Daniel F
Schmidt, Jonathan Weber, Geoffrey I Webb, Lhassane Idoumghar, Pierre-Alain Muller, and
François Petitjean. Inceptiontime: Finding alexnet for time series classification. Data Mining
and Knowledge Discovery, 34(6):1936–1962, 2020.

[91] Gordon J Ross. Modelling financial volatility in the presence of abrupt changes. Physica A:

Statistical Mechanics and its Applications, 392(2):350–360, 2013.

109

[92] Dilip Kumar. Sudden changes in extreme value volatility estimator: Modeling and forecasting

with economic significance analysis. Economic Modelling, 49:354–371, 2015.

[93] Qingsong Wen, Liang Sun, Fan Yang, Xiaomin Song, Jingkun Gao, Xue Wang, and Huan Xu.
Time series data augmentation for deep learning: A survey. arXiv preprint arXiv:2002.12478,
2020.

[94] Brian Kenji Iwana and Seiichi Uchida. An empirical survey of data augmentation for time

series classification with neural networks. Plos one, 16(7):e0254841, 2021.

[95] Frank J Massey Jr. The kolmogorov-smirnov test for goodness of fit. Journal of the American

statistical Association, 46(253):68–78, 1951.

[96] HARALD CRAMAR. On the composition of elementary errors. Skand. Aktuarietids,

11:13–74, 1928.

[97] CD Broad. Wahrscheinlichkeit, statistik, und wahrheit., 1937.

[98] Sana Tonekaboni, Danny Eytan, and Anna Goldenberg. Unsupervised representation learning
for time series with temporal neighborhood coding. arXiv preprint arXiv:2106.00750, 2021.

[99] Alexander Nikitin, Letizia Iannucci, and Samuel Kaski. Tsgm: A flexible framework for
generative modeling of synthetic time series. arXiv preprint arXiv:2305.11567, 2023.

[100] Jonathan Crabbé, Nicolas Huynh, Jan Stanczuk, and Mihaela van der Schaar. Time series

diffusion in the frequency domain. arXiv preprint arXiv:2402.05933, 2024.

[101] Luis Candanedo. Appliances Energy Prediction. UCI Machine Learning Repository, 2017.

DOI: https://doi.org/10.24432/C5VC8G.

[102] Australian Bureau of Meteorology, 2016.

[103] Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov,
Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Peng, and H Eugene Stanley.
Physiobank, physiotoolkit, and physionet: components of a new research resource for complex
physiologic signals. circulation, 101(23):e215–e220, 2000.

[104] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew
Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual
concepts with a constrained variational framework. In International conference on learning
representations, 2016.

[105] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density Estimation using Real NVP.

arXiv preprint arXiv:1605.08803, 2016.

[106] Ahmed Alaa, Alex James Chan, and Mihaela van der Schaar. Generative Time-series
Modeling with Fourier Flows. In International Conference on Learning Representations,
2021.

110