PERIODIC ARMA MODEL: FORECASTING, PARSIMONY, ASYMPTOTIC
NORMALITY AND AIC
By
Kai Zhang

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Statistics - Doctor of Philosophy
2014

ABSTRACT
PERIODIC ARMA MODEL: FORECASTING, PARSIMONY,
ASYMPTOTIC NORMALITY AND AIC
By
Kai Zhang

Periodic autoregressive moving average (PARMA) models are indicated for time series whose
mean, variance, and covariance function vary with the season. In this thesis, I develop
and implement forecasting procedures for PARMA models. The required computations are
documented in detail. An application to monthly river flow forecasting is provided.

For my father.

iii

ACKNOWLEDGMENTS

I was lucky to meet with my thesis advisor Dr. Mark Meerschaert, who led me into the fascinating world of time series and academic research. I want to say a million thanks, however
it will not be enough for my gratitude. I could not make it without his encouragement and
belief in me. I will also benefit from his life attitude for the rest of my life.
I am also grateful for the tremendous help from Dr. Paul Anderson, who introduced me
into the periodic ARMA modeling. Thanks him for numerous of transits to East Lansing
for discussing my thesis.
Special thanks go to my thesis committee member and department Chairperson Dr. Hira
Koul. I am indebted for his help in every little detail of my doctoral years, including the
learning in STT 954, failing and passing two prelim exams, student travel fund applications,
summer research fellowships, job-hunting and thesis updates, etc.
Additionally, Dr. Chae Young Lim was always supportive in my research and job-hunting
process. I am deeply grateful of her time spent reading my thesis and valuable feedbacks.
Lastly, I want to express my gratitude for Ms. Sue Watson.

iv

TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Computation of PARMA Autocovariances . . . . . . . . . . . . . . . . . . .

1
8

Chapter 2 Forecasting and forecast error . . . . . . . . . . . . . . . . . . . .
2.1 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Forecast errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13
13
25

Chapter 3 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1 A simulation study for the convergence of the coefficients in innovations algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

Chapter 4

59

Application to a Natural River Flow . . . . . . . . . . . . . . . .

Chapter 5 Maximum likelihood estimation and reduced model .
5.1 Maximum likelihood Function for PARMAS (p, q) Model . . . . .
5.2 Reduced PARMAS (1, 1) model. . . . . . . . . . . . . . . . . . .
5.2.1 Reduced model by asymptotic distribution of ψˆi ( ) . . . .
5.2.2 Reduced model by asymptotic distribution of MLE . . . .
Chapter 6

.
.
.
.
.

67
67
83
83
85

Asymptotic normality of PARMA model . . . . . . . . . . . . .

98

Chapter 7 Periodic AIC for PARMAS (p, q) model . . . . .
7.1 Kullback-Liebler (K-L) Information . . . . . . . . . . . . .
7.2 Derivation of periodic AIC for PARMAS (p, q) process . . .
7.2.1 Application to model selection for the Fraser River
7.3 Future Research . . . . . . . . . . . . . . . . . . . . . . . .
APPENDIX

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

54

.
.
.
.
.

.
.
.
.
.

126
126
128
145
147

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

BIBLIOGRAPHY

. . . . . . . . . . . . .

v

. . . . . . . . . . . . . . . . . 186

LIST OF TABLES
Table 1.1

Parameters in PARMA12 (1, 1) for Example 1.1.1. . . . . . . . . . .

11

Table 1.2

Part of autocovariances in PARMA12 (1, 1) for Example 1.1.1. . . . .

12

Table 3.1

Moving average parameter estimates ψˆi ( ) at season i and lag =
1, 2, . . . , 6, and p-values, after k = 20 iterations of the innovations
algorithm applied to average monthly flow series for the Fraser River
near Hope BC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

Table 3.2

Model parameters and estimates for simulated PARMA12 (1, 1) data.

56

Table 4.1

Parameter estimates for the PARMA model (1.0.1) of average monthly
flow series for the Fraser River near Hope BC from October 1912 to
September 1982. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

Sample mean, standard deviation and autocorrelation at lag 1 and
2 for an average monthly flow series for the Fraser River near Hope
BC, from October 1912 to September 1982. . . . . . . . . . . . . . .

63

Estimated parameters for PARMA12 (1, 1) model, from innovations
algorithm, using the logarithm of the original data. The resulting
negative likelihood value is -362.655. . . . . . . . . . . . . . . . . . .

77

Table 5.2

A compare of different algorithm in optim function. . . . . . . . . .

78

Table 5.3

MLE result by BFGS method, with σi as nuisance parameters. The
ˆ = −463.8585. . . . . . . . . .
resulting MLE value was −2 log L(β)

79

MLE by BFGS method, using the values of σ ∗ from Figure 5.1. The
ˆ = −506.6053. We take those
resulting MLE value was −2 log L(β)
parameters as our best results in optimization. . . . . . . . . . . .

83

Parameter estimates for the reduced PARMA model (1.0.1) of average
monthly flow series for the Fraser River near Hope BC from October
1912 to September 1982. The resulting negative likelihood value is
-351.3939. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

Table 4.2

Table 5.1

Table 5.4

Table 5.5

vi

Table 5.6

MLE and its standard error. . . . . . . . . . . . . . . . . . . . . . .

88

Table 5.7

Model parameters by MLE optimization and their p-values. . . . . .

95

Table 7.1

Comparison of AIC values for different models . . . . . . . . . . . . 146

Table 7.2

The model parameters in PAR12 (p) automatic model fitting by pear
package in R, where the number of estimable parameters is 19, assuming σ
ˆ i as nuisance parameters in the AIC computation. . . . . . 146

Table 7.3

A compare of AIC values for different models, after taking the log of
the data. Note that both full and reduced PARMA12 (1, 1) models
perform better than PAR models. . . . . . . . . . . . . . . . . . . . 147

LIST OF FIGURES
Convergence of ψˆi ( ) as iterations k increase,where 0 ≤ i < S −1, 1 ≤
≤ 2, k = 1, 2, . . . , 30. For interpretation of the references to color
in this and all other figures, the reader is referred to the electronic
version of this dissertation. . . . . . . . . . . . . . . . . . . . . . .

50

Convergence of vk, i−k as iterations k increase,where 0 ≤ i < S −
1, 1 ≤ ≤ 2, k = 1, 2, . . . , 30. . . . . . . . . . . . . . . . . . . . . . .

51

Plots of R1k and R2k against iterations k, which show the convergence of ψˆi ( ) and vˆk, i−k as iterations k increase,where 0 ≤ i ≤
11, 1 ≤ ≤ 2, k = 1, 2, . . . , 50. . . . . . . . . . . . . . . . . . . . . . .

57

Convergence of φˆi and θˆi as iterations k increase,where 0 ≤ i ≤
11, k = 1, 2, . . . , 50. . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

Average monthly flows in cubic meters per second (cms) for the Fraser
River at Hope, BC indicate a seasonal pattern. . . . . . . . . . . . .

60

Statistics for the Fraser river time series: (a) seasonal mean; (b)
standard deviation; (c,d) autocorrelations at lags 1 and 2. Dotted
lines are 95% confidence intervals. Season = 0 corresponds to October
and Season = 11 corresponds to September. . . . . . . . . . . . . .

61

√
ACF and PACF for model residuals, showing the bounds ±1.96/ N ,
indicate no serial dependence. With no apparent pattern, these plots
indicate that the PARMA12 (1, 1) model is adequate. . . . . . . . . .

62

24-month forecast (solid line with dots) based on 70 years of Fraser
river data, with 95% prediction bounds (dotted lines). For comparison, the actual data (solid line) is also shown. This data was not
used in the forecast. . . . . . . . . . . . . . . . . . . . . . . . . . .

64

Figure 4.5

Width of 95% prediction bounds for the Fraser river.

. . . . . . . .

66

Figure 5.1

−2 log{L(β, σ 2 )} as a function of each σi , i = 0, . . . , 11, with the
remaining parameters fixed. . . . . . . . . . . . . . . . . . . . . . .

80

Figure 3.1

Figure 3.2

Figure 3.3

Figure 3.4

Figure 4.1

Figure 4.2

Figure 4.3

Figure 4.4

viii

Figure 5.2

Figure 5.3

−2 log{L(β, σ 2 )} as a function of each θi , i = 0, . . . , 11, with the
remaining parameters fixed. . . . . . . . . . . . . . . . . . . . . . . .

81

−2 log{L(β, σ 2 )} as a function of each φi , i = 0, . . . , 11, with the
remaining parameters fixed. . . . . . . . . . . . . . . . . . . . . . .

82

Figure 5.4

A 24-month forecast using the reduced PARMA12 (1, 1) model. Solid
line is the original data; Solid line with dots is the reduced PARMA(1,1)
forecast; The two dotted lines are 95% Gaussian prediction bounds.
The 24-month forecast is from October 1982 to September 1984. . . 86

Figure 5.5

A comparison of 24-month forecasts between the full PARMA12 (1, 1)
model and reduced PARMA12 (1, 1) model. In the first two graphs,
dotted line is forecast, and solid line is real data. . . . . . . . . . . .

87

A 24-month forecast using the reduced PARMA12 (1, 1) model in Table 5.7. Solid line is the original data; Solid line with dots is the
reduced PARMA(1,1) forecast; The two dotted lines are 95% Gaussian prediction bounds. The 24-month forecast is from October 1982
to September 1984. . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

A comparison of 24-month forecasts between the full PARMA12 (1, 1)
model and reduced PARMA12 (1, 1) model, by asymptotic distribution of MLE. In the first two graphs, dotted line is forecast, and solid
line is real data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

Figure 5.6

Figure 5.7

Chapter 1
Introduction
Mathematical modeling and simulation of river flow time series are critical issues in hydrology and water resources. Most river flow time series are periodically stationary, that is,
their mean and covariance functions are periodic with respect to time. To account for the
periodic correlation structure, a periodic autoregressive moving average (PARMA) model
can be useful. PARMA models are also appropriate for a wide variety of time series applications in geophysics and climatology. In a PARMA model, the parameters in a classical
ARMA model are allowed to vary with the season. Since PARMA models explicitly describe
seasonal fluctuations in mean, standard deviation and autocorrelation, they have been used
to generate more faithful models and simulations of natural river flows.

Historically, Gladyshev [18] first defined the concept of periodically correlated stochastic
process; In [20], some early work of Jones and Brelsford studied the problem of predicting
time series with periodic structure, including the estimation of necessary parameters; In [32]
and [47], Pagano and Troutman studied the elementary properties of univariate processes,
1

and connections with stationary multivariate processes; For stationary time series models,
Box and Jenkins [10] presented a systematic approach for modeling time series based on
three stages: (1) Model identification; (2) parameter estimation; and (3) diagnostic checks
or tests of goodness of fit. Since then, applications and extensions of this modeling approach
to hydrology have been widespread.
Adams and Goodwin [1] described an on-line parameter estimation technique, based
on methods from automatic control, which provides consistent estimates of PARMA model
parameters. Anderson and Vecchia [3] obtained the asymptotic distribution for the sample autocovariance and sample autocorrelation functions of PARMA process, and they also
studied the asymptotic properties of the discrete Fourier transform of the estimated periodic
autocovariance and autocorrelation functions. Anderson and Meerschaert [5] established the
basic asymptotic theory for periodic moving averages of i.i.d. random variables with regularly varying tails. They showed that when the underlying random variables have finite
variance but infinite fourth moment, the sample autocorrelations are asymptotically stable.
Lund and Basawa [22, 23] explored recursive prediction and likelihood evaluation techniques
for PARMA models.
Time series analysis involves four general steps: model identification, parameter estimation, diagnostic checking, and forecasting. Model identification is the most difficult step for
PARMA modeling. Noakes et al. [31] suggested examining the plots of the periodic partial
autocorrelation function as the best approach to identify PARMA models. This method
is highly recommended when the parameter space is not constrained, however it requires a
high level of user experience. Another method is to use an automatic selection criterion,
such as the Akaike Information Criterion (AIC) [2], or the Bayesian information criterion
2

(BIC) [37] when all possible candidates of models are examined. However, this procedure
requires investigating quite a large number of models, especially when the number of parameter estimates is fairly large, for example, monthly data with 12 seasons. Ursu and Turkman
[52] applied the genetic algorithm as a method of identifying the PAR model, which greatly
improved the model selection efficiency. Salehi H. has also made tremendous contributions
on periodic stochastic process, see [26, 27, 28, 29].
Additionally, model identification for general PARMA times series was discussed in Tesfaye, Meerschaert and Anderson [42]. Anderson, Meerschaert and Vecchia [6] developed an
innovations algorithm for PARMA parameter estimation. Tesfaye, Meerschaert and Anderson [42] demonstrated model diagnostics for a PARMA model of monthly river flows for the
Fraser River in British Columbia, Canada. In this thesis, I develop a practical method for
forecasting PARMA models, and I demonstrate the method by forecasting monthly river
flows for the same time series of monthly flows.
˜ }
˜
A stochastic process {X
t t∈Z is periodically stationary if its mean EXt and covariance
˜ ,X
˜
Cov(X
t t+h ) for h ∈ Z are periodic functions of time t with the same period S, i.e., for
some integer S ≥ 1, for i = 0, 1, ..., S − 1, and for all integers k and h, I have

˜ = EX
˜
EX
i
i+kS

˜ ,X
˜
˜
˜
and Cov X
i i+h = Cov Xi+kS , Xi+kS+h .

˜ } ia called a PARMA (p, q) process if the meanA periodically stationary process {X
t
S
˜ − µ is of the form
centered process Xt = X
t
t
p
Xt −

q
φt (k)Xt−k = εt +

θt (j)εt−j
j=1

k=1
3

(1.0.1)

where {εt } is a sequence of random variables with mean zero and standard deviation σt > 0
such that {δt = σt−1 εt } is independent and identically distributed. ({εt } is called periodic
i.i.d. Gaussian noise if Xt is a Gaussian process.) Here the autoregressive parameters φt (j),
the moving average parameters θt (j), and the residual standard deviations σt are all assumed
to be periodic functions of t with the same period S ≥ 1. Throughout this paper I will also
assume:

(i) The model (1.0.1) admits a causal representation
∞
ψt (j)εt−j

Xt =

(1.0.2)

j=0
where ψt (0) = 1 and

∞ |ψ (j)| < ∞ for all t. Note that ψ (j) = ψ
t
t+kS (j) for all
j=0 t

j.
(ii) The model (1.0.1) also satisfies an invertibility condition
∞
πt (j)Xt−j

εt =

(1.0.3)

j=0
where πt (0) = 1 and

∞ |π (j)| < ∞ for all t, and define X
t−j = 0 when t − j < 0.
j=0 t

Again, πt (j) = πt+kS (j) for all j.
The notation used in this paper is consistent with: Anderson and Vecchia [3]; Anderson
and Meerschaert [4, 5]; Anderson, Meerschaert, and Vecchia [6]; Anderson and Meerschaert
[7]; Tesfaye, Meerschaert, and Anderson [42]; and Tesfaye, Anderson, and Meerschaert [43].
This notation is also an extension of the notation in Brockwell and Davis [11].
˜ ,X
˜
˜
Suppose I have N years of data, consisting of n = N × S data points, X
0 1 , . . . , Xn−1 ,
4

where S is the number of seasons. For example, for monthly data I have S = 12, and our
convention is to let i = 0 represent the first month, i = 1 represent the second, . . . , and
i = S − 1 = 11 represent the last.
The sample mean for season i is

µ
ˆi = N −1

N −1
˜
X
kS+i .

(1.0.4)

k=0
The sample autocovariance for season i at lag

γˆi ( ) = N −1

is

N −1−hi
˜
X
ˆi
jS+i − µ

˜
X
ˆi+
jS+i+ − µ

,

(1.0.5)

j=0
where

≥ 0, hi = (i + )/S and = · is the greatest integer function.

The sample autocorrelation for season i at lag

ρˆi ( ) =

is

γˆi ( )
,
γˆi (0)ˆ
γi+ (0)

(1.0.6)

which is also the sample cross-correlation between two different seasons.
In (1.0.5) the divisor N is used rather than N − hi , since this ensures that the autocoˆ (i) = γˆ (j − ) n
variance matrix at season i, Γ
i
j, =1 is non-negative definite, where
N





(i)

ˆ = γˆ (j − ) n
Γ
=
i
j, =1 
N





γˆi (0)

γˆi (1)

...

γˆi (1)
..
.

γˆi (0)

...
..

.

γˆi (N − 1) γˆi (N − 2) . . .
5


γˆi (N − 1)


γˆi (N − 2)

.

..

.


γˆi (0)

ˆ (i) = 1 ΓΓ , where Γ is the N × 2N matrix
To see this we may write Γ
N
N

0


0

Γ=
 ..
.


0


... ...

0

Y1

Y2 . . .

...

YN 


. . . 0 Y1 Y2 . . . . . . YN 0 

,

.
.
.
.
.
.
.. ..
.
.
.
.
.
.
.
. .
.
.
.
.
. 


Y1 Y2 . . . YN 0 . . . . . .
0

˜
ˆ (i) a = N −1 (a Γ)(a Γ) ≥
j = 1, . . . , N , and Yj = X
µi . Then ∀N ×1 vector a, we have a Γ
jS+i −ˆ
N
ˆ (i) =
0. By definition of non-negative definiteness, since a is arbitrary, then the matrix Γ
N
γˆi (j − ) n
j, =1 is non-negative definite.

Then the sample covariance matrix

γˆi (1)
 γˆi (0)


 γˆ (1)
γˆi (0)
i
ˆ (i) = 
Γ

k

..

.


γˆi (k − 1) γˆi (k − 2)

...
...
..

.

...


γˆi (k − 1)


γˆi (k − 2)



..

.


γˆi (0)

converges in the operator norm A = sup{ Ax : x = 1} to the covariance matrix

γi (1)
 γi (0)



γi (0)
(i)  γi (1)
Γ =
k

..

.


γi (k − 1) γi (k − 2)

...
...
..

.

...


γi (k − 1)


γi (k − 2)



..

.


γi (0)

in probability as N → ∞, if k → ∞ in such a way that k 2 /N → 0. This result also assumes
6

a spectral bound, see [6, Theorem 3.1], and that the underlying noise sequence has finite
fourth moment. Note that, if every σi > 0, then Proposition 4.1 of Lund and Basawa [22]
(i)
shows that the covariance matrix Γn = γi (j − ) n
j, =1 is invertible for every n ≥ 1 and
each i = 0, 1, . . . , S − 1. Since the set of invertible matrices is open, the convergence in
probability from [6, Theorem 3.1] implies that the sample covariance matrix is invertible
with probability approaching 1 as k → ∞.

Given a PARMAS (p, q) model (1.0.1) for a periodic time series, a recursive forecasting
algorithm is developed in this thesis based on minimizing mean squared error. I detail
the computation of h-step ahead forecasts for a PARMA model, based on the innovations
algorithm, and an idea of Ansley (see Ansley [9]; Lund and Basawa [23]). I also have
developed R codes to implement these forecasts, and compute the asymptotic variance of
the forecast errors. All R codes are listed in the appendix. This thesis is laid out as follows.
Chapter 2 develops the algorithms for computing h-step ahead forecasts for any h ≥ 1,
and computes the associated forecast error variances. Chapter 3 specifies the details of
computation in this paper, mainly by R software. Chapter 4 illustrates the methods of
this thesis by forecasting average monthly flows for the Fraser River. Chapter 5 develops
a reduced PARMAS (p, q) model to achieve parsimony. Chapter 6 derives the asymptotic
theory of PARMA models, and Chapter 7 discusses the periodic AIC for automatic model
selection.
7

1.1

Computation of PARMA Autocovariances

Given a PARMAS (p, q) time series (1.0.1), Define the covariance function

γj ( − j) = E(Xj X ) = Cov(Xj , X ),

(1.1.1)

and note that
γj ( − j) = E(Xj X ) = E(X Xj ) = γ (j − )
for all j, ∈ Z. Then the covariance function γj ( −j) = E(Xj X ) can be explicitly computed
by two methods. The first approach is to use the causal representation in (1.0.2). Given the
model parameters, we may write

E Xj X

∞

= E



∞
ψj (k)εj−k

ψ (r)ε −r 

r=0
k=0
∞ ∞
ψj (k)ψ (r)E εj−k ε −r .
=
k=0 r=0
2 , when j − k = − r, and E ε
Notice that E εj−k ε −r = σj−k
j−k ε −r = 0, otherwise.
Letting r = − j + k, therefore
∞
γj ( − j) = E Xj X

=

2 .
ψj (k)ψ ( − j + k)σj−k

(1.1.2)

k=0
However, (1.1.2) is computationally impractical since it requires determination and infinite
summation of ψj (k) and ψ ( − j + k).
Let γt (h) = Cov Xt , Xt+h be the autocovariance of Xt at season i = t − S t/S and
lag h ≥ 0. Now I will consider the second method, by mimicking Yule-Walker methods for
8

stationary ARMA series. Multiplying both sides of (1.0.1) by Xt−h , where h > max (p, q),
I have

Xt−h Xt −





p



q

φt (k)Xt−k  = Xt−h εt +

θt (j)εt−j  .
j=1

k=1

Take expectations on both sides, and use causal representation to compute the right hand
side:






p

E Xt−h Xt −





φt (k)Xt−k  = E Xt−h εt +

θt (j)εt−j 
j=1

k=1


p



∞

φt (k)γt−h (h − k) = E 

γt−h (h) −



q

θt (j)εt−j 

ψt−h (k)εt−h−k  εt +
k=0

k=1



q
j=1

= 0,
since εt−h−k ⊥ εt−j , ∀j = 0, 1, . . . , q, k = 0, 1, . . ., where x ⊥ y if and only if x, y = 0,
and x, y is the inner product of x and y in Hilbert space H. Recall from Chapter 2 of
Brockwell and Davis [11] that x, y is called the inner product of x and y, such that
(a) x, y = y, x , the bar denoting complex conjugation,
(b) x + y, z = x, z + y, z for all x, y, z ∈ H,
(c) αx, y = α x, y for all x, y ∈ H and α ∈ R,
(d) x, x ≥ 0 for all x ∈ H,
(e) x, x = 0 if and only if x = 0.
In this way I obtain
p
φt (k)γt−h (h − k).

γt−h (h) =
k=1

9

(1.1.3)

Equation (1.1.3) expresses γt−h (h) in terms of autocovariances of the previous p lags when
h > max (p, q). The advantage of this method is that the computational complexity of
(1.1.3) does not increase with increasing h. Therefore, once γt−h (h) is identified for all lags
0 ≤ h ≤ max (p, q) and for all seasons i = 0, 1, . . . , S − 1, then the PARMA autocovariances
at higher lags can be efficiently computed. Next I will focus on computation of γt−h (h) for
0 ≤ h ≤ max (p, q) and all seasons i = 0, 1, . . . , S − 1. With a similar technique, multiply
both sides of (1.0.1) by Xt−h and take expectations:






p

E Xt−h Xt −







q

φt (k)Xt−k  = E Xt−h εt +

θt (j)εt−j  .
j=1

k=1
Notice that for 0 ≤ h ≤ max (p, q), then
p
φt (k)γmin(t−k,t−h) (|h − k|)

γt−h (h) −
k=1




∞

q



θt (j)εt−j 
ψt−h (k)εt−h−k εt +
= E
j=1
k=0
q
2 ψ
=
θt (j)σt−j
t−h (j − h),
j=h
2 , when k = j − h, and E(ε
where θt (0) = 1 and E(εt−h−k εt−j ) = σt−j
t−h−k εt−j ) = 0,
otherwise. In this way I get the general form of autocovariance function for 0 ≤ h ≤
max (p, q),
p

q
2 ψ
θt (j)σt−j
t−h (j − h).

φt (k)γmin(t−k,t−h) (|h − k|) +

γt−h (h) =
k=1

(1.1.4)

j=h

In a straightforward way, (1.1.4) is actually an S × [max(p, q) + 1] dimensional linear system.
10

Table 1.1: Parameters in PARMA12 (1, 1) for Example 1.1.1.
season i
φi
θi
σi
0
0.198 0.687 11875.479
1
0.568 0.056 11598.254
2
0.560 -0.052 7311.452
3
0.565 -0.050 5940.845
0.321 0.470 4160.214
4
5
0.956 -0.389 4610.209
1.254 -0.178 15232.867
6
7
0.636 -0.114 31114.514
8
-1.942 2.393 32824.370
9
-0.092 0.710 29712.190
0.662 -0.213 15511.187
10
0.355 0.322 12077.991
11

The matrix associated with this linear system is invertible as long as the PARMA model is
causal. See the appendix for a simple R code written to solve the linear system, for γt−h (h),
for all 0 ≤ h ≤ max (p, q) and i = 0, 1, . . . , S − 1. Note that (1.1.4) only requires ψt−h (j − h)
for j ≤ q, which great reduces the computation, compared with the first method in (1.1.2).

Example 1.1.1. Consider the PARMA12 (1, 1) model in Table 1.1.1, used in Tesfaye et al.
[42] to fit the 72-year monthly observations of Fraser river.
By (1.1.4), when 0 ≤ h ≤ max(1, 1), I have h = 0, 1, and


2
2
 γt (0) − φt (1)γ
t−1 (1) = θt (0)ψt (0)σt + θt (1)ψt (1)σt−1

 γ
2
t−1 (1) − φt (1)γt−1 (0) = θt (1)σt−1

(1.1.5)

Note that θt (0) = 1, ψt (0) = 1, ψt (1) = φt (1) + θt (1), and (1.1.5) is a 12 × 2 dimensional
linear system containing 24 unknown paramters γt (0) and γt (1), for t = 0, 1, . . . , 11. Apply
the computation in R, I can get γt (0) and γt (1) for all the seasons, which are shown in the
11

Table 1.2: Part
season i
0
1
2
3
4
5
6
7
8
9
10
11

of autocovariances in PARMA12 (1, 1) for Example 1.1.1.
γt (0)
γt (1)
γt (2)
γt (3)
261385575 156364519 87564130
49473734
228262590 120832037 68270101
21914702
117569804 63754073
20465057
19564595
69938164
39038161
37320482
46799885
42959747
34336947
43058531
27385226
50262780
59246310
37680653 -73175828
302264368 165787551 -321959424 29620267
1059745614 258668383 -23797491 -15753939
1619934424 615947912 407757518 144753919
1298905828 671836226 238501860 47223368
600922799 290799803 57578361
32704509
301560482 159927070 90838576
50869602

first two columns of Table 1.1.1. For h > 1,

γt−h (h) = φt γt−h (h − 1),

(1.1.6)

therefore γt (h) at higher lags h > 1 can be computed. Partial results for higher lags are
shown in Table 1.1.1.

12

Chapter 2
Forecasting and forecast error

2.1

Forecasting

The PARMA prediction equations are based on orthogonal projection, to minimize the mean
squared prediction error among all linear predictors. If the PARMA process has Gaussian
innovations, then this will also minimize the mean squared prediction error among all predictors.
To simplify notation, I denote the first season to be forecast as season 0, and define the
season of the oldest data point as season S − 1. If the total number of available data is
not a multiple of S, I discard a few(< S) of the oldest observations, to obtain the data set
˜ ,X
˜
˜
X
0 1 , . . . , Xn−1 , where n = N × S.
˜ − µ is the mean-centered process in (1.0.1). Fix a probability space
Recall that Xt = X
t
t
on which the PARMAS (p, q) model (1.0.1) is defined, and let

˜ n = sp{1,
˜ ,...,X
˜
H
¯
X
¯
X0 , . . . , Xn−1 }
0
n−1 } = sp{1,
13

denote the set of all linear combinations of these random variables in the Hilbert space of
random variables on that probability space, with the inner product X, Y = E(XY ). Note
that
˜ =P
P˜ X
˜ n (Xn + µn )
Hn n
H
= µn + P ˜ Xn
Hn

(2.1.1)

= µn + Psp{1,X
Xn
¯
0 ,...,Xn−1 }
= µn + Psp{X
Xn ,
¯
0 ,...,Xn−1 }
˜ can be accomplished by forecasting the meanso that forecasting for the original data X
t
centered process Xt , and then adding the seasonal mean. In order to develop a more efficient
forecasting algorithm, it is useful to consider a transformed process (cf. Ansley [9]; Lund
and Basawa [6]) defined by

Wt =





 Xt ,



 Xt −

t = 0, . . . , m − 1
p

(2.1.2)
φt (k)Xt−k , t ≥ m

k=1

where m = max(p, q). Then it follows from (1.0.1) that, for t ≥ m, the transformed process
(2.1.2) has the moving average representation
q
Wt =

θt (j)εt−j ,

(2.1.3)

j=0
where θt (0) = 1 for all t. Notice that φt (k) and θt (k) are periodic in S, such that φt (k) =
φ t (k) and θt (k) = θ t (k), where t is the season corresponding to index t, so that t = t

14

mod S.

Proposition 2.1.1.

Hn = sp{X
¯
¯
0 , . . . , Xn−1 } = sp{W
0 , . . . , Wn−1 }

(2.1.4)

for all n ≥ 1.

∗ = sp{W
Proof. Define Hn
¯
¯
¯
0 , . . . , Wn−1 }. For j = 1, obviously H1 = sp{X
0 } = sp{W
0} =
H1∗ . Assume that when j = n − 1, (2.1.4) holds, i.e. Hn−1 = sp{X
¯
0 , . . . , Xn−2 } =
∗ . By (2.1.2),
sp{W
¯
. , Wn−2 } = Hn−1
0, . . 

 X
n−1<m

 n−1 ,
p
Wn−1 =


φn−1 (k)Xn−1−k , n − 1 ≥ m.
 Xn−1 −

k=1
Then Wn−1 can be expressed as a linear combination in the span Hn = sp{X
¯
0 , . . . , Xn−1 }.
∗ , I conclude that
Together with the inductive assumption at j = n − 1, Hn−1 = Hn−1
∗ ⊂ H . Next I will prove the other direction, H ⊂ H∗ . By a transformation on (2.1.2),
Hn
n
n
n
I have,
Xn−1 =





 Wn−1 ,



 Wn−1 +

n−1<m
p
φn−1 (k)Xn−1−k , n − 1 ≥ m

k=1
∗
By the induction hypothesis, Hn−1 = sp{X
¯
¯
0 , . . . , Xn−2 } = sp{W
0 , . . . , Wn−2 } = Hn−1 ,
it follows that Xn−1 can be expressed as a linear combination in the span

∗ = sp{W
Hn
¯
0 , . . . , Wn−2 , Wn−1 }.
15

∗ . Together, this proves that
Then Hn ⊂ Hn

∗
Hn = sp{X
¯
¯
0 , . . . , Xn−1 } = sp{W
0 , . . . , Wn−1 } = Hn

for all n ≥ 1.

Proposition 2.1.2. The process Wt satisfies an invertibility condition
∞
πt,j Wt−j

εt =

(2.1.5)

j=0

where πt,0 = 1 and

∞ π
j=0 t,j < ∞ for all t. Also πt,j = πt+kS,j for all j.

Proof. First define Wt−j = 0 if t − j < 0. By (2.1.2), when t ≥ m,
p
φt (k)Xt−k ,

Xt = W t +
k=1
and then by Proposition 2.1.1,

Xt ∈ sp{W
¯
¯
t , Xt−1 , . . . , Xt−p } = sp{W
t , Wt−1 , . . . , Wt−p }.

Then Xt can be written as
q
Xt = at Wt + at−1 Wt−1 + . . . + at−p Wt−p =

at−k Wt−k ,
k=0

16

where at , . . . , at−p are finite constants. Then by (1.0.3),
∞



∞

εt =

πt (j)Xt−j =
j=0

at−j−k Wt−j−k  .

πt (j) 
j=0



q
k=0

Therefore I can write
∞
εt =

πt,j Wt−j ,
j=0

where πt,j is a finite linear combination of πt (j), such that

j





πt (j − k), 0 ≤ j < q

 at−j
k=0
πt,j =
q




a
πt (j − k), j ≥ q


 t−j
k=0
Or

min(q,j)
πt (j − k).

πt,j = at−j
k=0

If we let a∗ = max{at , . . . , at−p } which is bounded, by taking maximum of p + 1 finite

17

constants. Then
∞

min(q,j)

∞
πt,j =

j=0

πt (j − k)

at−j
j=0

≤ a∗

k=0
∞ min(q,j)
|πt (j − k)|

j=0 k=0


q−1 j
∞ q
= a∗ 
|πt (j − k)| +
|πt (j − k)|
j=q k=0
j=0 k=0


q ∞
j q−1
|πt (j − k)| +
|πt (j − k)|
= a∗ 
k=0 j=q
k=0 j=0
min(q,j) ∞
∗
= a
|πt (j − k)| < ∞,
k=0 j=k
since a∗ < ∞ and

∞ |π (j)| < ∞ for all t in (1.0.3).
j=0 t

ˆ =W
ˆ = 0 and, for n ≥ 1, let
Define X
0
0
ˆn = P
X
Hn (Xn )

(2.1.6)

ˆn = P
W
Hn (Wn )
denote the one-step projections of Xn and Wn onto Hn , respectively.
The next result computes the covariance function of the transformed process {Wt }. This
result was stated without proof in Lund and Basawa [23, Equation (3.16)], in a different
notation.

Proposition 2.1.3. Given a PARMAS (p, q) process (1.0.1), the covariance function C(j, ) =
18

E(Wj W ) of the transformed process (2.1.2) is given by




γj ( − j)
0≤j ≤ ≤m−1





p




φ (k)γj ( − k − j) 0 ≤ j ≤ m − 1 < ≤ 2m − 1
γ ( − j) −


 j
k=1
C(j, ) =




0
0 ≤ j ≤ m − 1, ≥ 2m





q



2

θj (k)θ (k + − j)σj−k
j, ≥ m


k=0
for all j, ∈ Z, where I define θj (0) = 1 for all j, ∈ Z.

Proof. For the first case, where 0 ≤ j ≤

≤ m − 1, I have from (2.1.2) that Wj = Xj and

W = X . Then E(Wj W ) = E(Xj X ) = γj ( − j). For the second case, when j ≤ m − 1,
but m ≤ ≤ 2m − 1, I have

E(Wj W ) = E Xj (X − φ (1)X −1 − . . . − φ (p)X −p )
p
= γj ( − j) −
φ (k)γ ( − k − j).
k=1
In the third case, if j ≤ m − 1 and

≥ 2m, then I have using (1.0.1) with θt (0) = 1 that




q

E(Wj W ) = E Xj

θ (k)ε −k  ,
k=0

where E(Xj ε −k ) = 0 for k = 0, 1, . . . , q since
19

− k ≥ − q ≥ 2m − q ≥ 2m − m = m >

m − 1 ≥ j. Finally, for the last case, when m ≤ j ≤ , using (1.0.1) again I have


q

E(Wj W ) = E 



q
θj (k)εj−k

θ (r)ε −r 

r=0
k=0
q
θj (k)θ (r)E(εj−k ε −r ),
=
k=0 r=0
q

and E(εj−k ε −r ) = 0 unless j − k = − r.

Remark 2.1.4. Since θ0 (k) = 0 for k > q, it follows from the final case in Proposition 2.1.3
that C(j, ) = 0 whenever

> m and

> j + q.

Using the covariance function C(j, ) computed in Proposition 2.1.3, I can now apply the
innovations algorithm from Brockwell and Davis [11, Proposition 5.2.2 ] to the transformed
process (2.1.2) to compute the one-step ahead predictor
n
ˆn =
W

ˆ
θn,j Wn−j − W
n−j ,

(2.1.7)

j=1
where θn,1 , . . . , θn,n are the unique projection coefficients that minimize the mean squared
error
ˆn
vn = E Wn − W
20

2

.

(2.1.8)

The coefficients θn,j in (2.1.7) and vn in (2.1.8) are computed from the system of equations

v0 = C(0, 0)

θn,n−k = v −1 C(n, k) −
k



k−1

θk,k−j θn,n−j vj 
j=0

n−1
vn = C(n, n) −

θn,n−j

2

(2.1.9)

vj

j=0
solved in the order v0 , θ1,1 , v1 , θ2,2 , θ2,1 , v2 , θ3,3 , θ3,2 , θ3,1 , v3 , . . . and so forth.
Proposition 2.1.5. Given a PARMAS (p, q) process (1.0.1), the innovations algorithm (2.1.9)
applied to the transformed process (2.1.2) with covariance function C(j, ) from Proposition
2.1.3 yields





0





 n


ˆ
θn,j Wn−j − W
n−j
ˆ
Wn =
j=1





q



ˆ

θn,j Wn−j − W

n−j


j=1

n = 0,
1 ≤ n < m,

(2.1.10)

n ≥ m.

In particular, I have θn,j = 0 whenever j > q and n ≥ m.
Proof. If n ≥ m and j > q, then Remark 2.1.4 implies that C(n, n − j) = E(Wn Wn−j ) = 0.
Since
ˆ
W
¯
0 , . . . , Wn−j−1 },
n−j ∈ sp{W
ˆ
another application of Remark 2.1.4 shows that E(Wn W
n−j ) = 0, and then it follows using
the linearity of the expectation operator that

ˆ
E Wn (Wn−j − W
n−j ) = 0.
21

(2.1.11)

The proof of the innovations algorithm in [11, Proposition 5.2.2 ] shows that the random
variables
ˆ
{Wn−j − W
n−j : j = 1, 2, . . . , n}
are uncorrelated, and hence the coefficients

−1 E W (W
ˆ
θn,j = vn−j
n n−j − W
n−j )

(2.1.12)

are uniquely defined. Then (2.1.10) follows from (2.1.11), (2.1.12), and (2.1.9).
Remark 2.1.6. Proposition 2.1.5 shows the advantage of the transformed process (2.1.2) for
computing the innovations, since the sum in (2.1.10) terminates after q lags when n ≥ m.
Since the forecast equations developed later in this paper all depend on the innovations, this
fact will be used to speed up the forecast computations.
ˆ n for the best linear predictor, that
The next result gives the one-step ahead predictors X
minimize the mean squared prediction error. Equation (2.1.13) was proven by Lund and
Basawa [23, Equation (3.4)] in a different notation.
Theorem 2.1.7. The one-step predictors (2.1.6) for a PARMAS (p, q) process (1.0.1) can
be computed recursively using
 n


ˆ

θn,j Xn−j − X

n−j


j=1
ˆn =
X
p
q



ˆ

φn (j)Xn−j +
θn,j Xn−j − X

n−j

j=1
j=1

1≤n<m
(2.1.13)
n≥m

where the coefficients θn,j are computed via the innovations algorithm (2.1.9) applied to the
transformed process (2.1.2).
22

Proof. Project each side of (2.1.2) onto the space Hn in (2.1.4) to get


ˆ
 X
t = 0, . . . , m − 1,
t
ˆ
Wt =

 X
ˆ − φ (1)X
t
t
t−1 − . . . − φt (p)Xt−p t ≥ m.

(2.1.14)

Subtract each side of (2.1.2) from the corresponding expression in (2.1.14) to see that

ˆ −W =X
ˆ −X
W
t
t
t
t

(2.1.15)

ˆ in (2.1.14), substitute (2.1.10), and then use (2.1.15) to arrive at
for all t ≥ 0. Solve for X
t
(2.1.13).

Remark 2.1.8. Theorem 2.1.7 is the basis for our forecasting computations. The fact that
the sum in (2.1.13) terminates after q innovations when n ≥ m simplifies and speeds up the
computations.

Recall from (2.1.4) that Hn = sp{X
¯
0 , . . . , Xn−1 }. The next result gives the h-step
ahead predictors PHn Xn+h that minimize the mean squared prediction error. This result
was proven by Lund and Basawa [23, Equation (3.36)] in a different notation.

Theorem 2.1.9. The h-step predictors for a PARMAS (p, q) process (1.0.1) can be computed
recursively using
n+h
ˆ
θn+h,j Xn+h−j − X
n+h−j

PHn Xn+h =
j=h+1

23

(2.1.16)

when n < m and 0 ≤ h ≤ m − n − 1, and
p
PHn Xn+h =

φn+h (k)PHn Xn+h−k
k=1
q

(2.1.17)
ˆ
θn+h,j Xn+h−j − X
n+h−j .

+
j=h+1

otherwise, where the coefficients θn,j are computed via the innovations algorithm (2.1.9)
applied to the transformed process (2.1.2).

Proof. Since Hn is a subspace of Hn+h , I can write

PHn Wn+h = PHn PH
W
n+h n+h
ˆ
= PHn W
n+h


n+h
ˆ
.
θn+h,j Wn+h−j − W
= PHn 
n+h−j
j=1
ˆ
Since Wn+h−j − W
n+h−j is orthogonal to Hn for j ≤ h, and contained in Hn for j > h,
it follows using Brockwell and Davis [11, Proposition 2.3.2] that
n+h
ˆ
θn+h,j Wn+h−j − W
n+h−j .

PHn Wn+h =

(2.1.18)

j=h+1
ˆ
Since each Wn+h−j − W
n+h−j lies in Hn , for j > h, and Wn+h −PHn Wn+h is orthogonal
to any element of Hn , it follows that the mean square error of the h-step prediction is

E Wn+h − PHn Wn+h

2

n+h
= C(n + h, n + h) −

θn+h,j
j=h+1

24

2

vn+h−j .

From (2.1.2), I can write PHn Wn+h = PHn Xn+h when 0 ≤ h ≤ m − n − 1, and
p
PHn Wn+h = PHn Xn+h −

φn+h (k)PHn Xn+h−k

(2.1.19)

k=1
when h ≥ m − n. Substitute (2.1.15) into (2.1.18) to get
n+h
ˆ
θn+h,j Xn+h−j − X
n+h−j .

PHn Wn+h =

(2.1.20)

j=h+1
If 0 ≤ h ≤ m − n − 1, then (3.0.6) and (2.1.2) imply that (2.1.16) holds. If h ≥ m − n,
substitute (3.0.6) into (2.1.19) to get (2.1.17), using the fact that θn+h,j = 0 if j > q and
h ≥ m − n.
˜ ,...,X
˜
Given a PARMAS (p, q) time series data set X
0
n−1 , I can forecast future values
using the h-step ahead predictors from Theorem 2.1.9. Note that this requires computing
all of the innovations
ˆ
Xt − X
t

for t = 0, 1, . . . , n − 1

using the recursive equation (2.1.13). Complete details will be provided in Chapter 3. In
the next chapter, I explicitly compute the forecast errors, which will be needed to compute
prediction bands for this forecast.

2.2

Forecast errors

The next theorem is the main theoretical result of this chapter. It explicitly computes the
variance of the forecast errors, and a simpler asymptotic variance that is useful for computations. Formula (2.2.12) for the covariance matrix of the forecast errors was established
25

by Lund and Basawa [23, Equation (3.41)] in a different notation. However, that paper did
not develop an explicit formula for the forecast error variance. I begin with the data set
{X0 , X1 , . . . , Xn−1 } as in Chapter 2, where n = N × S.

Theorem 2.2.1. Define χj (0) = 1 for all j ≥ 0, χj (j − ) = 0 for all j ≥ 0 and

> j, and

recursively
min(p, )
φj (k)χj−k ( − k)

χj ( ) =

(2.2.1)

k=1
for all j ≥ 0 and 0 ≤ < j.
2 (h) = E
Then the mean-squared error σn

Xn+h − PHn Xn+h

2

of the h-step predictors

PHn Xn+h for the PARMAS (p, q) process (1.0.1) can be computed recursively using

2 (h) =
σn



h

2

j

χh (k)θn+h−k,j−k  vn+h−j


j=0

(2.2.2)

k=0

when n ≥ m := max(p, q), where the coefficients θn+h−k,j−k and vn+h−j are computed
via the innovations algorithm (2.1.9) applied to the transformed process (2.1.2). Furthermore,
the asymptotic mean squared error is given by

2 (h) →
σn

h

2 (j)σ 2
ψh
h−j

as n = N S → ∞,

(2.2.3)

j=0
where
j
χh (k)θh−k (j − k).

ψh (j) =
k=0

˜ − µ is the mean-centered process in (1.0.1), and W is the
Proof. Recall that Xt = X
t
t
t
2 (h) =
transformed process in (2.1.2). Note that the mean squared prediction error σn
26

E(Xn+h − PHn Xn+h )2 for the mean-centered PARMA process (1.0.1) is not the same
as the mean squared prediction error E(Wn+h − PHn Wn+h )2 for the transformed process
(2.1.2). When n ≥ m := max(p, q), Theorem 2.1.9 implies that (2.1.17) holds for any h ≥ 0,
and note that the second term in (2.1.17) vanishes when h + 1 > q. Write

ˆ
ˆ
Xn+h = X
n+h + (Xn+h − Xn+h ),

and since n = N S, then φn+h (j) = φh (j). Substitute (2.1.13) with n ≥ m to get

Xn+h = φh (1)Xn+h−1 + . . . + φh (p)Xn+h−p
q
ˆ
θn+h,j Xn+h−j − X
+
n+h−j ,
j=0

(2.2.4)

with θn,0 = 1 for all n. Subtract (2.1.17) from (2.2.4) and rearrange terms to get
p
Xn+h − PHn Xn+h −

φh (k) Xn+h−k − PHn Xn+h−k
k=1
h

(2.2.5)
ˆ
θn+h,j Xn+h−j − X
n+h−j .

=
j=0
Define the random vectors




Mn = 



ˆn
Xn − X
..
.
ˆ
Xn+h − X
n+h













 X n − PH n X n





.
.
and Fn = 

.




Xn+h − PHn Xn+h .

27

(2.2.6)

Write
h
Φh = −φj (j − )
j, =0

(2.2.7)

where I define φj (0) = −1 for all j, and φj (k) = 0 for k > p or k < 0. Note that φj (k) is
periodic in S, such that φj (k) = φ j (k), where j is the season corresponding to index j,
so that j = j mod S. Then from the innovations algorithm, write

h
Θn = θn+j,j−
j, =0

(2.2.8)

where I define θn,0 = 1, and θn,k := 0 for k > q or k < 0. Then I can use (5.1.3) to write

Φh Fn = Θn Mn ,

(2.2.9)

where I note that




0
0
0
 1


 −φ (1)
1
0
0
 1


 −φ (2)
−φ2 (1)
1
0
 2
Φh = 

 −φ3 (3)
−φ3 (2)
−φ3 (1)
1


..
..
..
..


.
.
.
.


−φh (h) −φh (h − 1) −φh (h − 2) . . .

28

. . . 0


. . . 0



. . . 0



. . . 0


.. .. 
. .


... 1

and





0
0
 1


θ
1
0
 n+1,1


θ
θn+2,1
1
 n+2,2
Θn = 

 θn+3,3
θn+3,2
θn+3,1


..
..
 ..
 .
.
.


θn+h,h θn+h,h−1 θn+h,h−2

. . . 0


. . . 0



. . . 0


.. 
1 .


.. .. 
. .


... 1

are lower triangular matrices.
The entries of the innovations vector Mn in (2.2.6) are uncorrelated, with covariance matrix

Vn := E Mn Mn = diag vn , vn+1 , . . . , vn+h ,

(2.2.10)

where I recall that
vn = E

ˆn
Xn − X

2

(2.2.11)

are the one step ahead prediction errors. Then the covariance matrix of the vector Fn =
Φ−1 Θn Mn of prediction errors is
h

Cn := E Fn Fn = Ψn Vn Ψn

where Ψn = Φ−1 Θn
h

(2.2.12)

and denotes the transpose.

In order to compute the matrix Cn in (2.2.12), I first need to compute the inverse matrix
h
Φ−1 = χj (j − )
h
j, =0
29

(2.2.13)

and show that (2.2.1) holds. An elegant way to compute the inverse matrix (5.1.11) is to
use operator notation along with the z-transform. Use (1.0.1) to write

Φ(B)Xn+j = Θ(B)εn+j

for j ≥ 0,

(2.2.14)

where
p

p

p

φn+j (k)z k = −

Φ(z) = 1 −

φn+j (k)z k = −

φj (k)z k

k=1
q

k=0
k=0
q
q
Θ(z) = 1 +
θn+j (k)z k =
θn+j (k)z k =
θj (k)z k ,
k=1
k=0
k=0
since n = N S, BXt = Xt−1 is the backward shift operator. Then the infinite order moving
average representation (1.0.2) for the mean-centered process Xt can be written in the form

Xn+j = Ψ(B)εn+j

for j ≥ 0,

(2.2.15)

where
Ψ(z) = Φ−1 (z)Θ(z)

(2.2.16)

Notice that Φh = Φ, however Θn = Θ and Ψn = Ψ. Write
∞
Ψ(z) =
Φ−1 (z) =

k=0
∞

ψj (k)z k
χj (k)z k

k=0
extending the notation (5.1.11). If n ≥ m = max(p, q), then the infinite order moving

30

average representation (2.1.3) of the transformed process Wt can be written in the form

for j ≥ 0 and n ≥ m.

Wn+j = Θ(B)εn+j

(2.2.17)

Equating (2.2.15) and (2.2.17) and using (2.2.16) shows that

Xn+j = Φ−1 (B)Wn+j

for j ≥ 0.

(2.2.18)

Since
Φ−1 (z)Φ(z) = I,
I have


∞







p

χj+k (k)z k  1 −

k=0

φj (k)z k  = 1
k=1

for all j ≥ 0.

By expanding the above equation, I have

χj (0) + χj+1 (1)z + · · · + χj+k (k)z k + · · ·

1 − φj (1)z − · · · − φj (p)z p = 1,

which could be separated as

χj (0) + χj+1 (1) − χj (0)φj (1) z + · · ·
+ χj+p (p) − χj+p−1 (p − 1)φj (1) − · · · − χj (0)φj (p) z p + · · · = 1
31

Setting every coefficient of z k to 0, for k = 1, 2, . . ., I have for all j ≥ 0,

χj (0) = 1

χj+1 (1) = χj (0)φj (1)
..
.
χj+p (p) = χj+p−1 (p − 1)φj (1) − · · · − χj (0)φj (p).
Hence,
min(p, )
φj (k)χj−k ( − k),

φj (k)χj−k ( − k) =

χj ( ) =

k=1

k=1

using the fact that φj (k) = 0, k > p and χj ( − k) = 0, k > , which proves (2.2.1).

Since Φh = Φ, this shows that the inverse matrix (5.1.11) can be computed for any h ≥ 0
by taking χj (0) = 1 for all j ≥ 0, χj (j − ) = 0 for all j ≥ 0 and

> j, and recursively

applying the formula (2.2.1) for all j > 0 and 0 ≤ < j.

Now I can proceed to compute the matrix Cn in (2.2.12). First write
h
Ψn = Φ−1 Θn = ψn+j,j−
h
j, =0
32

where
h
ψn+j,j− =
s=0
h

(Θ )
Φ−1
h j,s n s,
χj (j − s)θn+s,s−

=

(2.2.19)

s=0
j
χj (j − s)θn+s,s−

=
s=

since χj (j − s) = 0 for s > j, and θn+s,s− = 0 for s < . Substitute s = j − k in the sum
(5.1.5) to obtain
j−
ψn+j,j− =

χj (k)θn+j−k,j− −k
k=0

33

(2.2.20)

Now it follows from (5.1.5), (5.1.9), and (5.1.14) that

2 (h) = E
σn

Xn+h − PHn Xn+h

2

= Ch,h
h

h

=

Ψh,u Vu,w Ψw,h
u=0 w=0
h h

=

Ψh,u Vu,w Ψh,w

u=0 w=0
h
Ψh,u Vu,u Ψh,u
=
u=0
h
2
=
ψn+h,h−u
vn+u
u=0

2
h
h−u

=
χh (k)θn+h−k,h−u−k  vn+u
u=0 k=0

(2.2.21)

Substitute u = h − j to obtain (2.2.2).

Next I require a few preparatory results.

Lemma 2.2.2. With the innovations algorithm (2.1.9) applied to the transformed process
(2.1.2), I have vr − σr2 → 0, as r → ∞.

Proof. Recall that Hr = sp{X
¯
¯
0 , . . . , Xr−1 } = sp{W
0 , . . . , Wr−1 }, and define

Hr = sp{W
¯
j , −∞ < j ≤ r − 1}.
34

By the invertibility condition of Wr process in Proposition 2.1.2,
∞

σr2 = E(ε2r ) = E(Wr +

j=1

πr (j)Wr−j )2 = E(Wr − PHr Wr )2

where
∞
j=1

πr (j)Wr−j = PHr (εr − Wr ) = −PHr Wr ,

since εr ⊥Hr and Wr−j ∈ Hr , j = 1, 2, . . . , ∞. Since Hr ⊆ Hr , I have PHr Wr ∈ Hr , thus
by the Projection Theorem in Brockwell and Davis [11, Theorem 2.3.1 ], it follows that

ˆ r )2 = vr .
σr2 = E(Wr − PHr Wr )2 ≤ E(Wr − PHr Wr )2 = E(Wr − W

On the other hand,
r
πr (j)Wr−j ∈ Hr ,

−
j=1

so that by another application of the Projection Theorem in Brockwell and Davis [11, Theorem 2.3.1 ], I obtain

ˆ r )2 = E(Wr − P Wr )2
vr = E(Wr − W
Hr
r
≤ E Wr − (−

πr (j)Wr−j ) 2

j=1
r
= E πr (0)Wr +
πr (j)Wr−j 2
j=1
r
= E(
πr (j)Wr−j )2
j=0

35

Therefore, since εr ⊥ Hr ,


2

r

vr ≤ E 

πr (j)Wr−j 
j=0
2


= E εr −

πr (j)Wr−j 
j>r


2

= E(εr )2 + E 

πr (j)Wr−j 
j>r





= σr2 + E 

πr (j)Wr−j
j>r

≤ σr2 +

πr (k)Wr−k 
k>r

| πr (j) || πr (k) |E | Wr−j Wr−k |
j,k>r

≤ σr2 +

| πr (j) || πr (k) | C(r − j, r − j)C(r − k, r − k)
j,k>r


≤ σr2 + 

2
| πr (j) | M,

j>r
where M = max{C(i, i) : i = 0, 1, . . . , S − 1} < ∞. In summary, I have proved

| πr (j) | 2 M

σr2 ≤ vr ≤ σr2 +
j>r
where

2 → 0 uniformly over all seasons, as r → ∞. Hence , as r → ∞, I
j>r | πr (j) |

have
| πr (j) | 2 M → 0,

|vr − σr2 | ≤
j>r
which completes the proof.

36

ˆ r be given by (2.1.10). Then
Lemma 2.2.3. Let Wr be given by (2.1.2) and W

ˆ r − εr )2 = 0.
lim E (Wr − W
r→∞
q
j=0 θr (j)εr−j in (2.1.3), I may write

Proof. By the causal representation Wr =
q

ˆ r )] = E(ε2 ) = σ 2 ,
θr (j)εr−j − W
r
r

ˆ r )] = E[εr (εr +
E[εr (Wr − W
j=1

ˆ r . Hence E[εr (Wr − W
ˆ r )] =
since εr ⊥ εr−j , for j = 1, . . . , q, and εr ⊥ Hr , therefore εr ⊥ W
σr2 . Then
ˆ r − εr )2
E (Wr − W
ˆ r )2 − 2E[εr (Wr − W
ˆ r )] + E(ε2 )
= E (Wr − W
r
= vr − 2σr2 + σr2
= vr − σr2 → 0,
using Lemma 2.2.2.
Proposition 2.2.4. Let θr,k be the projection coefficients from (2.1.7) and let θr (k) be the
moving average coefficients from (1.0.1). Then

|θr,k − θr (k)| → 0 as r → ∞, for all k = 1, 2, . . .

Proof. I know that by (2.1.12)

ˆ
θr,k = v −1 E Wr (Wr−k − W
r−k ) .
r−k
37

I also have
θr (k) = σ −2 E(Wr εr−k ),
r−k
since
2 ,
E(Wr εr−k ) = E θr (k)ε2r−k = θr (k)σr−k
using the causal representation (2.1.3). By the triangle inequality,
ˆ
θr,k − θr (k) ≤ θr,k − σ −2 E Wr (Wr−k − W
r−k )
r−k
ˆ
+ σ −2 E Wr (Wr−k − W
r−k − εr−k )
r−k
= θr,k − σ −2 θr,k vr−k
r−k

(2.2.22)

ˆ
+ σ −2 E Wr (Wr−k − W
r−k − εr−k )
r−k
≤ θr,k 1 − σ −2 vr−k
r−k
1
1
ˆ r − εr ) 2 2 ,
+ σ −2 C(r, r) 2 E(Wr − W
r−k
using the Cauchy-Schwarz inequality. Note that θr,k is uniformly bounded in r over all
seasons, since (2.1.12) and the Cauchy-Schwarz Inequality imply,
ˆ
θr,k = v −1 E Wr (Wr−k − W
r−k )
r−k
2
ˆ
≤ v −1
C(r, r) E(Wr−k − W
r−k )
r−k
−1/2
=v
C(r, r).
r−k
Then, as r → ∞, the first term on the right-hand side of (2.2.22) approaches 0 by Lemma
2.2.2. The second term on the right-hand side of (2.2.22) approaches 0 by Lemma 2.2.3
1
and the fact that σ −2 C(r, r) 2 is uniformly bounded in r. Thus, θr,k − θr (k) → 0 as
r−k

38

r → ∞.

Proof of Theorem 2.2.1 continued. Using the covariance function C(j, ) computed in Proposition 2.1.3, I can now apply the innovations algorithm from Brockwell and Davis [11, Proposition 5.2.2 ] to the transformed process (2.1.2) to compute the one-step ahead predictor
n
ˆn =
W

ˆ
θn,j Wn−j − W
n−j
j=1

ˆ = 0. Proposition 2.2.4 shows that
for n > 0, where W
0

θs, − θs ( ) → 0 as s → ∞,

Note that (2.2.23) also holds for
s = n + h − k and

for all

> 0.

(2.2.23)

= 0, since θs,0 = θs (0) = 1 by definition. Substitute

= j − k to see that

θn+h−k,j−k − θn+h−k (j − k) → 0 as n → ∞

(2.2.24)

for all j ≥ k ≥ 0. Since n = N S, then θn+h−k (j − k) = θN S+h−k (j − k) = θh−k (j − k).
Therefore (5.1.19) is better written as

θN S+h−k,j−k − θh−k (j − k) → 0 as N → ∞.

(2.2.25)

The mean-centered process Xt has moving average representation (2.2.15), which can be
related to the moving average representation (2.2.17) of the transformed process Wt by the
39

relation (2.2.16). Expanding both sides I obtain
∞

ψh (j)z j =

j=0

q

∞

χh (j)z j

j=0

θh (j)z j .
j=0

Expanding the product on the right-hand side and equating coefficients, we have
∞
j=0

ψh (j)z j =

∞

χh (j)z j

θh (0)1 + θh (1)z + · · · + θh (q)z q

j=0
= χh (0)θh (0)1 + χh (0)θh (1) + χh (1)θh−1 (0) z
j
+ ··· +
χh (k)θh−k (j − k) z j + · · ·
k=0
j
∞
=
χh (k)θh−k (j − k) z k .
j=0 k=0

Then
j
χh (k)θh−k (j − k)

ψh (j) =

(2.2.26)

k=0
for all h ≥ 0 and j ≥ 0. Now it follows from (5.1.5), (2.2.25) and (2.2.26) that

lim ψ (j) − ψN S+h,j
N →∞ h
j
j
= lim
χh (k)θh−k (j − k) −
χh (k)θN S+h−k,j−k
N →∞
k=0
k=0
j
≤ lim
χh (k) θh−k (j − k) − θN S+h−k,j−k
N →∞
k=0

(2.2.27)

= 0,
where for each h ≥ 0, k = 0, . . . , j and 0 ≤ j ≤ h, χh (k) is a constant independent of

40

n = N S. Notice that from (2.2.25), as N → ∞ I have
j

j

χh (k)θh−k (j − k)

χh (k)θN S+h−k,j−k →
k=0

k=0
and therefore,


2

j



2

j

χh (k)θh−k (j − k)

χh (k)θN S+h−k,j−k  → 



(2.2.28)

k=0

k=0

2
is periodic with period S and thus
Furthermore, by (2.2.28), since h is finite and σh−j
bounded, then as N → ∞,
h
lim
N →∞



j



j

χh (k)θN S+h−k (j − k) − 


j=0

2

k=0

2
χh (k)θN S+h−k,j−k 

2
σh−j
→0

k=0
(2.2.29)

41

Then it follows from (2.2.2), (2.2.27), (2.2.29) and Lemma 2.2.2 that

lim σ 2 (h) −
n→∞ n

=

=

h

2 (j)σ 2
ψn+h
n+h−j

j=0

2 (h) −
lim σN
S
N →∞
2 (h) −
lim σN
S
N →∞

h

2 (j)σ 2
ψh
h−j

j=0
h




2

j

2
χh (k)θh−k (j − k) σh−j

j=0

k=0

2
j
h
2
2 (h) −

χh (k)θN S+h−k,j−k  σh−j
≤ lim σN
S
N →∞
j=0 k=0

2 
2
j
j
h
2

χh (k)θh−k (j − k) − 
χh (k)θN S+h−k,j−k  σh−j
+ lim
N →∞
j=0 k=0
k=0

2
j
h
2 (h) −
2

χh (k)θN S+h−k,j−k  σh−j
= lim σN
S
N →∞
j=0 k=0

2
j
h
2 (h) −
2

χh (k)θN S+h−k,j−k  vN S+h−j + σh−j
− vN S+h−j
= lim σN
S
N →∞
j=0 k=0

2
j
h
2 (h) −

χh (k)θN S+h−k,j−k  vN S+h−j
≤ lim σN
S
N →∞
j=0 k=0

2
j
h
2

+ lim
χh (k)θN S+h−k,j−k 
σh−j
− vN S+h−j
N →∞
j=0 k=0

2
j
h
2 (h) −

= lim σn
χh (k)θn+h−k,j−k  vn+h−j + 0
n→∞
j=0 k=0
= 0.

This proves (2.2.3).
42

The following corollary gives asymptotic prediction intervals for the forecast, in the Gaussian case.
Corollary 2.2.5. If {Xt } is a 0-mean Gaussian process, then the probability that Xn+h lies
1
h ψ 2 (j)σ 2
2 approaches (1 − α) as n → ∞,
between the bounds PHn Xn+h ± zα/2
j=0 h
h−j
where zα is the (1 − α)-quantile of the standard normal distribution.
Proof. Since (X0 , X1 , . . . , Xn+h ) has a multivariate normal distribution, then by Problem
2.20 in Brockwell and Davis [11],

Xn+h = E(Xn+h |X0 , . . . , Xn−1 ).
PHn Xn+h = Esp
¯ {X ,...,X
0
n−1 }
Then since (2.2.3) holds, letting Φ(t) denote the cumulative distribution function of the
standard normal distribution, it follows that, as n → ∞,
h
P PHn Xn+h − zα/2

1
2 (j)σ 2
2
ψh
h−j

j=0
h

≤ Xn+h ≤ PHn Xn+h + zα/2

1
2 (j)σ 2
2
ψh
h−j

j=0

= P |Xn+h − PHn Xn+h | ≤ zα/2 σn (h)
→ Φ(zα/2 ) − Φ(−zα/2 )
= 1 − α.

43

Chapter 3
Computation
In this chapter, I outline the computations needed to produce forecasts and the associated
prediction intervals. All the calculations are carried out by R software, and the codes are
in the appendix. This chapter details the codes and their usage. The first step is model
selection, i.e., the number of seasons S, the order p of the autoregressive part, the order q
of the moving average part must be chosen. Usually S is known from the application. For
example, I use S = 4 for quarterly data and S = 12 for monthly data. The next step is to
estimate the autoregressive parameters φt (k) for k = 1, . . . , p, the moving average parameters
θt (j) for j = 1, . . . , q, and the residual standard deviations σt of a PARMAS (p, q) model
˜ −µ
for the sample-mean corrected data, Xt = X
t ˆt . These two steps are closely connected,
since the process of model selection requires fitting a proposed model to judge its adequacy.
The entire procedure of model selection and parameter estimation is outlined in Tesfaye, et
al.[42]. A brief synopsis is given in the following paragraph.
˜ ,X
˜
˜ , we can extract a subset X
˜ ,X
˜
˜
For any data set X
0 1 , . . . , Xn
i i+1 , . . . , Xi+n−1 , where
˜
i represents the i-th season, n = N S, N equals the number of years of data and S equals the
44

number of seasons in a year. Use (1.0.4) and (1.0.5), respectively, to compute the seasonal
sample means and autocovariances. The means µ
ˆi for i = 0, 1, . . . , S − 1 are stored in an
S × 1 array MU(I) for I= 1, . . . , S and the autocovariances γˆi ( ) for i = 0, 1, . . . , S − 1
and

= 0, . . . , N − 1 are stored in an S × N array GAMMA(I,L) for I= 1, 2, . . . , S and

L= 1, . . . , N . Since our notation begins with season i = 0 and lag

= 0, and since many

coding platforms (including R) do not allow zero or negative subscripts, there is a change of
notation I= i + 1 and L= + 1. In this way, MU(I) = µ
ˆi and GAMMA(I,L) = γˆi ( ).

Now consider a general innovations algorithm for all seasons i, where i = 0, 1, . . . , S − 1:

v0,i = C(i, i)

(i)
θ
= v −1 C(i + n, i + k) −
n,n−k
k,i
n−1
vn,i = C(i + n, i + n) −



k−1
θ
j=0

(i)
(i)
θn,n−j vj,i 
k,k−j

(3.0.1)

2
(i)
vj,i ,
θn,n−j

j=0
(i)
(i) (i)
(i) (i) (i)
solved in the order v0,i , θ1,1 , v1,i , θ2,2 , θ2,1 , v2,i , θ3,3 , θ3,2 , θ3,1 , v3,i , . . . and so forth.
Now k + 1 iterations of the innovations algorithm (3.0.1) with C(j, ) = γˆj ( − j) must be
computed for every initial season i = 0, 1, . . . , S − 1 to obtain the estimates of the seasonal
variances
σ
ˆi2 = vk, i−k ,

(3.0.2)

and estimates of the coefficients in the infinite order moving average representation (1.0.2),

( i−k )
ψˆi ( ) = θ
k,
45

(3.0.3)

for

= 0, . . . , k, where t is the season corresponding to the index t, so that jS + i = i.

See Anderson, Meerschaert, and Vecchia [6] for more details.

It is necessary to repeat the innovations algorithm for every initial season i = 0, 1, . . . , S −
1 because the final estimates of the seasonal variances σ
ˆi2 and the infinite order moving
average coefficients ψˆi (j) depend on both the starting season i and the number of iterations.
The number of iterations k + 1 should be chosen so that all the parameter estimates σ
ˆi2
for i = 0, 1, . . . , S − 1 and ψˆi (j) for i = 0, 1, . . . , S − 1 and

= 0, . . . , m show evidence of

convergence, where m = max{p, q}. I use the idea of relative error to show the convergence.
For example, as the number of iterations increases, for a fixed season i, define σ
ˆi2 (k) as
the seasonal sample variance after k iterations, and σ
ˆi2 (k + 1) the seasonal variance after
k + 1 iterations. Since the value of variance can be large, I consider the relative change
σ
ˆi2 (k + 1) − σ
ˆi2 (k) /ˆ
σi2 (k). As a general rule of thumb, I interpret a relative error less than
0.05 after k + 1 iterations as evidence of convergence. The estimated seasonal variances are
stored in an S × 1 array VAR(I) for I= 1, . . . , S and the estimated coefficients in the infinite
order moving average representation are stored in an S × N array PSI(I,L) for I= 1, . . . , S
and L= 1, . . . , N . In this way, VAR(I) = σ
ˆi2 and PSI(I,L) = ψˆi ( ) , where I= i + 1 and
L= + 1.

Anderson and Meerschaert (2005) show that

P
vk, i−k −−−→ σi2 ,

(3.0.4)

( i−k ) P
θˆ
−−−→ ψi (j),
k,j

(3.0.5)

46

and
( i−k )
− ψi (j) ⇒ N 0,
N 1/2 θˆ
k,j

j−1 σ 2
i−n ψ 2 (n)
i
2
n=0 σi−j

(3.0.6)

as N → ∞ and k → ∞ for any fixed i = 0, 1, . . . , S − 1, where “ ⇒ ” indicates convergence
in distribution, and N (m, v) is a normal random variable with mean m and variance v. The
main technical condition for the convergence (3.0.6) to hold is that the noise sequence εt has
a finite fourth moment.

In practical applications, N is the number of years of data, k is the number of iterations
of the innovations algorithm (typically on the order of k = 10, 15 or 20, see the discussion
later), and the convergence in distribution is used to approximate the quantity on the lefthand side of (3.0.6) by a normal random variable. Equation (3.0.6) can be used to produce
confidence intervals and hypothesis tests for the moving average parameters in (1.0.1). For
example, an α-level test statistic rejects the null hypothesis H0 : ψi ( ) = 0 in favor of the
alternative Ha : ψi ( ) = 0, indicating that the model parameter is statistically significantly
different from zero) if |Z| > zα/2 . The p-value for this test is

p = P (|Z| > |z|)
( i−k )
N 1/2 θˆ
k,
z=
,
W

where

and

Z ∼ N (0, 1),
−1
n=0 vˆk, i−k−n

W2 =

vˆk, i−k−

( i−k ) 2
θˆ
k,n

(3.0.7)
.

The innovations algorithm allows us to identify an appropriate model for the periodic time
series at hand, and the p-value formula gives us a way to determine which coefficients in
the identified PARMA model are statistically significantly different from zero (those with a
47

small p-value, say, p < 0.05).
Once estimates of the infinite order moving average coefficients ψˆi (j) have been computed, a system of vector difference equations must be solved to determine estimates of the
autoregressive parameters φˆi (j) for i = 0, 1, . . . , S − 1 and j = 0, . . . , p, and estimates of
the moving average parameters θˆi (j) for i = 0, 1, . . . , S − 1 and j = 0, . . . , q. See Tesfaye,
Anderson and Meerschaert [43] for complete details. In the special case of a PARMAS (1, 1)
model, it is possible to solve those difference equations by hand to obtain

φˆt (1) = ψˆt (2)/ψˆt−1 (1) and θˆt (1) = ψˆt (1) − φˆt (1)

(3.0.8)

where ψˆt (0) = 1. Hence k + 1 iterations of the innovations algorithm for every initial season
i = 0, 1, . . . , S −1 are sufficient to estimate these parameters, assuming that k is large enough
to ensure convergence for the variance estimates σ
ˆi2 and the infinite order moving average
coefficients ψˆi (j) for all seasons i = 0, 1, . . . , S − 1 and for all lags j = 0, . . . , m, where
m = max{p, q}. In general, the number of iterations needed for convergence will depend on
the order of the model being fit. I use an S ×(m+1) array to store θˆt (j), for a PARMAS (p, q)
model, and use the same size array to store φˆt (j). In my R code, the corresponding names
of these arrays are named as THETA and PHI respectively.
Table 3 lists moving average parameter estimates ψˆi ( ) at season i and lag = 1, 2, . . . , 6,
and p-values, after k = 20 iterations of the innovations algorithm applied to average monthly
flow series for the Fraser River near Hope BC. In the discussion of a PARMAS (1, 1) model,
by (3.0.8) I only consider ψˆi ( ) at lag 1 and lag 2, and the ones with a higher p-value are
shown in bold font in Table 3. In order to study the convergence of ψˆi ( ) as iterations k
increase, I exclude ψˆi ( ) if its corresponding p-value is more than p0 of 0.05, since that value
48

Table 3.1: Moving average parameter estimates ψˆi ( ) at season i and lag = 1, 2, . . . , 6, and
p-values, after k = 20 iterations of the innovations algorithm applied to average monthly
flow series for the Fraser River near Hope BC.
i ψˆi (1) p
ψˆi (2)
p
ψˆi (3)
p
ψˆi (4)
p
ψˆi (5)
p
ψˆi (6)
p
0 0.885 .00 0.134 .28 0.105 .10 0.163 .01 0.006 .93 0.038
1 0.625 .00 0.503 .00 0.085 .46 0.140 .02 0.077 .17 -0.004 .94
2 0.508 .00 0.350 .00 0.419 .00 0.032 .72 0.097 .03 0.019 .65
3 0.515 .00 0.287 .00 0.140 .07 0.239 .00 0.034 .60 0.030 .37
4 0.791 .00 0.165 .10 0.295 .00 0.112 .12 0.160 .03 0.045 .43
5 0.567 .00 0.757 .00 0.057 .61 0.250 .00 0.062 .40 0.139 .06
6 1.076 .01 0.711 .11 0.856 .01 0.415 .13 0.241 .17 0.112 .52
7 0.522 .03 0.684 .41 0.988 .28 1.095 .09 0.350 .51 0.198 .56
8 0.451 .00 -1.014 .00 -0.062 .66 -0.745 .50 0.128 .87 -0.635 .31
9 0.618 .00 -0.041 .77 -0.746 .01 -1.083 .26 -0.047 .97 0.514 .50
10 0.448 .00 0.409 .00 0.026 .78 -0.241 .20 -1.125 .08 0.799 .26
11 0.677 .00 0.159 .01 0.194 .00 0.050 .46 -0.190 .17 -0.402 .38

would not be significantly different from zero by (3.0.7). And consider the relative error
ψˆi ( )(k+1) − ψˆi ( )(k)
.
ERRk
(
)
=
i
ψˆi ( )(k)

(3.0.9)

And define the maximum of the relative error

Rk = max{ERRk
/ Ik },
i ( ) : (i, ) ∈

where Ik is defined as
Ik = {i, : p < p0

in (3.0.7)}

Figure 3.1 shows plots of Rk against number of iterations k, and the convergence of ψˆi ( ) as
k increase. Figure 3.1 also shows plots excluding ψˆi ( ) with p values above p0 of 0.01 and
0.10. Similar plots for vk, i−k are given in Figure 3.2.
Once the model is fit, the adequacy of the model can be judged. One way to do this is
49

0.5
0.4
0.3
0.2
0.1

bound of 0.1

0.0

Value of max relative error

0.6

(a) Max relative error, P−value = 1%

0

5

10

15

20

25

30

Number of iterations

Figure 3.1: Convergence of ψˆi ( ) as iterations k increase,where 0 ≤ i < S − 1, 1 ≤ ≤ 2, k =
1, 2, . . . , 30. For interpretation of the references to color in this and all other figures, the
reader is referred to the electronic version of this dissertation.

to compute the residuals and check for any remaining serial correlation. To compute the
residuals from the data, the invertible representation (1.0.3) is used and the weights πt (j)
must be computed by solving another system of vector difference equations. In the special
case of a PARMAS (1, 1) model, it is possible to solve those difference equations by hand
(see Anderson, Meerschaert and Tesfaye [8]) to obtain
δˆt = σ
ˆt−1 Xt −
t−i
+
(−1)j
j=2

φˆt (1) + θˆt (1) Xt−1
(3.0.10)
φˆt−j+1 (1) + θˆt−j+1 (1) θˆt (1)θˆt−1 (1)...θˆt−j+2 (1)Xt−j

where i is the season of the first data point. This produces n − 1 residuals δˆt for t =
i + 1, . . . , i + n − 1. In general, one obtains n − m residuals where m = max{p, q} depends on
the order of the model being fit. Now plot the autocorrelation function (ACF) and the partial
50

0.5
0.4
0.3
0.1

0.2

bound of 0.2

0.0

Value of max relative error

0.6

(a) Max relative error

0

5

10

15

20

25

30

35

Number of iterations

Figure 3.2: Convergence of vk, i−k as iterations k increase,where 0 ≤ i < S − 1, 1 ≤
2, k = 1, 2, . . . , 30.

≤

autocorrelation function (PACF) of the residuals and check for any remaining serial correlation, in exactly the same way as for ARMA modeling. An example could be seen in Figure
4.3 of next chapter. Since the standardized errors δt = σt−1 εt in a PARMAS (p, q) model
are also iid observations under this model, 95% of the ACF and PACF values should fall
√
within the confidence bands ±1.96/ n − m if the model is adequate, see Tesfaye, et al [42].
The principle of parsimony suggests that I choose an adequate model with m = max (p, q) as
small as possible. Once the order p of the autoregressive part and the order q of the moving
average part are chosen and found adequate, it is then advisable to fit a reduced model with
fewer parameters. One method for finding a reduced model using discrete Fourier transforms
is discussed in Anderson, Meerschaert and Tesfaye [8].

Validation of a time series model is tantamount to the application of diagnostic checks to
51

the model residuals, to see if they resemble white noise. The Ljung-Box test can be used to
test the white-noise null hypothesis (see [11]). If the null hypothesis of white-noise residuals is
not rejected, and if the autocorrelation and partial autocorrelation functions of the residuals
show no evidence of serial correlation, then I judge the model to be adequate. Fitting a
suitable distribution to the residuals allows for a faithful simulation based on this model.
To obtain additional parsimony, it is also advisable to consider simpler models where some
statistically insignificant model parameters are set to zero. If the resulting model residuals
pass the same diagnostic tests, then the simplified model is also deemed adequate.
Next I detail the computations required to produce forecasts, and their prediction intervals. As before, I assume that the first season to be forecast is season i = 0, and write the
˜ ,X
˜
˜
data in the form X
0 1 , . . . , Xn−1 , discarding a few of the oldest observations if necessary.
˜ −µ
Then I compute the sample means µ
ˆi using (1.0.4) and set X(I) = X
i ˆi for I = 1, . . . , S.
Step 0: Compute the sample autocovariance by (1.0.5). Apply innovations algorithm
with γˆi ( ), and get estimates for σ
ˆi2 and ψˆi ( ) by (3.0.2) and (3.0.3). Then model parameter
estimates θˆt (j) and φˆt (k) could be computed by (3.0.8). From the constructed model, create
an S × Q array to store θˆt (j), j = 0, 1, . . . , q with θˆt (0) = 1 and t = 0, 1, . . . , S − 1, where
Q = q + 1. Also, create an S × p array to store φˆt (k), k = 1, . . . , p and an S × 1 array
to store σ
ˆt2 . The corresponding array names in my R codes are THETA, PHI and SIGMA
respectively.
Step 1: Use Proposition 2.1.3 to compute the covariances, C(j, ), of the transformed
process Wt given in (2.1.2). Notice that in the computation process, γj ( − j), θˆt (j), φˆt (k),
and σ
ˆt2 are obtained in Step 0. Create an n × n array C(J,L) to store C(j, ), where C(J,L)
= C(j, ), for J = j + 1, L = + 1. Speed and efficiency of the algorithm given in Proposition
52

2.1.3 is provided for by noting that C(j, ) = 0 if

− j > q and

≥ i + m where m =

max(p,q). I may let the n × n array defined just now as a zero-array, in this way, with the
previous condition, I could avoid a lot of computations.
Step 2: From the innovations algorithm given by (2.1.9), once C(j, ) is given, calculate
the coefficients θn,j . Since θn,j = 0 if j > q and n ≥ m, I will need an array of size n × Q
(again, θn,0 = 1) to store the innovation coefficients, where the corresponding array name is
THETA in my R code. The innovations algorithm is solved in the order v0 , θ1,1 , v1 , θ2,2 ,
θ2,1 , v2 , θ3,3 , θ3,2 , θ3,1 , v3 , . . . and so forth. Use an n × 1 array to store vk .
ˆ ,X
ˆ
ˆ
Step 3: Compute the one-step predictors, X
1 2 , . . . , Xn−1 , by (2.1.13), and they are
stored in the array Xhat in my R code. At most only an m-step computation is needed,
where m = max(p, q), since (2.1.13) only requires the storage of at most p past observations
ˆ
Xn−1 , . . . , Xn−p and at most q past innovations (Xn−j − X
n−j ), j = 1, . . . , q. Therefore
computation is much faster when p and q are small, for example, PARMA(1,1) model. The
one-step prediction is based on the mean-subtracted data, so I add the seasonal mean back
ˆ n once the prediction is obtained. See step 4 for specific storage of X
ˆn.
to X
Step 4: The one- and h-step predictors are stored in the same vector Y of length n + h,
where h is the desired number of forecast steps. For example, Y [n + 2] is the prediction at
2-step. Compute the h-step predictors PHn Xn , PHn Xn+1 , . . ., in order, using the recursion
given by (2.1.17). The calculation of Y [h] is based on the information on Y [h − 1], so the
computation is recursive. At most only an m-step computation is needed.
Step 5: By Corollary 2.2.5, compute approximate 95% Gaussian prediction bounds for
Xn+h given by PHn Xn+h ± 1.96 × σn (h). The large sample approximation of σn (h) is
obtained from (2.2.3). The weights, ψt (j) in (2.2.3) are computed from the model parameters
53

φt and θt via the following recursions:
p
φi (k)ψi−k (j − k) = 0, j ≥ max(p, q + 1)

ψi (j) −
k=1
and
j
ψi (j) −

φi (k)ψi−k (j − k) = 0, 0 ≤ j < max(p, q + 1)
k=1

To avoid this computation, I may adopt the parameter estimate ψˆi (j) in (3.0.3). Finally,
˜
˜
noting that X
ˆn+h = Xn+h , the approximate 95% prediction bounds for X
n+h − µ
n+h
ˆn+h = µ
ˆ n+h . Two h-length vectors
are (ˆ
µn+h + PHn Xn+h ) ± 1.96 × σn (h). Also, µ
U BOUND and L BOUND are defined to store the prediction bounds.

3.1

A simulation study for the convergence of the coefficients in innovations algorithm.

To better testify the convergence of ψˆi ( ) and vˆk, i−k as iterations k increase in Figure 3.1
and Figure 3.2, we will conduct a detailed simulation to show the actual error in convergence
of ψˆi ( ) and vˆk, i−k in innovations algorithm. 72-year of monthly data for PARMA12 (1, 1)
were simulated, as shown in Table 3.1, and the innovations algorithm was used to obtain
parameter estimates. Some general conclusions can be drawn from this study, which in
practice proves the results in (3.0.4) and (3.0.5). Define the relative error for ψˆi ( ) as
follows:
ψˆi ( )(k+1) − ψˆi ( )(k)
ERR1k
(
)
=
.
i
ψˆi ( )(k)
54

(3.1.1)

And define the maximum of the relative error

R1k = max{ERR1k
/ Ik },
i ( ) : (i, ) ∈

where Ik is defined as
Ik = {i, : p < p0

in (3.0.7)}

And define the relative error for vˆk, i−k :

ERR2k
i =

vˆk+1, i−(k+1) − vˆk, i−k
vˆk, i−k

.

(3.1.2)

And define the maximum of the relative error

R2k = max{ERR2k
i : ∀i = 0, . . . , 11},

Figure 3.3 illustrates plots of R1k and R2k against iterations k, which shows the convergence of ψˆi ( ) and vˆk, i−k as iterations k increase,where 0 ≤ i ≤ 11, 1 ≤

≤ 2, k =

1, 2, . . . , 50. For ψˆi ( ) , the maximum relative error over all seasons i drop below 5% when
k is between 10 and 20, and it increases when k is larger than 20, since the sample autocovariance becomes relatively small when the lag is large. Similar observation could be seen
for vˆk, i−k , where the maximum relative error over all seasons i drop below 7% when k is
between 10 and 20, and it increases when k is larger than 20. Therefore, we conclude that
10 to 20 iterations of the innovations algorithm are usually sufficient to obtain reasonable
estimates of the model parameters. Furthermore, a convergence test for the model parameters φˆ and θˆ is shown in FIgure 3.4, and it shows similar results, which proves our previous
55

Table 3.2: Model parameters and estimates
season i
φi
φˆi
θi
0
0.20 0.34 0.70
0.60 0.72 0.10
1
2
0.60 0.56 -0.10
0.60 0.45 -0.10
3
4
0.30 0.30 0.50
5
0.90 0.85 -0.40
6
1.30 2.03 -0.20
0.60 0.06 -0.10
7
8
-1.90 -4.85 2.40
-0.09 0.38 0.70
9
0.70 0.65 -0.20
10
11
0.40 0.10 0.30

conclusion.

56

for simulated PARMA12 (1, 1) data.
θˆi
σi
σˆi
0.58 11900 8766
0.10 11600 7626
-0.16 7300 6693
0.05 6000 4402
0.48 4200 3521
-0.29 4600 4287
-0.23 15200 15880
0.20 31100 22339
5.43 32800 31761
0.17 29700 29657
-0.25 15500 13426
0.51 12100 11126

0.3
0.2
0.1

Max relative error

0.4

Convergence of psi
(median result in 100 simulations)

0.0

5%

0

10

20

30

40

50

Number of iterations

0.4
0.3
0.2

Max relative error

0.5

0.6

Convergence of sigma
(median result in 100 simulations)

0.1

7%

0

10

20

30

40

50

Number of iterations

Figure 3.3: Plots of R1k and R2k against iterations k, which show the convergence of ψˆi ( )
and vˆk, i−k as iterations k increase,where 0 ≤ i ≤ 11, 1 ≤ ≤ 2, k = 1, 2, . . . , 50.
57

0.3
0.2

Max relative error

0.4

0.5

Convergence of phi
(median result in 100 simulations)

0.0

0.1

9%

0

10

20

30

40

50

Number of iterations

3
2

50%

0

1

Max relative error

4

5

Convergence of theta
(median result in 100 simulations)

0

10

20

30

40

50

Number of iterations

Figure 3.4: Convergence of φˆi and θˆi as iterations k increase,where 0 ≤ i ≤ 11, k =
1, 2, . . . , 50.
58

Chapter 4
Application to a Natural River Flow
In this chapter, I apply the R codes in Chapter 3 to forecast future flows for a time series of
monthly river flows, get 95% Gaussian prediction bounds for these forecasts.
I model a monthly river flow time series from the Fraser River at Hope, British Columbia,
which is the longest river in British Columbia, traveling almost 1400 km and sustained by a
drainage area covering 220,000 square kilometers. See for maps and river flow data downloads
at http://www.wateroffice.ec.gc.ca/ .
Daily discharge measurements, in cubic meters per second (cms), were averaged over
each of the respective months to produce monthly Fraser River flow time series. The series
contains 72 years of data from October 1912 to September 1984, and part of the data is shown
in Figure 4.1. To better test the prediction results, I based our forecast on the first 70 years
of data, from 1912 to 1982. Then, I computed a 24-month prediction from October 1982 to
September 1984. In our analysis, i = 0 corresponds to October and i = 11 corresponds to
September. Using the “water year” starting on 1 October is customary for modeling of river
flows, because of low correlation between Fall monthly flows. The sample seasonal mean,
59

250000
100000
0

Flow (cms)

10/1912

10/1918

10/1924

Month / Year
Figure 4.1: Average monthly flows in cubic meters per second (cms) for the Fraser River at
Hope, BC indicate a seasonal pattern.

standard deviation and autocorrelations at lag 1 and lag 2 are given in Table 4 and Figure
4.2, with 95% confidence intervals. The non-stationary of the series is apparent, since the
mean, standard deviation and correlation functions vary significantly from month to month
(the confidence bands for some wet and dry seasons don’t overlap). Removing the periodicity
in mean and variance will not yield a stationary series. Therefore a periodically stationary
series model is appropriate. Tesfaye, et al. [42], identified a PARMA12 (1, 1) model

Xt − φi Xt−1 = εt + θi εt−1 ,
60

(4.0.1)

50000
0

20000

Sample sd (cms)

350000
150000
0

2

4

6

8

10

0

2

4

6

8

10

(c) Sample Autocorrelations : lag = 1

(d) Sample Autocorrelations : lag = 2

2

4

6

8

10

0.5
0.0
−0.5

0.8
0.4
0

1.0

Season

Autocorrelations

Season

0.0

Autocorrelations

(b) Sample Standard Deviations

0

Sample mean (cms)

(a) Sample Means

0

Season

2

4

6

8

10

Season

Figure 4.2: Statistics for the Fraser river time series: (a) seasonal mean; (b) standard
deviation; (c,d) autocorrelations at lags 1 and 2. Dotted lines are 95% confidence intervals.
Season = 0 corresponds to October and Season = 11 corresponds to September.

˜ − µ , E (ε ) = 0, V (ε ) = σ 2 , σ −1 ε iid, for the series with S = 12, and
where Xt = X
t
t
t
t
t t t
used the innovations algorithm at k = 20 iterations for periodically stationary processes by
(3.0.1) and (3.0.8) to estimate φi (1), θi (1), and σi , i = 0, 1, . . . , 11.
Table 4 gives the parameter estimates of the model where φˆ = (φ0 (1), . . . , φ11 (1)) ,
θˆ = (θ0 (1), . . . , θ11 (1)) , and σ
ˆ = (σ0 , . . . , σ11 ) . I employ these estimates as the parameters
of the model. Although the model is periodically stationary, the standardized residuals
√
(3.0.10) should be stationary, so the standard 95% limits ( that is, 1.96/ n ) still apply.
Figure 4.3 shows the ACF and PACF of the model residuals. Although a few values lie
61

√
Figure 4.3: ACF and PACF for model residuals, showing the bounds ±1.96/ N , indicate no
serial dependence. With no apparent pattern, these plots indicate that the PARMA12 (1, 1)
model is adequate.

62

Table 4.1: Parameter estimates for the PARMA model (1.0.1) of average monthly flow series
for the Fraser River near Hope BC from October 1912 to September 1982.
month

φˆ

θˆ

σ
ˆ

OCT
NOV
DEC
JAN
FEB
MAR
APR
MAY
JUN
JUL
AUG
SEP

0.187
0.592
0.575
0.519
0.337
0.931
1.286
1.059
-2.245
-1.105
0.679
0.353

0.704
0.050
-0.038
-0.041
0.469
-0.388
-0.088
-0.592
2.661
0.730
-0.236
0.326

11761.042
11468.539
7104.342
5879.327
4170.111
4469.202
15414.905
30017.508
32955.491
30069.997
15511.989
12111.919

Table 4.2: Sample mean, standard deviation and autocorrelation at lag 1 and 2 for an average
monthly flow series for the Fraser River near Hope BC, from October 1912 to September
1982.
month
OCT
NOV
DEC
JAN
FEB
MAR
APR
MAY
JUN
JUL
AUG
SEP

µ
ˆ
69850
55824
40502
33006
30740
29348
58959
173308
249564
198844
127138
86437

Parameter
1
γˆ (0) 2
19976
17709
12858
9269
8878
8864
20314
39437
45154
42627
28253
20071

63

ρˆ(1)
0.712
0.748
0.731
0.786
0.787
0.504
0.333
0.260
0.577
0.780
0.720
0.621

ρˆ(2)
0.515
0.577
0.541
0.697
0.380
0.286
-0.286
-0.031
0.499
0.456
0.308
0.472

350000
250000
150000
0

50000

Flow (cms)

10/1982

02/1983

06/1983

10/1983

02/1984

06/1984

Month / Year
Figure 4.4: 24-month forecast (solid line with dots) based on 70 years of Fraser river data,
with 95% prediction bounds (dotted lines). For comparison, the actual data (solid line) is
also shown. This data was not used in the forecast.

slightly outside of the 95% confidence bands, there is no apparent pattern. The p value
from the Ljung-Box test was 0.08 indicating that I do not reject the null hypothesis that
the residuals resemble iid white noise. Hence we conclude that the PARMA12 (1, 1) model
is adequate.
If I reject the null hypothesis that PARMA(1, 1) model residuals resemble iid noise, I
would abandon that model and fit a PMA(q) model to the data, q ≥ 2. Using Theorem 1
in [8], I would identify the order, q, of the pure moving average and then parsimoniously
estimate the moving average parameters that I deem to be nonzero. However, from Tesfaye,
64

et al. [42], the PARMA(1, 1) model is generally adequate to model most river flow time
series.
I then compute a 24-step future prediction for the Fraser river data, that is, a forecast
for the next 24 months from October 1982 to September 1984.. The prediction is compared
to the original data in Figure 4.4. Note that the forecast curve is close to the original data
curve, and that the historical Fraser River data stay well within the 95% prediction bands.
Figure 4.5 illustrates how the width of the prediction intervals vary with the season. This
is a consequence of the non-stationarity of the river flow series, and specifically the fact that
the standard deviation and the correlation functions vary significantly from month to month.

65

50000
0
−50000

Width of prediction bounds

Width of prediction bounds (mean subtracted)

5

10

15

20

Month

Figure 4.5: Width of 95% prediction bounds for the Fraser river.

66

Chapter 5
Maximum likelihood estimation and
reduced model

5.1

Maximum likelihood Function for PARMAS (p, q)
Model

(i)
(i)
(i)
By Yule-Walker equations, the vector of autoregressive coefficients φn = (φn,1 , . . . , φn,n )
solves the prediction equations
(i)
(i)
Γn,i φn = γn

(5.1.1)

(i)
with γn = (γi+n−1 (1), γi+n−2 (2), . . . , γi (n)) and

Γn,i =

γi+n− ( − m)

(5.1.2)
,m=1,...,n

is the covariance matrix of Xn,i = (Xi , ..., Xi+n−1 ) for each i = 0, ..., S − 1. If σi2 > 0
for i = 0, . . . , S − 1, then for a causal PARMAS (p, q) process the covariance matrix Γn,i is
67

nonsingular for every n ≥ 1 and each i in Lund and Basawa [22, Theorem 3.1]. Assuming
that Xn,i is a Gaussian process, its likelihood function can be written using the general
formula for a multivariate normal random vector with a given mean and covariance matrix

L(Γn,i ) = (2π)−n/2 (det Γn,i )−1/2 exp − 21 Xn,i Γ−1
n,i Xn,i ,

(5.1.3)

where Γn,i is the covariance matrix of Xn,i defined in (5.1.2).
Whenever Γn,i is invertible, the direct calculation of det Γn,i and Γ−1
n,i can be avoided
ˆ
by expressing it in terms of the one-step predictors X
i+j and the mean square errors vj,i ,
both of which are easily calculated recursively from the innovations algorithm, where vj,i is
from (3.0.1), and

ˆ (i) ∼ N (0, v ),
Xi+j − X
j,i
i+j

j = 0, 1, . . . , n − 1,

Recall the innovations representation for PARMA models in Anderson et al [6], where




t=0

 0,
(i)
ˆ
t
X
(i)
i+t = 
ˆ (i)

θt,j (Xi+t−j − X

i+t−j ),


j=1

(5.1.4)
t = 1, . . . , n − 1.

Then the computation for (5.1.3) can be simplified in the following theorem:

Theorem 5.1.1. When Γn,i is invertible, the likelihood (5.1.3) can be equivalently written
68

in the form of
L(Γn,i ) =(2π)−n/2 (v0,i v1,i · · · vn−1,i )−1/2
n−1
1
ˆ (i) )2 /v
exp − 2
(Xi+j − X
j,i ,
i+j
j=0

(5.1.5)

ˆ
where X
i+j are the one-step predictors in (2.1.13) and vj,i are the mean square errors in
(3.0.1).

Proof. For each i, define the n × n lower triangular matrix

Ci = [θk,k−j (i)]n−1 ,
k,j=0

(5.1.6)

where det Ci = 1, since all the diagonal elements θk,0 (i) = 1 for k = 0, . . . , n − 1, and define
the n × n diagonal matrix,

Di = diag(v0,i , v1,i , . . . , vn−1,i ).

(5.1.7)

ˆ (i) = (X
ˆ (i) , . . . , X
ˆ (i)
Therefore X
n,i
i
i+n−1 ) can be written in the from,
ˆ
ˆ (i)
X
n,i = (Ci − I)(Xn,i − Xn,i ),

(5.1.8)

where I is the n × n identity matrix. Hence,

ˆ
ˆ
ˆ
Xn,i = Xn,i − X
n,i + Xn,i = Ci (Xn,i − Xn,i ).
69

(5.1.9)

Since Di = E

ˆ
Xn,i − X
n,i

ˆ
Xn,i − X
n,i

, and Γn,i = E Xn,i Xn,i , it follows that

Γn,i = Ci Di Ci .

(5.1.10)

By (5.1.9) and (5.1.10), we obtain

−1
ˆ
ˆ
Xn,i Γ−1
n,i Xn,i = (Xn,i − Xn,i ) Di (Xn,i − Xn,i ) =

n−1
ˆ (i) )2 /v , (5.1.11)
(Xi+j − X
j,i
i+j
j=0

and
det Γn,i = (det Ci )2 (det Di ) = v0,i v1,i · · · vn−1,i .

(5.1.12)

Therefore the likelihood (5.1.3) of the vector Xn,i reduces to (5.1.5). Whenever Γn,i is
invertible, the likelihood in (5.1.3) has the equivalent innovations representation as (5.1.5).
Actually Γn,i is invertible for causal PARMA models, see the proof in Lund and Basawa [23,
Proposition 4.1]. Given the seasonal one-step error variances vj,i from the innovations and
ˆ (i) , then L(Γ ) in (5.1.3) can easily be calculated.
the one-step ahead predictor X
n,i
i+j

For notation, let φt = (φt (1), . . . , φt (p)) and θ t = (θt (1), . . . , θt (q)) denote the autoregressive and moving-average parameters during season t, respectively. Then we can write
L(Γn,i ) = L(β), where we use β = (φ0 , θ 0 , φ1 , θ 1 , . . . , φS−1 , θ S−1 ) to denote the collection of all PARMAS (p, q) parameters. The dimension of β is (p+q)S ×1. Following the ideas
2 )
from Basawa and Lund [24], we treat the white noise variances σ 2 = (σ02 , σ12 , . . . , σS−1
70

as nuisance parameters. Then the likelihood function is given by

LX (β) = (2π)−n/2 (v0,i v1,i · · · vn−1,i )−1/2 exp − 21

n−1
ˆ (i) )2 /v
(Xi+j − X
j,i .
i+j
j=0
(5.1.13)

Another way to compute the PARMA likelihood parameter estimates is to equivalently
minimize the negative log likelihood
n−1

n−1
ˆ (i) )2 /v .
−2 log{L(β)} = n log(2π) +
log(vj,i ) +
(Xi+j − X
j,i
i+j
j=0
j=0

(5.1.14)

ˆ
According to Basawa and Lund [24], once we obtain the maximum likelihood estimate β,
the MLE of σi2 for 0 ≤ i ≤ S − 1, have the large sample form

σ
ˆi2 = N −1

N −1

ˆ 2,
εjS+i (β)

j=0
2 ) . However, they didn’t give a proof for the large sample form.
where σ
ˆ 2 = (ˆ
σ02 , . . . , σ
ˆS−1
In the following we propose our method of MLE computation, where there is an exact form
of MLE for σ 2 , and the computation is much faster. Given the data set Xi , . . . , Xi+n−1 ,
ˆ (i) , j = 0, . . . , n − 1
apply innovations algorithm to compute the 1-step ahead predictors X
i+j
2
ˆ (i)
and the forecast errors vj,i = E Xi+j − X
.
i+j
Corollary 5.1.2. |rj,i − 1| → 0 as j → ∞.
Proof. By Anderson et al. [6, Corollary 2.2.1]

vk, i−k → σi2 ,
71

we have
2 | → 0, as j → ∞.
|vj,i − σi+j
2 r , then we have
Recall that vj,i = σi+j
j,i
2 r − σ 2 | → 0, as j → ∞,
|σi+j
j,i
i+j

so that
2 |r − 1| → 0, as j → ∞.
σi+j
j,i
2 is periodic of S seasons, so
Notice that for PARMAS (p, q) model σi+j
2 ≤ max{σ 2 , . . . , σ 2 } < ∞,
0 < σi+j
0
S−1

then we have proved
|rj,i − 1| → 0, as j → ∞.

72

Then the negative log likelihood (5.1.14) becomes
n−1
(β) = −2 log{L(β)} = n log(2π) +

n−1

2
ˆ (i)
Xi+j − X
/vj,i
i+j

log(vj,i ) +
j=0
S−1 N −1

= n log(2π) +

j=0
log rkS+i σi2

i=0 k=0
S−1 N −1

ˆ (i)
XkS+i − X
kS+i

+
i=0 k=0

N −1
= n log(2π) +

2

(5.1.15)
/ rkS+i σi2

log(rj ) + N log(σi2 ) +

j=0
where
N −1 (X
Si =
k=0

S−1

Si /σi2 ,

i=0

2
ˆ (i)
kS+i − XkS+i )
.
rkS+i

Then for i = 0, . . . , S − 1, we have

∂
∂σi2

=

N
−1
+
S = 0,
2 i
σi2
2
σi

therefore
N σi2 − Si = 0,
and
S
σi2 = i .
N
So the MLE of σi2 is
1
σ
ˆi2 =
N

N −1

ˆ (i)
XkS+i − X
kS+i

k=0
73

2
/rkS+i = Si /N,

(5.1.16)

ˆ (i)
where X
and rkS+i come from the innovations algorithm applied to the model. Now
kS+i
we can substitute σ
ˆi2 for σi2 in the negative log likelihood to get
n−1
−2 log{L(β)} = n log(2π) + n +

S−1
log(rj,i ) + N

j=0

log Si /N .
i=0

Then it suffices to minimize

∗ (β) = 1
N

n−1

S−1
log(rj,i ) +

j=0

log Si /N .
i=0

Also, since rj,i → 1 as j → ∞, then log(rj ) → 0, so

1
N

and approximate MLE minimizes

n−1
log(rj ) ≈ 0,
j=0

S−1 log S /N .
i
i=0

Example 5.1.3. A PARMA(1, 1) example from Chapter 4 will be shown here, to demonstrate how to explicitly solve the mean square errors, and obtain the likelihood value and
maximum likelihood estimate. We adopt the 70 years of monthly Fraser river data from
October 1912 to September 1982, and since there are 840 observations, we denote them as
{Xt } = {X0 , X1 , . . . , X839 }. Therefore the negative log likelihood is simplified as
n−1
ˆ )2 /v
log(vj ) +
(Xj − X
j
j
j=0
j=0
N −1
S−1
2
= n log(2π) +
log(rj ) + N log(σi ) +
Si /σi2 ,
j=0
i=0

−2 log{L(β, σ 2 )} = n log(2π) +

n−1

74

(5.1.17)

2 ) . Computations in
where β = (φ0 , θ0 , φ1 , θ1 , . . . , φ11 , θ11 ) and σ 2 = (σ02 , σ12 , . . . , σ11
Lund and Basawa [23, Equations (2.7), (2.11)] show that the PARMA(1, 1) model is causal
when |φ0 φ1 · · · φ11 | < 1 and invertible when |θ0 θ1 · · · θ11 | < 1. By Theorem 2.1.7, we may
write
ˆ =0
X
0
ˆ =θ
ˆ
X
1
1,1 X0 − X0

(5.1.18)

ˆ =φ X
ˆ
X
t
t t−1 + θt,1 Xt−1 − Xt−1

t = 2, . . . , 839.

For the covariance structure of {Xt }, we will use the γt (h) results from Section 1.1.
The innovations algorithm explicitly identifies the prediction coefficients as

−1 C (t, t − 1) = v −1 θ σ 2 .
θt,1 = vt−1
t−1 t t−1

(5.1.19)

Use (5.1.19) in innovation algorithm gives

−1 2 4
2 v
2
2 2
vt = C(t, t) − θt,1
t−1 = (σt + θt σt−1 ) − vt−1 θt σt−1 ,

(5.1.20)

where t ≥ 1, and the initial condition is v0 = γ0 (0). Equation (5.1.20) is a difference
equation for vt , which can be explicitly solved as follows. First, write
y
vt = σt2 t ,
yt − 1
yt−1
2
vt−1 = σt−1
,
yt−1 − 1

75

(5.1.21)

and substitute (5.1.21) in (5.1.20) to get
4
θt2 σt−1
(yt−1 − 1)
yt
2
2
2
2
σt
= (σt + θt σt−1 ) −
2
yt − 1
yt−1
σt−1
2 )y
2 2
σt2 yt yt−1 = (σt2 + θt2 σt−1
t−1 (yt − 1) − θt σt−1 (yt−1 − 1)(yt − 1),
therefore
yt = 1 +

σt2

t ≥ 1,

yt−1
σt−1 θt2

(5.1.22)

σ0−2 γ0 (0)
where the initial condition of (5.1.22) is y0 = −2
. Letting
σ0 γ0 (0)−1
σt2
P (t) =
,
2 θ2
σt−1
t

(5.1.23)

then the solution to (5.1.22) is

yt = 1 + P (t)yt−1
= 1 + P (t) 1 + P (t − 1) yt−2
= 1 + P (t) 1 + P (t − 1)

1 + P (t − 2) yt−3

(5.1.24)

..
.


t



t



t




=1+
P (r) y0 +
P (r) ,
r=1
j=2 r=j

In the computation, notice that

t
j=2

t ≥ 1.

t
r=j P (r) is the row-summation of the product

76

of each row in a matrix, where the matrix is




1 P (2) P (3) . . . . . . P (t)




1

1
P
(3)
.
.
.
.
.
.
P
(t)






.
1
1
1
P
(4)
.
.
.
P
(t)




..
..
..
..
.. 
 ..
.
.
.
.
.
. 




1
1
1
...
1 P (t)

Use (5.1.24) in (5.1.21) to compute vt . Then use vt in (5.1.19) to compute θt,1 . Next obtain
ˆ in (5.1.18). Finally substitute into (5.1.17) to get the negative log likelihood function.
X
t
In Chapter 4, we fit a PARMA(1,1) model to the 70-year Fraser river flows data. The
model parameters θˆt , φˆt and σ
ˆt are shown in Table 4 of Chapter 4. Using these values to
ˆ in this example, we obtain the negative log likelihood value of 18543
compute vt and X
t
for this model. However, in the following we consider to take the logarithm of the data, for
all computations in MLE. The model parameters for the logarithm of the data are shown
in Table 5.1.3, where the negative log likelihood value is -362.655.
Season
θˆi
φˆi
σi
Season
θˆi
φˆi
σi

To get the maximum

0
1
2
3
4
5
0.650 0.206 -0.026 -0.033 0.426 -0.400
0.393 0.740 0.712 0.715 0.431 1.020
0.165 0.187 0.164 0.159 0.135 0.134
6
7
8
9
10
11
0.446 -0.171 1.918 0.836 -0.299 0.540
0.425 0.313 -1.628 -0.045 0.979 0.414
0.260 0.184 0.127 0.143 0.108 0.138

Table 5.1: Estimated parameters for PARMA12 (1, 1) model, from innovations algorithm,
using the logarithm of the original data. The resulting negative likelihood value is -362.655.

likelihood estimators, we will differentiate −2 log L(β) partially with respect to the parame77

ters. A non-linear optimization algorithm will be applied, using R software to find the MLE.
2 ) as
By our earlier assumption, we treat the white noise variances σ 2 = (σ02 , σ12 , . . . , σS−1
nuisance parameters, so that the independent variables in the optimization are the model
ˆ we solve
parameters β = (φ0 , θ0 , φ1 , θ1 , . . . , φS−1 , θS−1 ) . After we compute the MLE β,
2 ) using
the MLE for σ
ˆ 2 = (ˆ
σ02 , . . . , σ
ˆS−1

1
σ
ˆi2 =
N

N −1

ˆ (i)
XkS+i − X
kS+i

2
/rkS+i = Si /N,

k=0

The function optim in R is used, a general-purpose optimization based on Nelder–Mead,
quasi-Newton and conjugate-gradient algorithms. It includes an option for box-constrained
optimization and simulated annealing. Several different options were explored, using our
likelihood function, and the initial values from Table 4. A summary of the performance of
the different options is provided in Table 5.2. The BFGS option was found to be the fastest
convergent method. This option uses the results published simultaneously by Broyden,
Fletcher, Goldfarb and Shanno in 1970 [12], [17], [19] and [38], which uses function values and
gradients to approximate the optimization surface. The parameter estimates that resulted
using the BFGS algorithm are listed in Table 5.3.
ˆ
Method
−2 log{L(β)}
BFGS
463.8585
CG
-505.37
SANN
-362.65
Nelder-Mead
1182.61

Running time
137s
282s
378s
17.29s

Convergence
Yes
No
Yes
No

Depending on initial value
Yes
Yes
Yes
Yes

Table 5.2: A compare of different algorithm in optim function.

In order to test the convergence, we use the result in Table 5.3 as an initial value, and
run the optimization procedure again. The algorithm converges quickly to the same value
78

Season
θˆi
φˆi
σi
Season
θˆi
φˆi
σi

0
0.586
0.546
0.165
6
0.204
0.720
0.260

1
0.154
0.800
0.187
7
0.071
0.162
0.184

2
3
4
5
0.119 -0.008 0.085 -0.206
0.709 0.679 0.738 0.884
0.164 0.159 0.135 0.134
8
9
10
11
1.035 0.398 -0.110 0.415
-0.690 0.417 0.855 0.558
0.127 0.143 0.108 0.138

Table 5.3: MLE result by BFGS method, with σi as nuisance parameters. The resulting
ˆ = −463.8585.
MLE value was −2 log L(β)

ˆ as before.
of −2 log{L(β)}
Next we plot −2 log{L(β, σ 2 )} as a function of each σi , for i = 0, 1, . . . , 11. In this way,
we obtain 12 plots, shown in Figure 5.1. There is a clear global local minimum point on
each plot, which we compute using the optimize function in R to find the corresponding σi
∗ ) . With this σ ∗ value, we run optim
value, and denote it as σi∗ , with σ ∗ = (σ0∗ , σ1∗ , . . . , σ11
i
again for −2 log{L(β, σ 2 )}, using the same BFGS algorithm, resulting in a lower negative
likelihood value of -506.6053. For different starting values and different iterations, the routine
converges to the same optimum. See results in Table 5.4.
Finally, to check our results, we use a one variable optimization to minimize the negative
log likelihood function as a function of each individual variable, by similar techniques to find
∗ ) on Figure 5.2 and φ∗ = (φ∗ , φ∗ , . . . , φ∗ )
the global minimum points θ ∗ = (θ0∗ , θ1∗ , . . . , θ11
0 1
11
on Figure 5.3. Since these results have been consistent with the MLE results from BFGS
optimization algorithm shown in Table 5.4, then we are confident that the values reported
in Table 5.4 represent the true MLE. Therefore we take the results in Table 5.4 as our final
optimization results for model parameters.
79

0

0
0

−100

−300
−400

loglik

−200

loglik

loglik

−100

−200

−200
−300
−400

−400

−500

−500
0.25

0.50

0.75

1.00

0.25

sigma_0

0.50

0.75

1.00

0.25

sigma_1

0.50

0.75

1.00

sigma_2

−200
−300
−350

loglik

−300

−300

loglik

loglik

−200

−400

−400

−400

−450
−500

−500

−500
0.25

0.50

0.75

1.00

0.25

sigma_3

0.50

0.75

1.00

0.25

sigma_4
200

1000

0.50

0.75

1.00

sigma_5
−250
−300

loglik

loglik

500

loglik

0

−200

−350
−400

0
−450
−400
−500

−500
0.25

0.50

0.75

1.00

0.25

sigma_6

0.50

0.75

1.00

0.25

sigma_7

0.50

0.75

1.00

sigma_8

−200

−250
−300
−300

loglik

loglik

loglik

−350

−300

−400

−400

−500
0.25

0.50

0.75

sigma_9

1.00

−350
−400

−450

−450

−500

−500
0.25

0.50

0.75

sigma_10

1.00

0.25

0.50

0.75

1.00

sigma_11

Figure 5.1: −2 log{L(β, σ 2 )} as a function of each σi , i = 0, . . . , 11, with the remaining
parameters fixed.
80

1500
1000
1000

0

500

loglik

loglik

loglik

500

0

−500

0

−500
−5.0

−2.5

0.0

2.5

5.0

500

−500
−5.0

−2.5

theta_0

0.0

2.5

5.0

−5.0

−2.5

theta_1

0.0

2.5

5.0

2.5

5.0

theta_2
2000

1500
1500

1500
1000

500

loglik

loglik

loglik

1000
500

0

0

−5.0

−2.5

0.0

2.5

5.0

500
0

−500

−500

1000

−500
−5.0

−2.5

theta_3

0.0

2.5

5.0

−5.0

−2.5

theta_4

0.0

theta_5

2000
−100

9000

−300

loglik

loglik

loglik

−200
6000

1000

3000
−400

0
0

−500
−5.0

−2.5

0.0

2.5

5.0

−5.0

theta_6

−2.5

0.0

2.5

5.0

−350

1000

−400

500

loglik

loglik

−300

−400

−450

−500

−500
0

1

theta_9

2

1

2

3

4

5

theta_8

loglik

−200

−1

0

theta_7

0

−500
−1.0

−0.5

0.0

0.5

theta_10

1.0

−5.0

−2.5

0.0

2.5

5.0

theta_11

Figure 5.2: −2 log{L(β, σ 2 )} as a function of each θi , i = 0, . . . , 11, with the remaining
parameters fixed.
81

2000
3000
4000

500

2000

loglik

1000

loglik

loglik

1500

1000

2000

0
0

0

−500
−5.0

−2.5

0.0

2.5

5.0

−5.0

−2.5

0.0

phi_0

2.5

5.0

−5.0

−2.5

phi_1

0.0

2.5

5.0

2.5

5.0

2.5

5.0

2.5

5.0

phi_2

6000
6000

2500

loglik

4000

loglik

loglik

5000

2000

2000

0

0
−5.0

−2.5

0.0

2.5

5.0

4000

0
−5.0

−2.5

0.0

phi_3

2.5

5.0

−5.0

−2.5

phi_4

0.0

phi_5
6000

1000

15000

500

10000

0

5000

−500

0
−5.0

−2.5

0.0

2.5

5.0

loglik

loglik

loglik

4000

0
−5.0 −2.5

phi_6

0.0

2.5

5.0

−5.0

−2.5

phi_7

0.0

phi_8

2000

4000

1500

3000

1000
500

loglik

5000

loglik

loglik

2000

2500

2000
1000

0
0

0
−500
−5.0

−2.5

0.0

phi_9

2.5

5.0

−5.0

−2.5

0.0

2.5

phi_10

5.0

−5.0

−2.5

0.0

phi_11

Figure 5.3: −2 log{L(β, σ 2 )} as a function of each φi , i = 0, . . . , 11, with the remaining
parameters fixed.
82

Season
θˆi
φˆi
σi∗
Season
θˆi
φˆi
σi∗

0
0.586
0.546
0.199
6
0.204
0.720
0.300

1
0.154
0.800
0.208
7
0.071
0.162
0.219

2
3
4
5
0.119 -0.008 0.085 -0.206
0.709 0.679 0.738 0.884
0.194 0.171 0.166 0.151
8
9
10
11
1.035 0.398 -0.110 0.415
-0.690 0.417 0.855 0.558
0.153 0.162 0.126 0.154

Table 5.4: MLE by BFGS method, using the values of σ ∗ from Figure 5.1. The resulting
ˆ = −506.6053. We take those parameters as our best results in
MLE value was −2 log L(β)
optimization.

5.2

Reduced PARMAS (1, 1) model.

To obtain additional parsimony, it is also advisable to consider simpler models where some
statistically insignificant model parameters are set to zero.

5.2.1

Reduced model by asymptotic distribution of ψˆi ( )

In the discussion of a reduced PARMAS (1, 1) model, by (3.0.8),

φˆt (1) = ψˆt (2)/ψˆt−1 (1) and θˆt (1) = ψˆt (1) − φˆt (1).

In order to obtain the estimates of the PARMA12 (1, 1) model parameters, we need only
consider ψˆi ( ) at lag 1 and lag 2. An α-level test statistic rejects the null hypothesis
H0 : ψi ( ) = 0 in favor of the alternative Ha : ψi ( ) = 0, indicating that the model parameter is statistically significantly different from zero) if |Z| > zα/2 . The p-value for this
test is given by (3.0.7), and it gives us a way to determine which coefficients in the identified
PARMA model are statistically significantly different from zero (those with a small p-value,
83

i
θi
φi
σi

0
1.04
0.00
0.16

1
0.20
0.74
0.18

2
3
4
-0.02 -0.03 0.42
0.71 0.71 0.43
0.16 0.15 0.13

5
6
7
8
9
10
11
-0.40 0.87 0.00 -0.71 0.79 -0.29 0.54
1.02 0.00 0.00 1.00 0.00 0.97 0.41
0.13 0.26 0.18 0.12 0.14 0.10 0.13

Table 5.5: Parameter estimates for the reduced PARMA model (1.0.1) of average monthly
flow series for the Fraser River near Hope BC from October 1912 to September 1982. The
resulting negative likelihood value is -351.3939.
say, p < 0.05). Coefficients with a higher p-value are shown in bold font in Table 3.1, and we
set coefficients ψˆi ( ) = 0 in that case. Using (3.0.8), we then list in Table 5.5 the parameter
estimates of the reduced PARMA12 (1, 1) model, where autoregressive coefficients in season
0, 6, 7 and 9 are set to 0.
Here we apply innovations algorithm to get our model parameters. Later in this chapter
we will shown a different method using MLE. Once we obtain the estimates for the reduced
PARMA12 (1, 1) model parameters, we can carry out forecast procedures similar to the full
PARMA12 (1, 1) model, as follows. Since autoregressive coefficients in season 0, 6, 7 and 9
are set to 0, the computation is simpler for the Wt process.
• Compute the transformed process (2.1.2) using the reduced model parameters.
• Compute the sample autocovariance of that process by Proposition 2.1.3.
• Apply the innovations algorithm (2.1.9) to get the projection coefficients θn,j .
ˆ n for n = 1, 2, . . . , 840 =
• Use (2.1.13) to compute the one-step-ahead predictors X
70 × 12.
• Apply (2.1.17) to get the forecasts.
• Use the asymptotic formula (2.2.3) to compute 95% prediction bounds, based on the
assumption of Gaussian innovations.
84

The resulting prediction, along with the 95% prediction bands, is shown in Figure 5.4. The
actual data (solid line) is also shown for comparison. Note that the forecast (solid line with
dots) is in reasonable agreement with the actual data (which were not used in the forecast),
and that the actual data lies well within the 95% prediction bands. Furthermore, we give
a detailed comparison in Figure 5.5 between full model and the reduced one. As shown
in the first two graphs, the prediction results are quite similar. In a further investigation,
we check the difference and relative error between the two forecasts, for every step. The
differences are small. In fact, when compared to predictions, the relative errors are small.
When the step h ≥ 5, the relative errors are nearly zero. We conclude that the removed
autoregressive coefficients in season 0, 6, 7 and 9 are insignificant, and the reduced model
is a viable substitute for the full model. We prefer the reduced model, since it has fewer
parameters.

5.2.2

Reduced model by asymptotic distribution of MLE

Another method for obtaining a reduced model is to apply the asymptotics of MLE for the
PARMA process, which were developed in Basawa and Lund [24]. For a causal and invertible
Gaussian PARMA model, with the assumption of {εt } being periodic i.i.d. Gaussian noise,
ˆ as N → ∞,
Basawa and Lund [24, Theorem 3.1] gives the asymptotic distribution of β,

N 1/2 (βˆ − β) → N 0, A−1 (β, σ 2 ) ,

(5.2.1)

where
A(β, σ 2 ) =

S−1
i=0
85

σi−2 Γi (β, σ 2 ),

(5.2.2)

11
9

10

Flow (cms)

12

13

Forecasting by reduced model

10/1982

02/1983

06/1983

10/1983

02/1984

06/1984

Month / Year

Figure 5.4: A 24-month forecast using the reduced PARMA12 (1, 1) model. Solid line is
the original data; Solid line with dots is the reduced PARMA(1,1) forecast; The two dotted
lines are 95% Gaussian prediction bounds. The 24-month forecast is from October 1982 to
September 1984.

and
Γi (β, σ 2 ) = E

∂εt (β) ∂εt (β)
)(
∂β
∂β

.

Furthermore, Basawa and Lund [24, Remark 3.1] states that if {εt } is Gaussian, then the
maximum likelihood estimate has the same asymptotic distribution as the weighted least
ˆ the MLE of σ 2 for
squares estimate. Once we obtain the maximum likelihood estimate β,
i
0 ≤ i ≤ S − 1 could be computed by

1
σ
ˆi2 =
N

N −1

ˆ (i)
XkS+i − X
kS+i

k=0
86

2
/rkS+i = Si /N,

(5.2.3)

12
10
9
5

10

15

20

5

10

15

20

Month

Difference of forecast between
full model and reduced model

Relative error between
full model and reduced model

0.010
0.002

0.006

0.10

Flow (cms)

0.15

0.014

Month

0.05

Flow (cms)

11

Flow (cms)

11
9

10

Flow (cms)

12

13

Reduced model forecast vs Data

13

Full model forecast vs Data

5

10

15

20

5

Month

10

15

20

Month

Figure 5.5: A comparison of 24-month forecasts between the full PARMA12 (1, 1) model and
reduced PARMA12 (1, 1) model. In the first two graphs, dotted line is forecast, and solid
line is real data.

87

2 ).
ˆS−1
where σ
ˆ 2 = (ˆ
σ02 , . . . , σ

Example 5.2.1. We adopt the causal and invertible first-order PARMA12 (1, 1) model from
Example 5.1.3, and show how to compute the asymptotic covariance matrix A−1 (β, σ 2 ).
diag(A−1 )
The results are shown in Table 5.6, where se =
stands for the standard error
N 1/2
ˆ
for β.
In the following, p = q = 1, and S = 12 denotes the number of seasons, and

Season
0
1
2
3
4
5
6
7
8
9
10
11

γˆi (0) se(θˆi ) se(φˆi )
0.06
0.36
0.30
0.07
0.26
0.21
0.06
0.23
0.19
0.05
0.22
0.18
0.05
0.27
0.21
0.05
0.24
0.17
0.16
0.29
0.19
0.08
0.08
0.06
0.11
0.25
0.19
0.05
0.19
0.09
0.03
0.18
0.14
0.03
0.24
0.18

Table 5.6: MLE and its standard error.

i = 0, 1, . . . , S − 1, and t denotes the corresponding season index for t, where

t =





t − S[t/S]

if t ≥ 0,




S + t − S[t/S + 1] if t < 0,
and [ · ] stands for the greatest integer function. The model is

Xt = φt Xt−1 + εt + θt εt−1 .
88

(5.2.4)

Taking partial derivations in (5.2.4) gives

∂ε
(β)
∂εt (β)
= −e2 t +1 Xt−1 − θt t−1
− e2 t +2 εt−1 (β),
∂β
∂β

(5.2.5)

where ej denotes a (p + q)S × 1 unit vector whose entries are all zero, except for a 1 in the
∂ε (β)
j th row. Note that in the following Xt−1 , θt and εt−1 (β) are scalars, however t
and
∂β
ej denote the (p + q)S × 1 vectors. Therefore by causality

2 ,
E Xt−1 εt−1 (β) = E εt−1 (β)Xt−1 = E ε2t−1 (β) = σt−1

by Leibniz integral rule, provided that εt−1 (β) and

∂εt−1 (β)
are both continuous, we take
∂β

the derivative outside the expectation,

E

∂εt−1 (β)
∂
∂
2
= 0,
Xt−1 =
E εt−1 (β)Xt−1 =
σt−1
∂β
∂β
∂β

where 0 is a (p + q)S × 1 zero vector. Similarly

E Xt−1

∂εt−1 (β)
∂β


=0,

and
E

∂εt−1 (β)
∂
∂
2
εt−1 (β) =
E εt−1 (β)εt−1 (β) =
σt−1
= 0,
∂β
∂β
∂β

and similarly

E



∂εt−1 (β)
∂β

εt−1 (β) = 0 .

Therefore, if we multiply both sides of (5.2.5) by its own transpose and take an expectation,
89

we have

E

∂εt (β)
∂β

∂εt (β)
∂β

= E − e2 t +1 Xt−1 − θt

∂εt−1 (β)
− e2 t +2 εt−1 (β)
∂β

∂εt−1 (β)
− e2 t +2 εt−1 (β)
∂β
∂εt−1 (β)
+ var(Xt−1 )E2 t +1,2 t +1
∂β

− e2 t +1 Xt−1 − θt
∂εt−1 (β)
= θt2 E
∂β

2
E2 t +2,2 t +2 + E2 t +2,2 t +1 + E2 t +1,2 t +2
+ σt−1
+ θt e2 t +1 0 + 0e2 t +1 + e2 t +2 0 + 0e2 t +2
∂εt−1 (β)
= θt2 E
∂β

∂εt−1 (β)
∂β

+ var(Xt−1 )E2 t +1,2 t +1

2
+ σt−1
E2 t +2,2 t +2 + E2 t +2,2 t +1 + E2 t +1,2 t +2
+O
∂εt−1 (β)
= θt2 E
∂β

∂εt−1 (β)
∂β

+ var(Xt−1 )E2 t +1,2 t +1

2
+ σt−1
E2 t +2,2 t +2 + E2 t +2,2 t +1 + E2 t +1,2 t +2 ,

where var(Xt ) = γt (0) is the variance of Xt , and Ei,j = ei ej denotes a (p + q)S × (p + q)S
matrix whose entries are all zero, except for a one in the ith row and j th column, and O is
a (p + q)S × (p + q)S zero matrix. Recall that

Γt (β, σ 2 ) = E

∂εt (β)
∂β

∂εt (β)
∂β

then we obtain
Γt (β, σ 2 ) = θt2 Γt−1 (β, σ 2 ) + Mt ,
90

,

where

2
Mt =σt−1
E2 t +2,2 t +2 + E2 t +2,2 t +1 + E2 t +1,2 t +2
+ var(Xt−1 )E2 t +1,2 t +1 .

For simplicity we adopt the notation

Γi (β, σ 2 ) = θi2 Γi−1 (β, σ 2 ) + Mi ,

(5.2.6)

with the boundary condition Γ0 (β, σ 2 ) = ΓS (β, σ 2 ), hence in this way all the index i are
bounded, 0 ≤ i ≤ S − 1. The solution to (5.2.6) is

2
Γi (β, σ 2 ) = rθ,i

2
S−1
rθ,S−1
Mk
Mk
2
+(
)rθ,i
,
r2
1 − r2
r2
θ,S−1
k=0 θ,k
k=0 θ,k
i

(5.2.7)

where
t
rθ,t =

θt ,

(5.2.8)

j=0
the proof for (5.2.7) is given as follows. By (5.2.6), ∀ 0 ≤ i ≤ S − 1, we could write

91

recursively:

Γi = θi2 Γi−1 + Mi
2 Γ
= θi2 (θi−1
i−2 + Mi−1 ) + Mi
2 (θ2 Γ
2
= θi2 θi−1
i−2 i−3 + Mi−2 ) + θi Mi−1 + Mi
2 θ2 Γ
2 2
2
= θi2 θi−1
i−2 i−3 + θi θi−1 Mi−2 + θi Mi−1 + Mi
...
= θi θi−1 · · · θi−(S−1)
+ θi θi−1 · · · θi−(S−2)

2
2

Γi−S
Mi−(S−1) . . . + θi θi−1 · · · θ0 2 M−1

+ θi θi−1 · · · θ1 2 M0 + θi θi−1 · · · θ2 2 M1 + . . . + (θi θi−1 )2 Mi−2 + θi2 Mi−1 + Mi
2
(5.2.9)
= θi θi−1 · · · θi−(S−1) Γi−S + I + II,

where we define

I = θi θi−1 · · · θi−(S−2)

2

Mi−(S−1) . . . + θi θi−1 · · · θ0 2 M−1

II = θi θi−1 · · · θ1 2 M0 + θi θi−1 · · · θ2 2 M1 + . . . + (θi θi−1 )2 Mi−2 + θi2 Mi−1 + Mi ,

since Γi−S = Γi by periodic property, θi θi−1 · · · θi−(S−1) = rθ,S−1 by definition in

92

(5.2.8), we could write (5.2.9) as

I + II
1 − r2
θ,S−1
2
2
II(1 − rθ,S−1
) + I + II(rθ,S−1
)
=
1 − r2
θ,S−1
2
I + II(rθ,S−1
)
.
= II +
1 − r2
θ,S−1

Γi =

(5.2.10)

Note that

II = θi θi−1 · · · θ1 2 M0 + θi θi−1 · · · θ2 2 M1 + . . . + (θi θi−1 )2 Mi−2 + θi2 Mi−1 + Mi
Mi
M0
M1
+ ... +
= θi θi−1 · · · θ0 2
+
(θ0 θ1 )2
(θ0 θ1 · · · θi )2
θ02


2  M0 + M1 + . . . + Mi 
= rθ,i
r2
r2
r2
θ,0
θ,1
θ,i
i
Mk
2
= rθ,i
.
r2
k=0 θ,k

(5.2.11)

On the other hand, we want to prove that
2
I + II(rθ,S−1
)
1 − r2
θ,S−1

2
S−1
rθ,S−1
Mk
2
=(
)rθ,i
,
1 − r2
r2
θ,S−1
k=0 θ,k

93

(5.2.12)

in this way we will finish the proof of (5.2.7). Note that

2
2
I + (rθ,S−1
)II = θi θi−1 · · · θi−(S−2) Mi−(S−1) . . . + θi θi−1 · · · θ0 2 M−1
2

Mi
M0
+ ... +
(θ0 θ1 · · · θi )2
θ02
2
= θi θi−1 · · · θi−(S−2) Mi+1 . . . + θi θi−1 · · · θ0 2 MS−1
i
Mk
2
2
+ rθ,S−1 rθ,i
r2
k=0 θ,k
MS−1
Mi+1
2
θi · · · θ0 2
= θS−1 · · · θ0
+ ... +
(θ0 · · · θi+1 )2
(θ0 · · · θS−1 )2
i
Mk
2
2
+ rθ,S−1 rθ,i
r2
k=0 θ,k
S−1
i
Mk
Mk
2
2
2
2
= rθ,S−1 rθ,i
+ rθ,S−1 rθ,i
r2
r2
k=i+1 θ,k
k=0 θ,k
S−1
Mk
2
2
= rθ,S−1
rθ,i
,
r2
k=0 θ,k

+ θS−1 θS−2 · · · θ0

θi θi−1 · · · θ0 2

2
dividing both sides by 1−rθ,S−1
, (5.2.12) is proved. Add (5.2.11) and (5.2.12) into (5.2.10),
we complete the proof of (5.2.7).
The invertibility of the model requires that |rθ,S−1 | < 1. In the end, A−1 (β, σ 2 ) is a
(p + q)S × (p + q)S matrix, and it could be computed by the inverse of A(β, σ 2 ) in (5.2.2).

Equation (5.2.1) can be used to produce confidence intervals and hypothesis tests for the
model parameters β. An α-level test statistic rejects the null hypothesis H0 : β i = 0 in
favor of the alternative hypothesis Ha : β i = 0, indicating that β i is statistically significantly
94

different from zero if |Z| > zα/2 . The p value for this test is

p = P (|Z| > |z|),

(5.2.13)

where Z ∼ N (0, 1), and
ˆ − 0)/se.
Z = (β
After we get the MLE optimization results from Example 5.1.3 and Table 5.4, we do a
ˆ φ),
ˆ where the standard error se is obhypothesis test for the model parameters βˆ = (θ,
tained from Table 5.6. Using the level of significance of α = 5%, we get the reduced model
parameters, which achieve the goal of parsimony. The results are shown in Table 5.7.
Season
θˆi
p-value
reduced θˆi
φˆi
p-value
reduced φˆi

0
1
2
3
4
0.55 -0.01 0.04 -0.02 0.09
0.13 0.96 0.87 0.93 0.74
0.00 0.00 0.00 0.00 0.00
0.37 0.63 0.52 0.54 0.71
0.22 0.00 0.01 0.00 0.00
0.00 0.63 0.52 0.54 0.71

5
-0.26
0.27
0.00
0.86
0.00
0.86

6
0.01
0.98
0.00
1.18
0.00
1.18

7
8
0.37 1.89
0.00 0.00
0.37 1.89
0.23 -1.41
0.00 0.00
0.23 -1.41

9
10
0.27 -0.08
0.16 0.64
0.00 0.00
0.35 0.57
0.00 0.00
0.35 0.57

11
0.32
0.18
0.00
0.37
0.04
0.37

Table 5.7: Model parameters by MLE optimization and their p-values.

We obtain the forecast in a similar manner as the previous reduced model. The resulting
prediction, along with the 95% prediction bands, is shown in Figure 5.6. The actual data
(solid line) is also shown for comparison. The forecast (solid line with dots) is in reasonable
agreement with the actual data (which were not used in the forecast), and the actual data
lies well within the 95% prediction bands. A detailed comparison can be seen in Figure 5.7
between full model and the reduced one. As shown in the first two graphs, the reduced
model predicts as well as the full model. In the next two graphs, we check the difference and
95

relative error between the two forecasts, for every step. The differences are small. In fact,
when compared to predictions, the relative errors are quite small. When the step h ≥ 5,
the relative errors are nearly zero. We conclude that the removed autoregressive coefficients
in Table 5.7 are insignificant, and the reduced model performs as well as the full model. In
summary, we prefer the reduced PARMA12 (1, 1) model simply because it is a simpler model
with fewer parameters and the same adequacy.

11
9

10

Flow (cms)

12

13

Forecasting by reduced PARMA model
by asymptotic distribution of MLE

10/1982

02/1983

06/1983

10/1983

02/1984

06/1984

Month / Year

Figure 5.6: A 24-month forecast using the reduced PARMA12 (1, 1) model in Table 5.7. Solid
line is the original data; Solid line with dots is the reduced PARMA(1,1) forecast; The two
dotted lines are 95% Gaussian prediction bounds. The 24-month forecast is from October
1982 to September 1984.

96

0

100000

Flow (cms)

100000
0

Flow (cms)

250000

Reduced model forecast vs Data

250000

Full model forecast vs Data

5

10

15

20

5

10

15

20

Month

Difference of forecast between
full model and reduced model

Relative error between
full model and reduced model

0.08
0.00

0.04

Flow (cms)

4000
0

Flow (cms)

8000

Month

5

10

15

20

5

Month

10

15

20

Month

Figure 5.7: A comparison of 24-month forecasts between the full PARMA12 (1, 1) model and
reduced PARMA12 (1, 1) model, by asymptotic distribution of MLE. In the first two graphs,
dotted line is forecast, and solid line is real data.

97

Chapter 6

Asymptotic normality of PARMA
model

Let Xn,i = (Xi , ..., Xi+n−1 ) be a causal and invertible PARMAS (p, q) process (1.0.1),
where n is the number of observations and i = 0, ..., S − 1. For notations, let φt =
(φt (1), . . . , φt (p)) and θ t = (θt (1), . . . , θt (q)) denote the autoregressive and moving-average
parameters during season t, respectively. Then we can write the likelihood function as
L(β), where we use β = (φ0 , θ 0 , φ1 , θ 1 , . . . , φS−1 , θ S−1 ) to denote the collection of all
PARMAS (p, q) parameters. The dimension of β is (p+q)S ×1. Following the ideas from Ba2 ) as nuisance
sawa and Lund [24], we treat the white noise variances σ 2 = (σ02 , σ12 , . . . , σS−1
parameters. Then the likelihood function is given by (5.1.13) in Chapter 5. By (5.1.16), once
ˆ the MLE of σ 2 for 0 ≤ i ≤ S − 1 could be
we obtain the maximum likelihood estimate β,
i
computed by
1
σ
ˆi2 =
N

N −1

ˆ (i)
XkS+i − X
kS+i

k=0
98

2
/rkS+i = Si /N,

2 ) . Equivalently, we could minimize the negative log likelihood
ˆS−1
where σ
ˆ 2 = (ˆ
σ02 , . . . , σ

(β) = −2 log{L(β)}
n−1
= n log(2π) +

n−1 (X
log(vj,i ) +

j=0
n−1
= n log(2π) +

j=0
2 r )+
log(σi+j
j,i

j=0
n−1
= n log(2π) +

ˆ (i) 2
i+j − Xi+j )
vj,i

n−1 (X
j=0

2 )+
log(σi+j

j=0

ˆ (i) 2
i+j − Xi+j )
2 r
σi+j
j,i

n−1

n−1 (X
log(rj,i ) +

j=0

j=0

Next we consider the properties of the first derivative

(6.0.1)

ˆ (i) 2
i+j − Xi+j )
.
2 r
σi+j
j,i

∂ (β)
in a few lemmas.
∂βk

Lemma 6.0.2. For k = 1, . . . , p+q, there exist constants C > 0 and s ∈ (0, 1) such that

∂rt,i (β)

≤ C(β)st ,

∂βk
where t ≥ 1, and 1 ≤ k ≤ p + q.

Proof. Note that

rt,i (β) = E Yt+i − Yˆt+i

2

ˆ
E Xt+i − X
t+i
=
2
σt+i

2
,

where Y t,i = (Yi , . . . , Yt+i−1 ) is the mean zero PARMA process with period S given by
p
Yt −

q
at (k)Yt−k = Zt +

bt (j)Zt−j
j=1

k=1
99

(6.0.2)

where {Zt } is a sequence of random variables with mean zero and standard deviation 1. Here
the autoregressive parameters at (j) and the moving average parameters bt (j) are assumed
to be periodic functions of t with the same period S ≥ 1. By the invertibility assumption,
we write
∞
Yt+i +

Yt+i−j πt+i (j) = Zt+i ,

(6.0.3)

j=1
where πt+i (j) = πt+i+kS (j) by the periodic property of PARMA model. In the following,
we write πt+i (j) = πt+i (j; β) to emphasize the dependence of πt+i (j) on β, and the
autocovariance function is
ηi (h; β) = Cov(Yi , Yi−h ),
By Anderson et al. [6], the best linear predictor of Yt+i is defined as
(i)
(i)
(i)
Yˆt+i = φt,1 Yt+i−1 + . . . + φt,t Yi , t ≥ 1,

where the vector of coefficients, φt,i =

(i)
(i)
φt,1 , . . . , φt,t

, appears in the prediction equations

Gt,i φt,i = η t,i ,

where
η t,i = ηt+i−1 (1), ηt+i−2 (2), . . . , ηi (t) ,
and
t
Gt,i = ηt+i− ( − j; β)
j, =1
is the covariance matrix of (Yt+i−1 , . . . , Yi ) . Proposition 2.1.1 of Anderson et al. [6] shows

100

that if σi2 > 0 for i = 0, 1, . . . , S − 1, then for a causal PARMAS (p, q) process the covariance
matrix Gt,i is nonsingular for every t ≥ 1 and i. In this way we can generalize Corollary
5.1.1 in Brockwell and Davis [11] to periodically stationary process,

rt,i (β) = ηt+i (0; β) − η t,i G−1
t,i η t,i .

(6.0.4)

If we multiple Yt+i−1 and take expectations on both sides of (6.0.3), we could obtain
∞
ηt+i−1 (1; β) = −

πt+i (j; β)ηt+i−j (j − 1; β),
j=1

similarly, if we multiple Yt+i−2 , Yt+i−3 , . . . individually on both sides of (6.0.3) and take
expectations, we have
∞
ηt+i−2 (2; β) = −

πt+i (j; β)ηt+i−j (j − 2; β),
j=1
∞
πt+i (j; β)ηt+i−j (j − 3; β),

ηt+i−3 (3; β) = −
j=1

..
.
Therefore we could have
η ∞,i = −G∞,i π ∞,i ,
where
η ∞,i = ηt+i−1 (1; β), ηt+i−2 (2; β), . . . ,
∞
G∞,i = ηt+i− (j − ; β)
,
j, =1
101

π ∞,i = πt+i (1; β), πt+i (2; β), . . . .
Additionally, G−1
∞,i can be written as
G−1
i,∞ = Ti,∞ Ti,∞ ,

where
Ti,∞ = πt+i (k − j; β) ∞
k,j=1 ,
πt+i (0; β) = 1 and πt+i (j; β) = 0, for j < 0. Furthermore, if we multiply Yt+i on both
sides of (6.0.3) and take expectation, we have

ηt+i (0; β) = π ∞,i G∞,i π ∞,i + 1.

(6.0.5)

Substituting (6.0.5) into (6.0.4), we have

rt,i (β) = 1 + π ∞,i G∞,i π ∞,i − η t,i G−1
t,i η t,i
−1
= 1 + η ∞,i G−1
∞,i η ∞,i − η t,i Gt,i η t,i

Therefore

∂rt,i (β 0 )
∂βk

= 2

∂η ∞,i

−1 ∂G∞,i G−1 η
G−1
η
+
η
G
∞,i ∞,i ∂β
∞,i ∞,i
∞,i ∞,i
∂βk
k

−2

∂η t,i

−1 ∂Gt,i G−1 η ,
η
+
η
G
G−1
t,i t,i ∂β
t,i t,i
∂βk t,i t,i
k

where all the terms on the right hand side are evaluated at β = β 0 , and β 0 is the truth. By
102

G−1
t,i η t,i = φt,i , the above equation reduces to



∂η
∂η
∂rt,i (β 0 )
t,i
 ∞,i

= −2 
π ∞,i +
φt,i 
∂βk
∂βk
∂βk
∂G∞,i
∂Gt,i
− π ∞,i
π ∞,i − φt,i
φ
.
∂βk
∂βk t,i

In Anderson et al. [6, Corollary 2.2.4], they show that

(i)
φ → −πt+i (k) as k → ∞,
t,k

and

2



n−1



2
2 
(i)

πt+i (j)  M  ,
≤
φt,j + πt+i (j)

πC
j≥n
j=0

where M = max{ηt+i (0; β), i = 0, . . . , S − 1}. Note that

j≥n πt+i (j)

2

is bounded

in n. Therefore we could have
∂rt,i (β 0 )
∂βk


≤ 2K1 

n−1


(i)
πt+i (j) + φt,j +

πt+i (j) 

j≥n
j=0

t
(i) (i)
+K1 
πt+i (i)πt+i ( ) − φt,j φ
+2
πt+i (j)
t,
j≥n
j, =1


πt+i ( )  ,

1 π ∂g(λ;β0 ) dλ and g(λ; β ) is the spectral density matrix of the equivwhere K1 = 2π
0
−π
∂βk
alent vector ARMA process with length S. Note that |A| is the determinant of A if A is a

103

matrix. By the Cauchy-Schwarz inequality,
∂rt,i (β 0 )
∂βk

≤ K2 t1/2 st/2 + K3

πt+i (j) + K4 tst/2
j≥n

(6.0.6)

≤ K5 st1 ,
where K2 , K3 , K4 , K5 are positive constants, 0 < s < 1, and 0 < s1 < 1. This finishes the
proof.

Lemma
 6.0.3. For k = 1, . . . , p+q,
n−1/2 

n−1 log r
t,i +
t=0

∂
∂βk

Proof. Given


2
ˆ
p
n−1 (Xt+i −Xt +i) ∂rt,i 
→ 0.
2
t=0
∂βk
rt,i
β=βˆ

∂rt,i (β 0 )
2
≥ 1, we
≤ C(β){S(β)}t from Lemma 6.0.2 and the fact rt−1,i
∂βk

have

n−1/2

∂
∂βk

n−1

≤ n−1/2

log rt−1,i
β=βˆ

t=0

n−1

1

2
t=0 rt−1,i

∂rt−1,i
∂βk

n−1 ∂r
t−1,i
∂βk
β=βˆ
t=0
n−1
−1/2
st1
≤ C1 n
t=0
→ 0.
≤ n−1/2

Therefore
n−1/2

∂
∂βk

n−1

p
→ 0.

log rt−1,i
β=βˆ

t=0
104

β=βˆ

On the other hand,




n−1 (X

2
ˆ

t+i − Xt+i ) ∂rt,i
I(A)
2
∂βk
rt,i
t=0
β=βˆ


n−1 (X
ˆ
−X
)2 t
t+i
t+i

≤ C1 E
s1 I(A)
2
r
t,i
t=0


2
2
n−1 σ r
t+i t,i t
= C1 
s1 I(A)
2
r
t,i
t=0

n−1/2 E 

n−1
≤ CC1

st1

t=0
→ 0,

where I(·) is the indicator function and C > 0 is a constant, A is an event with the probability
arbitrarily close to 1 and on which βˆ − β 0 arbitrarily small for all large n, where we adopt
the strong consistency for MLE of a vector ARMA model from Dunsmuir and Hannan [15],
and apply it for PARMA model.

n−1/2

n−1 (X
t=0

2
ˆ
t+i − Xt+i ) ∂rt,i
2
∂βk
rt,i

p
→ 0.
β=βˆ

Remark 6.0.4. From Section 2 of Basawa and Lund [24], the PARMA model (1.0.1) has the
S−variate vector ARMA representation

→
−
Φ0 X N −

p∗

→
−
−
Φk X N −k = Θ0 →
ε0+

k=1

q∗
k=1

105

−
Θk →
ε N −k ,

(6.0.7)

→
−
→
−
−
where { X N } and {→
ε N } are the S−variate series { X N } = XN S+1 , . . . , XN S+S

and

−
{→
ε N } = εN S+1 , . . . , εN S+S . The model orders in (6.0.7) are p∗ = [p/S] and q ∗ =
[q/S], where [x] denotes the smallest integer greater than or equal to x.

Remark 6.0.5. In Dunsmuir and Hannan [15], the following assumption is needed for the
strong consistency of MLE of the vector ARMA model. They restricted the elements of
Φj , 1 ≤ j ≤ p∗ and Θj , 1 ≤ j ≤ q ∗ to lie in a ball specified by


tr 

p∗

q∗
Φj Φj +

1

1



Θj Θj  < ∞.

(6.0.8)

With condition (6.0.8), Dunsmuir and Hannan [15, Theorem 3] gave the strong consistency
of the MLE of the vector ARMA model.

Remark 6.0.6. By Section 3 of Basawa and Lund [24], we could work in the univariate
PARMA setting rather than transform to an S−variate vector ARMA model. This is eligible
by two reasons. First of all, one has to invert the S-variate ARMA transformation to obtain
the individual PARMA model cofficients. Hence the results derived directly in terms of the
univariate PARMA model will be more readily usable. Second of all, even though one could
obtain a standard vector ARMA model, the covariance matrix of the vector noises and the
moving average parameters would still depend both the PARMA autoregressive parameters
and the variance of vector ARMA model. For the above reasons, we will work directly in
the PARMA setting. Therefore we adopt the strong consistency for MLE of a vector ARMA
model from Dunsmuir and Hannan [15], and apply it for PARMA model.

106

Lemma 6.0.7. For k = 1, . . . , p + q,

n−1/2

n−1 X
ˆ
ˆ
p
t+i − Xt+i + Zt+i ∂(Xt+i + Zt+i )
→ 0,
rt,i
∂βk
β=βˆ
t=0

n−1/2

n−1 X
ˆ
ˆ
p
t+i − Xt+i − Zt+i ∂(Xt+i − Zt+i )
→ 0.
rt,i
∂βk
β=βˆ
t=0

and

Proof. We only prove n−1/2

ˆ
ˆ
n−1 Xt+i −Xt+i −Zt+i ∂(Xt+i −Zt+i )
rt,i
t=0
∂βk

p
→ 0, since

β=βˆ
the other result could be proved in a similar way. By the invertible assumption of PARMAS (p, q)
model,
∞
πt (j)Xt−j ,

Zt =
j=0

and from the model set-up in (1.0.1), we can write

φt (z) = 1 − φt (1)z − . . . − φt (p)z p ,

and
θ t (z) = 1 + θt (1)z + . . . + θt (q)z q .
Additionally,
π t (z) =

φt (z)
=1+
θ t (z)

∞

πj z j .

j=1

Notice that β = (φt (1), φt (2), . . . , φt (p), θt (1), θt (2), . . . , θt (q)) , and when k = 1, . . . , p, we
107

have
∞
−∂φt (z)/∂βk
zk
∂ (φt (z)/θ t (z))
∂π t (z)
∂πt (j) j
=
=−
=−
=−
z ,
θ t (z)
θ t (z)
∂βk
∂βk
∂βk
j=1
and when k = 1, 2, . . . , q, we have

∂ (π t (z))
=
∂βp+k

∞

∂πt (j) j
∂ (φt (z)/θ t (z))
φ (z) ∂θ t (z)
φ (z)
z =
=− t
= − t zk .
∂β
∂βp+k
θ 2t (z) ∂βp+k
θ 2t (z)
j=1 p+k

In summary,




zk
∞ ∂πt (j) z j

 θ (z) = − j=1 ∂β
t
k

φt (z) k

∞ ∂πt (j) j


 θ 2 (z) z = − j=1 ∂βp+k z .
t

(6.0.9)

Then we can prove from (6.0.9) that there exist constants C > 0 and s ∈ (0, 1) such that
∂πj
∂βk

≤ Csj , j ≥ 1, 1 ≤ k ≤ p + q.

(i)
(i)
Furthermore, we may expand the Yule-Walker equation Γn,i φn = γ n as




















γi+n−1 (1)  
 
 

γi+n−2 (2) 
 
 
 
=
γi+n−3 (3) 
 
 
..
 
 
.
 
 
γi (n)



(i)
φn,1
(i)
φn,2
(i)
φn,3
..
.

γi+n−1 (0) γi+n−1 (−1) . . . γi+n−1 (−n + 1)  



γi+n−2 (1) γi+n−2 (0) . . . γi+n−2 (−n + 2) 




γi+n−3 (2) γi+n−3 (1) . . . γi+n−3 (−n + 3) 


..
..
..


.
.
...
.


(i)
γi (n − 1)
γi (n − 2)
...
γi (0)
φn,n

108









,







then for 1 ≤ j ≤ n, we have
n

(i)
φ γi+n−j (j − ),
n,

γi+n−j (j) =
=1
therefore for 1 ≤ k ≤ p + q,

∂γi+n−j (j)
∂βk

n ∂γ
n ∂φ(i)
(j
−
)
(i)
i+n−j
n,
=
φ
+
γ
(j − ).
n,
∂βk
∂βk i+n−j
=1
=1

(6.0.10)

On the other hand,
∞
Zn+i = Xn+i +

πn+i ( )Xn+i− ,
=1

if we multiply Xn+i−j on both sides and take expectations, we can obtain
∞
πn+i ( )γn+i−j (j − ),

γn+i−j (j) =
=1
then similarly for 1 ≤ k ≤ p + q,
∂γn+i−j (j)
∂βk

=

∞ ∂π
∞
∂γn+i−j (j − )
n+i ( ) γ
(j
−
)
+
πn+i ( )
.
n+i−j
∂βk
∂βk
=1
=1

(6.0.11)

If we subtract (6.0.10) from (6.0.11), we have
n

0 =

∂γi+n−j (j − )
(i) ∂γi+n−j (j − )
πi+n ( ) − φ
+
πi+n ( )
(6.0.12)
n,
∂βk
∂βk
=1
>n


(i)
n
∂φ
∂πi+n ( )
n, 
 ∂πi+n ( )
−
γi+n−j (j − ).
 γi+n−j (j − ) +

∂βk
∂βk
∂βk
>n
=1

109

Furthermore, we may write (6.0.12) as

(i)
n
∂φ
∂π
( )
 n,
− i+n  γi+n−j (j − )

∂βk
∂βk
=1
n
(i) ∂γi+n−j (j − )
=
πi+n ( ) − φ
n,
∂βk
=1
∂γi+n−j (j − ) ∂πi+n ( )
+
πi+n ( )
+
γi+n−j (j − ) .
∂βk
∂βk
>n


(6.0.13)

Consequently, we write (6.0.13) as
∂ π n,i − φn,i
∂βk

∂Γn,i
= Γ−1
n,i ∂β (φn,i − π n,i ) + dn,i ,
k

where dn,i is an n × 1 vector with

πi+n ( )
>n

∂γi+n−j (j − )
∂βk

+

∂πi+n ( )
γi+n−j (j − )
∂βk

as its j-th component. Then under the assumption of Z−t = X−t = 0 for all t ≥ 0 we have

Zt+i = θ(B)−1 φ(B)Xt+i
= π(B)Xt+i
∞
= Xt+i +
Xt+i−j πt+i (j)
j=1
n−1
= Xt+i +
Xt+i−j πt+i (j).
j=0

110

Now it follows from (6.0.13) that
2
(i)
ˆ
 ∂ Xt+i − Zt+i 

E


∂βk




(i)
(i)
 ∂ φt,1 Xt+i−1 + . . . + φt,t Xi − Xt+i −
=E 

∂βk

2
n−1 X
j=0 t+i−j πt+i (j) 



2
(i)
n−1 ∂ φt,j − πt+i (j)

=E 
Xt+i−j 


∂βk
j=0


=

=
≤

∂ φt,i − π t,i
∂βk

Γt,i

∂ φt,i − π t,i
∂βk

∂Γt,i
φt,i − π t,i + dn,i
∂βk
2
λmin (Γt,i )

∂Γt,i
Γ−1
φt,i − π t,i + dt,i
t,i ∂β
k

∂Γt,i
φt,i − π t,i
∂βk


2+ d
2
t,i

2 
∂πt+i ( )
2max{α, γi (0)}t 
2
 
≤
πt+i ( ) +
 φt,i − π t,i + 

λmin (Γt,i )
∂βk
>t


≤C1 st1 ,
where λmin (Γt,i ) > 0 denotes the minimum eigenvalue of Γt,i , and
2
π ∂ θ(eiw )/φ(eiw )
α

−π

∂βk

dw ≥

∂γi (j)
,
∂βk

for j ≥ 1 and 1 ≤ k ≤ p + q, and C1 > 0 and s1 ∈ (0, 1) are some constants, which depend
on β ∈ B continuously. Additionally, a detailed discussion of spectral density for PARMA

111

model could be seen in Wyloma´
nska [56]. Next using the similar argument as in the proof
of Lemma 6.0.3, we may show that
ˆ
∂(X
t+i − Zt+i ) I(A)
E
∂βk

2
β=βˆ

≤ C2 st2 , t ≥ 1, 1 ≤ k ≤ p + q,

where A is an event with the probability arbitrarily close to 1 and on which βˆ − β 0
arbitrarily small for all large n, and C2 > 0 and S2 ∈ (0, 1) are some constants. Hence



ˆ
ˆ
n−1 E (Xt+i −Xt+i −Zt+i ) ∂(Xt+i −Zt+i )
I(A)
rt,i
t=0
∂βk
ˆ
β=β

n−1/2

1/2
2
ˆ
∂(Xt+i −Zt+i )
n−1 E X
ˆ
I(A)
t+i − Xt+i (β) − Zt+i (β) E
t=0
∂βk
ˆ
β=β
t/2
Cn−1/2 n−1
t=0 s2


≤ n−1/2
≤

2

→

0,

where C > 0 is a constant. Thus

n−1/2

n−1 (X
ˆ
ˆ
t+i − Xt+i − Zt+i ) ∂(Xt+i − Zt+i )
→ 0.
rt,i
∂βk
ˆ
β=β
t=0

In a similar manner, we could prove

n−1/2

n−1 X
ˆ
ˆ
p
t+i − Xt+i + Zt+i ∂(Xt+i + Zt+i )
→ 0,
rt,i
∂βk
β=βˆ
t=0

and this ends the proof.

We will show that the asymptotic variance of the estimated periodic AR and MA coeffi112

cient vector can be nicely represented in terms of the two PAR models. Define

φt (B)ξt = Wt ,

and
θt (B)ςt = Wt,
where {Wt } ∼ WN(0, 1) is a white noise process with mean 0 and variance 1. Let ξ =
ξ−1 , ξ−2 , . . . , ξ−p , ς−1 , . . . , ς−q

, and

W(β) = W(φt , θt ) = {Var(ξ)}−1 .

Before we state our next lemma, we need to clarify notation. By the invertible assumption
of PARMA model (1.0.1), we may write

εt+i = εt+i (β) = Xt+i −φt (i)Xt+i−1 −. . . φt (p)Xt+i−p −θt (1)εt+i−1 . . . −θt (q)εt+i−q .

Then for k = 1, . . . .p, we may write

∂ε
(i)
U
= − t+i ,
tk
∂βk
and for k = 1, 2, . . . , q,
V

∂ε
(i)
= − t+i .
tk
∂βp+k

Let χ = (X, Z) and τ = (U , V ), where X and U are n × p matrices with Xi+j−
(i)
and U
as their (j, )-th elements, respectively, and Z and V are n × q matrices with
j
113

(i)
εi+j− and V
as their (j, )-th elements respectively. Denote R as the diagonal matrix
j
diag r0,i (β0 ), . . . , rn−1,i (β0 ) .

P
Lemma 6.0.8. n−1 τ R−1 τ → W∗ (β) = {Var(σ t ξ t )}−1 .

(i) (i)
Proof. In the following proof, all U ,V ,Zt+i , and rt,i are evaluated at β = β 0 . We
tk tk
adopt the notations that
(i)
(i)
B k Utj = U
,
t−k,j
and
(i)
(i)
B k Vtj = V
.
t−k,j
From the invertible representation, we have

(i)
Utj = θt−1 (B)Xt+i−j ,

and
(i)
Vtj = θt−1 (B)Zt+i−j = φt (B)θt−2 (B)Xt+i−j,
assuming that X−t = Z−t = 0 for all t ≥ 0. Let




1/θt (z) = 1 −

j≥k ψt (j)z




φt (z)/θ2 (z) = 1 −
t
114

j

j
j≥1 ηj z ,

then

(i)
Utj =Xt+i−j −

t+i−j−1
ψt (j)Xt+i−j−k
k=1

=φt (B)−1 Zt+i−j +

ψt (k)Xt+i−j−k

(6.0.14)

k≥t+i−j
˜(i)
(i)
=Utj + utj ,
and

(i)
Vtj =Xt+i−j −

t+i−j−1
ηk Xt+i−j−k
k=1

=θt (B)−1 Zt+i−j +

ηk Xt+i−j−k

(6.0.15)

k≥t+i−j
˜(i)
(i)
=Vtj + vtj .
Note that

2
∞
2
(i)
= E
E utj
ψt (k)Xt+i−j−k 
k≥t+i−j
∞
≤
γt+i−j−k (k − )ψt (k)ψt ( )
=1,k≥t+i−j
∞
2
≤ M
ψt+i−j−k
k=1
≤ Cst−j ,

(6.0.16)

for t ≥ 1, and 1 ≤ j ≤ p, where 0 < s < 1 and M = max{γi (0) : i = 0, 1, . . . , S − 1}.

115

Similarly, we could have
(i) 2
E vtj
≤ Cst−j ,
for t ≥ 1, and 1 ≤ j ≤ q. Notice that the (j, )−th element of n−1 τ R−1 τ is

1
n

n−1
t=0

1
(i) (i)
Utj U /rt,i =
t
n

n−1
t=0

˜(i) ˜(i)
˜(i) (i)
(i) ˜(i)
(i) (i)
Utj U + Utj u + utj U + utj u
/rt,i .
t
t
t
t

By the ergodic theorem, under the assumption of a measurable density function for PARMA
process,

1
n

n−1 ˜ ˜
(i) (i)
Utj U
t
t=0

=
a.s.
→
= Cov
=

n−1
−1
[φ−1
t (B)εt+i−j ][φt (B)εt+i− ]
t=0
˜(i) ˜(i)
Cov(Utj U )
t
1
n

−1
φ−1
t (B)Zt+i−j , φt (B)Zt+i−
(σt+i−j σt+i− )Cov(ξt+i−j , ξt+i− ).

By Corollary 5.1.2 |rt,i − 1| → 0 as t → ∞, we have

1
n

n−1 ˜ ˜
(i) (i)
a.s.
Utj U /rt,i → (σt+i−j σt+i− )Cov(ξt+i−j , ξt+i− ).
t
t=0

116

By Cauchy-Schwartz inequality and (6.0.16),

1
n

n−1
t=0

1
(i) (i)
E utj u
/rt,i ≤
t
n
1
≤
n
≤

n−1
t=0
n−1

C
n

(i) (i)
E utj u
t

t=0
n−1

1/2
(i) 2
(i) 2
E utj
E u
t
s2t−j−

t=0

→ 0.
Then
1
n

n−1
t=0

p
(i) (i)
utj u /rt,i → 0.
t

With a similar method, we may also show that

1
n

n−1 ˜
p
(i) (i)
Utj u /rt,i → 0,
t
t=0

and
1
n

n−1
t=0

p
(i) ˜(i)
utj U /rt,i → 0.
t

Therefore for 1 ≤ j, ≤ p,

1
n

n−1
t=0

p
(i) (i)
Utj U /rt,i → (σt+i−j σt+i− )Cov(ξt+i−j , ξt+i− ).
t

We can prove in a similar manner that for 1 ≤ j ≤ p and 1 ≤ ≤ q,

1
n

n−1
t=0

p
(i) (i)
Utj V /rt,i → (σt+i−j σt+i− )Cov(ξt+i−j , ςt+i− ),
t

117

(6.0.17)

and 1 ≤ j, ≤ q,

1
n

n−1
t=0

p
(i) (i)
Vtj V /rt,i → (σt+i−j σt+i− )Cov(ςt+i−j , ςt+i− ).
t

Putting the above three equations together, we have completed the proof.

(i)
Lemma 6.0.9. Let τ = (U , V ), U is an n × p matrix with U
as its (j, )-th elej
(i)
ments, respectively, and V is an n × q matrix with V
as their (j, )-th elements. Both
j
U and V follow the assumption of second finite moment. R denotes the diagonal matrix
diag r0,i (β0 ), . . . , rn−1,i (β0 ) , and

Z = Zt+i (β0 ), Zt+i+1 (β0 ) . . . , Zt+n−1 (β0 ) ,

then
D
n−1/2 τ R−1 Z → N 0, σt2 W ∗ (β 0 )−1 .

Proof. Define
(i)
U˜t =

(i)
(i) (i)
(i)
U˜t1 , . . . , U˜tp , V˜t1 , . . . , V˜tp

and
(i)
ut =

(i)
(i) (i)
(i)
ut1 , . . . , utp , vt1 , . . . , vtq

,

(i) (i) (i)
(i)
where U˜tj , V˜tj , utj and vtj are defined in (6.0.14) and (6.0.15). From the model invertible
assumption, we may write Zt = εt + zt , where zt = −

πt (j)Xt−j , and π t (z) =
j≥t

118

φt (z)
=
θ t (z)

1+

∞ π z j . then
j=1 j

n−1/2 τ R−1 Z = n−1/2

n−1

(i)
(i)
U˜t + ut

εt + zt
rt,i

(6.0.18)

t=0
n−1 U˜ (i) ε + U˜ (i) z + u(i) ε + u(i) z
−1/2
t t
t t
t t
t t.
= n
rt,i
t=0
Using similar method as (6.0.16), we may obtain that for all t ≥ 1,

E zt2 ≤ Cst ,

where C > 0, and s ∈ (0, 1) are some constants. Additionally based on the same argument
as (6.0.17), we may obtain that

n−1/2

n−1 U˜ (i) z + u(i) ε + u(i) z
P
t t
t t
t t →
0.
rt,i
t=0

(6.0.19)

(i)
(i)
Let Ft be the σ−algebra generated by {εt+i−k, k ≥ 0}, then {α U˜t εt /rt,i } are martin(i)
(i)
gale differences with respect to {Ft }, for any α ∈ Rp+q . Furthermore, α U˜t εt /rt,i is
(i)
Ft −measurable and

E

(i)
(i)
α U˜t εt /rt,i |Ft−1 =

119

(i)
α U˜t /rt,i Eεt = 0.

Further for any ε > 0,

1
n
≤

1
n

n−1

2
(i)
α U˜t εt /rt,i I

E

2
(i)
α U˜t εt I

t=0
n−1

(i)
α U˜t εt /rt,i

(i)
|Ft−1

(i)
α U˜t εt > n1/2 ε

t=0
I

1
≤
n

E

(i)
α U˜t > log n + I

n−1

(i) 2
2
˜
σt α Ut
I

(i)
α U˜t ≤ log n

(i)
|Ft−1

(i)
(i) 2
˜
˜
α Ut > log n + α Ut
E ε2t I

n1/2 ε
|εt | >
log n

t=0

∼σt2 E

(i) 2
I
α U˜1

(i) 2
+ E α U˜1
E ε21 I

(i)
α U˜1 > log n

|ε1 | >

n1/2 ε
log n

→0.

(i)
The last limit follows from the fact that both εt and α U˜1 have finite second moments.
Note that since rt,i → 1 as t → ∞,

1
n

n−1

(i)
α U˜t εt /rt,i

2
∼

t=0

1
n

n−1

2
(i)
α U˜t εt

t=0

2
(i)
a.s.
→ E α U˜t εt
(i) 2
= σt2 E α U˜t
=

σt4 α W (β 0 )−1 α

=

σt2 α W ∗ (β 0 )−1 α.

120

Then it follows from Theorem 4 of p. 511 of Shiryaev [40] that

n−1/2

n−1

(i)
α U˜t εt /rt,i

d
→N

0, σt2 α W ∗ (β 0 )−1 α ,

(6.0.20)

t=0
for any α ∈ Rp+q . Now the limit

n−1/2

n−1

D
τ R−1 Z → N 0, σt2 W ∗ (β 0 )−1

t=0
follows from (6.0.20), (6.0.18) and (6.0.19).

Theorem 6.0.10. Let Xt be the PARMAS (p, q) process defined by (1.0.1), and suppose that
the vector of true parameter values β 0 ∈ B, where B is the parameter space containing all
β. Then as n → ∞,
D
ˆ −β )→
n1/2 (β
N 0, W ∗ (β 0 ) .
0

(i) X − Xˆ , where H (i) is given as
ˆ ,...,X
ˆ
Proof. Let Xˆn = X
n
n
i
i+n−1 , then Xn = H








(i)
H =








1

0

... 0 


(i)
h11
1
0
... 0 



(i)
(i)
,
h22
h21
1
... 0 



(i)
(i)
(i)
h33
h32
h31
... 0 


(i)
(i)
(i)
hn−1,n−1 hn−1,n−2 hn−1,n−3 . . . 1
121

0

(i)
where the coefficients h22 depend on i. Notice that Consequently,

Xn Σ(β)−1 Xn =

n−1

2
ˆ (i)
Xt+i − X
t+i
σt2 rt−1,i

t=0

where |Σ(β)| =

,

S−1 σ 2 N , with N = n/S. Now it follows from (6.0.1) that
i=0 i
n−1

(β) = −2 log L(β) = n log(2π) + n +

S−1
log(rj,i ) + N

j=0
where
N −1

ˆ (i)
XkS+i − X
kS+i

k=0

rkS+i

Si =

log(Si /N ),
i=0

2
.

ˆ is the solution of the equation ∂ (β) = 0, and for 1 ≤ k ≤ p, the equality
Note that β
∂β
∂ (β)|
= 0 leads to
∂β
β=βˆ
n−1 Z (β)U
ˆ (i) (β)
ˆ
t
(i)
tk
0 =
+δ
(6.0.21)
k
ˆ
rt,i (β)
t=0


(i)
p
q
n−1
Uˆ (β 0 )
ˆ (j)X
ˆ (p + i)Z
ˆ − β ) + δ (i) ,
Xt −
 tk
=
β
β
+ η k (β
t
t
0
t−j −
t−i (β 0 )
k
rt,i
t=0
j=1
i=1

122

where X−j = Z−j = 0 for all j ≥ 0, and
n−1
ˆ )2 ∂rt,i
σt2 ∂ n−1
1
(Xt − X
(i)
t
δ =
log rt,i −
k
2
2 ∂βk
2
∂βk
r
t,i
t=0
t=0
n−1
ˆ + Z ∂(X
ˆ +Z ) X −X
ˆ − Z ∂(X
ˆ −Z )
1
Xt − X
t
t
t
t + t
t
t
t
t
−
2
rt,i
∂βk
rt,i
∂βk
t=0

ˆ
β=β

,

(6.0.22)
and
n−1 U (i) (β ) q
(i)
tk n
ˆ
β
Ut−j (β n )
ηk =
p+j
r
(β )
t=0 t−1 n j=1




(i)
q
p
n−1
U
ˆ
 ∂  tk 
Xt −
βˆj Xt−j −
β
+
p+j Zt−i (β n ) ∂β r
t−1
t=0
j=1
j=1
β=β n


(i)
n−1
n−1 U (i) (β ) q
∂  Utk 
0
(i)
tk
Zt (β 0 )
β p+j Ut−j (β 0 ) +
=
rt−1 (β 0 )
∂β rt−1
t=0
t=0
j=1
β=β 0
ˆ − β ).
+Op (n β
0
(i)
(i) (i)
(i)
Ut1 , . . . , Utp , Vt1 , . . . , Vtq , and β n is always between
∂
ˆ and β . Similarly, the equation
β
(β)|
0
ˆ = 0 (1 ≤ k ≤ q) leads to
∂βp+j
β=β

In the above expression, U t =

n−1



p
ˆ X
β
j t−j −

Xt −

0=
t=0
+η



q

j=1

j=1

ˆ
 (i)
β
p+j Zt−i (β 0 ) Vtk (β 0 )rt−1 (β 0 )

(i) (i) ˆ
(i)
(β − β 0 ) + δ
,
p+k
p+k

123

where

δ

n−1
n−1
ˆ )2 ∂rt,i
σt2 ∂
1
(Xt − X
t
log rt,i −
2
2 ∂βp+k
2
∂βp+k
rt,i
t=0
t=0
n−1
ˆ + Z ∂(X
ˆ +Z ) X −X
ˆ − Z ∂(X
ˆ −Z )
1
Xt − X
t
t
t
t + t
t
t
t
t
−
2
rt,i
∂βp+k
rt,i
∂βp+k
t=0

(i)
=
p+k

ˆ
β=β

,

(6.0.23)
and


(i)
n−1 U (i) (β ) q
n−1
V
∂  tk 
(i)
(i)
tk 0
η
=
β p+j Ut−j (β 0 ) +
Zt (β 0 )
p+k
r
(β )
∂β rt−1
t=0 t−1 0 j=1
t=0

β=β 0

ˆ − β ).
+Op (n β
0
It follows from (6.0.21) and (6.0.23) that

ˆ = U R−1 Y + A (β
ˆ − β ) + δ (i) ,
U R−1 X β
0

(6.0.24)

where
(i)
(i)
δ (i) = (δ1 , . . . , δp+q ) ,
and A is the (p + q) × (p + q) matrix with η

(i)
as its k-th column. Note that Y − Xβ 0 = Z
k

and

q




U =X−
β p+j 


j=1

124


U −j (β 0 )
..
.
U n−1−j (β 0 )




.



By (6.0.21), (6.0.23) and (6.0.24), we could obtain

ˆ − β ) = U R−1 Z + B (i)
U R−1 U (β
0

ˆ − β ) + δ (i) ,
(β
0

where B (i) is the (p + q) × (p + q) matrix, and the sum of last two terms on the right hand
side of (6.0.23) is the (p + k)-th term for B (i) , with k = 1, . . . , q. Therefore,

ˆ −β ) = 
n1/2 (β
0

=

U R−1 U
n
U R−1 U
n

−

B (i)

−1

n

−1
U R−1 Z − δ (i)

n1/2

U R−1 Z
+ op (1),
n1/2

(i) p
where the last equality follows from Lemma 6.0.3 and Lemma 6.0.7, and the fact that Bn →
0, with a similar proof as (6.0.17). Then the theorem follows from Lemma 6.0.9.

125

Chapter 7
Periodic AIC for PARMAS (p, q) model
7.1

Kullback-Liebler (K-L) Information

The development of the AIC is predicted on the Kullback-Liebler (K-L) information between
two probability density functions f and g, where K-L information is defined to be

I(f, g) =

f (t) ln

f (t)
g(t)

dt.

The notation I(f, g) denotes a measure of the information lost when g is used to approximate f , which is also the expectation of the logarithmic difference between the two density
functions f and g.

Example 7.1.1. Suppose we approximate the normal distribution given by f (t|µ, σ 2 ) =
1 t−µ 2
1 t−ξ 2
√1 e− 2 ( σ ) with g(t|ξ, τ 2 ) = √1 e− 2 ( τ ) . Then
σ 2π
τ 2π

ln

f (t)
g(t)

1
τ2
t−µ 2
t−ξ 2
= {ln
−(
) +(
) },
2
σ
τ
σ2
126

and the K-L information is
1
τ2
t−µ 2
t−ξ 2
ln
− E(
) + E(
)
2
σ
τ
σ2
1
τ2
σ 2 + (µ − ξ)2
=
ln
−1+
.
2
σ2
τ2

I(f, g) = Ef ln

f (t)
g(t)

=

If the true distribution f is the standard normal and g ∼ N(0.1, 1.5), then
1.5
1 + (0 − 0.1)2
1
−1+
} = 0.0394.
I(f, g) = {ln
2
1
1.5

Proposition 7.1.2. I(f, g) ≥ 0.

Proof. For a convex function, C, Jensen’s Inequality from Durrett [16, Theorem 1.5.1] asserts
that C [E(X)] ≤ E (C(X)). Therefore by letting C = − ln(t), we write

I(f, g) =
=
=
=
≥
=

f (t)
dt
g(t)
g(t)
− f (t) ln
dt
f (t)
g(t)
dt
f (t)C
f (t)
g(t)
Ef C
f (t)
g(t)
C Ef
f (t)
g(t)
− ln f (t)
dt
f (t)
f (t) ln

= − ln
= − ln 1
= 0.
127

g(t)dt

7.2

Derivation of periodic AIC for PARMAS (p, q) process

Following the ideas from Basawa and Lund [24], we treat the white noise variances σ 2 =
2 ) as nuisance parameters. We use β = (φ , θ , φ , θ , . . . , φ
(σ02 , σ12 , . . . , σS−1
0 0 1 1
S−1 , θ S−1 )
to denote the collection of all PARMAS (p, q) parameters. The dimension of β is (p+q)S ×1.
ˆ (i) depend on β. We
Then the likelihood function is given by (5.1.13), where vj,i and X
i+j
also need the asymptotic results of MLE for PARMA process in Basawa and Lund [24]. For
a causal and invertible Gaussian PARMA model, with the assumption of {εt } being periodic
i.i.d. Gaussian noise, Theorem 3.1 in Basawa and Lund [24] gave the asymptotic distribution
ˆ
of β,
N 1/2 (βˆ − β) → N 0, A−1 (β, σ 2 ) ,

(7.2.1)

where
A(β, σ 2 ) =

S−1

σi−2 Γi (β, σ 2 ),

(7.2.2)

i=0
and
Γi (β, σ 2 ) = E

∂εt (β)
∂β

∂εt (β)
∂β

,

where the right hand side also depends on β and σ 2 by (5.2.7). However, their proof for
the asymptotic distribution is based on the asymptotic equivalence of lease square estimator
and MLE. In Chapter 6 of my thesis, I gave a direct proof in Theorem 6.0.10.
ˆ the MLE of σ 2 for 0 ≤ i ≤ S − 1
Once we obtain the maximum likelihood estimate β,
i
128

could be computed from (5.1.16), where

1
σ
ˆi2 =
N

N −1
ˆ
XkS+i − X
kS+i

2

/rkS+i = Si /N,

k=0

ˆ
where X
kS+i and rkS+i come from the innovations algorithm applied to the model.
Now we are ready to discuss the derivation of AIC. In the K-L information, f represents
the true probability distribution and g represents a model distribution that estimates f . In
the context of PARMA time series modeling, we assume that the truth f and all approximating alternatives g are Gaussian. Additionally, define it as g(t|β0 ). Suppose X is an
n-length PARMAS (p, q) time series whose probability density is given by f (t) = g(t|β0 ),
where X = (Xi , Xi+1 , . . . , Xi+n−1 ) . To see how we use the K-L information to determine
which model, g(t|β), best fits the truth f , so that it would minimize I(f, g), we write

I(f, g) =
=

f (t) ln(

f (t)
)dt
g(t|β)

f (t) ln(f (t))dt −

f (t) ln(g(t|β))dt.

(7.2.3)

For all models g, the first integral on the right-hand side of (7.2.3) is a constant, so it suffices
to maximize

f (t) ln(g(t|β))dt. Given that we have data Y = (Yi , Yi+1 , . . . , Yi+n−1 ) as

ˆ ), since
a sample from the same truth f (t), the logical step would be to find the MLE β(Y
ˆ ) approximates βˆ that minimizes the K-L information. Then we compute an estimate
β(Y
0
of I(f, g(t|β0 )) as
ˆ )) =
I(f, g(t|β(Y

f (t) ln(

f (t)
)dt.
ˆ ))
g(t|β(Y

ˆ ) other than β results in I(f, g(t|β(Y
ˆ ))) >
Since f = g(t|β0 ), any value of β(Y
0
129

I(f, g(t|β0 )).

Consider the method of repeated sampling as a guide to inference, and minimize the K-L
ˆ )))]. We therefore want to select the model g that minimizes
discrepancy EY [I(f, g(t|β(Y

ˆ )))] =
EY [I(f, g(t|β(Y

f (t) ln(f (t))dt − EY

ˆ )) dt
f (t) ln g(t|β(Y

ˆ )
= constant − EY EX ln g(t|β(Y
= constant − T,

where all expectations are taken under the assumption that f is the density of X and Y ,
and X and Y are independent. Hence, the K-L information criterion for selecting the best
model, g, is to maximize the objective function denoted by

ˆ )
T = EY EX ln g(t|β(Y

.

(7.2.4)

ˆ )
We have assumed that f (t) has the true PARMAS (p, q) model structure, and g t|β(Y
ˆ ) → β as N → ∞. Note that βˆ is not necessarily
is an estimate of f (t). By (7.2.1), β(Y
equal to βˆ0 or even of the same dimension. Here βˆ is the parameter vector for the PARMA
model under consideration with the smallest K-L discrepancy from the true model.

Without loss of generality we take the likelihood of β(X) as g(t|β(X)) = LX (β(X)),
by simply then interpreting g as a function of β(X) given X, which equals the likelihood
under data X. Similarly, under data Y we write g(t|β(Y )) = LY (β(Y )). In the following
ˆ )))] would be ln(L (β(X))),
ˆ
we will show that our estimate of T = EY EX [ln(LY (β(Y
X
130

where the data X = (Xi , Xi+1 , . . . , Xi+n−1 ) are given. The bias of this estimate is

ˆ
bias = EX [ln(LX (β(X)))]
− T.

(7.2.5)

We will obtain a first order estimate of the bias, and then remove this. Ultimately, AIC is
defined to be
ˆ
AIC = −2 ln(LX (β(X)))
+ 2 bias.

(7.2.6)

Next we will state a few prerequisite results for our main theorem. Define
n−1 ε2 (β) N −1 S−1 ε2
i+j
jS+k+i (β)
S(β) =
=
,
2
2
σ
σ
i+j
j=0
j=0 k=0
k+i

(7.2.7)

where n = N S, and
p
εt = X t −

q
φt (k)Xt−k −

θt (j)εt−j ,
j=1

k=1

are the model residuals. Then we can have the following result, where we write the residual
εt as εt (β) to emphasize explicit dependence of εt on β, and β needs to be estimated.
Proposition 7.2.1. Under the assumption of finite second moment for the causal and invertible PARMA process, as N → ∞, we have
1 ∂ 2 S(β)
→ 2A(β, σ 2 ) in probability,
N ∂β 2
where A(β, σ 2 ) is given in (7.2.2).
Proof. In the following, β is a (p+q)S ×1 vector,
131

∂ 2 S(β)
is a (p+q)S ×(p+q)S dimensional
∂β 2

matrix.

1 ∂ 2 S(β)
1
=
2
N ∂β
N

∂ 2(

∂(
=

2
N

2
=
N

N −1
j=0

N −1
j=0

N −1 S−1

ε2
(β)
S−1 jS+k+i
)
k=0
σ2
k+i
2
∂β
S−1 εjS+k+i (β) ∂εjS+k+i (β) )
∂β
k=0
σ2
k+i
∂β

σ −2
(
jS+k+i

j=0 k=0
N −1 S−1 ε

∂εjS+k+i (β) ∂εjS+k+i (β)
)(
)
∂β
∂β

2
jS+k+i (β) ∂ εjS+k+i (β)
∂β 2
σ2
j=0 k=0
k+i


S−1
N −1 ∂ε
(β)
∂ε
(β)
1
jS+k+i
jS+k+i
= 2
σ −2 
(
)(
)
i+k N
∂β
∂β
j=0
k=0


2ε
S−1
N −1 ε
∂
(β)
(β)
jS+k+i
jS+k+i
1
,
+ 2
2
2
N
∂β
σ
j=0
k+i
k=0
2
+
N

As N → ∞, by equation (3.13) in Basawa and Lund [24],

1
N

N −1 ∂ε
jS+k+i (β) ∂εjS+k+i (β)
(
)(
) → Γi+k (β, σ 2 ) in probability,
∂β
∂β
j=0

and by equation (3.14) in Basawa and Lund [24],

1
N

N −1 ε

2
jS+k+i (β) ∂ εjS+k+i (β)
→ 0 in probability,
2
∂β 2
σ
j=0
k+i

132

therefore as N → ∞,
S−1
1 ∂ 2 S(β)
→ 2
σ −2 Γi+k (β, σ 2 ) + 0 in probability
i+k
N ∂β 2
k=0
= 2A(β, σ 2 ).

Lemma 7.2.2. Given X a random variable, if E|X| < ∞, then as x → ∞,

E |X|I {|X|>x} → 0.

Thus every random variable X such that E|X| < ∞ is by itself uniform integrable.

Proof. 0 ≤ |X|I{|X|>x} is monotone increasing in x to |X|, and therefore using the monotone convergence theorem yields

E |X|I{|X|≤x} → E|X|,

as x → ∞.

Notice that
E|X| = E |X|I{|X|≤x} + E |X|I{|X|>x} ,
by assumption E|X| < ∞, we conclude that

E |X|I {|X|>x} → 0,

and X is by definition uniform integrable.

133

Lemma 7.2.3. If Xn → X in probability, the following statements are equivalent: (i)
{Xn : n ≥ 0} is uniform integrable. (ii) Xn → X in L1 . (iii) E|Xn | → E|X| < ∞.
Proof. See Durrett [16, pp. 221-222].

In the following, we rewrite the second derivatives of log likelihood function, to simplify
notation. Define
∂ 2 ln(LX (β))
,
Ω(β) =
∂β 2
then we have the following lemma, which was given in Lund et al. [25, Equation 10] without
proof. A few necessary assumptions are needed for Lemma 7.2.4 and Theorem 7.2.5, and
2 ∗
2
1 ∂ S (β) , 1 ∂ S(β) and 1 (−2Ω(β)) are not far apart. If we could
they guarantee that N
N ∂β 2
N
∂β 2
ˆ for PARMA model, some of the assumptions below may
prove the strong consistency of β
be reduced. This would be the direction of our further investigation. Currently we adopt the
strong consistency for MLE of a vector ARMA model from Section 3 of Basawa and Lund
[24], and apply it for PARMA model.

• A1. Ω(β) is bounded and continuous on the parameter space B.
2 ∗
2
1 ( ∂ S (β) − ∂ S(β) ) = o (1).
• A2. N
p
∂β 2
∂β 2
1
• A3. N

∂ 2 S ∗ (β)
Ω(β) + 21
∂β 2

• A4. N β − βˆ

β − βˆ

= op (1).

is uniform integrable.

Lemma 7.2.4.
−

Ω(β)
→ A(β,σ 2 )
N

in probability as
134

N → ∞.

Proof. By Proposition 7.2.1, as N → ∞,
1 ∂ 2 S(β)
→ A(β, σ 2 ) in probability,
2N ∂β 2

then by assumption A2 and A3 we can write, as N → ∞,

−

Ω(β)
1
=−
N
N

Ω(β) +

1 ∂ 2 S ∗ (β)
2 ∂β 2

+

1
N

1 ∂ 2 S ∗ (β) 1 ∂ 2 S(β)
−
2 ∂β 2
2 ∂β 2

+

1 ∂ 2 S(β)
2N ∂β 2

1
1 ∂ 2 S(β)
= −op (1) + op (1) +
2
2N ∂β 2
→ A(β,σ 2 ) in probability,

where the last equality is by Brockwell and Davis [11, Proposition 6.1.3].

Now we are ready to state our main theorem.

Theorem 7.2.5. Suppose that Xt is a causal and invertible Gaussian PARMAS (p, q) process, such that assumptions A1 − A4 hold. Let f (t) be the joint probability density function
ˆ
of X = (Xi , Xi+1 , . . . , Xi+n−1 ) , and β(X)
denote the maximum likelihood estimates
given data X, suppose that we are given an independent realization of the same process
ˆ ) as its MLE. Let T = E E [ln(L (β(Y
ˆ )))],
Y = (Yi , Yi+1 , . . . , Yi+n−1 ) , with β(Y
Y X
Y
then
ˆ
T = EX [ln(LX (β(X)))]
− (p + q)S + o(1).
Proof. In the following, both the expectations EX and EY are with respect to f , where the
samples X and Y are independent. Let true parameter values be β 0 ∈ B, where B is the
135

parameter space containing all β. From (7.2.5),

ˆ
EX [ln(LX (β(X)))]
−T
ˆ
ˆ ))]
=EX [ln(LX (β(X)))]
− EY EX [ln(LX (β(Y
ˆ
ˆ ))]
=EY EX [ln(LX (β(X)))
− ln(LX (β(Y
ˆ
ˆ ))]
=EY EX [ln(LX (β(X)))
− ln LX (β0 ) ] + EY EX [ln(LX (β0 )) − ln(LX (β(Y
ˆ
ˆ ))]
=EX [ln(LX (β(X)))
− ln LX (β0 ) ] + EY EX [ln(LX (β0 )) − ln(LX (β(Y
(7.2.8)

ˆ
We will prove in the following that EX ln(LX (β(X)))
− ln LX (β0 )

= 21 (p+q)S +o(1)

ˆ ))] = 1 (p + q)S + o(1), respectively.
and EY EX [ln(LX (β0 )) − ln(LX (β(Y
2

ˆ
First of all, we compute EX [ln(LX (β(X)))
− ln LX (β0 ) ] in (7.2.8). Apply a Taylor
ˆ
series expansion to ln LX (β0 ) about the MLE β(X)
for a sample of data X yielding
ˆ
∂ ln(LX (β(X)))
ˆ
ˆ
ln LX (β0 ) = ln(LX (β(X)))
+[
] (β0 − β(X))
∂β

(7.2.9)

ˆ
∂ 2 ln(LX (β(X)))
1
ˆ
ˆ
+ (β0 − β(X))
[
](β0 − β(X))
+ Re,
2
∂β 2

where

ˆ
∂ 2 ln(LX (β(X)))
ˆ
is a (p + q)S × (p + q)S matrix, β0 − β(X)
is a 1 × (p + q)S
∂β 2

column vector, and Re represents the exact remainder term for the quadratic Taylor series
1 ), and the convergence
expansion and assume it is uniformly integrable. Note that Re = op ( n
of Re in probability could be inferred from Lemma 6.0.7 and Brockwell and Davis [11,
Proposition 6.1.5].
136

ˆ
Since β(X)
is the MLE from data X, then

[

ˆ
∂ ln(LX (β(X)))
∂ ln(LX (β))
]=[
]
= 0.
β=βˆ
∂β
∂β

Taking expectations on both sides of (7.2.9),

ˆ
EX [ln LX (β0 )] = EX [ln(LX (β(X)))]
ˆ
∂ 2 ln(LX (β(X)))
1
ˆ
ˆ
(β0 − β(X))]
+ EX [Re]
+
EX [(β0 − β(X))
2
∂β 2
ˆ
= EX [ln(LX (β(X)))]
ˆ
∂ 2 ln(LX (β(X)))
1
ˆ
ˆ
+
tr{EX [
](β0 − β(X))(β
0 − β(X)) } + EX [Re],
2
∂β 2
1 ) by Lemma 6.0.7 and Brockwell and Davis [11, Proposition 6.1.5]. Therewhere Re = op ( n
fore the expectation of the remainder term is negligible for large sample sizes if we assume
the uniform integrability, i.e. limN →∞ EX [Re] = 0.

Additionally, because βˆ is the MLE under LX (β0 ), by (7.2.1), βˆ → β0 in probability
as N → ∞. By assumption A1 and [11, Proposition 6.1.4], we have

ˆ → Ω(β ) in probability as N → ∞.
Ω(β)
0

ˆ is also uniformly integrable. Then by Lemma
By assumption A1 and Lemma 7.2.2, Ω(β)
7.2.3 we can get:

lim EX [
N →∞

ˆ
∂ 2 ln(LX (β))
∂ 2 ln(LX (β(X)))
] = EX [
].
∂β 2
∂β 2
137

Hence we may write
ˆ
∂ 2 ln(LX (β(X)))
ˆ
ˆ
(β0 − β(X))
(β0 − β(X))
∂β 2
ˆ
ˆ
ˆ
= (β0 − β(X))
Ω(β)(β
0 − β(X))
ˆ
ˆ
= (β0 − β(X))
Ω(β0 )(β0 − β(X))
ˆ
+ (β0 − β(X))

ˆ − Ω(β ) (β − β(X))
ˆ
Ω(β)
0
0

ˆ
ˆ
= (β0 − β(X))
Ω(β0 )(β0 − β(X))
+ op (1),

ˆ
since (β0 − β(X))
= op (1) so that

ˆ − Ω(β ) = op (1),
Ω(β)
0

and
ˆ
(β0 − β(X))

ˆ − Ω(β ) (β − β(X))
ˆ
Ω(β)
= op (1)
0
0

by Brockwell and Davis [11, Proposition 6.1.1]. Therefore

ˆ
ˆ
ˆ
ˆ
ˆ
(β0 − β(X))
Ω(β)(β
0 − β(X)) − (β0 − β(X)) Ω(β0 )(β0 − β(X)) = op (1).

(7.2.10)

Now by assumption A1 and A4,

ˆ
ˆ
ˆ
ˆ
ˆ
EX (β0 − β(X))
Ω(β)(β
0 − β(X)) − (β0 − β(X)) Ω(β0 )(β0 − β(X)) = o(1).

138

Therefore

ˆ
ˆ
ˆ
Ω(β)(β
EX (β0 − β(X))
0 − β(X))
ˆ
ˆ
ˆ
ˆ
ˆ
= EX (β0 − β(X))
Ω(β)(β
0 − β(X)) − (β0 − β(X)) Ω(β0 )(β0 − β(X))
ˆ
ˆ
+ EX (β0 − β(X))
Ω(β0 )(β0 − β(X))
ˆ
ˆ
= o(1) + EX (β0 − β(X))
Ω(β0 )(β0 − β(X))
ˆ
= o(1) + tr EX Ω(β0 ) β0 − β(X)

ˆ
β0 − β(X)

.

Then

ˆ
lim EX [ln LX (β0 )] =
lim EX [ln(LX (β(X)))]
N →∞
N →∞
1
ˆ
ˆ
ˆ
+
(β0 − β(X))
Ω(β)(β
lim E
0 − β(X))
2 N →∞ X
+

lim EX [Re]
N →∞

ˆ
lim EX [ln(LX (β(X)))]
N →∞
1
ˆ
ˆ
+ +
lim tr{EX Ω(β0 ) β0 − β(X)
β0 − β(X)
2 N →∞
1
ˆ
= EX [ln(LX (β(X)))]
−
lim tr{EX AN WN },
2 N →∞

=

Ω(β0 )
ˆ
N , and WN = N β0 − β(X)
√
ˆ
we define WN = ZN ZN , where ZN = N β0 − β(X)

where we define AN = −

}

ˆ
β0 − β(X)
. Additionally
→ Z as N → ∞ and Z ∼

N(0, A−1 (β, σ 2 )) by (7.2.1). Next we will show that

lim EX AN WN = I(p+q)S,
N →∞

139

(7.2.11)

where I(p+q)S is the identity matrix. Notice that AN → A(β,σ 2 ) in probability as N →
∞, by Lemma 7.2.4, and WN → W = ZZ , then EWN → EW = E ZZ

= A−1 (β,σ 2 ),

where the uniform integrability is guaranteed in assumption A4. Hence

EX AN WN = EX AN WN − W

AN − A(β,σ 2 ) W

+ EX

(7.2.12)

+ EX A(β,σ 2 )W .
By assumption A1, |AN | < M , where M < ∞, therefore as N → ∞,

0 ≤ |EX AN WN − W

| ≤ |EX |AN | WN − W

| ≤ M |EX

WN − W

| → 0,

then
EX AN WN − W

→ 0.

(7.2.13)

Similarly,
EX

AN − A(β,σ 2 ) W → 0,

(7.2.14)

and

EX A(β,σ 2 )W = A(β,σ 2 )EX [W ] = A(β,σ 2 )A−1 (β,σ 2 ) = I(p+q)S.

(7.2.15)

Substitute (7.2.13), (7.2.14) and (7.2.15) into (7.2.12), and let N → ∞, to arrive at (7.2.11).
Then

1
ˆ
lim tr{I(p+q)S }
lim EX [ln LX (β0 )] = EX [ln(LX (β(X)))]
−
2 N →∞
N →∞
1
ˆ
= EX [ln(LX (β(X)))]
− (p + q)S.
2
140

So we obtain

1
ˆ
lim {EX [ln(LX (β(X)))]
− EX [ln LX (β0 )]} = (p + q)S,
2
N →∞
or, in other words,

1
ˆ
EX [ln(LX (β(X)))]
− EX [ln LX (β0 )] = (p + q)S + o(1).
2

(7.2.16)

ˆ ))], the remaining term in the last
Now let us consider EY EX [ln(LX (β0 )) − ln(LX (β(Y
ˆ )) around β for any
line of (7.2.8). Similarly, apply the Taylor expansion to ln(LX (β(Y
0
given data X yielding

ˆ )) = ln(L (β )) + [ ∂ ln(LX (β)) ] (β(Y
ˆ )−β )
ln LX (β(Y
0
X 0
∂β
∂ 2 ln(LX (β)) ˆ
1 ˆ
+ (β(Y ) − β0 ) [
](β(Y ) − β0 ) + Re.
2
∂β 2
Taking expectations with respect to X yields

ˆ ))] = E [ln(L (β ))] + E [ ∂ ln(LX (β)) ] (β(Y
ˆ )−β )
EX [ln LX (β(Y
0
X
X 0
X
∂β
∂ 2 ln(LX (β)) ˆ
1 ˆ
+ (β(Y ) − β0 ) EX [
](β(Y ) − β0 )
2
∂β 2
+EX [Re]
(7.2.17)

where Y is independent of X.

141

Taking expectations of (7.2.17) with respect to Y yields

ˆ ))] = E [ln(L (β ))] + E [ ∂ ln(LX (β)) ] E
ˆ
EY EX [ln LX (β(Y
X
X 0
X
Y β(Y ) − β0
∂β
+

2
1
ˆ ) − β )]
ˆ ) − β ) E [ ∂ ln(LX (β)) ](β(Y
EY [(β(Y
0
0 X
2
∂β 2

+ EX [Re],

ˆ ) − β → 0 by
letting N → ∞ on both sides, the linear terms vanishes since EY β(Y
0

142

(7.2.1), then the above equations becomes

ˆ ))]
lim EY EX [ln LX (β(Y
N →∞
∂ ln(LX (β))
ˆ )−β
= lim EX [ln(LX (β0 ))] + lim EX [
] EY β(Y
0
∂β
N →∞
N →∞
∂ 2 ln(LX (β)) ˆ
1
ˆ
+
lim E [(β(Y ) − β0 ) EX [
](β(Y ) − β0 )]
2 N →∞ Y
∂β 2
+ lim E [Re]
N→∞ X
∂ ln(LX (β))
ˆ )−β
] lim EY β(Y
0
∂β
N →∞
1
1
1
1 ∂ 2 ln(LX (β))
ˆ ) − β )N 2 (β(Y
ˆ ) − β ) ]}
+
lim tr{EX [
]EY [N 2 (β(Y
0
0
2 N →∞
N
∂β 2

=EX [ln(LX (β0 ))] + EX [

=EX [ln(LX (β0 ))]
1
1
1
1 ∂ 2 ln(LX (β))
ˆ ) − β )N 2 (β(Y
ˆ ) − β ) ]}
]EY [N 2 (β(Y
+ tr{ lim EX [
0
0
2 N →∞
N
∂β 2
1 ∂ 2 ln(LX (β))
1
]A(β, σ 2 )−1 }
=EX [ln(LX (β0 ))] + tr{EX lim [
2
∂β 2
N →∞ N
=EX [ln(LX (β0 ))]
1
1 ∂ 2 ln(LX (β)) 1 ∂ 2 S ∗ (β)
+ tr{EX lim [
+
]A(β, σ 2 )−1 }
2
2 ∂β 2
∂β 2
N →∞ N
1 ∂ 2 S ∗ (β)
1
]A(β, σ 2 )−1 }
− tr{EX lim [
2
2
2
∂β
N →∞
1
=EX [ln(LX (β0 ))] + 0 − tr{[2A(β, σ 2 )A(β, σ 2 )−1 ]}
4
1
=EX [ln(LX (β0 ))] − tr{I(p+q)S }
2
1
=EX [ln(LX (β0 ))] − (p + q)S,
2
Hence
ˆ ))]} = 1 (p + q)S,
lim {EX [ln(LX (β0 ))] − EY EX [ln(LX (β(Y
2
N →∞

143

or, in other words,

ˆ ))] = 1 (p + q)S + o(1).
EX [ln(LX (β0 ))] − EY EX [ln(LX (β(Y
2

(7.2.18)

Add (7.2.16) and (7.2.18) into (7.2.8), we have

ˆ
EX [ln(LX (β(X)))]
− T = (p + q)S + o(1)

which completes the proof.

ˆ
Akaike [2] defined an information criterion (AIC) by multiplying ln(LX (β(X)))
by −2,
to get
ˆ
AIC = −2 ln(LX (β(X)))
+ 2K,
where K is the bias term for maximum log-likelihood as an estimator for

ˆ )))],
T = EY EX [ln(LY (β(Y

which is equal to the number of estimable parameters in the model. This has become known
as Akaike’s information criterion or AIC. For a PARMAS (p, q) model, if we treat the variance
as nuisance parameters, there are k = (p + q)S estimable parameters. It is also proved in
Theorem 7.2.5, where

ˆ
lim bias = lim {EX [ln(LX (β(X)))]
− T } = (p + q)S,
N →∞
N →∞
144

so that
bias = (p + q)S + o(1).
ˆ )))] is
Therefore an asymptotically unbiased estimator of T = EY EX [ln(LY (β(Y

ˆ
EX [ln(LX (β(X)))]
− (p + q)S.

From (7.2.6) the AIC for a PARMA model is obtained by

ˆ
AIC = −2 ln(LX (β(X)))
+ 2(p + q)S,

7.2.1

(7.2.19)

Application to model selection for the Fraser River

Besides the compare of forecast plots in Chapter 5, we can also compute the value of AIC
for each candidate model, and the one yielding the minimum AIC is the best model. The
results are shown in Table 7.1. Within all full model candidates, the full PARMA(1, 1)
model fitted by MLE in Table 5.4 has the minimum AIC, and so it is the best full model.
For reduced model, the model in in Table 5.7, obtained by asymptotic distribution of MLE,
has the minimum AIC, and there are only 13 estimable parameters. Lastly, for PARS (p)
model, we tried two different approaches. First approach is by removing all θˆi in Table 5.7,
since there are only three of them. In this way, we obtain a PAR12 (1) model, and there are
only 12 estimable parameters. However, the value of AIC turns out to be very large. The
second approach is done in a more rigorous way, using the pear package in R to do automatic
model selection for PAR12 (p) model. The pear package was developed by A.I. McLeod and
Mehmet Balcilar, for estimating periodic autoregressive models, and they provided a built-in
145

data set for the historical flows of Fraser river. The best model selected is shown in Table 7.2,
which is a PAR12 (3) model with 19 estimable parameters. We compute its AIC, shown in
Table 7.1. Additionally, Table 7.3 demonstrate the AIC values for the logarithm of the data,
where the model parameters have more impacts on AIC. Note that both full and reduced
PARMA12 (1, 1) models perform better than PAR models. Therefore the PARMA12 (1, 1)
model is a better model for the Fraser river flows.

Full PARMA12 (1, 1) model
Full PARMA12 (1, 1) model
Reduced PARMA12 (1, 1) model
Reduced PARMA12 (1, 1) model
PAR12 (1) model, removing all θˆi
PAR12 (3) model

in
in
in
in
in
in

Model Number of parameters
AIC
Table 5.3
24 18524.25
Table 5.4
24 18476.47
Table 5.5
19 18768.79
Table 5.7
13 18528.13
Table 5.7
12 18754.09
Table 7.2
19 17714.33

Table 7.1: Comparison of AIC values for different models

ˆ (1)
Season i φ
i
0 0.527
1 0.779
2 0.764
3 1.188
4 0.647
5 0.411
6 0.545
7 0.517
8 0.661
9 0.890
10 0.631
11 0.543

ˆ (2) φ
ˆ (3)
φ
i
i
0.000 0.000
-0.231 0.189
0.000 0.000
0.000 0.000
0.000 0.000
-1.237 1.562
0.000 0.000
0.000 0.000
-0.127 0.000
-0.434 0.165
0.000 0.000
0.000 0.000

σ
ˆi
6325.612
5004.514
5326.131
17540.533
37180.017
38084.824
34809.521
17666.091
13518.955
14685.378
12434.951
8535.279

Table 7.2: The model parameters in PAR12 (p) automatic model fitting by pear package in
R, where the number of estimable parameters is 19, assuming σ
ˆ i as nuisance parameters in
the AIC computation.

146

Full
Full
Reduced
Reduced

PARMA12 (1, 1)
PARMA12 (1, 1)
PARMA12 (1, 1)
PARMA12 (1, 1)
PAR12 (3)

model
model
model
model
model

in
in
in
in
in

Model Number of parameters
AIC
Table 5.3
24 -415.8586
Table 5.4
24 -506.6053
Table 5.5
19 -313.3939
Table 5.7
14 -297.9929
Table 7.2
19 520.9146

Table 7.3: A compare of AIC values for different models, after taking the log of the data.
Note that both full and reduced PARMA12 (1, 1) models perform better than PAR models.

7.3

Future Research

In this section I would like to list out the open problems in my research, and this would also
be helpful for researchers who are interested in studying in this topic deeply.
• Strong consistency of MLE. This could follow the result for ARMA model from Yao
and Brockwell [57]. Clear details applying to PARMA model should be carefully generalized.
• A stronger condition for Theorem 7.2.5, with fewer assumptions. The uniform integrability of N β − βˆ

β − βˆ

is waiting for a complete proof, and the property of

the second derivative of likelihood function in PARMA model needs to be studied in
details.
• I will work on cleaning up my R code, and add the forecasting tool in perARMA
package. This would be very useful for researchers would work on hydrology and
periodic stationary time series prediction.

147

APPENDIX

148

APPENDIX

# R code for the paper

"Forecasting for periodic ARMA models"

# by Anderson, Paul; Meerschaert, Mark; Zhang, Kai

# Please put the file "frazierc.txt" under your work directory

#######################################
#

Seasonal Sample Mean for (4.1)

#######################################

data <- read.table("frazierc.txt")
XMEAN <- array(0,c(12))
# XMEAN is a vector of sample means for 12 seasons
for(I in 1:12)
{
XMEAN[I] <- 0
for(T in 0:71)
{
XMEAN[I] <- XMEAN[I] + data[T*12+I,1]
}
XMEAN[I] <- XMEAN[I]/72
149

}

######################################
# Sample Autocovariance for (4.2)
######################################

# season i = 0, 1, ..., 11
# LAG = 0, 1, ..., 124

COVAR <- array(0,c(12,125))
for(L in 1:125)
{
LAG <- L-1
for (I in 1:12)
{
i <- (I-1)
COVAR[I,L] <- 0
J <- as.integer((i+LAG)/12)
K <- ((I+LAG)-(12*J))
for (T in 0:(71-J))
{
COVAR[I,L] <- COVAR[I,L]
+(data[T*12+I,1]-XMEAN[I])*(data[T*12+I+LAG,1]-XMEAN[K])

150

}
COVAR[I,L] <- COVAR[I,L]/(72-J)
}
}

####################################
# Sample Autocorrelation for (4.3)
####################################
rho <- array(0,c(12,125))
for (I in 1:12)
{
for (L in 1:125)
{
R <- (I+L-2)%%12 + 1
rho[I,L] <- COVAR[I,L]/((COVAR[I,1]*COVAR[R,1])^.5)
}
}

##############################################
#

Innovations Algorithm for X_t process

##############################################

##

I = i + 1

151

##

J = j + 1

##

K = k + 1

##

L = ell + 1

##

N = n + 1

##

COVAR(I,L) = gamma_i(ell)

##

V(N,I) = v_{n,i}

##

THETA(N,M,I) = theta_{n,m}^{(i)} WHERE m = M + 1

##

NOTE THAT:

theta_{n,n-k}^{(i)} = THETA(N,N-K+1,I)

##

theta_{k,k-j}^{(i)} = THETA(K,K-J+1,I)

##

gamma_k(n-k)

= COVAR(K,N-K+1)

V<- array(0,c(50,12))
THETA <- array(1,c(50,50,12))
for (I in 1:12)
{
V[1,I] <- COVAR[I,1]
for (N in 2:50)
{
for (K in 1:(N-1))
{
S <- 0
if (K == 1)
{

152

K0 <- as.integer((I+1-2)/12)
K1 <- (I+K-1)-12*K0
THETA[N,N,I] <- (COVAR[K1,N]-S)/V[1,I]
}

else
{
for (J in 1:(K-1))
{
S <- S+THETA[K,K-J+1,I]*THETA[N,N-J+1,I]*V[J,I]
K0 <- as.integer((I+K-2)/12)
K1 <- (I+K-1)-12*K0
THETA[N,N-K+1,I] <- (COVAR[K1,N-K+1]-S)/V[K,I]
}
}
}
R <- 0
for(J in 1:(N-1))
{
R <- R+V[J,I]*(THETA[N,N-J+1,I])^2
N0 <- as.integer((I+N-2)/12)
N1 <- (I+N-1)-12*N0
V[N,I] <- COVAR[N1,1]-R

153

}
}
}

# At k = 20 iterations, get the convergence of THETA and V
psi1 <- array(0,c(12,12))
for ( I in 1:12 )
{
R <- 0
for (J in 1:12)
{
R <- (I-20-1)%%12+1
psi1[I,J] <- THETA[21,J,R]
}
}

sigma_square <- array(0,c(12))
for (I in 1:12)
{
S <- 0
S <- (I-20-1)%%12+1
sigma_square[I] <- V[21,S]
}

154

############################################
# Get model parameter estimates by (4.4)
############################################

phi <- array(0,c(12))
sigma <- array(0,c(12))
theta <- array(0,c(12,864))
for (I in 1:12)
{
R <- ((I-1)-1)%%12+1
phi[I] <- psi1[I,3]/psi1[R,2]
}
for (I in 1:12)
{
theta[I,1] <- -1
}
for (I in 1:12)
{
theta[I,2] <- psi1[I,2]-phi[I]
}
for (I in 1:12)
{

155

sigma[I] <- (sigma_square[I])^.5
}

# A simple output of model estimates
phi
theta[,2]
sigma

# reduced model
# phi[1] <- 0
# phi[5] <- 0
# phi[7] <- 0
# phi[8] <- 0
# phi[10] <- 0

######################################
# The following is for prediction
######################################

# Autocovariances K(J,L) for W_t process
# K(J,L) = C in (2.6)

K <- array(0,c(865,865))

156

for (I in 1:12)

# I is season

{
K[I,I] <- COVAR[I,1]

# when J = L = I

for (J in 1:12)
{
for (L in 1:13)
{
if ( J <= I && L == (I+1) )
{
s1 <- (J-1)%%12+1
l1 <- (L-1)%%12+1
K[J,L] <- COVAR[s1,abs(J-L)+1]
- phi[l1]*COVAR[s1,abs(L-1-J)+1]
}
}
}
}

for (I in 1:12)
{

157

for (J in 1:865)
{
for (L in 1:865)
{
if (min(J,L) >= (I+1) && abs(J-L) <= 1)
{
s1 <- (J-1)%%12+1
n1 <- (L-1)%%12+1
q1 <- (J-2)%%12+1
K[J,L] <- theta[s1,1]*theta[n1,abs(J-L)+1]*(sigma[s1])^2
+ theta[s1,2]*theta[n1,(abs(1+J-L))+1]*(sigma[q1])^2
}
}
}
}

##########################################
# Innovations algorithm for W_t process
# The computation in (2.6)
##########################################

V<- array(0,c(865,12))
THETA <- array(0,c(865,865,12))

158

for (I in 1:12)
{
for (J in 1:865)
{
THETA[J,1,I] <- 1
}
}

for (I in 1:12)
{
V[1,I] <- K[I,I]

for (N in 2:(865-I))
{
THETA[N,2,I] <- K[I+N,I+N-1]/V[N-1,I]
V[N,I] <- K[I+N,I+N]- V[N-1,I]*(THETA[N,2,I])^2
}
}

#####################################
# Computation of \hat{X} in (2.7)
#####################################

159

# Subtract seasonal mean
fraser <- t(data)
X <- array(0,c(864))
for(I in 1:12)
{
for(T in 0:71)
{
X[T*12+I] <- fraser[T*12+I]-XMEAN[I]
}
}

# Xhat = \hat{X} in (2.7)
Xhat <- array(0,c(12,880))
for (I in 1:12)
{
Xhat[I,1+I] <- 0
Xhat[I,2+I] <- THETA[2,2,I]*(X[I+1]-Xhat[I,I+1])
for (N in 3:(840-I))
{
s1 <- (I+N-1)%%12+1
Xhat[I,N+I] <- phi[s1]*X[I+N-1]+ THETA[N,2,I]*(X[I+N-1]-Xhat[I,I+N-1])
}

160

}

#############################################
# h-step prediction in (2.8) and (2.9)
#############################################
for (I in 1:12)
{
s1 <- (841-1)%%12+1
Xhat[I,841] <- phi[s1]*X[840] + THETA[841-I,2,I]*(X[840]-Xhat[I,840])
for (h in 2:24)
{
m1 <- (840+h-1)%%12+1
Xhat[I,840+h] <- phi[m1]*Xhat[I,840+h-1]
}
}

#############################
# Forecast error in (3.3)
#############################

# Calculation of Casual represention for PARMA(1,1)
# The casual coefficient psi is periodic in S
psi <- array(0,c(12,24))

# I = 12, h =24

161

for (I in 1:12)
{
psi[I,1] <- 1
psi[I,2] <- (phi[I]+theta[I,2])
S <- 1
for (k in 3:24)
{
for (j in 0:(k-3))
{
j0 <- (I-j-1)%%12+1
S <- S*phi[j0]
}
j1 <- (I-(k-1)-1)%%12+1
psi[I,k] <- S*(phi[j1]+theta[j1,2])
}
}

# h-step prediction error
# Use this error for confidence band in Corr.3.2
sigma_h2 <- array(0,c(12,24))
for (I in 1:12)
{
for (h in 1:24)

162

{
R <- 0
for (J in 1:h)
{
s0 <- (840+h-1)%%12+1
s1 <- (840+h-J-1)%%12+1
R <- R + (psi[s0,J])^2*(sigma[s1])^2
}
sigma_h2[I,h] <- R
}
}

# Add seasonal mean to Xhat
Yhat <- array(0,c(880))
for(I in 1:12)
{
for(T in 0:71)
{
Yhat[T*12+I] <- Xhat[1,T*12+I]+XMEAN[I]
}
}

# Computation of residuals

163

res <- X - Yhat[1:864]
for (I in 1:12)
{
for (T in 0:71)
{
res[T*12+I] <- res[T*12+I]/sigma[I]
}
}

##############################################
# 95% Prediction bounds for h-step prediction in Figure 3
##############################################
CI_low <- array(0,c(24))
CI_up <- array(0,c(24))
for (I in 1:12)
{
for (h in 1:24)
{
CI_low[h] <- Yhat[840+h]-1.96*sqrt(sigma_h2[I,h])
CI_up[h] <- Yhat[840+h]+1.96*sqrt(sigma_h2[I,h])
}
}

164

###########################################
# 95% Confidence Intervals for sample mean
###########################################
CI_low_mean <- array(0,c(12))
CI_up_mean <- array(0,c(12))
for (I in 1:12)
{
CI_low_mean[I] <- XMEAN[I]-1.96*(COVAR[I,1])^.5
CI_up_mean[I] <- XMEAN[I]+1.96*(COVAR[I,1])^.5
}

#########################################
# 95% Confidence Intervals for gamma_0
#########################################
error <- array(0,c(12))

## define error as sqrt{[(V_00)_ellell]/72}

for (I in 1:12)
{
S <- 0
for (L in 0:10)
{
S <- S + 4*((COVAR[I,12*L+1])^2)
}
error[I] <- ((S - 2*((COVAR[I,1])^2))/72)^.5

165

## subtract the exact one from niu = 0
}

CI_low_gamma0 <- array(0,c(12))
CI_up_gamma0 <- array(0,c(12))

for (I in 1:12)
{
CI_low_gamma0[I] <- COVAR[I,1]-1.96*error[I]
CI_up_gamma0[I] <- COVAR[I,1]+1.96*error[I]
}

#########################################
# 95% Confidence Intervals for rho_1
#########################################

W_11 <- array(0,c(12))
for (I in 1:12)
{
S <- 0
R <- ((I-1)%%12)+1
## S is the sum at n = 0, niu = 12
S <- rho[I,1]*rho[R,1] + rho[I,2]*rho[R,2] - rho[I,2]*(rho[I,1]*rho[I,2]+

166

rho[R,2]*rho[R,1]) - rho[I,2]*(rho[I,1]*rho[R,2]+rho[I,2]*rho[R,1]) +
.5*rho[I,2]^2*(rho[I,1]^2+rho[I,2]^2+rho[R,2]^2+rho[R,1]^2)
## Next loop is sum from n = -10 to 10
for (L in 1:10)
{
S <- S + (rho[I,L*12+1]*rho[R,L*12+1] + rho[I,L*12+2]*rho[R,L*12] rho[I,2]*(rho[I,L*12+1]*rho[I,L*12+2]+rho[R,L*12]*rho[R,L*12+1]) rho[I,2]*(rho[I,L*12+1]*rho[R,L*12]+rho[I,L*12+2]*rho[R,L*12+1]) +
.5*rho[I,2]^2*(rho[I,L*12+1]^2
+rho[I,L*12+2]^2+rho[R,L*12]^2+rho[R,L*12+1]^2))
}
P <- 0
for (L in (-10):(-1))
{
P <- P + (rho[I,abs(L*12)+1]*rho[R,abs(L*12)+1] +
rho[I,abs(L*12+1)+1]*rho[R,abs(L*12-1)+1] rho[I,2]*(rho[I,abs(L*12)+1]*rho[I,abs(L*12+1)+1]+
rho[R,abs(L*12-1)+1]*rho[R,abs(L*12)+1]) rho[I,2]*(rho[I,abs(L*12)+1]*rho[R,abs(L*12-1)+1]+
rho[I,abs(L*12+1)+1]*rho[R,abs(L*12)+1]) +
.5*rho[I,2]^2*(rho[I,abs(L*12)+1]^2+rho[I,abs(L*12+1)+1]^2
+rho[R,abs(L*12-1)+1]^2+rho[R,abs(L*12)+1]^2))
}

167

W_11[I] <- P+S
}

CI_low_rho1 <- array(0,c(12))
CI_up_rho1 <- array(0,c(12))

for (I in 1:12)
{
CI_low_rho1[I] <- rho[I,2]-1.96*((W_11[I]/72)^.5)
CI_up_rho1[I] <- rho[I,2]+1.96*((W_11[I]/72)^.5)
}

#########################################
# 95% Confidence Intervals for rho_2
#########################################
W_22 <- array(0,c(12))
for (I in 1:12)
{
S <- 0
R <- ((I)%%12)+1
## S is the sum at n = 0, niu = 12
S <- rho[I,1]*rho[R,1] + rho[I,3]*rho[R,3] rho[I,3]*(rho[I,1]*rho[I,3]+rho[R,3]*rho[R,1]) -

168

rho[I,3]*(rho[I,1]*rho[R,3]+rho[I,3]*rho[R,1]) +
.5*rho[I,3]^2*(rho[I,1]^2+rho[I,3]^2+rho[R,3]^2+rho[R,1]^2)
## Next loop is sum from n = -10 to 10
for (L in 1:10)
{
S <- S + (rho[I,L*12+1]*rho[R,L*12+1] +
rho[I,L*12+3]*rho[R,L*12-1]
- rho[I,3]*(rho[I,L*12+1]*rho[I,L*12+3]+
rho[R,L*12-1]*rho[R,L*12+1])
- rho[I,3]*(rho[I,L*12+1]*rho[R,L*12-1]+
rho[I,L*12+3]*rho[R,L*12+1]) + .5*rho[I,3]^2*(rho[I,L*12+1]^2+
rho[I,L*12+3]^2+rho[R,L*12-1]^2+rho[R,L*12+1]^2))
}
P <- 0
for (L in (-10):(-1))
{
P <- P + (rho[I,abs(L*12)+1]*rho[R,abs(L*12)+1] +
rho[I,abs(L*12+2)+1]*rho[R,abs(L*12-2)+1] rho[I,3]*(rho[I,abs(L*12)+1]*rho[I,abs(L*12+2)+1]+
rho[R,abs(L*12-2)+1]*rho[R,abs(L*12)+1]) rho[I,3]*(rho[I,abs(L*12)+1]*rho[R,abs(L*12-2)+1]+
rho[I,abs(L*12+2)+1]*rho[R,abs(L*12)+1]) +
.5*rho[I,3]^2*(rho[I,abs(L*12)+1]^2+rho[I,abs(L*12+2)+1]^2+

169

rho[R,abs(L*12-2)+1]^2+rho[R,abs(L*12)+1]^2))
}
W_22[I] <- P+S
}

CI_low_rho2 <- array(0,c(12))
CI_up_rho2 <- array(0,c(12))

for (I in 1:12)
{
CI_low_rho2[I] <- rho[I,3]-1.96*((W_22[I]/72)^.5)
CI_up_rho2[I] <- rho[I,3]+1.96*((W_22[I]/72)^.5)
}

#############
# Output
#############

# Please remove "#" if you want to generate output files

write(CI_low,file="CI_low_prediction.txt",ncolumns=1)
write(CI_up,file="CI_up_prediction.txt",ncolumns=1)

170

write(fraser,file="Data of Fraser River.txt",ncolumns=1)
write(Yhat[1:864],file="24-month Predictions.txt",ncolumns=1)

# write(XMEAN,file="Sample Mean.txt",ncolumns=1)
# write(CI_low_mean,file="CI_low_mean.txt",ncolumns=1)
# write(CI_up_mean,file="CI_up_mean.txt",ncolumns=1)

# write(COVAR[,1],file="Sample Variance.txt",ncolumns=1)
# write(CI_low_gamma0,file="CI_low_variance.txt",ncolumns=1)
# write(CI_up_gamma0,file="CI_up_variance.txt",ncolumns=1)

# write(rho[,2],file="rho_1.txt",ncolumns=1)
# write(CI_low_rho1,file="CI_low_rho1.txt",ncolumns=1)
# write(CI_up_rho1,file="CI_up_rho1.txt",ncolumns=1)

# write(rho[,3],file="rho_2.txt",ncolumns=1)
# write(CI_low_rho2,file="CI_low_rho2.txt",ncolumns=1)
# write(CI_up_rho2,file="CI_up_rho2.txt",ncolumns=1)

# write(res,file="Residuals.txt",ncolumns=1)

###########################
#

Plots

171

###########################

# The plots in this paper were producted by Minitab
# The following plots are provided as drafts for reference

############
# Figure 1
g_range <- range(0, data)
plot(data[1:180,1], col = "black", ylab = "Flow (cms)",
axes=FALSE, type = "l", ylim=g_range, xlab="Month / Year",
cex.lab=1.5, lty = 5)
axis(1,at = c(1,37,73,109,145,180),lab= c("10/1912",
"10/1915","10/1918","10/1921","10/1924","09/1927"))
axis(2, at = c(1,50000,100000,150000,200000,250000,
300000,350000,400000), lab =c("0","50000","100000",
"150000","200000","250000","300000","350000","400000") )

############
# Figure 3
############
g_range <- range(0, CI_up)
plot(CI_low, col = "blue", ylab = "Flow (cms)", lwd = 2,
axes=FALSE, type = "l", ylim=g_range, xlab="Month / Year",

172

cex.main = 2, cex.lab=1.5, lty = 5)
axis(1,at = c(1,5,9,13,17,21,24),lab= c("10/1982","02/1983",
"06/1983","10/1983","02/1984","06/1984",""))
axis(2)
lines(CI_up, col = "blue", type="l", lty=5, lwd = 2)
lines(Yhat[841:864], col = "red", type="o", pch=20, lty=1, lwd = 2)
lines(data[841:864,1],col = "black", type="l", lty=1, lwd = 2)

############
# Figure 4
############
g_range <- range((CI_low-c(XMEAN,XMEAN)), (CI_up-c(XMEAN,XMEAN)))
plot((CI_low-c(XMEAN,XMEAN)), col = "blue", type = "l",
ylab = "Width of prediction bounds", ylim=g_range, xlab="Month",
main="Width of prediction bounds (mean subtracted)", lty = 5)
lines(Yhat[841:864]-c(XMEAN,XMEAN), col = "red", type="o", pch=20, lty=1)
lines((CI_up-c(XMEAN,XMEAN)), col = "blue", type="l", lty=5)

############
# Figure 2
############
par(mfrow = c(2,2))
g_range <- range(0, (CI_up_mean))

173

plot(XMEAN, col = "black", ylab = "Sample mean (cms)", lwd =2,
ylim=g_range,axes=FALSE, xlab="Season",
main="(a) Sample Means", lty = 5, cex.main = 2.5, cex.lab=1.5)
axis(1,at = 1:12,lab= c("0","1","2","3","4","5","6","7","8","9","10","11"))
axis(2)
lines(XMEAN, type="o",lwd = 2,pch = 20)
lines((CI_low_mean), col = "red", type="l", lty=5,lwd =2)
lines((CI_up_mean), col = "red", type="l", lty=5,lwd =2)

g_range <- range(0, (CI_up_gamma0)^.5 )
plot((COVAR[,1])^.5, col = "black", ylab = "Sample sd (cms)",
axes=FALSE, lwd=2, ylim=g_range, xlab="Season",
main="(b) Sample Standard Deviations", lty = 5,cex.main = 2.2, cex.lab=1.5)
axis(1,at = 1:12,lab= c("0","1","2","3","4","5","6","7","8","9","10","11"))
axis(2)
lines((COVAR[,1])^.5, type="o",lwd = 2,pch = 20)
lines((CI_low_gamma0)^.5, col = "red", type="l", lty=5,lwd=2)
lines((CI_up_gamma0)^.5, col = "red", type="l", lty=5,lwd=2)

g_range <- range(0, 1)
plot(rho[,2], col = "black", ylab = "Autocorrelations", axes=FALSE,
ylim=g_range, xlab="Season", main="(c) Sample Autocorrelations : lag = 1",
lty = 5,cex.main = 1.9, cex.lab=1.5)

174

axis(1,at = 1:12,lab= c("0","1","2","3","4","5","6","7","8","9","10","11"))
axis(2)
lines(rho[,2], type="o",lwd = 2,pch = 20)
lines(CI_low_rho1, col = "red", type="l", lty=5,lwd=2)
lines(CI_up_rho1, col = "red", type="l", lty=5,lwd=2)

g_range <- range(-.5, 1)
plot(rho[,3], col = "black", ylab = "Autocorrelations", axes=FALSE,lwd=2,
ylim=g_range, xlab="Season", main="(d) Sample Autocorrelations : lag = 2",
lty = 5,cex.main = 1.9, cex.lab=1.5)
axis(1,at = 1:12,lab= c("0","1","2","3","4","5","6","7","8","9","10","11"))
axis(2)
lines(rho[,3], type="o",lwd = 2,pch = 20)
lines(CI_low_rho2, col = "red", type="l", lty=5,lwd=2)
lines(CI_up_rho2, col = "red", type="l", lty=5,lwd=2)

#################################################
# Computation of PARMA Autocovariances in Chapter 1.1
#################################################

# Model is PARMA_12(1,1)

i.e. there are 12 seasons

# Model is based on table 5 in Tesfaye, Meerschaert and Anderson (2006)

# phi_1 <- c(.198,.568,.560,.565,.321,.956,1.254,.636,-1.942,-.092,.662,.355)
175

phi_1 <- c(0,.568,.560,.565,0,.956,0,0,-1.942,0,.662,.355)
theta_1 <- c(.687,.056,-.052,-.05,.47,-.389,-.178,-.114,2.393,.71,-.213,.322)
sigma <- c(11875.479,11598.254,7311.452,5940.845,4160.214,4610.209,
15232.867,31114.514,32824.370,29712.190,15511.187,12077.991)
theta_0 <- array(1,c(12))
psi_0 <- array(1,c(12))
psi_1 <- phi_1 + theta_1
# By (16) in Tesfaye, Meerschaert and Anderson (2006)

# Set AX = b, then solve X, where X is a vector
# X gives ACVF = gamma_i(h), when h <= max(p,q), i = 0, 1, ... 11
# In PARMA_12(1,1), p = q = 1

A <- array(0,c(24,24))
for (i in 1:12)
{
A[i,i] <- 1
A[i+12,(i-2)%%12+1+12] <- 1
A[i,(i-2)%%12+1+12] <- (-phi_1[i])
A[i+12,(i-2)%%12+1] <- (-phi_1[i])
}

b <- array(0,c(24))

176

for (i in 1:12)
{
b[i] <- theta_0[i] * psi_0[i] * (sigma[i])^2 +
theta_1[i] * psi_1[i] * (sigma[(i-2)%%12+1])^2
b[i+12] <- theta_1[i] * psi_0[(i-2)%%12+1] * (sigma[(i-2)%%12+1])^2
}

X <- solve(A,b)
X

matrix(X, ncol = 2)

# COVAR is autocovariance function
# COVAR[I,H] = \gamma_{i}(h)
# I = i + 1, i is season, i = 0, 1, 2, ... 11
# H = h + 1, h is lag, h = 0, 1, 2, ... 79
COVAR <- array(0,c(12,80))

COVAR[,1:2] <- X

# Read X into first columns of COVAR
# A good way to get rid of the for loop below

#for (I in 1:12)
# This loop reads X into COVAR, for h <= max(p,q); here h = 0,1

177

# {
# COVAR[I,1] <- X[I]
# COVAR[I,2] <- X[I+12]
# }

for (H in 3:80)

# This loop computes ACVF for h > max(p,q)

{
for (I in 1:12)
{
COVAR[(I-H)%%12+1,H] <- phi_1[I]*COVAR[(I-H)%%12+1,H-1]
#this one is right!
# (I-2)%%12+1 represents season I-1
# COVAR[I,H] <- phi_1[I]*COVAR[(I-2)%%12+1,H-1]
}
}

###############################
# Innovations Algorithm
###############################

V<- array(0,c(50,12))
THETA <- array(0,c(50,50,12))

178

for (I in 1:12)

# I = i+1, i is season

{

V[1,I] <- COVAR[I,1]
for (N in 2:50)

# N = n+1, n is number of iterations

{
for (K in 1:(N-1))
{
S <- 0
if (K == 1)
{
K0 <- as.integer((I+1-2)/12)
K1 <- (I+K-1)-12*K0
THETA[N,N,I] <- (COVAR[K1,N]-S)/V[1,I]
}

else
{
for (J in 1:(K-1))
{
S <- S+THETA[K,K-J+1,I]*THETA[N,N-J+1,I]*V[J,I]
K0 <- as.integer((I+K-2)/12)
K1 <- (I+K-1)-12*K0

179

THETA[N,N-K+1,I] <- (COVAR[K1,N-K+1]-S)/V[K,I]
THETA[N,1,I] <- 1

# This defines \theta_{n,0}(i) = 1

}
}
}
R <- 0
for(J in 1:(N-1))
{
R <- R+V[J,I]*(THETA[N,N-J+1,I])^2
N0 <- as.integer((I+N-2)/12)
N1 <- (I+N-1)-12*N0
V[N,I] <- COVAR[N1,1]-R
}
}
}

#################################
# convergence of theta to psi
#################################

psi_k <- array(0,c(50,12,50))

# psi_k is \psi_{i}(\ell)

for (K in 1:50)

# K = k+1, k is number of iterations

{

180

for ( I in 1:12 )

# I = i+1, i is season

{
R <- 0
for (J in 1:50)
{
R <- (I-K)%%12+1
psi_k[K,I,J] <- THETA[K,J,R]
}
}
}
# Test output for lag = 1
# This matches values of psi_1 = phi_1 + theta_1, after 5 iterations
psi_k[,,2]
psi_1

# Test output for lag = 2
# psi_k[,,3]

# This shows the error between psi_k[,,2] and psi_1
error <- array(0,c(50,12))
for (I in 1:12)
{
for (K in 1:50)

181

{
error[K,I] <- psi_k[K,I,2]-psi_1[I]
}
}

max_error <- array(0,c(49))
# for all season, for lag =1 only
for (K in 1:49)
for (I in 1:12)
{
{
max_error[K] <- max(abs(error[K,I]))
}
}
plot(max_error,ylab = "Value of convergence error",
xlab="Number of iterations", main="error",cex.main = 2, cex.lab=1.8)
lines(max_error, type="o", pch=20, lty=1, col="red")

########################################
# convergence of v to sigma^2
########################################

182

sigma_square <- array(0,c(50,12))
for (K in 1:50)
{
for (I in 1:12)
{
S <- 0
S <- (I-K)%%12+1
sigma_square[K,I] <- V[K,S]
}
}

# Test output, which matches sigma, after 5 iterations
sigma_square^.5
sigma

# This gives the error between sigma_square and sigma^2
error2 <- array(0,c(50,12))
for (I in 1:12)
{
for (K in 1:50)
{
error2[K,I] <- (sigma_square[K,I]-(sigma[I])^2)
}

183

}

184

BIBLIOGRAPHY

185

BIBLIOGRAPHY
[1] Adams, G.J and G.C. Goodwin, (1995), Parameter estimation for periodically ARMA
models. Journal of Time Series Analysis, 16, 127–145.
[2] Akaike, H. (1974), A new look at the statistical model identification. IEEE Transactions
of Automatic Control, 19, 716–723.
[3] Anderson, P.L. and A.V. Vecchia (1993), Asymptotic results for periodic autoregressive
moving -average processes. Journal of Time Series Analysis, 14, 1–18.
[4] Anderson, P.L. and M.M. Meerschaert (1997), Periodic moving averages of random variables with regularly varying tails. Annals of Statistics, 25, 771–785.
[5] Anderson, P.L. and M.M. Meerschaert (1998), Modeling river flows with heavy tails.
Water Resources Research , 34 (9), 2271–2280.
[6] Anderson, P.L., M.M. Meerschaert and A.V. Veccia (1999), Innovations algorithm for
periodically stationary time series. Stochastic Processes and their Applications, 83, 149–
169.
[7] Anderson, P.L. and M.M. Meerschaert (2005), Parameter estimates for periodically stationary time series. Journal of Time Series Analysis, 26, 489–518.
[8] Anderson, P.L., M.M. Meerschaert, Y.G. Tesfaye (2007), Fourier-PARMA models and
their application to modeling of river flows. Journal of Hydrologic Engineering, 12 (5),
462–472.
[9] Ansley, C.F. (1979), An algorithm for the exact likelihood of a mixed autoregressivemoving average process. Biometrika, 66 (1), 59–65.
[10] Box, G.E., and G.M. Jenkins (1976), Time Series Analysis: Forcasting and Control,
2nd edition, San Francisco: Holden-Day.
[11] Brockwell, P.J. and R.A. Davis (1991), Time Series: Theory and Methods, 2nd ed. New
York: Springer-Verlag.
186

[12] Broyden, C.G. (1970), The convergence of a class of double-rank minimization algorithms, Journal of the Institute of Mathematics and Its Applications 6, 7690.
[13] Burnham, K.P. and Anderson, D.R. (2010), Model selection and multi-model inference:
a practical information theoretic approach, 1st ed. Springer.
[14] Chen, H.-L. and A.R. Rao (2002), Testing hydrologic time series for stationarity. Journal
of Hydrologic Engineering,7 (2), 129–136.
[15] Dunsmuir, W. and Hannan E. J. (1976), Vector linear time series models. Advances in
Applied Probability, 8 (2), 339 - 364.
[16] Durrett, R. (2010), Probability: Theory and Examples, 4th ed. Cambridge University
Press.
[17] Fletcher, R. (1970), A New Approach to Variable Metric Algorithms, Computer Journal
13 (3): 317322.
[18] Gladyshev, E.G. (1961), Periodically correlated random sequences. Soviet Mathematics,
2, 385–388.
[19] Goldfarb, D. (1970), A Family of Variable Metric Updates Derived by Variational
Means, Mathematics of Computation 24 (109): 2326.
[20] Jones, R.H. and W.M. Brelsford (1967), Times series with periodic structure.
Biometrika, 54, 403–408.
[21] Lehmann, E.L. and Caselle, G. (1998), Theory of Point Estimation, 2nd ed. Springer.
[22] Lund, R.B. and I.V. Basawa (1999), Modeling and inference for periodically correlated
time series. Asymptotics, Nonparameterics, and Time Sereis(ed. S. Ghosh), 37–62. Marcel
Dekker, New York.
[23] Lund, R.B. and I.V. Basawa (2000), Recursive prediction and likelihood evaluation for
periodic ARMA models. Journal of Time Series Analysis, 20 (1), 75–93.
[24] Basawa, I.V. and Lund, R.B. (2001), Large sample properties of parameter estimates
for periodic ARMA models. Journal of Time Series Analysis, 22(6), 651–663.
187

[25] Lund, R.B., Shao Q. and I.V. Basawa (2006), Parsimonious periodic time series modeling. Australian & New Zealand Journal, 48(1), 33–47.
[26] Makagon, A., Miamee, A. G. and Salehi, H. Periodically correlated processes and their
spectrum(1991). Nonstationary stochastic processes and their applications, 147-164.
[27] Makagon, A., Miamee, A. G. and Salehi, H. Continuous time periodically correlated
processes: spectrum and prediction(1994). Stochastic Process. Appl., (49)2, 277-295.
[28] Makagon, A., Miamee, A. G., Salehi, H. and Soltani, A. R. On spectral domain of
periodically correlated processes(2008). Theory Probab. Appl., 52(2), 353-361.
[29] Makagon, A. and Salehi, H. Structure of periodically distributed stochastic sequences(1993). Stochastic processes, 245-251.
[30] Mcleod, A.I. (1994), Diagnostic checking of periodic autoregression models with application. Journal of Time Series Analysis, 15(2), 221-233.
[31] Noakes, D.J., Mcleod, A.I. and Hipel, K.W. (1985), Forecasting monthly riverflow time
series. International Journal of Forecasting, 1, 179–190.
[32] Pagano, M. (1978), On periodic and multiple autoregressions. Annals of Statistics, 6 (6),
1310–1317.
[33] R Development Core Team (2008) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0,
http://www.R-project.org.
[34] Salas, J.D, D.C. Boes and R.A. Smith (1982), Estimation of ARMA models with seasonal parameters. Water Resources Research, 18, 1006–1010.
[35] Salas, J.D, G.Q. Tabios III and P. Bartolini (1985), Approaches to multivariate modeling
of water resources time series. Water Resources Bulletin, 21, 683–708.
[36] Salas, J.D. and J.T.B Obeysekera (1992), Conceptual basis of seasonal streamflow time
series models. Journal of Hydraulic Engineering, 118 (8), 1186–1194.
[37] Schwarz,G. (1978), Estimating the dimension of a model. Annals of Statistics, 6, 461–
464.
188

[38] Shanno, D.F. (1970), Conditioning of quasi-Newton methods for function minimization,
Math. Comput. 24 (111): 647656.
[39] Shao, Q. and R.B. Lund (2004), Computation and characterization of autocorrelations
and partial autocorrelations in periodic ARMA models. Journal of Time Series Analysis,
25 (3), 359–372.
[40] Shiryaev, A.N. (1984), Probability. Springer-Verlag, New York.
[41] Tesfaye, Y.G. (2005), Seasonal Time Series Models and Their Application to the Modeling of River Flows. PhD. Dissertation, Uinversity of Nevada, Reno, Nevada.
[42] Tesfaye, Y.G., M.M. Meerschaert and P.L. Anderson (2006), Identification of PARMA
Models and Their Application to the Modeling of River Flows. Water Resources Research,
42 (1), W01419.
[43] Y.G. Tesfaye, P.L. Anderson, and M.M. Meerschaert (2011), Asymptotic results for
Fourier-PARMA time series. Journal of Time Series Analysis, 32 (2), 157–174.
[44] Tiao, G.C. and M.R. Grupe (1980), Hidden periodic autoregressive moving average
models in time series data. Biometrika 67, 365-373.
[45] Tjøstheim, D. and J. Paulsen (1982), Empirical identification of multiple time series.Journal of Time Series Analysis , 3, 265–282.
[46] Thompstone,R.M., K.W. Hipel and A.I. McLeod (1985), Forecasting quarter-monthly
riverflow. Water Resources Bulletin, 25 (5),731–741.
[47] Troutman, B.M. (1979), Some results in periodic autoregression. Biometrika, 6, 219–
228.
[48] Ula, T.A. (1990), Periodic covariance stationarity of multivariate periodic autoregressive
moving average processess. Water Resources Research, 26 (5), 855–861.
[49] Ula, T.A. (1993), Forcasting of Multivariate periodic autoregressive moving average
processes. Journal of Time Series Analysis, 14, 645–657.
[50] Ula, T.A. and A.A Smadi (1997), Periodic stationarity conditions for periodic autoregressive moving average processess as eigenvalue problems. Water Resources Research,
33 (8), 1929–1934.
189

[51] Ula, T.A. and A.A Smadi (2003), Identification of periodic moving average models.
Communications in Statistics: Theory and Methods, 32 (12), 2465–2475.
[52] Ursu, E. and Turkman K.F. (2012), Periodic autoregression model identification
using genetic algorithms. Journal of Time Series Analysis, doi: 10.1111/j.14679892.2011.00772.x.
[53] Vecchia , A.V. (1985a), Periodic autoregressive-moving average (PARMA) modelling
with applications to water resources. Water Resources Bulletin 21, 721–730.
[54] Vecchia, A.V. (1985b), Maximum likelihood estimation for periodic moving average
models. Technometrics, 27 (4), 375–384.
[55] Vecchia , A.V. and R. Ballerini (1991), Testing for periodic autocorrelations in seasonal
time series data. Biometrika, 78 (1), 53–63.
[56] Wyloma´
nska, A. (2008), Spectral measures of PARMA sequences. Journal of Time
Series Analysis, 29 (1), 113.
[57] Yao, Q. and Brockwell, P.J. (2006), Gaussian maximum likelihood estimation for ARMA
models I: time series. Journal of time series analysis, 27 (6), 857-875.

190