ENVIRONMENTAL ADAPTIVE SAMPLING FOR MOBILE SENSOR NETWORKS
USING GAUSSIAN PROCESSES
By
Yunfei Xu

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Mechanical Engineering
2011

ABSTRACT
ENVIRONMENTAL ADAPTIVE SAMPLING FOR MOBILE SENSOR
NETWORKS USING GAUSSIAN PROCESSES
By
Yunfei Xu
In recent years, due to signiﬁcant progress in sensing, communication, and embedded-system
technologies, mobile sensor networks have been exploited in monitoring and predicting environmental ﬁelds (e.g., temperature, salinity, pH, or biomass of harmful algal blooms).
The conventional inverse problem approach based on physical transport models is computationally prohibitive for resource-constrained, multi-agent systems. In contrast, emphasizing
practicality and usefulness, this work relies extensively on the phenomenological and statistical modeling techniques, in particular, Gaussian processes. However, such statistical models
need to be carefully tailored such that they can be practical and usable for mobile sensor networks with limited resources. In this dissertation, we consider the problem of using mobile
sensor networks to estimate and predict environmental ﬁelds modeled by spatio-temporal
Gaussian processes.
In the ﬁrst part of the dissertation, we ﬁrst present robotic sensors that learn a spatiotemporal Gaussian process and move in order to improve the quality of the estimated covariance function. For a given covariance function, we then theoretically justify the usage
of truncated observations for Gaussian process regression for mobile sensor networks with
limited resources. We propose both centralized and distributed navigation strategies for
resource-limited mobile sensing agents to move in order to reduce prediction error variances
at points of interest. Next, we formulate a fully Bayesian approach for spatio-temporal
Gaussian process regression such that multifactorial eﬀects of observations, measurement

noise, and prior distributions of hyperparameters are all correctly incorporated in the posterior predictive distribution. To cope with computational complexity, we design sequential
Bayesian prediction algorithms in which exact predictive distributions can be computed in
constant time as the number of observations increases. Under this formulation, we provide
an adaptive sampling strategy for mobile sensors, using the maximum a posteriori (MAP)
estimation to minimize the prediction error variances.
In the second part of the dissertation, we address the issue of computational complexity
by exploiting the sparsity of the precision matrix used in a Gaussian Markov random ﬁeld
(GMRF). The main advantages of using GMRFs are: (1) the computational eﬃciency due
to the sparse structure of the precision matrix, and (2) the scalability as the number of
measurements increases. We ﬁrst propose a new class of Gaussian processes that builds
on a GMRF with respect to a proximity graph over the surveillance region, and provide
scalable inference algorithms to compute predictive statistics. We then consider a discretized
spatial ﬁeld that is modeled by a GMRF with unknown hyperparameters. From a Bayesian
perspective, we design a sequential prediction algorithm to exactly compute the predictive
inference of the random ﬁeld. An adaptive sampling strategy is also designed for mobile
sensing agents to ﬁnd the most informative locations in taking future measurements in
order to minimize the prediction error and the uncertainty in the estimated hyperparameters
simultaneously.

ACKNOWLEDGMENT

I would like to thank everybody who helped to accomplish this dissertation. First of all,
I would like to express my sincere appreciation and gratitude to my advisor, Dr. Jongeun
Choi, who continuously encouraged, and supported me. This thesis is not possible without
his invaluable advice and eﬀort to improve the work. In addition, I would like to thank all
my committee members, Dr. Ranjan Mukherjee, Dr. Xiaobo Tan, Dr. Guoming Zhu, and
Dr. Tapabrata Maiti, for their contribution and suggestions.
I would also like to thank my colleagues, Mr. Mahdi Jadaliha and Mr. Justin Mrkva, for
their help to obtain experimental data.
Last but not least, I cannot thank enough my parents and my wife for always being there
for me. I would never have been able to accomplish what I have without them.

iv

TABLE OF CONTENTS

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
List of Figures

viii

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 Introduction
1.1 Background . . . . . . . . . . .
1.2 Contribution . . . . . . . . . . .
1.3 Organization . . . . . . . . . .
1.4 Publication . . . . . . . . . . .
1.4.1 Journal Articles . . . . .
1.4.2 Conference Proceedings

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

2 Preliminaries
2.1 Mathematical Notation . . . . . . . . . .
2.2 Physical Process Model . . . . . . . . . .
2.2.1 Gaussian process . . . . . . . . .
2.2.2 Spatio-temporal Gaussian process
2.2.3 Gaussian Markov random ﬁeld . .
2.3 Mobile Sensor Network . . . . . . . . . .
2.4 Gaussian Processes for Regression . . . .
3 Learning the Covariance Function
3.1 Learning the Hyperparameters . . . . . .
3.2 Optimal Sampling Strategy . . . . . . .
3.3 Simulation . . . . . . . . . . . . . . . . .
3.3.1 Spatio-temporal Gaussian process
3.3.2 Time-varying covariance functions
3.3.3 Advection-diﬀusion process . . .

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

4 Prediction with Known Covariance Function
4.1 GPR with Truncated Observations . . . . . . . . . .
4.1.1 Error bounds in using truncated observations
4.1.2 Selecting temporal truncation size . . . . . . .
4.2 Optimal Sampling Strategies . . . . . . . . . . . . . .
4.2.1 Centralized navigation strategy . . . . . . . .
4.2.2 Distributed navigation strategy . . . . . . . .
4.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . .
v

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

ix

.
.
.
.
.
.

1
2
4
7
8
8
9

.
.
.
.
.
.
.

11
11
12
13
15
16
18
19

.
.
.
.
.
.

22
23
25
29
29
32
33

.
.
.
.
.
.
.

37
39
40
49
52
53
55
62

4.3.1
4.3.2
4.3.3

Gradient-based algorithm vs. exhaustive search algorithm . . . . . .
Centralized sampling scheme . . . . . . . . . . . . . . . . . . . . . . .
Distributed sampling scheme . . . . . . . . . . . . . . . . . . . . . . .

5 Fully Bayesian Approach
5.1 Fully Bayesian Prediction Approach . . . . . . . . . .
5.1.1 Prior selection . . . . . . . . . . . . . . . . . .
5.1.2 MCMC-based approach . . . . . . . . . . . .
5.1.3 Importance sampling approach . . . . . . . .
5.1.4 Discrete prior distribution . . . . . . . . . . .
5.2 Sequential Bayesian Prediction . . . . . . . . . . . . .
5.2.1 Scalable Bayesian prediction algorithm . . . .
5.2.2 Distributed implementation for a special case
5.2.3 Adaptive sampling . . . . . . . . . . . . . . .
5.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 MCMC-based approach on a 1-D scenario . .
5.3.2 Centralized scheme on 1-D scenario . . . . . .
5.3.3 Distributed scheme on 2-D scenario . . . . . .
6 Gaussian Process with Built-in GMRF
6.1 Spatial Prediction . . . . . . . . . . . . .
6.1.1 Spatial model based on GMRF .
6.1.2 Gaussian process regression . . .
6.1.3 Sequential prediction algorithm .
6.2 Distributed Spatial Prediction . . . . . .
6.2.1 Distributed computation . . . . .
6.2.2 Distributed prediction algorithm .
6.3 Simulation and Experiment . . . . . . .
6.3.1 Simulation . . . . . . . . . . . . .
6.3.2 Centralized scheme . . . . . . . .
6.3.3 Distributed scheme . . . . . . . .
6.3.4 Experiment . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.

7 Bayesian Spatial Prediction Using GMRF
7.1 Problem Setup . . . . . . . . . . . . . . .
7.1.1 Spatial ﬁeld model . . . . . . . . .
7.1.2 Mobile sensor network . . . . . . .
7.2 Bayesian Predictive Inference . . . . . . .
7.3 Sequential Bayesian Inference . . . . . . .
7.3.1 Update full conditional distribution
7.3.2 Update likelihood . . . . . . . . . .
7.3.3 Update predictive distribution . . .
7.4 Adaptive Sampling . . . . . . . . . . . . .
7.5 Simulation . . . . . . . . . . . . . . . . . .
vi

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

63
64
66

.
.
.
.
.
.
.
.
.
.
.
.
.

76
78
80
81
87
91
92
92
95
97
99
99
100
102

.
.
.
.
.
.
.
.
.
.
.
.

108
109
109
111
116
118
118
119
122
123
123
124
126

.
.
.
.
.
.
.
.
.
.

128
129
130
132
134
137
137
139
141
141
144

8 Conclusion and Future Work
151
8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
A Mathematical Background
A.1 Gaussian Identities . . . . . . . .
A.2 Matrix Inversion Lemma . . . . .
A.2.1 Woodbury identity . . . .
A.2.2 Sherman-Morrison formula
A.3 Generating Gaussian processes . .
A.3.1 Cholesky decomposition .
A.3.2 Circulant embedding . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

.
.
.
.
.
.
.

155
155
156
157
157
157
157
158
161

LIST OF TABLES

3.1

Centralized optimal sampling strategy at time t. . . . . . . . . . . . . . . . .

28

3.2

Parameters used in simulation. . . . . . . . . . . . . . . . . . . . . . . . . . .

34

4.1

Prediction means and variances using y, ym , and yr . . . . . . . . . . . . . . .

48

4.2

Centralized sampling strategy at time t. . . . . . . . . . . . . . . . . . . . .

56

4.3

Distributed sampling strategy at time t. . . . . . . . . . . . . . . . . . . . .

69

5.1

Gibbs sampler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

5.2

Centralized Bayesian prediction algorithm. . . . . . . . . . . . . . . . . . . . 106

6.1

Sequential algorithm for ﬁeld prediction. . . . . . . . . . . . . . . . . . . . . 117

6.2

Distributed algorithm for sequential ﬁeld prediction. . . . . . . . . . . . . . . 122

7.1

Sequential Bayesian predictive inference. . . . . . . . . . . . . . . . . . . . . 147

A.1 Generating multivariate Gaussian samples by Cholesky decomposition. . . . 158
A.2 Generating multivariate Gaussian samples by circulant embedding. . . . . . 159

viii

LIST OF FIGURES

2.1

Realization of a Gaussian process. For interpretation of the references to color
in this and all other ﬁgures, the reader is referred to the electronic version of
this dissertation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.2

Realization of Gaussian process at (a) t = 1, (b) t = 5, and (c) t = 10. . . . .

16

3.1

Snap shots of the realized Gaussian process at (a) t = 1, (b) t = 10, and (c)
t = 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

Monte Carlo simulation results (100 runs) for a spatio-temporal Gaussian
process using (a) the random sampling strategy, and (b) the adaptive sampling
strategy. The estimated hyperparameters are shown in blue circles with errorbars. The true hyperparameters that used for generating the process are
shown in red dashed lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

3.3

Predicted ﬁelds along with agents’ trajectories at (a) t = 1 and (b) t = 20. .

32

3.4

(a) Weighting factor λ(t) and (b) the estimated λ(t). . . . . . . . . . . . . .

33

3.5

Snap shots of the advection-diﬀusion process at (a) t = 1 and (b) t = 10. . .

35

3.6

Simulation results (100 runs) for a advection-diﬀusion process. The estimated
hyperparameters with (a) random sampling and (b) optimal sampling. . . . .

36

Robot predicts a scalar value at x∗ (denoted by a red star) based on cumulative n spatio-temporal observations (denoted by blue crosses). Near-optimal
prediction can be obtained using truncated observations, e.g., the last m observations. In this case, x = (sx , sy , t)T . . . . . . . . . . . . . . . . . . . . . .

39

Example of the selection of truncated observations. The parameters used in
2
the example are: σf = 1, σ = 0.2, σw = 0.1. . . . . . . . . . . . . . . . . . .

48

Example of selecting a temporal truncation size η. The parameters used in
2
the example are: σf = 1, σx = σy = 0.2, σt = 5, γ = 100. . . . . . . . . . . .

52

3.2

4.1

4.2

4.3

ix

4.4

Function Φ(d) in (4.22) with γ = 100, R = 0.4, and d0 = 0.1 is shown in a
red dotted line. The function Φ(d) = γ is shown in a blue solid line. . . . . .

59

Prediction error variances at t = 5 achieved by (a) using the gradient-based
algorithm, and (b) using the exhaustive search algorithm. The trajectories of
the agent are shown in solid lines. . . . . . . . . . . . . . . . . . . . . . . . .

64

Average of prediction error variances over target points (in blue circles) achieved
by the centralized sampling scheme using all collective observations for (a) case
1, (b) case 2, and (c) case 3. In (a), the target points are ﬁxed at time t = 10,
and the counterpart achieved by the benchmark random sampling strategy is
shown in red squares with error-bars. In (b) and (c), the target points are
at t + 1 and change over time. The counterpart achieved by using truncated
observations are shown in red squares. . . . . . . . . . . . . . . . . . . . . .

70

Simulation results at t = 1 and t = 5 obtained by the centralized sampling
scheme for case 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

Simulation results at t = 1 and t = 5 obtained by the centralized sampling
scheme for case 2 (cont’d). . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

Simulation results obtained by the centralized sampling scheme for case 3.
The trajectories of agents are shown in solid lines. . . . . . . . . . . . . . . .

72

Cost function Jd (˜) from t = 1 to t = 2 with a communication range R = 0.4.
q

73

4.10 Average of prediction error variances over all target points and agents achieved
by the distributed sampling scheme with a communication range (a) R = 0.3,
and (b) R = 0.4. The average of prediction error variances over all target
points and agents are shown in blue circles. The average of prediction error
variance over local target points and agents are shown in red squares. The
error-bars indicate the standard deviation among agents. . . . . . . . . . . .

73

4.11 Simulation results obtained by the distributed sampling scheme with diﬀerent
communication ranges. The edges of the graph are shown in solid lines. . . .

74

4.5

4.6

4.7

4.7

4.8

4.9

4.11 Simulation results obtained by the distributed sampling scheme with diﬀerent
communication ranges (cont’d). The edges of the graph are shown in solid lines. 75
5.1

Example with three agents sampling the spatio-temporal Gaussian process
in 1-D space and performing Bayesian inference. In this example, σ t = 2.5,
T
T
η = 2, ∆ = 3, t = 15, ct = 2, y = (y1:3 , y6:8 )T and y = y13:15 . . . . . . . . . .
¯
˜
x

95

5.2

Example with three group of agents sampling the spatio-temporal Gaussian
process in 2-D space and performing Bayesian prediction. The symbol ‘o’
denotes the position of a leader for a group and the symbol ‘x’ denotes the
position of an agent. Distance between any two sub-regions is enforced to be
greater than σ s which enables the distributed Bayesian prediction. . . . . . .

97

5.3

2
Posterior distribution of β, σf , σs , and σt at (a) t = 1, and (b) t = 20. . . . . 100

5.4

Prediction at (a) t = 1, and (b) t = 20 using the MCMC-based approach.
The true ﬁelds are plotted in blue solid lines. The predicted ﬁelds are plotted
in red dash-dotted lines. The area between red dotted lines indicates the 95%
conﬁdence interval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.5

Prediction at t = 1 using (a) the maximum likelihood based approach, and
(b) the proposed fully Bayesian approach. The true ﬁelds are plotted in blue
solid lines. The predicted ﬁelds are plotted in red dash-dotted lines. The area
between red dotted lines indicates the 95% conﬁdence interval. . . . . . . . . 103

5.6

(a) Prior distribution θ, (b) posterior distribution of θ at time t = 100, (c)
posterior distribution of θ at time t = 300. . . . . . . . . . . . . . . . . . . . 104
Prediction at (a) t = 100, and (b) t = 300 using the centralized sequential
Bayesian approach. The true ﬁelds are plotted in blue solid lines. The predicted ﬁelds are plotted in red dash-dotted lines. The area between red dotted
lines indicates the 95% conﬁdence interval. . . . . . . . . . . . . . . . . . . . 105

5.7

5.8

Posterior distribution of θ at time t = 100 using the distributed algorithm. . 105

5.9

Comparison of (a) the true ﬁeld at t = 100 and (b) the predicted ﬁeld at
t = 100 using the distributed algorithm. . . . . . . . . . . . . . . . . . . . . 107

6.1

(a) Generating points in blue dots and the associated Delaunay graph with
edges in red dotted lines. The Voronoi partition is also shown in blue solid
lines. (b) Gaussian random ﬁeld with a built-in GMRF with respect to the
Delaunay graph in (a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.2

Example of computing (ΛT Λ)ij = λ(s , pi )λ(s , pj ). . . . . . . . . . . . . . . 121

6.3

Simulation results for the centralized scheme. (a) The true ﬁeld, (b) the
predicted ﬁeld at time t = 1, (c) the predicted ﬁeld at time t = 5, (d) the
predicted ﬁeld at time t = 20. The generating points are shown in black
circles, and the sampling locations are shown in black crosses. . . . . . . . . 125
xi

6.4

(a) Graph G = (V, E). (b) Sparsity structure of the precision matrix Q. . . . 126

6.5

Simulation results for the distributed scheme. (a) The true ﬁeld, (b) the
predicted ﬁeld at time t = 5. The generating points are shown in circles, and
the sampling locations are shown in crosses. . . . . . . . . . . . . . . . . . . 126

6.6

(a) True ﬁeld on grid positions obtained by the Kinect sensor and randomly
sampled positions indicated in black crosses. (b) The ﬁtted Gaussian random
ﬁeld with a build-in GMRF with respect to the Delaunay graph. . . . . . . . 127

7.1

Elements of the precision matrix Q related to a single location. . . . . . . . . 132

7.2

Numerically generated spatial ﬁelds deﬁned in (7.1) with µ(si ) = β = 20, and
Qη|θ constructed using (7.2) with hyperparameters being (a) θ = (4, 0.0025)T ,
(b) θ = (1, 0.01)T , and (c) θ = (0.25, 0.04)T . . . . . . . . . . . . . . . . . . . 133

7.3

Posterior distributions of θ, i.e., π(θ|y1:t ), at (a) t = 1, (b) t = 5, (c) t = 10,
and (d) t = 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

7.4

Predicted ﬁelds at (a) t = 1, (c) t = 5, (e) t = 10, and (g) t = 20. Prediction
error variances at (b) t = 1, (d) t = 5, (f) t = 10, and (h) t = 20. . . . . . . . 148

7.4

Predicted ﬁelds at (a) t = 1, (c) t = 5, (e) t = 10, and (g) t = 20. Prediction
error variances at (b) t = 1, (d) t = 5, (f) t = 10, and (h) t = 20 (cont’d). . . 149

7.5

(a) Estimated β, and (b) root mean square error. . . . . . . . . . . . . . . . 150

xii

Chapter 1

Introduction

In recent years, due to drastic global climate changes, it is necessary to monitor the changing
ecosystems over vast regions in lands, oceans, and lakes. For instance, for certain environmental conditions, rapidly reproducing harmful algal blooms in the lakes can cause the death
of nearby ﬁsh and produce harmful conditions to aquatic life as well as human beings1 . Besides natural disasters, there exist growing ubiquitous possibilities of the release of toxic
chemicals and contaminants in the air, lakes, and public water systems. Hence, there are
strong motivations to monitor and predict the environmental ﬁeld undergoing often complex
transport phenomena2 .
In this dissertation, we consider the problem of using mobile sensor networks to estimate
and predict environmental ﬁelds that are modeled by spatio-temporal Gaussian processes.

1 See

http://www.glerl.noaa.gov/res/Centers/HABS/habs.html for more details.
examples are diﬀusion, convection, and advection.

2 Common

1

1.1

Background

Due to the recent advances of micro-electro-mechanical systems technology, wireless communications and digital electronics, the concept of sensor networks has been made viable [2].
A sensor network consists of a collection of low-cost, low-power, multifunctional sensing devices that are small in size and communicate in short distances. Endowing the nodes in a
sensor network with mobility increases the network’s capabilities drastically [7]. The sensor
networks which consist of mobile sensing agents are more ﬂexible than the ones with only
static nodes. For example, the mobility allows the network to handle a large number of data
sources with a much smaller number of moving sensors that visit the sources over time.
In a mobile sensor network, the resource limited sensing agents are required to collaborate
in order to meet a speciﬁc objective. The cooperative control becomes essential. The most
popular applications are in networks of autonomous ground vehicles, underwater vehicles,
or aerial vehicles. Emerging technologies have been reported on the coordination of mobile
sensing agents [28,37,46,52,68,69]. The mobility of mobile agents can be designed in order to
perform the optimal sampling of the ﬁeld of interest. Optimal sampling design is the process
of choosing where to take samples in order to maximize the information gained. Recently
in [35], Leonard et al. developed mobile sensor networks that optimize ocean sampling
performance deﬁned in terms of uncertainty in a model estimate of a sampled ﬁeld. However,
this approach optimized the collective patterns of mobile agents parameterized by a restricted
number of parameters rather than optimizing individual trajectories. In [10], distributed
learning and cooperative control were developed for multi-agent systems to discover peaks
of the unknown ﬁeld based on the recursive estimation of an unknown ﬁeld. A typical
sensor placement technique [16] that puts sensors at the locations where the entropy is high
2

tends to place sensors along the borders of the area of interest [32]. In [32], Krause et al.
showed that seeking sensor placements that are most informative about unsensed locations
is NP-hard, and they presented a polynomial time approximation algorithm by exploiting
the submodularity of mutual information [14]. In a similar approach, in [58], Singh et al.
presented an eﬃcient planning of informative paths for multiple robots that maximizes the
mutual information.

To ﬁnd these locations that predict the phenomenon best, one needs a model of the spatiotemporal phenomenon. To this end, we use the Gaussian processes (Gaussian random ﬁelds)
to model ﬁelds undergoing transport phenomena. Nonparametric Gaussian process regression (or Kriging in geostatistics) has been widely used as a nonlinear regression technique to
estimate and predict geostatistical data [15,23,38,51]. A Gaussian process is a natural generalization of the Gaussian probability distribution. It generalizes a Gaussian distribution with
a ﬁnite number of random variables to a Gaussian process with an inﬁnite number of random
variables in the surveillance region [51]. Gaussian process modeling enables us to predict
physical values, such as temperature, salinity, pH, or biomass of harmful algal blooms, at
any point with a predicted uncertainty level eﬃciently. For instance, near-optimal static
sensor placements with a mutual information criterion in Gaussian processes were proposed
in [31, 32]. A distributed Kriged Kalman ﬁlter for spatial estimation based on mobile sensor
networks is developed in [13]. Multi-agent systems that are versatile for various tasks by
exploiting predictive posterior statistics of Gaussian processes were developed in [9] and [8].
However, Gaussian process regression, based on the standard mean and covariance functions,
requires an inversion of a covariance matrix whose size grows as the number of observations
increases.
3

The advantage of a fully Bayesian approach is that the uncertainty in the model parameters are incorporated in the prediction [5]. In [22], Gaudard et al. presented a Bayesian
method that uses importance sampling for analyzing spatial data sampled from a Gaussian
random ﬁeld whose covariance function was unknown. However, the solution often requires
Markov Chain Monte Carlo (MCMC) methods, which greatly increases the computational
complexity. In [25], an iterative prediction algorithm without resorting to MCMC methods
has been developed based on analytical closed-form solutions from results in [22], by assuming that the covariance function of the spatio-temporal Gaussian random ﬁeld is known up
to a constant.
Recently, there have been eﬀorts to ﬁnd a way to ﬁt a computationally eﬃcient Gaussian
Markov random ﬁeld (GMRF) on a discrete lattice to a Gaussian random ﬁeld on a continuum
space [17,27,56]. Such methods have been developed using a ﬁtting with a weighted L2 -type
distance [56], using a conditional-mean least-squares ﬁtting [17], and for dealing with large
data by fast Kriging [27]. It has been demonstrated that GMRFs with small neighborhoods
can approximate Gaussian ﬁelds surprisingly well [56]. This approximated GMRF and its
regression are very attractive for the resource-constrained mobile sensor networks due to its
computational eﬃciency and scalability [34] as compared to the standard Gaussian process
and its regression, which is not scalable as the number of observations increases.

1.2

Contribution

Here, we summarize the speciﬁc contributions of this dissertation in the order of chapters.
In Chapter 3, we develop covariance function learning algorithms for the sensing agents
to perform nonparametric prediction based on a properly adapted Gaussian process for a
4

given spatio-temporal phenomenon. By introducing a generalized covariance function, we
expand the class of Gaussian processes to include the anisotropic spatio-temporal phenomena. Maximum likelihood (ML) optimization is used to estimate hyperparameters for the
associated covariance function. The proposed optimal navigation strategy for autonomous
vehicles will maximize the Fisher information [30], improving the quality of the estimated
covariance function.
In Chapter 4, we ﬁrst present a theoretical foundation of Gaussian process regression
with truncated observations. In particular, we show that the quality of prediction based
on truncated observations does not deteriorate much as compared to that of prediction
based on all cumulative data under certain conditions. The error bounds to use truncated
observations are analyzed for prediction at a single point of interest. A way to select the
temporal truncation size for spatio-temporal Gaussian processes is also introduced. Inspired
by the analysis, we then propose both centralized and distributed navigation strategies for
mobile sensor networks to move in order to reduce prediction error variances at points of
interest. In particular, we demonstrate that the distributed navigation strategy produces an
emergent, swarming-like, collective behavior to maintain communication connectivity among
mobile sensing agents.
In Chapter 5, we formulate a fully Bayesian approach for spatio-temporal Gaussian process regression under practical conditions such as measurement noise and unknown hyperparmeters (particularly, the bandwidths). Thus, multifactorial eﬀects of observations, measurement noise and prior distributions of hyperparameters are all correctly incorporated
in the computed posterior predictive distribution. Using discrete prior probabilities and
compactly supported kernels, we provide a way to design sequential Bayesian prediction al5

gorithms that can be computed (without using the Gibbs sampler) in constant time as the
number of observations increases. An adaptive sampling strategy for mobile sensors, using
the maximum a posteriori (MAP) estimation, has been proposed to minimize the prediction
error variances.
In Chapter 6, we propose a new class of Gaussian processes for resource-constrained mobile sensor networks that builds on a Gaussian Markov random ﬁeld (GMRF) with respect
to a proximity graph over the surveillance region. The main advantages of using this class
of Gaussian processes over standard Gaussian processes deﬁned by mean and covariance
functions are its numerical eﬃciency and scalability due to its built-in GMRF and its capability of representing a wide range of non-stationary physical processes. The formulas for
predictive statistics are derived and a sequential ﬁeld prediction algorithm is provided for
sequentially sampled observations. For a special case using compactly supported weighting functions, we propose a distributed algorithm to implement ﬁeld prediction by correctly
fusing all observations.
In Chapter 7, We then consider a discretized spatial ﬁeld that is modeled by a GMRF
with unknown hyperparameters. From a Bayesian perspective, we design a sequential prediction algorithm to exactly compute the predictive inference of the random ﬁeld. The main
advantages of the proposed algorithm are: (1) the computational eﬃciency due to the sparse
structure of the precision matrix, and (2) the scalability as the number of measurements
increases. Thus, the prediction algorithm correctly takes into account the uncertainty in
hyperparameters in a Bayesian way and also is scalable to be usable for the mobile sensor
networks with limited resources. An adaptive sampling strategy is also designed for mobile
sensing agents to ﬁnd the most informative locations in taking future measurements in or6

der to minimize the prediction error and the uncertainty in the estimated hyperparameters
simultaneously.

1.3

Organization

This dissertation is organized as follows. In Chapter 2, we ﬁrst introduce the basic mathematical notations that will be used throughout the thesis. Then, we describe the general
Gaussian processes and its usage in nonparametric regression problems. The notations for
mobile sensor networks are also introduced in Chapter 2. In Chapter 3, we deal with the
case where hyperparameters in the covariance function is deterministic but unknown. We
design an optimal sampling strategy to improve the maximum likelihood estimation of these
hyperparameters. In Chapter 4, we assume the hyperparameters in the covariance function
are given which can be obtained using the approach proposed in Chapter 3. We then analyze
the error bounds of prediction error using Gaussian process regression with truncated observations. Inspired by the analysis, we propose both centralized and distributed navigation
strategies for mobile sensor networks to move in order to reduce prediction error variances at
points of interest. In Chapter 5, we consider a fully Bayesian approach for Gaussian process
regression in which the hyperparameters are treated as random variables. Using discrete
prior probabilities and compactly supported kernels, we provide a way to design sequential
Bayesian prediction algorithms that can be computed in constant time as the number of observations increases. To cope with the computational complexity brought by using standard
Gaussian processes with covariance functions, in Chapter 6, we exploit the sparsity of the
precision matrix by using Gaussian Markov random ﬁelds (GMRF). We ﬁrst introduce a new
class of Gaussian processes with built-in GMRF and show its capability of representing a
7

wide range of non-stationary physical processes. We then derive the formulas for predictive
statistics and design sequential prediction algorithms with ﬁxed complexity. In Chapter 7,
we consider a discretized spatial ﬁeld that is modeled by a GMRF with unknown hyperparameters. From a Bayesian perspective, we design a sequential prediction algorithm to
exactly compute the predictive inference of the random ﬁeld. An adaptive sampling strategy
is also designed for mobile sensing agents to ﬁnd the most informative locations in taking
future measurements in order to minimize the prediction error and the uncertainty in the
estimated hyperparameters simultaneously.

1.4

Publication

In this section, I list journal articles and conference proceedings that have been published
(or will be published) related to the topic of this dissertation. Some of the work will be
described in the following chapters.

1.4.1

Journal Articles

(J1) Yunfei Xu, Jongeun Choi, Sarat Dass, and Taps Maiti, “Bayesian prediction and
adaptive sampling algorithms for mobile sensor networks,” IEEE Transactions on Automatic Control, (to appear, 2012).
(J2) Yunfei Xu, Jongeun Choi, “Spatial prediction with mobile sensor networks using
Gaussian Markov random ﬁelds,” Automatica, (in review, 2011).
(J3) Yunfei Xu, Jongeun Choi, and Songhwai Oh, “Mobile sensor network navigation using
Gaussian processes with truncated observations,” IEEE Transactions on Robotics, vol.
8

27, no. 6, pp. 1118-1131, 2011.

(J4) Yunfei Xu, Jongeun Choi, “Stochastic adaptive sampling for mobile sensor networks
using kernel regression,” International Journal of Control, Automation and Systems,
(conditionally accepted, 2011).

(J5) Yunfei Xu, Jongeun Choi, “Adaptive sampling for learning Gaussian processes using
mobile sensor networks,” Sensors, vol. 11, no. 3, pp. 3051-3066, 2011.

(J6) Mahdi Jadaliha, Yunfei Xu, and Jongeun Choi, “Gaussian process regression for sensor
networks under localization uncertainty,” IEEE Transactions on Signal Processing, (in
review, 2011).

(J7) Jongeun Choi,Yunfei Xu, Justin Mrkva, Joonho Lee, and Songhwai Oh, “Navigation
strategies for swarm intelligence using spatio-temproal Gaussian processes,” Robotics
and Autonomous Systems, (in review, 2010).

1.4.2

Conference Proceedings

(C1) Yunfei Xu, Jongeun Choi, Sarat Dass, and Taps Maiti, “Eﬃcient Bayesian spatial
prediction with mobile sensor networks using Gaussian Markov random ﬁelds,” in
Proceedings of the 2012 American Control Conference (ACC), June 27-29, Montr´al,
e
Canada. (in review).

(C2) Mahdi Jadaliha, Yunfei Xu, and Jongeun Choi, “Gaussian process regression using
Laplace approximations under localization uncertainty,” in Proceedings of the 2012
American Control Conference (ACC), June 27-29, Montr´al, Canada. (in review).
e
9

(C3) Yunfei Xu, Jongeun Choi, “Spatial prediction with mobile sensor networks using
Gaussian Markov random ﬁelds,” in Proceedings of 2011 ASME Dynamic Systems and
Control Conference (DSCC), October 31-November 2, 2011, Arlington, VA, USA.
(C4) Yunfei Xu, Jongeun Choi, Sarat Dass, and Taps Maiti, “Bayesian prediction and
adaptive sampling algorithms for mobile sensor networks,” in Proceedings of the 2011
American Control Conference (ACC), June 29-July 1, 2011, San Francisco, California,
USA.
(C5) Songhwai Oh, Yunfei Xu, and Jongeun Choi, “Explorative navigation of mobile sensor
networks using sparse Gaussian processes,” in Proceedings of the 49th IEEE Conference
on Decision and Control (CDC), December 15-17, 2010, Atlanta, Georgia, USA.
(C6) Yunfei Xu, Jongeun Choi, “Stochastic adaptive sampling for mobile sensor networks
using kernel regression,” in Proceedings of the 2010 American Control Conference
(ACC), June 20-July 2, 2010, Baltimore, Maryland, USA.
(C7) Yunfei Xu, Jongeun Choi, “Optimal coordination of mobile sensor networks using
Gaussian processes,” in Proceedings of 2009 ASME Dynamic Systems and Control
Conference (DSCC), October 12-14, 2009, Hollywood, California, USA.
(C8) Yunfei Xu, Jongeun Choi, “Mobile sensor networks for learning anisotropic gaussian
processes,” in Proceedings of the 2009 American Control Conference (ACC), June 1012, 2009, St. Louis, Missouri, USA.

10

Chapter 2
Preliminaries

2.1

Mathematical Notation

Standard notation is used throughout this dissertation. Let R, R≥0 , R>0 , Z, Z≥0 , Z>0
denote the sets of real numbers, non-negative real numbers, positive real numbers, integers,
non-negative integers, and positive integers, respectively.
Let E, Var, Corr, Cov denote the expectation operator, the variance operator, the correlation operator, and the covariance operator, respectively.
Let AT ∈ Rm×n be the transpose of a matrix A ∈ Rn×m . Let tr(A) and det(A) denote
the trace and the determinant of a matrix A ∈ Rn×n , respectively. Let rowi (A) ∈ Rm and
colj (A) ∈ Rn denote the i-th row and the j-th column of a matrix A ∈ Rn×m , respectively.
The positive deﬁniteness and the positive semi-deﬁniteness of a square matrix A are
denoted by A

0 and A

0, respectively.

Let |x| denote the absolute value of a scalar x. Let x denote the standard Euclidean
norm (2-norm) of a vector x. The induced 2-norm of a matrix A is denoted by A . Let
11

x ∞ denote the inﬁnity norm of a vector x.
Let 1 denote the vector with all elements equal to one and I denote the identity matrix
with an appropriate size. Let ei be the standard basis vector of appropriate size with 1 on
the i-th element and 0 on all other elements.
The symbol ⊗ denotes the Kronecker product. The symbol ◦ denotes the Hadamard
product (also known as the entry-wise product and the Schur product).
A random variable x, which is distributed by a normal distribution of mean µ and covariance matrix Σ, is denoted by x ∼ N (µ, Σ). The corresponding probability density function
is denoted by N (x; µ, Σ).
The relative complement of a set A in a set B is denoted by B \ A := B ∩ Ac , where Ac
is the complement of A. For a set A ∈ I, we deﬁne zA = {zi | i ∈ A}. Let −A denote the
set I \ A.
An undirected graph G = (V, E) is a tuple consisting of a set of vertices V := {1, · · · , n}
and a set of edges E ⊂ V × V. The neighbors of i ∈ V in G are denoted by Ni :=
{j ∈ V | {i, j} ∈ E}.
Other notation will be explained in due course.

2.2

Physical Process Model

In this section, we review important notions on the Gaussian process which will be used
to model the physical phenomenon. In particular, we introduce a class of spatio-temporal
Gaussian process model with anisotropic covariance functions. The properties of Gaussian
Markov Random ﬁelds (GMRF) are also brieﬂy reviewed.
12

2.2.1

Gaussian process

A Gaussian process can be thought of a generalization of a Gaussian distribution over a ﬁnite
vector space to function space of inﬁnite dimension. It is formally deﬁned as follows [50, 51]:

Deﬁnition 2.2.1. A Gaussian process (GP) is a collection of random variables, any ﬁnite
number of which have a consistent1 joint Gaussian distribution.

A Gaussian process
z(x) ∼ GP µ(x), C(x, x ; θ)

(2.1)

is completely speciﬁed by its mean function µ(x) and covariance function C(x, x ; θ) which
are deﬁned as

µ(x) = E [z(x)] ,
C(x, x ; θ) = E (z(x) − µ(x)) (z(x ) − µ(x ))|θ .
Although not needed to be done, we take the mean function to be zero for notational
simplicity2 , i.e., µ(x) = 0. If the covariance function C(x, x ; θ) is invariant to translations in the input space, i.e., C(x, x ; θ) = C(x − x ; θ), we call it stationary. Furthermore, if the covariance function is a function of only the distance between the inputs, i.e.,
C(x, x ; θ) = C( x − x ; θ), then it is called isotropic.
In practice, a parametric family of functions is used instead of ﬁxing the covariance
1 It

is also known as the marginalization property. It means simply that the random
variables obey the usual rules of marginalization, etc.
2 This is not a drastic limitation since the mean of the posterior process is not conﬁned
to zero [51].
13

function [5]. One common choice of a stationary covariance function is

2
C(x, x ; θ) = σf exp





D

−

x −x
2σ 2

=1



2

,

(2.2)



where x is the -th element of x ∈ RD . From (2.2), it can be easily seen that the correlation
between two inputs decreases as the distance between them increases. This decreasing rate
depends on the choice of the length scales {σ }. A very large length scale means that the
predictions would have little bearing on the corresponding input which is then said to be
2
insigniﬁcant. σf gives the overall vertical scale relative to the mean of the Gaussian process in

the output space. These parameters play the role of hyperparameters since they correspond
to the hyperparameters in neural networks and in the standard parametric model. Therefore,
2
we deﬁne θ = (σf , σ1 , · · · , σD )T ∈ RD+1 as the hyperparameter vector. A realization of a

Gaussian process that is numerically generated is shown in Fig. 2.1.

1
1.5
0.8

1
0.5

0.6

0
0.4

−0.5
−1

0.2
−1.5
0
0

0.2

0.4

0.6

0.8

1

−2

Figure 2.1: Realization of a Gaussian process. For interpretation of the references to color in
this and all other ﬁgures, the reader is referred to the electronic version of this dissertation.

14

2.2.2

Spatio-temporal Gaussian process

In the ﬁrst part of this dissertation, spatio-temporal Gaussian processes are of particular
interest. Consider a spatio-temporal Gaussian process

z(s, t) ∼ GP(µ(s, t), C(s, t, s , t ; θ)),
which is a special case of the Gaussian process deﬁned in (2.1), where x = (sT , t)T ∈
Rd × R≥0 . We consider the following generalized anisotropic covariance function C(x, x ; θ)
2
with a hyperparameter vector θ := (σf , σ1 , · · · , σd , σt )T ∈ Rd+2 :


2
C(x, x ; θ) = σf exp −

d

(s − s
2σ 2

=1

)2



2

 exp − (t − t )
2
2σt

,

(2.3)

where s, s ∈ Q ⊂ Rd , t, t ∈ R≥0 . {σ1 , · · · , σd } and σt are kernel bandwidths for space
and time, respectively. (2.3) shows that points close in the measurement space and time
indices are strongly correlated and produce similar values. In reality, the larger temporal
distance two measurements are taken with, the less correlated they become, which strongly
supports our generalized covariance function in (2.3). This may also justify the truncation
(or windowing) of the observed time series data to limit the size of the covariance matrix
for reducing the computational cost. A spatially isotropic version of the covariance function
in (2.3) has been used in [35]. A realization of a spatio-temporal Gaussian process that is
numerically generated is shown in Fig. 2.2.
15

1

1

1.5

1.5
0.8

1
0.5

0.6

1

0.8

1

1.5
1

0.8

0.5
0.6

0

0.5
0.6

0

0
0.4

−0.5

−0.5

0.4

−1
0.2

0.2

−1.5

−1.5
0
0

0.2

0.4

0.6

0.8

1

−2

−0.5

0.4

−1

−1

0
0

−2
0.2

0.4

(a)

0.6

0.8

1

0.2

−1.5
−2

0
0

0.2

(b)

0.4

0.6

0.8

1

(c)

Figure 2.2: Realization of Gaussian process at (a) t = 1, (b) t = 5, and (c) t = 10.

2.2.3

Gaussian Markov random ﬁeld

The Gaussian Markov random ﬁeld is formally deﬁned as follows [54].
Deﬁnition 2.2.2. (GMRF, [54, Deﬁnition 2.1]) A random vector z = (z1 , · · · , zn )T ∈ Rn is
called a GMRF with respect to a graph G = (V, E) with mean µ and precision matrix Q

0,

if and only if its density has the form

π(z) =

|Q|1/2

(2π)n/2

1
exp − (z − µ)T Q(z − µ) ,
2

and (Q)ij = 0 ⇔ {i, j} ∈ E for all i = j, where the precision matrix (or information matrix)
Q = C −1 is the inverse of the covariance matrix C, and |Q| denotes the determinant of Q.
The Markov property of a GMRF can be shown by the following theorem.
Theorem 2.2.3. ( [54, Theorem 2.4]) Let z be a GMRF with respect to G = (V, E). Then
the followings are equivalent.
1. The pairwise Markov property:

zi ⊥zj | z−ij

if {i, j} ∈ E and i = j,
/
16

where ⊥ denotes conditional independence and z−ij := z−{i,j} = zI\{i,j} . This implies
that zi and zj are conditionally independent given observations at all other vertices
except {i, j} if i and j are not neighbors.
2. The local Markov property:

zi ⊥z−{i,N } | zNi
i

for every i ∈ I.

3. The global Markov property:
zA ⊥zB | zC
for disjoint sets A, B, and C where C separates A and B, and A and B are non-empty.
If a graph G has small cardinalities of the neighbor sets, its precision matrix Q becomes
sparse with many zeros in its entries. This plays a key role in computation eﬃciency of a
GMRF which can be greatly exploited by the resource-constrained mobile sensor network.
For instance, some of the statistical inference can be obtained directly from the precision
matrix Q with conditional interpretations.
Theorem 2.2.4. ( [54, Theorem 2.3]) Let z be a GMRF with respect to G = (V, E) with
mean µ and precision matrix Q

0, then we have

E(zi | z−i ) = µi −

1
(Q)ii

j∈Ni

(Q)ij (zj − µj ),

1
,
(Q)ii
(Q)ij
Corr(zi , zj | z−ij ) = −
,
(Q)ii (Q)jj
Var(zi | z−i ) =

17

∀i = j.

2.3

Mobile Sensor Network

In this section, we explain the sensor network formed by multiple mobile sensing agents and
present the measurement model used throughout the thesis.
Let N be the number of sensing agents distributed over the surveillance region Q ∈ Rd .
The identity of each agent is indexed by I := {1, 2, · · · , N }. Assume that all agents are
equipped with identical sensors and take noisy observations at time t ∈ Z>0 . At time t, the
sensing agent i takes a noise-corrupted measurement yi (t) at its current location qi (t) ∈ Q,
i.e.,
yi (t) = z(qi (t), t) + i ,

i

i.i.d.

2
∼ N (0, σw ),

where the sensor noise i is considered to be an independent and identically distributed
2
Gaussian random variable. σw > 0 is the noise level and we deﬁne the signal-to-noise ratio

as
2
σf

γ= 2.
σw
Notice that when a static ﬁeld is considered, we have z(s, t) = z(s).

For notational simplicity, we denote the collection of positions of all N agents at time t
as q(t), i.e.,
q(t) := q1 (t)T , · · · , qN (t)T

T

∈ QN .

The collective measurements from all N mobile sensors at time t is denoted by

yt := (y1 (t), · · · , yN (t))T ∈ RN .
18

The cumulative measurements from time t ∈ Z>0 to time t ∈ Z>0 is denoted by
T
T
yt:t := yt , · · · , yt

T

∈ RN (t −t+1) .

The communication network of mobile agents can be represented by an undirected graph.
Let G(t) := (I, E(t)) be an undirected communication graph such that an edge (i, j) ∈ E(t) if
and only if agent i can communicate with agent j = i at time t. We deﬁne the neighborhood
of agent i at time t by Ni (t) := {j ∈ I | (i, j) ∈ E(t)}. Similarly, let q [i] (t) denote the vector
[i]

form of the collection of positions in qj (t) | j ∈ {i} ∪ Ni (t) . Let yt denote vector form of
the collection of observations in y(qj (t), t) | j ∈ {i} ∪ Ni (t) . The cumulative measurements
[i]
t:t

of agent i from time t to time t as y

2.4

.

Gaussian Processes for Regression

Suppose we have a data set D =

(x(i) , y (i) ) | i = 1, · · · , n

collected by mobile sensing

agents where x(i) denotes an input vector of dimension D and y (i) denotes a scalar value
of the noise corrupted output. The objective of probabilistic regression is to compute the
predictive distribution of the function values z∗ := z(x∗ ) at some test input x∗ .
For notational simplicity, we deﬁne the design matrix X of dimension n × D as the
aggregation of n input vectors (i.e., rowi (X) := (x(i) )T ), and the outputs are collected in a
vector y := (y (1) , · · · , y (n) )T . The corresponding vector of noise-free outputs is deﬁned as
z := (z(x(1) ), · · · , z(x(n) ))T .
The advantage of the Gaussian process formulation is that the combination of the prior
and noise models can be carried out exactly via matrix operations [62]. The idea of Gaus19

sian process regression is to place a GP prior directly on the space of functions without
parameterizing the function z(·), i.e.,

π(z|θ) = N (µ, K),
where µ ∈ Rn is the mean vector obtained by (µ)i = µ(x(i) ), and K := Cov(z, z|θ) ∈ Rn×n
is the covariance matrix obtained by (K)ij = C(x(i) , x(j) ; θ). Notice that the GP model and
all expressions are always conditional on the corresponding inputs. In the following, we will
always neglect the explicit conditioning on the input matrix X.
The inference in the Gaussian process model is as follows. First, we assume a joint GP
prior π(z, z∗ |θ) over functions, i.e.,


 



k
 µ   K

π(z, z∗ |θ) = N 
,
 ,
T C(x , x ; θ)
µ(x∗ )
k
∗ ∗

(2.4)

where k := Cov(z, z∗ |θ) ∈ Rn is the covariance between z and z∗ obtained by (k)i =
C(x(i) , x∗ ; θ). Then, the joint posterior is obtained using Bayes rule, i.e.,
π(z, z∗ |θ, y) =

π(y|z)π(z, z∗ |θ)
,
π(y|θ)

where we have used π(y|z, z∗ ) = π(y|z). Finally, the desired predictive distribution π(z∗ |θ, y)
is obtained by marginalizing out the latent variables in z, i.e.,

π(z∗ |θ, y) =

π(z, z∗ |θ, y)dz

1
=
π(y|θ)

π(y|z)π(z, z∗ |θ, y)dz.
20

(2.5)

Since we have the joint Gaussian prior given in (2.4) and

2
y|z ∼ N z, σw I ,

the integral in (2.5) can be evaluated in closed-form and the predictive distribution turns
out to be Gaussian, i.e.,
2
z∗ |θ, y ∼ N µz∗ |θ,y , σz∗ |θ,y ,

(2.6)

2
µz∗ |θ,y = µ(x∗ ) + k T (K + σw I)−1 (y − µ),

(2.7)

2
2
σz∗ |θ,y = C(x∗ , x∗ ; θ) − k T (K + σw I)−1 k.

(2.8)

where

and

For notational simplicity, we deﬁne the covariance matrix of the noisy observations as C :=
2
Cov(y, y|θ) = K + σw I.

21

Chapter 3
Learning the Covariance Function
Even though, there have been eﬀorts to utilize Gaussian processes to model and predict
the spatio-temporal ﬁeld of interest, most of recent papers assume that Gaussian processes
are isotropic implying that the covariance function only depends on the distance between
locations. Many studies also assume that the corresponding covariance functions are known
a priori for simplicity. However, this is not the case in general as pointed out in literature
[31, 32, 44], in which they treat the non-stationary process by fusing a collection of isotropic
spatial Gaussian processes associated with a set of local regions. Hence, our objective in
this Chapter is to develop theoretically-sound algorithms for mobile sensor networks to learn
the anisotropic covariance function of a spatio-temporal Gaussian process. Mobile sensing
agents can then predict the Gaussian process based on the estimated covariance function in
a nonparametric manner.
In Section 3.1, we introduce a covariance function learning algorithm for an anisotropic,
spatio-temporal Gaussian process. The covariance function is assumed to be deterministic
but unknown a priori and it is estimated by the maximum likelihood (ML) estimator. In
22

Section 3.2, an optimal sampling strategy is proposed to minimize the Cram´r-Rao lower
e
bound (CRLB) of the estimation error covariance matrix. In Section 3.3, simulation results
illustrate the usefulness of our proposed approach and its adaptability for unknown and/or
time-varying covariance functions.

3.1

Learning the Hyperparameters

Without loss of generality, we consider a zero-mean spatio-temporal Gaussian process

z(s, t) ∼ GP 0, C(s, t, s , t ; θ) ,

with the covariance function

C(s, t, s , t ; θ) =

2
σf

exp −

(s − s )2
∈{x,y}

2σ 2



2

 exp − (t − t )
2
2σt

,

where s, s ∈ Q ⊂ R2 , t, t ∈ R≥0 , for modeling the ﬁeld undergoing a physical transport
phenomenon. θ = (σf , σx , σy , σt )T ∈ Rm is the hyperparameter vector, where m = 4. The
assumption of zero-mean is not a strong limitation since the mean of the posterior process
is not conﬁned to zero [51].
If the covariance function C(s, t, s , t ; θ) of a Gaussian process is not known a priori, mobile agents need to estimate parameters of the covariance function (i.e., the hyperparameter
vector θ ∈ Rm ) based on the observed samples. In the case where measurement noise level
σw is also unknown, it can be incorporated in the hyperparameter vector and be estimated.
Thus, we have θ = (σf , σx , σy , σt , σw )T ∈ Rm where m = 5.
23

Existing techniques for learning the hyperparamters are based on the likelihood function.
Given the observations y = (y (1) , · · · , y (n) )T ∈ Rn collected by mobile sensing agents, the
likelihood function is deﬁned as
L(θ|y) = π(y|θ).
Notice that in this chapter, the hyperparameter vector θ is considered to be deterministic,
and hence π(y|θ) should not be considered as conditional distribution.
At time t, a point estimate of the hyperparameter vector θ can be made by maximizˆ
ing the log likelihood function. The maximum likelihood (ML) estimate θ ∈ Rm of the
hyperparameter vector is obtained by

ˆ
θ = arg max log L(θ|y),

(3.1)

θ∈Θ

where Θ is the set of all possible choices of θ. The log likelihood function is given by

1
1
n
log L(θ|y) = − y T C −1 y − log det(C) − ln 2π,
2
2
2
where C := Cov(y, y|θ) ∈ Rn×n is the covariance matrix, and n is the total number of observations. Maximization of the log likelihood function can be done eﬃciently using gradientbased optimization techniques such as the conjugate gradient method [26, 43]. The partial
derivative of the log likelihood function with respect to a hyperparameter θi ∈ R, i.e., the
i-th entry of the hyperparameter vector θ, is given by
∂ ln L(θ|y)
1
1
∂C
∂C −1
= y T C −1
C y − tr C −1
∂θi
2
∂θi
2
∂θi
=

∂C
1
tr (ααT − C −1 )
2
∂θi
24

,

where α = C −1 y ∈ Rn . In general, the log likelihood function is a non-convex function and
hence it can have multiple maxima.
As an alternative, when certain prior knowledge is available on the hyperparameters, a
prior distribution can be imposed on the hyperparameter vector, i.e., π(θ). Using Bayes’
rule, the posterior distribution π(θ|y) is proportional to the likelihood π(y|θ) times the prior
distribution π(θ), i.e.,
π(θ|y) ∝ π(y|θ)π(θ).
ˆ
Then the maximum a posteriori (MAP) estimate θ ∈ Rm of the hyperparameter vector can
be obtained similarly by

ˆ
θ = arg max (log L(θ|y) + log π(θ)) .
θ∈Θ

(3.2)

Notice that when no prior information is available, the MAP estimate is equivalent to the
ML estimate.
Once the estimate of the hyperparameter vector θ is obtained with conﬁdence, it can
be used as the true one for the mobile sensor network to predict the ﬁeld of interest using
Gaussian process regression in (2.6).

3.2

Optimal Sampling Strategy

Agents should ﬁnd new sampling positions to improve the quality of the estimated covariance
function in the next iteration at time t+1. For instance, to precisely estimate the anisotropic
phenomenon, i.e., processes with diﬀerent covariances along x and y directions, sensing
agents need to explore and sample measurements along diﬀerent directions.
25

To this end, we consider a centralized scheme. Suppose that a central station (or a
leader agent) has access to all measurements collected by agents. Assume that at time t + 1,
agent i moves to a new sampling position qi ∈ Q and make an observation yi (t+1) ∈ R. The
˜
collection of the new sampling positions and new observations from all agents are denoted by
q ∈ QN and y ∈ RN , respectively. The objective of the optimal sampling strategy is to ﬁnd
˜
˜
ˆ
the best sampling positions q such that the maximum likelihood (ML) estimate θt+1 ∈ Rm
˜
at time t + 1 is as close to the true hyperparameter vector θ∗ ∈ Rm as possible.

Consider the Fisher information matrix (FIM) that measures the information produced
by y1:t ∈ RN t and y ∈ RN for estimating the true hyperparameter vector θ∗ ∈ Rm at
˜
time t + 1. The Cram´r-Rao lower bound (CRLB) theorem states that the inverse of the
e
Fisher information matrix (denoted by M ∈ Rm×m ) is a lower bound of the estimation error
covariance matrix [30, 39]:

ˆ
ˆ
E (θt+1 − θ∗ )(θt+1 − θ∗ )T

M −1 ,

ˆ
where θt+1 ∈ Rm represents the ML estimate of θ∗ at time t + 1. The Fisher information
matrix (FIM) [30] is given by

(M )ij = − E

∂ 2 ln L(θ|˜, y1:t )
y
,
∂θi ∂θj

where L(θ|˜, y1:t ) is the likelihood function at time t + 1, and the expectation is taken with
y
respect to π(y1:t , y |θ). Notice that the likelihood is now a function of θ and y . The analytical
˜
˜
26

form of the FIM is given by

(M )ij =

˜
˜
1
˜ ∂ C C −1 ∂ C
˜
tr C −1
2
∂θi
∂θj

,

˜
where C ∈ RN (t+1)×N (t+1) is deﬁned as
    
y1:t  y1:t  
˜
C := Cov   ,   θ∗  .
y
˜
y
˜
Since the true value θ∗ is not available, we will evaluate the FIM at the currently best
ˆ
estimate θt .

We can expect that minimizing the Cram´r-Rao lower bound results in a decrease of
e
uncertainty in estimating θ [41]. The most common optimality criterion is D-optimality [21,
49]. It corresponds to minimizing the volume of the ellipsoid which represents the maximum
conﬁdence region for the maximum likelihood estimate of the unknown hyperparamters [21].
Using the D-optimality criterion [21, 49], the objective function J(·) is given by

J(˜) := det(M −1 ).
q

However, if one hyperparamter has a very large variance compared to the others, the ellipsoid
will be skinny and thus minimizing the volume may be misleading [21]. As an alternative,
A-optimality which minimizes the sum of the variances is often used. The objective function
J(·) based on A-optimality criterion is

J(˜) := tr(M −1 ).
q
27

Hence, a control law for the mobile sensor network can be formulated as follows:

q(t + 1) = arg min J(˜).
q
q ∈QN
˜

(3.3)

In (3.3), we only consider the constraint that robots should move within the region Q.
However, the mobility constraints, such as the maximum distance that a robot can move
between two time indices, or the maximum speed with which a robot can travel, can be
incorporated as additional constraints in the optimization problem [13].
The overall protocol for the sensor network is summarized as in Table 3.1.

Table 3.1: Centralized optimal sampling strategy at time t.
For i ∈ I, agent i performs:
1: make an observation at current position qi (t), i.e., yi (t)
2: transmit the observation yi (t) to the central station
The central station performs:
1: collect the observations from all N agents, i.e., yt
2: obtain the cumulative measurements, i.e., y1:t
ˆ
3: compute the maximum likelihood estimate θt based on
ˆ
θt = arg maxθ∈Θ ln L(θ|y1:t ),
ˆ
starting with the initial point θt−1
4: compute the control in order to minimize the cost function J(˜) via
q
q(t + 1) = arg minq∈QN J(˜)
q
˜
5: send the next sampling positions {qi (t + 1) | i ∈ I} to all N agents

For i ∈ I, agent i performs:
1: receive the next sampling position qi (t + 1) from the central station
2: move to qi (t + 1) before time t + 1

28

3.3

Simulation

In this section, we evaluate the proposed approach for a spatio-temporal Gaussian process
(Section 3.3.1) and an advection-diﬀusion process (Section 3.3.3). For both cases, we compare
the simulation results using the proposed optimal sampling strategy with results using a
benchmark random sampling strategy. In this random sampling strategy, each agent was
initially randomly deployed in the surveillance region Q. At time t ∈ Z>0 , the next sampling
position for agent i is generated randomly with the same mobility constraint, viz. a random
position within a square region with length 2 centered at the current position qi (t). For
fair comparison, the same values are used for all other conditions. In Section 3.3.2, our
approach based on truncated observations is applied to a Gaussian process with a timevarying covariance function to demonstrate the adaptability of the proposed scheme.

3.3.1

Spatio-temporal Gaussian process

We apply our approach to a spatio-temporal Gaussian process. The Gaussian process was
numerically generated for the simulation [51]. The hyperparameters used in the simulation
were chosen such that θ = (σf , σx , σy , σt , σw )T = (5, 4, 2, 8, 0.5)T . Snap shots of the realized
Gaussian random ﬁeld are shown in Fig. 3.1. In this case, N = 5 mobile sensing agents were
initialized at random positions in a surveillance region Q = [0, 20] × [0, 20]. The initial values
for the algorithm were given to be θ0 = (1, 10, 10, 1, 0.1)T . A prior of the hyperparameter
vector has been selected as

π(θ) = π(σf )π(σx )π(σy )π(σt )π(σw ),
29

where π(σf ) = π(σx ) = π(σy ) = π(σt ) = Γ(5, 2), and π(σw ) = Γ(5, 0.2). Γ(a, b) is a Gamma
distribution with mean ab and variance ab2 in which all possible values are positive. The
gradient method was used to ﬁnd the MAP estimate of the hyperparameter vector.
20

20

20

10
15

10
15

5
10

0

5
10

0

−5
5

5

10

(a)

15

20

5
10

0

−5
5

−5
5

−10
0
0

10
15

−10
0
0

5

10

15

20

−10
0
0

(b)

5

10

15

20

(c)

Figure 3.1: Snap shots of the realized Gaussian process at (a) t = 1, (b) t = 10, and (c)
t = 20.

For simplicity, we assumed that the global basis is the same as the model basis. We
considered a situation where at each time, measurements of agents are transmitted to a
leader (or a central station) that uses our Gaussian learning algorithm and sends optimal
control back to individual agents for next iteration to improve the quality of the estimated
covariance function. The maximum distance for agents to move in one time step was chosen
to be 1 for both x and y directions. The A-optimality criterion was used for optimal sampling.
For both proposed and random strategies, Monte Carlo simulations were run for 100
times and the statistical results are shown in Fig. 3.2. The estimates of the hyperparameters
(shown in circles and error bars) tend to converge to the true values (shown in dotted lines)
for both strategies. As can be seen, the proposed scheme (Fig. 3.2(a)) outperforms the
random strategy (Fig. 3.2(b)) in terms of the A-optimality criterion.
Fig. 3.3 shows the predicted ﬁeld along with agents’ trajectories at time t = 1 and t = 20
for one trial. As shown in Fig. 3.1(a) and Fig. 3.3(a), at time t = 1, the predicted ﬁeld is far
30

7
6
5
4
3

σf

σf

7
6
5
4
3

2

4

6

8

8

10 12 14 16 18 20

4

6

8

10 12 14 16 18 20

4

6

8

10 12 14 16 18 20

2

4

6

8

10 12 14 16 18 20

2

4

6

8

10 12 14 16 18 20

σx

6
4

4

6

8

10 12 14 16 18 20

σy

2

2

4

6

8

10
8
6
4
2

4

6

8

8
6
4
2

10 12 14 16 18 20

σt

σx
σy
σt

6

2

6

8
6
4
2

4

8

8

4

2

2

10 12 14 16 18 20

10
8
6
4

10 12 14 16 18 20

1.2
1
0.8
0.6

σw

σw

1.2
1
0.8
0.6
0.4

2

4

6

8

10 12 14 16 18 20

t

(a)

t

(b)

Figure 3.2: Monte Carlo simulation results (100 runs) for a spatio-temporal Gaussian process
using (a) the random sampling strategy, and (b) the adaptive sampling strategy. The estimated hyperparameters are shown in blue circles with error-bars. The true hyperparameters
that used for generating the process are shown in red dashed lines.

from the true ﬁeld due to the inaccurate hyperparameters estimation and small number of
observations. As time increases, the predicted ﬁeld will be closer to the true ﬁeld due to the
improved quality of the estimated the covariance function and the cumulative observations.
As expected, at time t = 20, the quality of the predicted ﬁeld is very well near the sampled
positions as shown in Fig. 3.3-(b). With 100 observations, the running time is around 30s
using Matlab, R2008a (MathWorks) in a PC (2.4 GHz Dual-Core Processor). No attempt
has been made to optimize the code. After converging to a good estimate of θ, agents can
31

switch to a decentralized conﬁguration and collect samples for other goals such as peak
tracking and prediction of the process [8–10].
20

20
10

10

15

15
5
3

10

5
3

10

0

0

−5

−5

5 1

5
2

5
0
0

5

−10

2

1

−10

4

10

15

0
0

20

5

(a)

5
10

15

4

20

(b)

Figure 3.3: Predicted ﬁelds along with agents’ trajectories at (a) t = 1 and (b) t = 20.

3.3.2

Time-varying covariance functions

To illustrate the adaptability of the proposed strategy to time-varying covariance functions,
we introduce a Gaussian process deﬁned by the following covariance function. The timevarying covariance function is modeled by a time-varying weighted sum of two known covariance functions C1 (·, ·) and C2 (·, ·) such as
C(·, ·) = λ(t)C1 (·, ·) + (1 − λ(t))C2 (·, ·),

(3.4)

where λ(t) ∈ [0, 1] is a time-varying weight factor that needs to be estimated. In the
simulation study, C1 (·, ·) is constructed with σf = 1, σx = 0.2, σy = 0.1, σt = 8, and
σw = 0.1; and C2 (·, ·) is with σf = 1, σx = 0.1, σy = 0.2, σt = 8, and σw = 0.1.
This Gaussian process deﬁned in (3.4) with theses particular C1 and C2 eﬀectively models
hyperparameter changes in x and y directions.
32

To improve the adaptability, the mobile sensor network uses only observations sampled
during the last 20 iterations for estimating λ(t) online. The true λ(t) and the estimated λ(t)
are shown in Fig. 3.4(a), and (b), respectively. From Fig. 3.4, it is clear that the weighting
factor λ(t) can be estimated accurately after some delay about 5–8 iterations. The delay is
due to using the truncated observations that contain past observations since the time-varying
covariance function changes continuously in time.
1
0.8
0.6

1−λ

λ

λ

1

0.4
0.2
0

0

20

30

5

t

(a)

10 15 20 25 30 35 40 45 50

t

(b)

Figure 3.4: (a) Weighting factor λ(t) and (b) the estimated λ(t).

3.3.3

Advection-diﬀusion process

We apply our approach to a spatio-temporal process generated by physical phenomena (advection and diﬀusion). This work can be viewed as a statistical modeling of a physical
process, i.e., as an eﬀort to ﬁt a Gaussian process to a physical advection-diﬀusion process
in practice. The advection-diﬀusion model developed in [29] was used to generate the experimental data numerically. An instantaneous release of Qkg of gas occurs at a location
(x0 , y0 , z0 ). This is then spread by the wind with mean velocity u = (ux , 0, 0)T Assuming
that all measurements are recorded at a level z = 0, and the release occurs at a ground level
33

Table 3.2: Parameters used in simulation.
Parameter
Notation
Unit
Value
Number of agents
Sampling time
Initial time
Gas release mass
Wind velocity in x axis
Eddy diﬀusivity in x axis
Eddy diﬀusivity in y axis
Eddy diﬀusivity in z axis
Location of explosion
Location of explosion
Location of explosion
Sensor noise level

Ns
ts
t0
Q
ux
Kx
Ky
Kz
x0
y0
z0
σw

min
min
kg
m/min
m2 /min
m2 /min
m2 /min
m
m
m
kg/m3

5
5
100
106
0.5
20
10
0.2
2
5
0
0.1

(i.e., z0 = 0), the concentration C at an arbitrary location (x, y, 0) and time t is described
by the following analytical solution [11]:
(∆x−u∆t)2

C(x, y, 0, t) =

Q exp − 4K ∆t
x
3

∆y 2
− 4K ∆t
y

1

3

(3.5)

4π 2 (Kx Ky Kz ) 2 (∆t) 2
where ∆x = x − x0 , ∆y = y − y0 , and ∆t = 5(t − 1) + t0 . The parameters used in the
simulation study are shown in Table 3.2. Notice that this process generates an anisotropic
concentration ﬁeld with parameters Kx = 20m2 /min and Ky = 10m2 /min as in Table 3.2.
The ﬁelds at time t = 1 and t = 10 are shown in Fig. 3.5. Notice the center of the
concentration moved. In this case, N = 5 mobile sensing agents were initialized at random
positions in a surveillance region Q = [−50, 150] × [−100, 100].
The initial values for the algorithm was chosen to be θ0 = (100, 100, 100)T where we
assumed σf = 1 and σw = 0.1. For this application, we did not assume any prior knowledge
about the covariance function. Hence, the MAP estimator was the same as the ML estimator.
34

100

7

100

7

6
50

5
4

0

6
50

5
4

0

3
2

−50

3
2

−50

1
−100
−50

0

50

100

1
−100
−50

150

0

(a)

50

100

150

(b)

Figure 3.5: Snap shots of the advection-diﬀusion process at (a) t = 1 and (b) t = 10.
The gradient method was used to ﬁnd the ML estimate.
We again assumed that the global basis is the same as the model basis and assumed all
agents have the same level of measurement noises for simplicity. In our simulation study,
agents start sampling at t0 = 100min and take measurements at time t with a sampling
time of ts = 5min as in Table 3.2.
Monte Carlo simulations were run for 100 times, and Fig. 3.6 shows the estimated σx ,
σy , and σt with (a) the random sampling strategy and (b) the optimal sampling strategy,
respectively. With 100 observations, the running time at each time step is around 20s
using Matlab, R2008a (MathWorks) in a PC (2.4 GHz Dual-Core Processor). No attempt
has been made to optimize the code. As can be seen in Fig. 3.6, the estimates of the
hyperparameters tend to converge to similar values for both strategies. Clearly, the proposed
strategy outperforms the random sampling strategy in terms of the estimation error variance.

35

100
σx

σx

80
60
5

10

15

100
90
80
70
60

20

5

σy

σy

60
40
5

10

15

σt

σt

50
0
5

10
t

15

20

(a)

20

10

15

20

5

10
t

15

20

80
70
60
50
40

20

100

15

5

80

10

100
80
60
40
20

(b)

Figure 3.6: Simulation results (100 runs) for a advection-diﬀusion process. The estimated
hyperparameters with (a) random sampling and (b) optimal sampling.

36

Chapter 4
Prediction with Known Covariance
Function
The main reason why the nonparametric prediction using Gaussian processes is not popular
for resource-constrained multi-agent systems is the fact that the optimal prediction must
use all cumulatively measured values in a non-trivial way [23, 38]. In this case, a robot
needs to compute the inverse of the covariance matrix whose size grows as it collects more
measurements. With this operation, the robot will run out of memory quickly. Therefore,
it is necessary to develop a class of prediction algorithms using spatio-temporal Gaussian
processes under a ﬁxed memory size.
The space-time Kalman ﬁlter model proposed in [18,40] and utilized in [9] partially solved
this problem by modeling the spatio-temporal ﬁeld as a sum of a zero-mean Gaussian process,
which is uncorrelated in time, and a time-varying mean function (see (6) and (12) in [18]).
The zero-mean Gaussian process represents a spatial structure that is independent from one
time point to the next as described in [18] by assuming that the dynamical environmental
37

process is governed by a relatively large time scale. This formulation in turn provides the
Markov property in time, which makes the optimal prediction recursive in time. However,
the value of a temporal mean function at a point (realized by a stable linear system) consists
of a linear sum of colored white noises, and transient responses that converge to zero values
exponentially fast [9], which can not represent a wide range of spatio-temporal phenomena
in a fully nonparametric manner [51].
A simple way to cope with this dilemma is to design a robot so that it predicts a spatiotemporal Gaussian process at the current (or future) time based on truncated observations,
e.g., the last m observations from a total of n of observations as shown in Fig. 4.1. This
seems intuitive in the sense that the last m observations are more correlated with the point of
interest than the other r = n − m observations (Fig. 4.1) in order to predict values at current
or future time. Therefore, it is very important to analyze the performance degradation and
trade-oﬀ eﬀects of prediction based on truncated observations compared to the one based on
all cumulative observations.
The second motivation is to design and analyze distributed sampling strategies for resourceconstrained mobile sensor networks. Developing distributed estimation and coordination algorithms for multi-agent systems using only local information from local neighboring agents
has been one of the most fundamental problems in mobile sensor networks [10, 13, 28, 46,
52, 68, 69]. Emphasizing practicality and usefulness, it is critical to synthesize and analyze
distributed sampling strategies under practical constraints such as measurement noise and
a limited communication range.
In Section 4.1, we propose to use only truncated observations to bound the computational
complexity. The error bounds in using truncated observations are analyzed for prediction at
38

sy

...

...
x∗

1

2

t−η

3

t

time

sx
r = n − m observations

m observations

Figure 4.1: Robot predicts a scalar value at x∗ (denoted by a red star) based on cumulative
n spatio-temporal observations (denoted by blue crosses). Near-optimal prediction can be
obtained using truncated observations, e.g., the last m observations. In this case, x =
(sx , sy , t)T .
a single point in Section 4.1.1. A way of selecting a temporal truncation size is also discussed
in Section 4.1.2. To improve the prediction quality, centralized and distributed navigation
strategies for mobile sensor networks are proposed in Section 4.2. In Section 4.3, simulation
results illustrate the usefulness of our schemes under diﬀerent conditions and parameters.

4.1

GPR with Truncated Observations

As mentioned in above, one drawback of Gaussian process regression is that its computational complexity and memory space increase as more measurements are collected, making
the method prohibitive for robots with limited memory and computing power. To overcome
this increase in complexity, a number of approximation methods for Gaussian process regression have been proposed. In particular, the sparse greedy approximation method [59],
the Nystrom method [63], the informative vector machine [33], the likelihood approxima39

tion [57], and the Bayesian committee machine [61] have been shown to be eﬀective for many
problems. However, these approximation methods have been proposed without theoretical
justiﬁcations.
In general, if measurements are taken from nearby locations (or space-time locations),
correlation between measurements is strong and correlation exponentially decays as the
distance between locations increases. If the correlation function of a Gaussian process has
this property, intuitively, we can make a good prediction at a point of interest using only
measurements nearby. In the next subsection, we formalize this idea and provide a theoretical
foundation for justifying Gaussian process regression with truncated observations proposed
in this chapter.

4.1.1

Error bounds in using truncated observations

Consider a zero-mean Gaussian process

2
z(x) ∼ GP(0, σf C(x, x )).

(4.1)

2
Notice that we denote the covariance function as σf C(x, x ) in which C(x, x ) := Corr(z(x), z(x ))

is the correlation function. Recall that the predictive distribution of z∗ := z(x∗ ) at a point
of interest x∗ given observations y = (y (1) , · · · , y (n) )T is Gaussian, i.e.,
2
z∗ |y ∼ N µz∗ |y , σz∗ |y ,

(4.2)

µz∗ |y = k T C −1 y,

(4.3a)

where

40

and
2
2
σz∗ |y = σf (1 − k T C −1 k).

(4.3b)

In (4.3a) and (4.3b), we have deﬁned C := Corr(y, y) ∈ Rn×n , and k := Corr(y, z∗ ) ∈ Rn .
Notice that in this chapter, we assume the hyperparameter vector θ ∈ Rm is given, and
hence we neglect the explicit conditioning on θ.
Without loss of generality, we assume that the ﬁrst m out of n observations are used to
predict z∗ . Let r = n − m, ym = (y (1) , · · · , y (m) )T , yr = (y (m+1) , · · · , y (n) )T . Then the
covariance matrix K ∈ Rn×n and k ∈ Rn can be represented as




 Km Kmr 
K=
,
T
Kmr Kr

 
km 
k =  .
kr

Using truncated observations, we can predict the value z∗ as
T −1
µz∗ |ym = km Cm ym ,

(4.4)

with a prediction error variance given by

2
T −1
2
σz∗ |ym = σf (1 − km Cm km ),

(4.5)

2
where Cm = Km + σw I ∈ Rm×m .

The following result shows the gap between predicted values using truncated measurements and all measurements.

41

2
Theorem 4.1.1. Consider a Gaussian process z(x) ∼ GP(0, σf C(x, x )), we have

−1
T
−1
T
−1
T
µz∗ |y − µz∗ |ym = (kr − Kmr Cm km )T (Cr − Kmr Cm Kmr )−1 (yr − Kmr Cm ym ), (4.6a)

and

2
2
2
T
−1
−1
T
−1
T
σz∗ |y − σz∗ |ym = −σf (kr − Kmr Cm km )T (Cr − Kmr Cm Kmr )−1 (kr − Kmr Cm km ) < 0.

(4.6b)

Proof. We can rewrite (4.3a) as
 T
km 
µz∗ |y =  
kr

−1  
 Cm Kmr  ym 

  ,
T
yr
Kmr Cr


(4.7a)

and (4.3b) as
−1  

km   Cm Kmr  km 
2
2
σz∗ |y = σf 1 −   
   .


T
kr
Kmr Cr
kr




T 

(4.7b)

Using the identity based on matrix inversion lemma (see Appendix A.2), (4.7a) and (4.7b)
become
T −1
µz∗ |y = km Cm ym
T
−1
T
−1
T
−1
+ (kr − Kmr Cm km )T (Cr − Kmr Cm Kmr )−1 (yr − Kmr Cm ym ),

42

and
T −1
2
2
σz∗ |y = σf 1 − km Cm km
2
T
−1
T
−1
T
−1
− σf (kr − Kmr Cm km )T (Cr − Kmr Cm Kmr )−1 (kr − Kmr Cm km ).

Hence, by the use of (4.4) and (4.5), we obtain (4.6a) and 4.6b.
2
Corollary 4.1.2. The prediction error variance σz |y is a non-increasing function of m.
∗ m

Proof. The proof is straightforward from Theorem 4.1.1 by letting n = m + 1.

Considering an ideal case in which the measurements ym are not correlated with the
remaining measurements yr , we have the following result.
Proposition 4.1.3. Under the assumptions used in Theorem 4.1.1 and for given yr ∼
2
2 T −1
2
T −1
N (0, Cr ), if Kmr = 0, then µz∗ |y − µz∗ |ym = kr Cr yr and σz |y − σz |y = −σf kr Cr kr .
∗ m
∗

In addition, we also have

T −1
µz∗ |y − µz∗ |ym ≤ kr Cr

√

r¯(p1 )
y

with a non-zero probability p1 . For a desired p1 , we can ﬁnd y (p1 ) by solving
¯

p1 =





¯
1 − 2Φ − y (p1 )  ,
1/2
λi
1≤i≤r

(4.8)

where Φ is the cumulative normal distribution and {λi | i = 1, · · · , r} are the eigenvalues of
Cr = U ΛU T with a unitary matrix U , i.e., Λ = diag(λ1 , · · · , λr ).
Proof. The ﬁrst statement is straightforward from Theorem 4.1.1.
43

1/2

For the second statement, we can represent yr as yr = Cr u = U Λ1/2 u = U y , where u
˜
1/2

is a vector of independent standard normals and Cr = U ΛU T and Cr

= U Λ1/2 . By using

the Cauchy-Schwarz inequality and norm inequalities, we have

T −1
T −1 ˜
µz∗ |y − µz∗ |ym = kr Cr yr = kr Cr U y
T −1
≤ kr Cr

T −1
U y = kr C r
˜

√

T −1
≤ kr Cr

y
˜

T −1
r y ∞ ≤ kr C r
˜

√
r¯.
y

Recall that we have u ∼ N (0, I) and y ∼ N (0, Λ), where Λ = diag(λ1 , · · · , λr ). Then we
˜
can compute the probability p1 = Pr( y ∞ ≤ y ) as follows.
˜
¯
p1 = Pr

max y (i) ≤ y = Pr
˜
¯

1≤i≤r

1/2

Pr λi

|ui | ≤ y =
¯



=



1≤i≤r

=

1/2

max λ
1≤i≤r i


1≤i≤r

¯
ui ≤ y

Pr |ui | ≤


y 
¯

1/2
λi



¯
1 − 2Φ − y  ,
1/2
λi
1≤i≤r

where Φ(·) is the cumulative standard normal distribution.

Hence, if the magnitude of Kmr is small, then the truncation error from using truncated
T −1
measurements will be close to kr Cr kr . Furthermore, if we want to reduce this error, we

want kr to be small, i.e., when the covariance between z∗ and the remaining measurements
yr is small. In summary, if the following two conditions are satisﬁed: (1) the correlation between measurements ym and the remaining measurements yr is small and (2) the correlation
between z∗ and the remaining measurements yr is small, then the truncation error is small
and µz∗ |ym can be a good approximation to µz∗ |y . This idea is formalized in a more general
44

setting in the following theorem.

2
Theorem 4.1.4. Consider a zero-mean Gaussian process z(x) ∼ N (0, σf C(x, x )) with the

correlation function
C(x, x ) = exp −

x−x
2 2

2

,

(4.9)

and assume that we have collected n observations, y (1) , · · · , y (n) . Suppose that Kmr is small
−1
T
−1
T
enough such that Kmr Cm km ≤ kr , and Kmr Cm ym ≤ δ2 yr and for some δ2 > 0.
(i) < y (p ) with probability p and
Given 0 < p2 < 1, choose y (p2 ) such that maxn
¯
¯ 2
2
i=m+1 y

> 0 such that

< 2γr(1 + δ2 )¯(p2 ) where γ is the signal-to-noise ratio. For x∗ , if the last
y

r = n − m data points satisfy
x(i) − x∗

2

1
> 2σ 2 log 2γ r(1 + δ2 )¯(p2 ) ,
y

then, with probability p2 , we have

µz∗ |y − µz∗ |ym < .
T
−1
−1
Proof. Let A = Cm Kmr and B = Kmr Cm Kmr for notational convenience. Then

T
T
µz∗ |y − µz∗ |ym = (kr − km A)(Cr − B)−1 (yr − AT ym )
T
T
≤ kr − km A

(Cr − B)−1 (yr − AT ym )

T
T
≤ kr − km A × ( (Cr − B)−1 yr + (Cr − B)−1 AT ym )

≤ 2 kr

(Cr − B)−1 yr + (Cr − B)−1 AT ym

Since Kr is positive semi-deﬁnite, and Cm is positive deﬁnite, we haveKr − B is positive
45

semi-deﬁnite. Then we have

(Cr − B)−1 = (Kr + 1/γI − B)−1

γI.

Combining this result, we get

µz∗ |y − µz∗ |ym ≤ 2γ kr ( yr + AT ym )
≤ 2γ(1 + δ2 ) kr

yr

√
≤ 2γ(1 + δ2 ) rCmax yr ,
(i) ≤
where C(x(i) , x∗ ) ≤ Cmax for i ∈ {m + 1, · · · , n}. Deﬁne y (p2 ) such that maxn
¯
i=m+1 y

y (p2 ) with probability p2 . Then, with probability p2 , we have
¯

µz∗ |y − µz∗ |ym ≤ 2γr(1 + δ2 )Cmax y (p2 ).
¯
Hence, for > 0, if
Cmax <

2γr(1 + δ2 )¯(p2 )
y

(4.10)

with probability p2 , we have
µz∗ |y − µz∗ |ym < .
Let l2 = min x(i) − x∗

2

for any i ∈ {m + 1, · · · , n}. Then (4.10) becomes, with probability

46

p2 ,

exp −

l2
2σ 2

l2 > −2σ 2 log
For

≤ Cmax <

2γr(1 + δ2 )¯(p2 )
y

2γr(1 + δ2 )¯(p2 )
y

< 2γr(1 + δ2 )¯(p2 ), we have
y
1
y
l2 > 2σ 2 log 2γ r(1 + δ2 )¯(p2 ) ,

and this completes the proof.
Remark 4.1.5. The last part of Proposition 4.1.3 and Theorem 4.1.4 seek a bound for
the diﬀerence between predicted values using all and truncated observations with a given
probability since the diﬀerence is a random variable.
Example 4.1.6. We provide an illustrative example to show how to use the result of The2
orem 4.1.4 as follows. Consider a Gaussian process deﬁned in (4.1) and (4.9) with σf = 1,

σ = 0.2, and γ = 100. If we have any randomly chosen 10 samples (m = 10) within (0, 1)2
and we want to make prediction at x∗ = (1, 1)T . We choose y (p2 ) = 2σf = 2 such that
¯
(i) < y (p ) with probability p = 0.95. According to Theorem 4.1.4, if we have
¯ 2
maxn
2
i=m+1 y

an extra sample x(11) (r = 1) at (2.5, 2.5)T , which satisﬁes the condition x(11) − x∗ > 0.92,
then the diﬀerence in prediction using with and without the extra sample is less than = 0.01
with probability p2 = 0.95.
Example 4.1.7. Motivated by the results presented, we take a closer look at the usefulness
of using a subset of observations from a sensor network for a particular realization of the
Gaussian process. We consider a particular realization shown in Fig. 4.2, where crosses
47

2
represent the sampling points of a Gaussian process deﬁned in (4.1) and (4.9) with σf = 1,

σ = 0.2, and γ = 100 over (0, 1)2 . We have selected ym as the collection of observations
(blue crosses) within the red circle of a radius R = 2σ = 0.4 centered at a point (a red star)
located at x∗ = (0.6, 0.4)T . If a measurement is taken outside the red circle, the correlation
between this measurement and the value at x∗ decreases to 0.135. The rest of observations
(blue crosses outside of the red circle) are selected as yr . The prediction results are shown
in Table 4.1. In this particular realization, we have z∗ = 1.0298. It can be seen that the
prediction means and variances using only ym are close to the one using all observations.
We also compute the prediction at x∗ with yr which is far from the true value with a large
variance.

1

0.8

0.6

0.4

0.2

0
0

0.2

0.4

0.6

0.8

1

Figure 4.2: Example of the selection of truncated observations. The parameters used in the
2
example are: σf = 1, σ = 0.2, σw = 0.1.

Table 4.1: Prediction means and variances using y, ym , and yr .
n = 20 m = 12 r = 8
µz∗ |y
2
σz |y
∗

1.0515
0.0079

1.0633
0.0080

48

0.3491
0.9364

The result of Theorem 4.1.4 and Examples 4.1.6 and 4.1.7 all suggest the usage of observations that are highly correlated with the point of interest.

4.1.2

Selecting temporal truncation size

In previous subsection, we have obtained the error bounds for the prediction at a single
point. In general, the observations made close to that point are more informative than the
others.
Consider a zero-mean spatio-temporal Gaussian process

2
z(s, t) ∼ GP(0, σf C(s, t, s, t )),

(4.11)

with covariance function

C(x, x ) = Cs (s, s )Ct (t, t )


2
2
(s − s )
 exp − (t − t )
= exp −
2
2σ 2
2σt

(4.12)
.

∈{x,y}

We deﬁne η as the truncation size, and our objective is to use only the observations made
during the last η time steps, i.e., from time t − η + 1 to time t, to make prediction at time t.
In general, a small η yields faster computation but lower accuracy and a large η yields slower
computation but higher accuracy. Thus, the truncation size η should be selected according
to a trade-oﬀ relationship between accuracy and eﬃciency.
Next, we show an approach to select the truncation size η in an averaged performance
sense. Given the observations and associated sampling locations and times (denoted by D
which depends on η), the generalization error x∗ ,D at a point x∗ = (sT , t∗ )T is deﬁned as
∗
49

2
the prediction error variance σz |D [60, 64]. For a given t∗ not knowing user speciﬁc s∗ a
∗

priori, we seek to ﬁnd η that guarantees a low prediction error variance uniformly over the
2
entire space Q, i.e., we want D = Es∗ [σz |D ] to be small [60, 64]. Here Es∗ denotes the
∗

expectation with respect to the uniform distribution of s∗ .
According to Mercer’s Theorem, we know that the kernel function Cs can be decomposed
into
Cs (s, s ) =

∞

λi φi (s)φi (s ),

i=1

where {λi } and {φi (·)} are the eigenvalues and corresponding eigenfunctions, respectively
[60]. In a similar way shown in [60], the input dependent generalization error D for our
spatio-temporal Gaussian process can be obtained as

D

2
= Es∗ σf 1 − tr kk T (K + 1/γI)−1

=

2
σf

1 − tr Es∗

[kk T ](K

+ 1/γI)−1

(4.13)
.

We have
T
Es∗ [kk T ] = ΨΛ2 ΨT ◦ kt kt ,

(4.14)

T
K = ΨΛΨT ◦ Kt Kt ,

(4.15)

and

where (Ψ)ij = φj (si ), (kt )j = Ct (t(j) , t∗ ), (Kt )ij = Ct (t(i) , t(j) ), and (Λ)ij = λi δij . δij
denotes the Dirac delta function. ◦ denotes the Hadamard (element-wise) product [60].
Hence, the input-dependent generalization error D can be computed analytically by plugging
(4.14) and (4.15) into (4.13). Notice that D is a function of inputs (i.e., the sampling
locations and times). To obtain an averaged performance level without the knowledge of the
50

algorithmic sampling strategy a priori, we use an appropriate sampling distribution which
models the stochastic behavior of the sampling strategy. Thus, further averaging over the
observation set D with the samping distribution yields (η) = ED [ D ] which is a function of
the truncation size η only. This averaging process can be done using Monte Carlo methods.
Then η can be chosen based on the averaged performance measure (η) under the sampling
distribution.
An alternative way, without using the eigenvalues and eigenfunctions, is to directly and
2
numerically compute D = Es∗ [σz |D ] uniformly over the entire space Q with random sam∗

pling positions at each time step. An averaged generalization error with respect to the
temporal truncation size can be plotted by using such Monte Carlo methods. Then the temporal truncation size η can be chosen such that a given level of the averaged generalization
error is achieved.
Example 4.1.8. Consider a problem of selecting a temporal truncation size η for spatiotemporal Gaussian process regression using observations from 9 agents. The spatio-temporal
2
Gaussian process is deﬁned in (4.1) and (4.9) with σf = 1, σx = σy = 0.2, σt = 5, and

γ = 100 over (0, 1)2 . The Monte Carlo simulation result is shown in Fig. 4.3. The achieved
generalization error D are plotted in blue circles with error-bars with respect to the temporal
truncation size η. As can be seen, an averaged generalization error (in blue circles) under
0.1 can be achieved by using observations taken from last 10 time steps.
Notice that the prediction error variances can be signiﬁcantly minimized by optimally
selecting the sampling positions. Hence, the selected η guarantees at least the averaged
performance level of the sensor network when the optimal sampling strategy is used.
By using a ﬁxed truncation size η, the computational complexity and memory space
51

0.55
0.5
0.45
0.4

ǫD

0.35
0.3

0.25
0.2
0.15
0.1
0.05
0

5

η

10

15

Figure 4.3: Example of selecting a temporal truncation size η. The parameters used in the
2
example are: σf = 1, σx = σy = 0.2, σt = 5, γ = 100.
required for making prediction (i.e., evaluating (4.3a) and (4.3b)) do not increase as more
measurements are collected. Our next objective is to improve the quality of the prediction
by carefully selecting the future sampling positions for the mobile sensor network.

4.2

Optimal Sampling Strategies

At time t, the goal of the mobile sensor network is to make prediction at pre-speciﬁed points of
interest pj = (vj , τj ) | j ∈ J

indexed by J := {1, · · · , M }. From here on, points of interest

will be referred to as target points. The introduction of target points is motivated by the fact
that the potential environmental concerns should be frequently monitored. For instance, the
target points can be assigned at the interface of a factory and a lake, sewage systems, or
polluted beaches. Thus, the introduction of target points, which can be arbitrarily speciﬁed
by a user, provides a ﬂexible way to deﬁne a geometrical shape of a subregion of interest in
a surveillance region. Notice that the target points can be changed by a user at any time.
In particular, we allow that the number of target points M can be larger than that of agents
52

N , which is often the case in practice. The prediction of zj := z(pj ) of the Gaussian process
at a target point pj can be obtained as in (4.3a) and (4.3b).

4.2.1

Centralized navigation strategy

Consider the case in which a central station receives collective measurements from all N
mobile sensors and performs the prediction. Let the central station discard the oldest set of
measurements yt−η+1 after making the prediction at time t. At the next time index t + 1,
using the remained observations yt−η+2:t in the memory along with new measurements yt+1
from all N agents at time t + 1, the central station will predict z(s∗ , t∗ ) evaluated at target
points pj | j ∈ J . Hence, agents should move to the most informative locations for taking
measurements at time t + 1 [32].
For notational simplicity, let y ∈ RN (η−1) be the remained observations, i.e., y :=
¯
¯
yt−η+2:t , and y ∈ RN be the measurements that will be taken at positions q = (˜1 , · · · , qN )T ∈
˜
˜
qT
˜T
QN and time t + 1. In contrast to the information-theoretic control strategies using the conditional entropy or the mutual information criterion [14, 32], in this chapter, the mobility
of the robotic sensors will be designed such that they directly minimize the average of the
prediction error variances over target points, i.e.,

Jc (˜) =
q

1
|J |

j∈J

2
σz |¯,˜(˜),
q
jyy

(4.16)

where |J | = M is the cardinality of J . The prediction error variance at each of M target
points is given by

2
2
σz |¯,˜(˜) = σf 1 − kj (˜)T C(˜)−1 kj (˜) ,
q
q
q
q
jyy

53

∀j ∈ J ,

where kj (˜) and C(˜) are deﬁned as
q
q


y
Corr(¯, zj )
kj (˜) = 
q
,
Corr(˜, zj )
y



y ¯
y ˜
Corr(¯, y ) Corr(¯, y )
C(˜) = 
q
.
Corr(˜, y ) Corr(˜, y )
y ¯
y ˜

In order to reduce the average of prediction error variances over target points pj | j ∈ J ,
the central station solves the following optimization problem

q(t + 1) = arg min Jc (˜).
q

(4.17)

q ∈QN
˜

Notice that in this problem set-up, we only consider the constraint that robots should move
within the region Q. However, the mobility constraints such as the maximum distance a
robot can move between two time indices or the maximum speed a robot can travel, can be
incorporated as additional constraints in the optimization problem [13].
The sensor network conﬁguration q(t) can be controlled by a gradient descent algorithm
such that q(t) can move to a local minimum of Jc for the prediction at time t + 1. The
gradient descent control algorithm is given by

dq(τ )
= − q Jc (q(τ )),
dτ

where

x Jc (x)

(4.18)

denotes the gradient of Jc (x) at x. A critical point of Jc (q) obtained in (4.18)

2
will be q(t + 1). The analytical form of ∂σz |¯,˜(˜)/∂ qi, , where qi, is the -th element in
q
˜
˜
jyy

qi ∈ Q, can be obtained by
˜
2
∂σz |¯,˜(˜)
q
jyy

∂ qi,
˜

T
= kj C −1

∂kj
∂C −1
C kj − 2
∂ qi,
˜
∂ qi,
˜
54

,

∀i ∈ I, ∈ {1, 2} .

Other more advanced non-linear optimization techniques may be applied to solve the optimization problem in (4.17) [3].
The centralized sampling strategy for the mobile sensor network with the cost function Jc
in (4.16) is summarized in Table 4.2. Notice that the prediction in the centralized sampling
strategy uses temporally truncated observations. A decentralized version of the centralized
sampling strategy in Table 4.2 may be developed using the approach proposed in [42] in
which each robot incrementally reﬁnes its decision while intermittently communicating with
the rest of the robots.

4.2.2

Distributed navigation strategy

Now, we consider a case in which each agent in the sensor network can only communicate with
other agents within a limited communication range R. In addition, no central station exists.
In this section, we present a distributed navigation strategy for mobile agents that uses only
local information in order to minimize a collective network performance cost function.
The communication network of mobile agents can be represented by an undirected graph.
Let G(t) := (I, E(t)) be an undirected communication graph such that an edge (i, j) ∈ E(t)
if and only if agent i can communicate with agent j at time t. We deﬁne the neighborhood
of agent i at time t by Ni (t) := {j ∈ I | (i, j) ∈ E(t)}. In particular, we have
Ni (t) = j ∈ I | qi (t) − qj (t) < R, j = i .
Note that in our deﬁnition above, “ < ” is used instead of “ ≤ ” in deciding the communication range.
At time t ∈ Z>0 , agent i collects measurements
55

yj (t) | j ∈ {i} ∪ Ni (t)

sampled at

Table 4.2: Centralized sampling strategy at time t.
Input:
(1) Number of agents N
(2) Positions of agents {qi (t) | i ∈ I}
2
(3) Hyperparameters of the Gaussian process θ = (σf , σx , σy , σt )T
(4) Target points pj | j ∈ J
(5) Truncation size η
Output:
(1) Prediction at target points µz |y
|j ∈ J
j t−η+1:t

(2) Prediction error variance at target points

2
σz |y
|j ∈ J
j t−η+1:t

For i ∈ I, agent i performs:
1: make an observation at current position qi (t), i.e., yi (t)
2: transmit the observation yi (t) to the central station
The central station performs:
1: collect the observations from all N agents, i.e., yt = (y1 (t), · · · , yN (t))T
T
T
2: obtain the cumulative measurements, i.e., yt−η+1:t = (yt−η+1 , · · · , yt )T
3: for j ∈ J do
4: make prediction at a target point pj
= k T C −1 y,
µz |y
j t−η+1:t
with a prediction error variance given by
2
2
= σf (1 − k T C −1 k),
σz |y
j t−η+1:t

5:
6:
7:
8:
9:

where y = yt−η+1:t , k = Corr(y, zj ), and C = Corr(y, y)
end for
if t ≥ η then
discard the oldest set of measurements taken at time t − η + 1, i.e., yt−η+1
end if
compute the control with the remained data yt−η+2:t
q(t + 1) = arg minq∈QN Jc (˜),
q
˜

via

dq(τ )
dτ

= − q Jc (q(τ ))

10: send the next sampling positions {qi (t + 1)}N (a critical point of Jc (˜)) to all
q
i=1
N agents
For i ∈ I, agent i performs:
1: receive the next sampling position qi (t + 1) from the central station
2: move to qi (t + 1) before time t + 1

56

qj (t) | j ∈ {i} ∪ Ni (t) from its neighbors and itself. The collection of these observations
[i]

[i]

and the associated sampling positions in vector forms are denoted by yt and qt , respectively.
Similarly, for notational simplicity, we also deﬁne the cumulative measurements that have
been collected by agent i from time t − η + 1 to t as
T
[i]
[i]
[i]
.
yt−η+1:t = (yt−η+1 )T , · · · , (yt )T

In contrast to the centralized scheme, in the distributed scheme, each agent determines
the sampling points based on the local information from neighbors. After making the predic[i]

tion at time t, agent i discards the oldest set of measurements yt−η+1 . At time t+1, using the
[i]

[i]

remained observations yt−η+2:t in the memory along with new measurements yt+1 from its
neighbors in Ni (t + 1), agent i will predict z(s∗ , t∗ ) evaluated at target points pj | j ∈ J .
For notational simplicity, let y [i] be the remained observations of agent i, i.e., y [i] :=
¯
¯
[i]

yt−η+2:t . Let y [i] be the new measurements that will be taken at positions of agent i and its
˜
neighbors q [i] ∈ Q|Ni (t+1)|+1 , and at time t + 1, where |Ni (t + 1)| is the number of neighbors
˜
of agent i at time t + 1. The prediction error variance obtained by agent i at each of M
target points (indexed by J ) is given by
σ2

(˜[i] )
q
zj |¯[i] ,˜[i]
y y

[i]

[i]

2
q
= σf 1 − kj (˜[i] )T C [i] (˜[i] )−1 kj (˜[i] ) ,
q
q

∀j ∈ J ,

[i]

where kj (˜[i] ) and C [i] (˜[i] ) are deﬁned as
q
q


[i] , z )
y
j 
[i]
Corr(¯
kj (˜[i] ) = 
q
,
[i] , z )
Corr(˜
y
j



[i] , y [i] ) Corr(¯[i] , y [i] )
y
y ˜ 
Corr(¯ ¯
C [i] (˜[i] ) = 
q
.
[i] , y [i] ) Corr(˜[i] , y [i] )
Corr(˜ ¯
y
y ˜
57

(4.19)

The performance of agent i can be evaluated by the average of the prediction error variances
over target points, i.e.,

J [i] (˜[i] ) =
q

1
|J |

σ2
j∈J

(˜[i] ),
q
zj |¯[i] ,˜[i]
y y

∀i ∈ I.

One criterion to evaluate the network performance is the average of individual performance,
i.e.,
J(˜) =
q

1
|I|

J [i] (˜[i] ).
q

(4.20)

i∈I

However, the discontinuity of the function J occurs at the moment of gaining or losing
neighbors, e.g., at the set
q | q i − qj = R .
˜ ˜ ˜
A gradient decent algorithm for mobile robots that minimizes such J may produce hybrid
system dynamics and/or chattering behaviors when robots lose or gain neighbors.
Therefore, we seek to minimize an upper-bound of J that is continuously diﬀerentiable.
Consider the following function

σ2
¯

[i]

zj |¯[i] ,˜[i]
y y

[i]

2
¯ q
(˜[i] ) = σf 1 − kj (˜[i] )T C [i] (˜[i] )−1 kj (˜[i] ) ,
q
q
q

∀j ∈ J ,

(4.21)

¯ q
where C [i] (˜[i] ) is deﬁned as


[i] , y [i] )
[i] , y [i] )
y
Corr(¯ ˜
y
Corr(¯ ¯

¯ q
C [i] (˜[i] ) = 
.
[i] , y [i] ) Corr(˜[i] , y [i] ) + C [i] (˜[i] )
˜ q
Corr(˜ ¯
y
y ˜
¯ q
˜ q
Notice that C [i] (˜[i] ) is obtained by adding a positive semi-deﬁnite matrix C [i] (˜[i] ) to the
58

lower right block of C [i] (˜[i] ) in (4.19), where
q

1
˜ q
C [i] (˜[i] ) = diag Φ(di1 )−1 , · · · , Φ(di(|N (t+1)|+1) )−1 − I,
i
γ
where dij := qi − qj
˜ ˜

is the distance between agent i and agent j, ∀j ∈ {i} ∪ Ni (t + 1).

Φ : [0, R) → (0, γ] is a continuously diﬀerentiable function deﬁned as

Φ(d) = γφ

d + d0 − R
d0

,

(4.22)

where
φ(h) =



 1,

 exp

h ≤ 0,
−h2
1−h2

, 0 < h < 1.

An example of Φ(d) where γ = 100, R = 0.4, and d0 = 0.1 is shown in the red dotted
line in Fig. 4.4. Notice that if Φ(d) = γ is used (the blue solid line in Fig. 4.4), we have
¯ q
C [i] (˜[i] ) = C [i] (˜[i] ). We then have the following result.
q
100

Φ(d)

80
60
40
20
0
0

0.1

0.2

0.3

0.4

0.5

d

Figure 4.4: Function Φ(d) in (4.22) with γ = 100, R = 0.4, and d0 = 0.1 is shown in a red
dotted line. The function Φ(d) = γ is shown in a blue solid line.

59

Proposition 4.2.1. σ 2
¯

zj |¯[i] ,˜[i]
y y

(˜[i] ) is an upper-bound of σ 2
q

zj |¯[i] ,˜[i]
y y

(˜[i] ), ∀i ∈ I.
q

˜ q
Proof. Let A := C [i] (˜[i] ) and B := diag(0, C [i] (˜[i] )). The result follows immediately from
q
the fact that (A + B)−1

A−1 for any A

0 and B

0.

Hence, we construct a new cost function as

Jd (˜) =
q

1
|I|

i∈I

1
|J |

σ2
¯
j∈J

q [i]
[i] ,˜[i] (˜ ).
zj |¯ y
y

(4.23)

By Proposition 4.2.1, Jd in (4.23) is an upper-bound of J in (4.20).
Next, we show that Jd is continuously diﬀerentiable when agents gain or lose neighbors.
In doing so, we compute the partial derivative of Jd with respect to qi, , where qi, is the
˜
˜
-th element in qi ∈ Q, as follows.
˜
∂Jd (˜)
q
1
=
∂ qi,
˜
|I|
=

1
|I|

k∈I

1
|J |

∂ σ2
¯

zj |¯[k] ,˜[k]
y y

∂ qi,
˜

j∈J

k∈{i}∪Ni

1
|J |

(˜[k] )
q

(˜[k] )
q
zj |¯[k] ,˜[k]
y y

(4.24)

∂ σ2
¯
j∈J

∂ qi,
˜

,

∀i ∈ I, ∈ {1, 2} .

We then have the following.

Proposition 4.2.2. The cost function Jd in (4.23) is of class C 1 , i.e., it is continuously
diﬀerentiable.

Proof. We need to show that the partial derivatives of Jd with respect to qi, , ∀i ∈ I, ∈
˜
{1, 2} exist and are continuous. Without loss of generality, we show that ∂Jd /∂ qi, , ∀ ∈
˜
60

{1, 2} is continuous at any point q ∗ in the following boundary set deﬁned by
˜
Sik := {˜ | dik = qi − qk = R} .
q
˜ ˜
First, we consider a case in which q ∈ Sik and dik < R, i.e., k ∈ Ni and i ∈ Nk . By
˜ /
the construction of σ 2
¯

in (4.21) using (4.22), when we take the limit of the partial

zj |¯[i] ,˜[i]
y y

derivative, as dik approaches R from below (as q approaches q ∗ ), we have that
˜
˜
∂ σ2
¯

zj |¯[i] ,˜[i]
y y

lim

∂ σ2
¯

zj |¯[k] ,˜[k]
y y

dik →R−

∂ σ2
¯

(˜[k] )
q

∂ qi,
˜

zj |¯[i] ,˜[i]
y y

=

∂ qi,
˜

dik →R−

lim

(˜[i] )
q

(˜[i] \˜k )
q q

∂ qi,
˜
∂ σ2
¯

=

zj |¯[k] ,˜[k]
y y

,

(˜[k] \˜i )
q
q

∂ qi,
˜

= 0,

where q [a] \˜b denotes the collection of locations of agent a and its neighbors excluding qb .
˜ q
˜
Hence we have
∂Jd (˜)
q
∂J (˜∗ )
q
= d
.
˜
∂ qi,
˜
dik →R− ∂ qi,
lim

(4.25)

Consider the other case in which q ∈ Sik and dik > R, i.e., k ∈ Ni and i ∈ Nk . When dik
˜/
/
/
approaches R from above (as q approaches q ∗ ), we have
˜
˜
∂ σ2
¯
lim

dik →R+

zj |¯[i] ,˜[i]
y y

(˜[i] )
q

∂ σ2
¯
=

∂ qi,
˜

61

zj |¯[i] ,˜[i]
y y

∂ qi,
˜

(˜[i] )
q
,

and hence
∂Jd (˜)
q
∂J (˜∗ )
q
= d
.
˜
∂ qi,
˜
dik →R+ ∂ qi,
lim

(4.26)

Therefore, from (4.25) and (4.26), we have
∂Jd (˜)
q
∂Jd (˜)
q
∂J (˜∗ )
q
= lim
= d
.
˜
˜
∂ qi,
˜
dik →R− ∂ qi,
dik →R+ ∂ qi,
lim

This completes the proof due to Theorem 4.6 in [53].

By using Jd in (4.23), a gradient descent algorithm can be used to minimize the network
performance cost function Jd in (4.23) for the prediction at t + 1.
dq(τ )
= − q Jd (q(τ )).
dτ

(4.27)

Note that the partial derivative in (4.24), which builds the gradient ﬂow in (4.27), is a function of positions in ∪j∈N (t) Nj (t) only. This makes the algorithm distributed. A distributed
i
sampling strategy for agent i with the network cost function Jd in (4.23) is summarized in
Table 4.3. In this way, each agent with the distributed sampling strategy uses spatially and
temporally truncated observations.

4.3

Simulation

In this section, we apply our approach to a spatio-temporal Gaussian process with a covariance function in (4.12). The Gaussian process was numerically generated through circulant
62

embedding of the covariance matrix for the simulation [20]. The hyperparameters used in
2
the simulation were chosen to be θ = (σf , σx , σy , σt )T = (1, 0.2, 0.2, 5)T . The surveillance

region Q is given by Q = (0, 1)2 . The signal-to-noise ratio γ = 100 is used throughout the
simulation which is equivalent to a noise level of σw = 0.1. In our simulation, N = 9 agents
sample at time t ∈ Z>0 . The initial positions of the agents are randomly selected. The truncation size η = 10 is chosen using the approach introduced in Section 4.1.2 that guarantees
the averaged performance level (η = 10) < 0.1 under a uniform sampling distribution (see
Example 4.1.8).
In the ﬁgures of simulation results, the target positions, the initial positions of agents,
the past sampling positions of agents, and the current positions of agents are represented by
white stars, yellow crosses, pink dots, and white circles with agent indices, respectively.

4.3.1

Gradient-based algorithm vs. exhaustive search algorithm

To evaluate the performance of the gradient-based algorithm presented in Section 4.2, we
compare it with the exhaustive search algorithm over suﬃciently many grid points, which
guarantees the near-optimum. Due to the exponential complexity of the grid-based exhaustive search algorithm as the number of agents increases, its usage for multiple robots is
prohibitive. Hence, we consider a simple case in which only one mobile agent samples and
makes prediction on 21 × 21 target points over Q. The grid points used in the exhaustive
search are the same as the target points, i.e., 21 × 21 grid points. The initial positions of
the agents for both cases were set to (0.2, 0.3)T . The prediction error variances at t = 5 for
the proposed algorithm and the exhaustive search algorithm are shown in Figs. 4.5-(a) and
(b), respectively. At time t = 5, the averaged prediction error variance over target points is
63

0.636 which is close to 0.613 achieved by the exhaustive search. Therefore, this simulation
study shows that the performance of the gradient-based algorithm is comparable to that of
the exhaustive search algorithm for the given problem.

(a) Gradient-based algorithm

(b) Exhaustive search algorithm

Figure 4.5: Prediction error variances at t = 5 achieved by (a) using the gradient-based
algorithm, and (b) using the exhaustive search algorithm. The trajectories of the agent are
shown in solid lines.

4.3.2

Centralized sampling scheme

Consider a situation where a central station has access to all measurements collected by
agents. At each time, measurements sampled by agents are transmitted to the central
station that uses the centralized navigation strategy and sends control commands back to
individual agents.
Case 1: First, we consider a set of ﬁxed target points, e.g., 6 × 6 grid points on Q
at a ﬁxed time t = 10. At each time step, the cost function Jc in (4.16), which is the
average of prediction error variances at target points, is minimized due to the proposed
centralized navigation strategy in Section 4.2.1. As a benchmark strategy, we consider
64

a random sampling scheme in which a group of 9 agents takes observations at randomly
selected positions within the surveillance region Q.
In Fig. 4.6-(a), the blue circles represent the average of prediction error variances over
target points achieved by the centralized scheme, and the red squares indicate the average
of prediction error variances over target points achieved by the benchmark strategy. Clearly,
the proposed scheme produces lower averaged prediction error variances at target points as
time increases, which demonstrates the usefulness of our scheme.
Case 2: Next, we consider the same 6 × 6 grid points on Q as in case 1. However, at
time t, we are now interested in the prediction at the next sampling time t + 1. At each
time step, the cost function Jc is minimized. Fig. 4.6-(b) shows the average of prediction
error variances over target points achieved by the centralized scheme with truncation (in red
squares) and without truncation (in blue circles). With truncated observations, i.e., with
only observations obtained from latest η = 10 time steps, we are able to maintain the same
level of the averaged prediction error variances (around 0.05 in Fig. 4.6-(b)).
Fig. 4.7-(a), (c), and (e) show the true ﬁeld, the predicted ﬁeld, and the prediction
error variance at time t = 1, respectively. To see the improvement, the counterpart of the
simulation results at time t = 5 are shown in Fig. 4.7-(b), (d), and (f). At time t = 1, agents
have little information about the ﬁeld and hence the prediction is far away from the true
ﬁeld, which produces a large prediction error variance. As time increases, the prediction
becomes close to the true ﬁeld and the prediction error variances are reduced due to the
proposed navigation strategy.
Case 3: Now, we consider another case in which 36 target points (plotted in Fig. 4.8 as
white stars) are evenly distributed on three concentric circles to form a ring shaped subregion
65

of interest. As in case 2, we are interested in the prediction at the next time iteration t + 1.
The average of prediction error variances over these target points at each time step achieved
by the centralized scheme with truncation (in red squares) and without truncation (in blue
circles) are shown in Fig. 4.6-(c). The prediction error variances at time t = 1 and t = 5 are
shown in Fig. 4.8-(a) and (b), respectively. It is shown that agents are dynamically covering
the ring shaped region to minimize the average of prediction error variances over the target
points.

4.3.3

Distributed sampling scheme

Consider a situation in which the sensor network has a limited communication range R,
i.e., Ni (t) := j ∈ I | qi (t) − qj (t) < R, j = i . At each time t ∈ Z>0 , agent i collects
measurements from itself and its neighbors Ni (t) and makes prediction in a distributed
fashion. The distributed strategy is used to navigate itself to move to the next sampling
position. To be comparable with the centralized scheme, the same target points as in case 2
of Section 4.3.2 are considered.
Fig. 4.9 shows that the cost function, which is an upper-bound of the averaged prediction
error variance over target points and agents, deceases smoothly from time t = 1 to t = 2 by
the gradient descent algorithm with a communication range R = 0.4. Signiﬁcant decreases
occur whenever one of the agent gains a neighbor. Notice that the discontinuity of minimizing
J in (4.20) caused by gaining or losing neighbors is eliminated due to the construction of Jd
in (4.23). Hence, the proposed distributed algorithm is robust to gaining or losing neighbors.
The following study shows the eﬀect of diﬀerent communication range. Intuitively, the
larger the communication range is, the more information can be obtained by the agent and
66

hence the better prediction can be made. Figs. 4.10-(a) and (b) show the average of prediction
error variances over all target points and agents in blue circles with error-bars indicating the
standard deviation among agents for the case R = 0.3 and R = 0.4, respectively. In both
cases, d0 = 0.1 in (4.22) was used. The average of prediction error variances is minimized
quickly to a certain level. It can be seen that the level of achieved averaged prediction error
variance with R = 0.4 is lower than the counterpart with R = 0.3.
Now, assume that each agent only predict the ﬁeld at target points within radius R (local
target points). The average of prediction error variances, over only local target points and
agents, are also plotted in Fig. 4.10 in red squares with the standard deviation among agents.
As can be seen, the prediction error variances at local target points (the red squares) are
signiﬁcantly lower than those for all target points (the blue circles).
Fig. 4.11 shows the prediction error variances obtained by agent 1 along with the edges
of the communication network for diﬀerent communication range R and diﬀerent time step
k. In Fig. 4.11, the target positions, the initial positions, and the current positions are
represented by white stars, yellow crosses, and white circles, respectively. Surprisingly, the
agents under the distributed navigation algorithm produce an emergent, swarm-like behavior
to maintain communication connectivity among local neighbors. Notice that this collective
behavior emerged naturally and was not generated by the ﬂocking or swarming algorithm as
in [10].
This interesting simulation study (Fig. 4.11) shows that agents won’t get too close each
other since the average of prediction error variances at target points can be reduced by
spreading over and covering the target points that need to be sampled. However, agents
won’t move too far away each other since the average of prediction error variances can be
67

reduced by collecting measurements from a larger population of neighbors. This trade-oﬀ is
controlled by the communication range. With the intertwined dynamics of agents over the
proximity graph, as shown in Fig. 4.11, mobile sensing agents are coordinated in each time
iteration in order to dynamically cover the target positions for better collective prediction
capability.

68

Table 4.3: Distributed sampling strategy at time t.
Input:
(1) Number of agents N
(2) Positions of agents {qi (t) | i ∈ I}
2
(3) Hyperparameters of the Gaussian process θ = (σf , σx , σy , σt )T
(4) Target points pj | j ∈ J
(5) Truncation size η
Output:
(1) Prediction at target points

µ

|i ∈
[i]
zj |yt−η+1:t


(2) Prediction error variances at target points

I, j ∈ J





| i ∈ I, j ∈ J
σ2

 z |y[i]
j t−η+1:t

For i ∈ I, agent i performs:
1: make an observation at qi (t), i.e., yi (t)
2: transmit the observation to the neighbors in Ni (t)
3: collect the observations from neighbors in Ni (t), i.e., y [i] (t)
4: obtain

the

cumulative

[i]

[i]

measurements,

i.e.,

T

(yt−η+1 )T , · · · , (yt )T
5: for j ∈ J do
6: make prediction at a target point pj
µ [i]

zj |yt−η+1:t

[i]

yt−η+1:t

=

= k T C −1 y,

with a prediction error variance given by
2
σ 2 [i]
= σf (1 − k T C −1 k),
[i]

7:
8:
9:
10:
11:
12:
13:

zj |yt−η+1:t

where y = yt−η+1:t , k = Corr(y, zj ), and C = Corr(y, y)
end for
if t ≥ η then
[i]
discard the oldest set of measurements taken at time t − η + 1, i.e., yt−η+1
end if
while t ≤ τ ≤ t + 1 do
[i]
compute q J [i] with the remained data yt−η+2
send

q

J [i] to agent
J[ ]

in Ni (τ )

receive qi
from all neighbors in Ni (τ )
[ ]
compute the gradient qi Jd =
∈Ni (τ ) qi J /|I|
update position according to qi (τ + δt) = qi (τ ) − α qi Jd for a small step size
α
17: end while

14:
15:
16:

69

0.7
Prediction error variance

Prediction error variance

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

2

4

6

8

0.6
0.5
0.4
0.3
0.2
0.1

10

t

(a) Case 1

0

2 4 6 8 10 12 14 16 18 20
t

(b) Case 2

Prediction error variance

0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0

2 4 6 8 10 12 14 16 18 20
t

(c) Case 3

Figure 4.6: Average of prediction error variances over target points (in blue circles) achieved
by the centralized sampling scheme using all collective observations for (a) case 1, (b) case
2, and (c) case 3. In (a), the target points are ﬁxed at time t = 10, and the counterpart
achieved by the benchmark random sampling strategy is shown in red squares with errorbars. In (b) and (c), the target points are at t + 1 and change over time. The counterpart
achieved by using truncated observations are shown in red squares.

70

(a) True ﬁeld at t = 1

(b) True ﬁeld at t = 5

(c) Predicted ﬁeld at t = 1

(d) Predicted ﬁeld at t = 5

Figure 4.7: Simulation results at t = 1 and t = 5 obtained by the centralized sampling
scheme for case 2.

71

(e) Prediction error variance at t = 1

(f) Prediction error variance at t = 5

Figure 4.7: Simulation results at t = 1 and t = 5 obtained by the centralized sampling
scheme for case 2 (cont’d).

(a) Prediction error variance at t = 1

(b) Prediction error variance at t = 5

Figure 4.8: Simulation results obtained by the centralized sampling scheme for case 3. The
trajectories of agents are shown in solid lines.

72

Cost function

0.85

0.8

0.75

0.7

0.65
200

400

600
800
Number of iterations

1000

1200

Figure 4.9: Cost function Jd (˜) from t = 1 to t = 2 with a communication range R = 0.4.
q

1
Prediction error variance

Prediction error variance

1
0.8
0.6
0.4
0.2
0

2 4 6 8 10 12 14 16 18 20
t

(a) R = 0.3

0.8
0.6
0.4
0.2
0

2 4 6 8 10 12 14 16 18 20
t

(b) R = 0.4

Figure 4.10: Average of prediction error variances over all target points and agents achieved
by the distributed sampling scheme with a communication range (a) R = 0.3, and (b)
R = 0.4. The average of prediction error variances over all target points and agents are
shown in blue circles. The average of prediction error variance over local target points and
agents are shown in red squares. The error-bars indicate the standard deviation among
agents.

73

(a) R = 0.3, t = 1

(b) R = 0.3, t = 2

(c) R = 0.3, t = 5

(d) R = 0.3, t = 20

Figure 4.11: Simulation results obtained by the distributed sampling scheme with diﬀerent
communication ranges. The edges of the graph are shown in solid lines.

74

(e) R = 0.4, t = 1

(f) R = 0.4, t = 2

(g) R = 0.4, t = 5

(h) R = 0.4, t = 20

Figure 4.11: Simulation results obtained by the distributed sampling scheme with diﬀerent
communication ranges (cont’d). The edges of the graph are shown in solid lines.

75

Chapter 5
Fully Bayesian Approach
Recently, there has been an increasing exploitation of mobile sensor networks in environmental monitoring [10, 13, 35, 37]. Gaussian process regression (or kriging in geostatistics)
has been widely used to draw statistical inference from geostatistical and environmental
data [15,51]. For example, near-optimal static sensor placements with a mutual information
criterion in Gaussian processes were proposed in [32]. A distributed kriged Kalman ﬁlter
for spatial estimation based on mobile sensor networks was developed in [13]. Multi-agent
systems that are versatile for various tasks by exploiting predictive posterior statistics of
Gaussian processes were developed in [8, 9].
The signiﬁcant computational complexity in Gaussian process regression due to the growing number of observations (and hence the size of covariance matrix) has been tackled in
diﬀerent ways. In [67], the authors analyzed the conditions under which near-optimal prediction can be achieved using only truncated observations. This motivates the usage of sparse
Gaussian process proposed in [45]. However, they both assumed the covariance function is
known a priori, which is unrealistic in practice. On the other hand, unknown parameters in
76

the covariance function can be estimated by the maximum likelihood (ML) estimator. Such
ML estimates may be regarded as the true parameters and then used in the prediction [65].
However, the point estimate itself needs to be identiﬁed using suﬃcient amount of measurements. Instead, a maximum a posterior (MAP) estimate can use the prior to provide the
point estimate with a small number of measurements. However, it fails to incorporate the
uncertainty in the estimate into the prediction.
The advantage of a fully Bayesian approach, which will be adopted in this work, is that the
uncertainty in the model parameters are incorporated in the prediction [5]. In [22], Gaudard
et al. presented a Bayesian method that uses importance sampling for analyzing spatial data
sampled from a Gaussian random ﬁeld whose covariance function was unknown. However,
the assumptions made in [22], such as noiseless observations and time-invariance of the ﬁeld,
limit the applicability of the approach on mobile sensors in practice. The computational
complexity of a fully Bayesian prediction algorithm has been the main hurdle for applications
in resource-constrained robots. In [25], an iterative prediction algorithm without resorting
to Markov Chain Monte Carlo (MCMC) methods has been developed based on analytical
closed-form solutions from results in [22], by assuming that the covariance function of the
spatio-temporal Gaussian random ﬁeld is known up to a constant. Our work builds on such
Bayesian approaches used in [22,25] and explores new ways to synthesize practical algorithms
for mobile sensor networks under more relaxed conditions.
In Section Section 5.1, we provide fully Bayesian approaches for spatio-temporal Gaussian
process regression under more practical conditions such as measurement noise and the unknown covariance function . In Section 5.2, using discrete prior probabilities and compactly
supported kernels, we provide a way to design sequential Bayesian prediction algorithms in
77

which the exact predictive distributions can be computed in constant time as the number of
observations increases. In particular, a centralized sequential Bayesian prediction algorithm
is developed in Section 5.2.1, and its distributed implementation among sensor groups is
provided for a special case in Section 5.2.2. An adaptive sampling strategy for mobile sensors, utilizing the maximum a posteriori (MAP) estimation of the parameters, is proposed
to minimize the prediction error variances in Section 5.2.3. In Section 5.3, the proposed sequential Bayesian prediction algorithms and the adaptive sampling strategy are tested under
practical conditions for spatio-temporal Gaussian processes.

5.1

Fully Bayesian Prediction Approach

In this chapter, we consider a spatio-temporal Gaussian process denoted by

2
z(x) ∼ GP µ(x), σf C(x, x ; θ) ,

where z(x) ∈ R and x := (sT , t)T ∈ Q × Z>0 contains the sampling location s ∈ Q ⊂ RD
and the sampling time t ∈ Z>0 . The mean function is assumed to be
µ(x) = f (x)T β,

where f (x) := (f1 (x), · · · , fp (x))T ∈ Rp is a known regression function, and β ∈ Rp is an
unknown vector of regression coeﬃcients. The correlation between z(x) and z(x ) is taken
as
C(x, x ; θ) = Cs

s−s
σs
78

Ct

t−t
σt

,

(5.1)

which is governed by spatial and temporal distance functions Cs (·) and Ct (·). We assume
that Cs (·) and Ct (·) are decreasing kernel functions over space and time, respectively, so
that the correlation between two inputs decreases as the distance between spatial locations
(respectively, time indices) increases. The decreasing rate depends on the spatial bandwidth
σs (respectively, the time bandwidth σt ) for given ﬁxed time indices (respectively, spatial
2
locations). The signal variance σf gives the overall vertical scale relative to the mean of

the Gaussian process in the output space. We deﬁne θ := (σs , σt )T ∈ R2 for notational
simplicity.

Given the collection of noise corrupted observations from mobile sensing agents up to
time t, we want to predict z(s∗ , t∗ ) at a prespeciﬁed location s∗ ∈ S ⊂ Q and current
(or future) time t∗ . To do this, suppose we have a collection of n observations D =
(x(i) , y (i) ) | i = 1, . . . , n

from N mobile sensing agents up to time t. Here x(i) denotes

the i-th input vector of dimension D + 1 (i.e., the sampling position and time of the i-th
observation) and y (i) denotes the i-th noise corrupted measurement. If all observations are
considered, we have n = N t. Notice that the number of observations n grows with the time
t. For notational simplicity, let y := (y (1) , · · · , y (n) )T ∈ Rn denote the collection of noise
corrupted observations. Based on the spatio-temporal Gaussian process, the distribution of
2
the observations given the parameters β, σf , and θ is Gaussian, i.e.,

2
2
y|β, σf , θ ∼ N (F β, σf C)

79

with F and C deﬁned as
F := f (x(1) ), · · · , f (x(n) )
C := Corr(y, y|θ) =

T

∈ Rn×p ,

C(x(i) , x(j) ; θ) +

1
δ
∈ Rn×n ,
γ ij

(5.2)

where δij is the Kronecker delta which equals to one when i = j, and zero, otherwise.

5.1.1

Prior selection

2
To infer the unknown parameters β, σf , and θ in a Bayesian framework, the collection of

them is considered to be a random vector with a prior distribution reﬂecting the a priori
belief of uncertainty for them. In this chapter, we use the prior distribution given by

2
2
2
π(β, σf , θ) = π(β|σf )π(σf )π(θ),

(5.3)

where
2
2
β|σf ∼ N (β0 , σf T ).
2
The prior for π(σf ) is taken to be the inverse gamma distribution, chosen to guarantee
2
2
positiveness of σf and a closed-form expression for the posterior distribution of σf for com-

putational ease of the proposed algorithms. To cope with the case where no prior knowledge
on β is available, which is often the case in practice, we propose to use a noninformative
prior. In particular, we take β0 = 0, T = αI, and subsequently, let α → ∞. Any proper
prior π(θ) that correctly reﬂects the priori knowledge of θ can be used.
80

5.1.2

MCMC-based approach

2
According to the Bayes rule, the posterior distribution of β, σf , and θ is given by

2
π(β, σf , θ|y)

=

2
2
π(y|β, σf , θ)π(β, σf , θ)
2
2
2
π(y|β, σf , θ)π(β, σf , θ)dβdσf dθ

.

(5.4)

When a proper prior is used, the posterior distribution can be written as

2
2
2
π(β, σf , θ|y) ∝ π(y|β, σf , θ)π(β, σf , θ).

2
The inference on β, σf , and θ can be carried out by sampling from the posterior distribution

in (5.4) via the Gibbs sampler. Table 5.1 gives the steps based on the following proposition.

Table 5.1: Gibbs sampler.
2 (1)
Input: initial samples β (1) , σf , and θ(1)
m
2 (i)
2
Output: samples β (i) , σf , θ(i)
from joint distribution π(β, σf , θ|y)
i=1
(1)

2
1: initialize β (1) , σf , θ(1)
2: for i = 1 to m do
2 (i)
3: sample β (i+1) from π(β|σf , θ(i) , y)

4:

2 (i+1) from π(σ 2 |β (i+1) , θ (i) , y)
sample σf
f

2 (i+1) , y)
5: sample θ(i+1) from π(θ|β (i+1) , σf
6: end for

Proposition 5.1.1. For a prior distribution given in (5.3) with the noninformative prior
on β, the conditional posteriors are given by
81

2
ˆ 2 ˆ
1. β|σf , θ, y ∼ N β, σf Σβ , where

Σβ = (F T C −1 F )−1 ,
ˆ
ˆ
β = Σβ (F T C −1 y),
ˆ

2
¯ b
2. σf |β, θ, y ∼ IG a, ¯ , where

n+p
,
2
¯ = b + 1 (y − F β)T C −1 (y − F β),
b
2

a=a+
¯

3. and

2
π(θ|β, σf , y) ∝ det(C)−1/2 exp −

(y − F β)T C −1 (y − F β)
2
2σf

π(θ).

Proof. Since the noninformative prior is chosen, the posterior distribution shall be computed
with T = αI and then let α → ∞.
2
i) For given σf , θ, and y, we have

2
π(β|σf , θ, y)

= lim

α→∞

2
2
π(y|β, σf , θ)π(β|σf )
2
2
π(y|β, σf , θ)π(β|σf )dβ

82

.

Let
2
2
num1 = π(y|β, σf , θ)π(β|σf )

=

=

exp − 12 (y − F β)T C −1 (y − F β)

exp − 12 β T T −1 β

2σ
f

2σ

f
2 )p/2 det(T )1/2
(2πσf

2
(2πσf )n/2 det(C)1/2

exp − 12 RSS
2σ
f

2
(2πσf )(n+p)/2 det(C)1/2 det(T )1/2

exp −

1
ˆ
ˆ
(β − β)T (F T C −1 F + T −1 )(β − β) ,
2
2σf

and

den1 =

=

exp − 12 RSS
2σ
f

exp −

2
(2πσf )(n+p)/2 det(C)1/2 det(T )1/2

exp − 12 RSS
2σ
f

2
(2πσf )(n+p)/2 det(C)1/2 det(T )1/2

1
ˆ
ˆ
(β − β)T (F T C −1 F + T −1 )(β − β) dβ
2
2σf

2
(2πσf )p/2 det(F T C −1 F + T −1 )−1/2

where
RSS = y T C −1 − C −1 F (F T C −1 F + T −1 )−1 F T C −1 y
= y T (C + F T F T )−1 y.

83

Then we have
num1
α→∞ den1

2
π(β|σf , θ, y) = lim

ˆ
ˆ
exp − 12 (β − β)T (F T C −1 F + T −1 )(β − β)
2σ

= lim

f
2
(2πσf )p/2 det(F T C −1 F

α→∞

ˆ
ˆ
exp − 12 (β − β)T Σ−1 (β − β)
ˆ
2σ

=

+ T −1 )−1/2

β

f

2
(2πσf )p/2 det(Σβ )1/2
ˆ

.

2
ˆ 2 ˆ
Therefore, we have β|σf , θ, y ∼ N (β, σf Σβ ).

ii) For given β, θ, and y, we have

2
π(σf |β, θ, y)

= lim

α→∞

= lim

α→∞

2
2
π(y|β, σf , θ)π(σf |β)

2
2
2
π(y|β, σf , θ)π(σf |β)dσf

2
2
2
π(y|β, σf , θ)π(β|σf )π(σf )

2
2
2
2
π(y|β, σf , θ)π(β|σf )π(σf )dσf

.

Let
2
2
2
num2 = π(y|β, σf , θ)π(β|σf )π(σf )

exp − 12 (y − F β)T C −1 (y − F β)
2σ

=

=

f

exp − 12 β T T −1 β
2σ

ba exp − b2
σ

f
f
2 )p/2 det(T )1/2 Γ(a)(σ 2 )a+1
(2πσf
f
¯ + 1 β T T −1 β
b 2
1
exp −
,
2 )a+1
2
¯
σf
Γ(a)(2π)a+1 det(C)1/2 det(T )1/2 (σf ¯
2
(2πσf )n/2 det(C)1/2
ba

84

and

den2 =
=

¯ + 1 β T T −1 β
b 2
1
exp −
2 ¯
2
(σf )a+1
σf

ba
¯
Γ(a)(2π)a+1 det(C)1/2 det(T )1/2

ba
¯
Γ(a)(2π)a+1 det(C)1/2 det(T )1/2

2
dσf

Γ(¯)¯−¯ .
ab a

Then we have
num2
α→∞ den2

2
π(σf |β, θ, y) = lim

¯+
¯a
b
b¯
exp −
2 )a+1
α→∞ Γ(¯)(σ ¯
a f

= lim

=

¯a
¯
b¯
b
exp − 2
Γ(¯)(σf )a+1
a 2 ¯
2σf

1 β T T −1 β
2
2
2σf

.

2
Therefore, we have σf |β, θ, y ∼ IG a, ¯ .
¯ b
2
iii) For given β, σf , and y, we have

2
π(θ|β, σf , y)

= lim

α→∞

2
π(y|β, σf , θ)π(θ)
2
π(y|β, σf , θ)π(θ)dθ

∝ det(C)−1/2 exp −

(y − F β)T C −1 (y − F β)
2
2σf

π(θ).

The posterior predictive distribution of z∗ := z(s∗ , t∗ ) at location s∗ and time t∗ can be
obtained by
π(z∗ |y) =

2
2
2
π(z∗ |y, β, σf , θ)π(β, σf , θ|y)dβdσf dθ,

(5.5)

2
where in (5.5), the conditional distribution π(z∗ |β, σf , θ, y), is integrated with respect to the
2
posterior of β, σf , and θ given observations y. The conditional distribution of z∗ is Gaussian,

85

i.e.,
2
),
z∗ |β, σf , θ, y ∼ N (µz |β,σ2 ,θ,y , σ 2
z∗ |β,σ 2 ,θ,y
∗
f
f

with
2
µz |β,σ2 ,θ,y = E(z∗ |β, σf , θ, y) = f (x∗ )T β + k T C −1 (y − F β),
∗
f

σ2

z∗ |β,σ 2 ,θ,y
f

2
2
= Var(z∗ |β, σf , θ, y) = σf (1 − k T C −1 k),

where k := Corr(y, z∗ |θ) = [K(x(i) , x∗ ; θ)] ∈ Rn . To obtain numerical values of π(z∗ |y), we
2
draw m samples β (i) , σf

(i)

, θ(i)

m

i=1

2
from the posterior distribution π(β, σf , θ|y) using the

Gibbs sampler presented in Table 5.1, and then obtain the predictive distribution in (5.5)
by
1
π(z∗ |y) ≈
m

m
i=1

2
π(z∗ |y, β (i) , σf

(i)

, θ(i) ).

It follows that the predictive mean and variance can be obtained numerically by

µz∗ |y = E(z∗ |y) ≈
2
σz∗ |y

1
m

= Var(z∗ |y) ≈

m

µ

,

(i)
z |β (i) ,σ 2 ,θ(i) ,y
f
i=1 ∗
m
1
σ2
(i) 2 (i) (i)
m
i=1 z∗ |β ,σf ,θ ,y

1
+
m

2

m

µ
i=1

(i)
z∗ |β (i) ,σ 2 ,θ(i) ,y
f

− µz∗ |y

.

Remark 5.1.2. The Gibbs sampler presented in Table 5.1 may take long time to converge,
which implies that the number of samples required could be quite large depending on the initial
values. This convergence rate can be monitored from a trace plot (a plot of sampled values v.s.
iterations for each variable in the chain). Moreover, since C is a complicated function of σs
2
and σt , sampling from π(θ|β, σf , y) in Proposition 5.1.1 is diﬃcult. An inverse cumulative

distribution function (CDF) method [19] needs to be used to generate samples, which requires
86

griding on a continuous parameter space. Therefore, high computational power is needed to
implement the MCMC-based approach.

In the next subsection, we present an alternative Bayesian approach which only requires
drawing samples from the prior distribution π(θ) using a similar approach to one used in [22].

5.1.3

Importance sampling approach

The posterior predictive distribution of z∗ := z(s∗ , t∗ ) can be written as

π(z∗ |y) =

π(z∗ |θ, y)π(θ|y)dθ,

(5.6)

where

π(θ|y) =

π(y|θ)π(θ)
,
π(y|θ)π(θ)dθ

2
is the posterior distribution of θ, by integrating out analytically the parameters β and σf .

We have the following proposition.

Proposition 5.1.3. For a prior distribution given in (5.3) with the noninformative prior
on β, we have

1. π(θ|y) ∝ w(θ|y)π(θ) with
1
1
log w(θ|y) = − log det(C) − log det(F T C −1 F ) − a log ˜
˜
b,
2
2
87

(5.7)

where
n
,
2
˜ = b + 1 y T C −1 y − 1 (F T C −1 y)T (F T C −1 F )−1 (F T C −1 y).
b
2
2

a=a+
˜

2. π(z∗ |θ, y) is a shifted student’s t-distribution with location parameter µ, scale parameter
λ, and ν degrees of freedom, i.e.,

π(z∗ |θ, y) =

Γ ν+1
2
Γ ν
2

1

λ 2
πν

λ(z∗ − µ)2
1+
ν

− ν+1
2

,

(5.8)

where ν = 2˜, and
a

µ = k T C −1 y + (f (x∗ ) − F T C −1 k)T (F T C −1 F )−1 (F T C −1 y),
λ=

˜
b
(1 − k T C −1 k) + (f (x∗ ) − F T C −1 k)T (F T C −1 F )−1 (f (x∗ ) − F T C −1 k) .
a
˜

Proof. i) For given θ, we have

π(y|θ) =
=

=

=

2
2
2
π(y|β, σf , θ)π(β, σf )dβdσf
2
2
2
2
π(y|β, σf , θ)π(β|σf )π(σf )dβdσf
b+ RSS
2
σ2
f
2 )n/2+a+1
(σf
n+2a
RSS − 2

exp −

ba
Γ(a)(2π)n/2 det(C)1/2 det(T )1/2 det(F T C −1 F + T −1 )1/2
Γ( n+2a )ba
2
Γ(a)(2π)n/2 det(C)1/2 det(T )1/2 det(F T C −1 F + T −1 )1/2
88

b+

2

2
dσf

where

RSS = y T C −1 − C −1 F (F T C −1 F + T −1 )−1 F T C −1 y.
As α → ∞, we have
π(θ|y) = lim

α→∞

∝

π(y|θ)π(θ)
π(y|θ)π(θ)dθ

det(C)−1/2 det(F T C −1 F )−1/2

− n+2a
2
1 T
b + y Σy
,
2

where Σ = C −1 − C −1 F (F T C −1 F )−1 F T C −1 .

ii) For given θ and y, we have

π(z∗ |θ, y) =
=

2
2
2
π(z∗ |y, β, σf , θ)π(β, σf |θ, y)dβdσf
2
2
2
2
π(z∗ |y, β, σf , θ)π(β|σf , θ, y)π(σf |θ, y)dβdσf ,

where
2
2
z∗ |y, β, σf , θ ∼ N f (x∗ )T β + k T C −1 (y − F β), σf (1 − k T C −1 k) ,
2
ˆ 2 ˆ
β|σf , θ, y ∼ N (β, σf Σβ ),
2
σf |θ, y ∼ IG a +

RSS
n
,b +
2
2

.

Then, it can be shown that

π(z∗ |θ, y) =

Γ ν+1
2
Γ ν
2

1

λ 2
πν
89

λ(z∗ − µ)2
1+
ν

− ν+1
2

,

when α → ∞.

The results in Proposition 5.1.3 are diﬀerent from those obtained in [22] by using a
2
noninformative prior on β. For a special case where β and σf are known a priori, we have

the following corollary which will be exploited to derive a distributed implementation among
sensor groups in Section 5.2.2.

2
Corollary 5.1.4. In the case where β and σf are known a priori, (5.7) and (5.8) can be

simpliﬁed as
1
1
log w(θ|y) = − log det(C) − (y − F β)T C −1 (y − F β),
2
2
2
z∗ |θ, y ∼ N f (x∗ )T β + k T C −1 (y − F β), σf (1 − k T C −1 k) .

If we draw m samples θ(i)

m
i=1

from the prior distribution π(θ), the posterior predictive

distribution in (5.6) can then be approximated by

π(z∗ |y) ≈

w(θ(i) |y)π(z∗ |θ(i) , y)
w(θ(i) |y)

.

It follows that the predictive mean and variance can be obtained by

µz∗ |y = E(z∗ |y) ≈
2
σz∗ |y = Var(z∗ |y) ≈

w(θ(i) |y)µ

z∗ |θ(i) ,y
,
w(θ(i) |y)

w(θ(i) |y)σ 2

z∗ |θ(i) ,y
w(θ(i) |y)

90

+

w(θ(i) |y) µ

− µz∗ |y
z∗ |θ(i) ,y
w(θ(i) |y)

2

,

where the mean and variance of the student’s t-distribution π(z∗ |θ, y) are given by
µz∗ |θ,y = E(z∗ |θ, y) = µ,
2
σz∗ |θ,y = Var(z∗ |θ, y) =

5.1.4

a
˜
λ.
a−1
˜

Discrete prior distribution

To further reduce the computational demands from the Monte Carlo approach, we assign
discrete uniform probability distributions to σs and σt as priors instead of continuous probability distributions. Assume that we know the range of parameters in θ, i.e.,

σs ∈ [σ s , σ s ] and σt ∈ [σ t , σ t ],

where σ and σ denote the known lower-bound and upper-bound of the random variable σ,
respectively. We constrain the possible choices of θ on a ﬁnite set of grid points denoted by
Θ. Hence, π(θ) is now a probability mass function (i.e.,

θ∈Θ π(θ)

= 1) as opposed to a

probability density. The integration in (5.6) is reduced to the following summation

π(z∗ |y) =

θ∈Θ

π(z∗ |θ, y)π(θ|y),

(5.9)

where the posterior distribution of θ is evaluated on the grid points in Θ by

π(θ|y) =

w(θ|y)π(θ)
.
θ∈Θ w(θ|y)π(θ)

(5.10)

In order to obtain the posterior predictive distribution in (5.9), the computation of π(z∗ |θ, y)
and w(θ|y) for all θ ∈ Θ using the results from Proposition 5.1.3 (or Corollary 5.1.4 for a
91

special case) are necessary. Note that these quantities are available in closed-form which
reduces the computational burden signiﬁcantly.

5.2

Sequential Bayesian Prediction

Although the aforementioned eﬀorts in Sections 5.1.3 and 5.1.4 reduce the computational
cost signiﬁcantly, the number of observations (that mobile sensing agents collect) n increases
with the time t. For each θ ∈ Θ, an n × n positive deﬁnite matrix C needs to be inverted
which requires time O(n3 ) using standard methods. This motivates us to design scalable
sequential Bayesian prediction algorithms by using subsets of observations.

5.2.1

Scalable Bayesian prediction algorithm

The computation of π(z∗ |y1:t ) soon becomes infeasible as t increases. To overcome this drawback while maintaining the Bayesian framework, we propose to use subsets of all observations
y1:t ∈ RN t . However, instead of using truncated local observations only as in [67], Bayesian
inference will be drawn based on two sets of observations:
• First, a set of local observations near target points y which will improve the quality of
˜
the prediction, and
• second, a cumulative set of observations y which will minimize the uncertainty in the
¯
estimated parameters.
Taken together, they improve the quality of prediction as the number of observations increases. We formulate this idea in detail in the following paragraph. For notational simplicity, we deﬁne y ∈ RN t as a subset of all observations y1:t which will be used for Bayesian
92

¯
˜
prediction. We partition y into two subsets, namely y and y . Let F and F be the coun¯
˜
terparts of F deﬁned in (5.2) for y and y , respectively. The following lemma provides the
¯
˜
conditions under which any required function of y in Proposition 5.1.3 can be decoupled.
¯
˜
Lemma 5.2.1. For a given θ ∈ Θ, let C = Corr(y, y|θ), C = Corr(¯, y |θ), C = Corr(˜, y |θ),
y ¯
y ˜
¯
˜
k = Corr(y, z∗ |θ), k = Corr(¯, z∗ |θ), and k = Corr(˜, z∗ |θ). If the following conditions are
y
y
satisﬁed
C1: Corr(˜, y |θ) = 0, i.e., y and y are uncorrelated, and
y ¯
˜
¯
C2: Corr(¯, z∗ |θ) = 0, i.e., y and z∗ are uncorrelated,
y
¯
then we have the following results:

¯ ¯ ¯ ˜ ˜ ˜
F T C −1 F = F T C −1 F + F T C −1 F ∈ Rp×p ,
¯ ¯ ¯ ˜ ˜ ˜
F T C −1 y = F T C −1 y + F T C −1 y ∈ Rp ,
y T C −1 y = y T C −1 y + y T C −1 y ∈ R,
¯ ¯ ¯ ˜ ˜ ˜
¯
˜
log det C = log det C + log det C ∈ R,
˜ ˜ ˜
F T C −1 k = F T C −1 k ∈ Rp ,
˜ ˜ ˜
k T C −1 k = k T C −1 k ∈ R.
Proof. The results follow by noting the correlation matrix C can be decoupled such that
¯
¯ ˜
C = diag(C, C) and k = 0.
Remark 5.2.2. In order to compute the posterior predictive distribution π(z∗ |y) (or the
predictive mean and variance) in (5.9), π(z∗ |θ, y) and π(θ|y) for all θ ∈ Θ need to be calculated. Notice that the posterior distribution of θ can be obtained by computing w(θ|y) in
93

¯ ¯ ¯
¯ ¯ ¯
¯
(5.7). Suppose F T C −1 F ∈ Rp×p , F T C −1 y ∈ Rp , y T C −1 y ∈ R, and log det C ∈ R are
¯ ¯ ¯
˜ ˜ ˜
˜ ˜ ˜
˜
known for all θ ∈ Θ. If F T C −1 F ∈ Rp×p , F T C −1 y ∈ Rp , y T C −1 y ∈ R, and log det C ∈ R
˜ ˜ ˜
for all θ ∈ Θ have ﬁxed computation times, then (5.7) and (5.8) can be computed in constant
time due to decoupling results of Lemma 5.2.1.
The following theorem provides a way to design scalable sequential Bayesian prediction
algorithms.
Theorem 5.2.3. Consider the discrete prior probability π(θ) and the compactly supported
kernel function φt (·). If we select η ≥ σ t ∈ Z>0 , ∆ ∈ Z>0 and deﬁne
ct := max

t−∆
,0
∆+η

∈ R,

ξj := y(j−1)(∆+η)+1:(j−1)(∆+η)+∆ ∈ R∆N ,
y :=
¯

T
(ξ1 , · · ·

T
, ξct )T

∈

(5.11)

R∆N ct ,

y := yt−∆+1:t ∈ R∆N ,
˜
where · is the ﬂoor function deﬁned by x := max {m ∈ Z | m ≤ x}, then the posterior
predictive distribution in (5.9) can be computed in constant time (i.e., does not grow with the
time t).
Proof. By construction, conditions C1-2 in Lemma 5.2.1 are satisﬁed. Hence, it follows
from Remark 5.2.2 that the posterior predictive distribution can be computed in constant
time.
Remark 5.2.4. In Theorem 5.2.3, η ≥ σ t guarantees the time distance between ξj and
ξj+1 is large enough such that the conditions in Lemma 5.2.1 are satisﬁed. Notice that ∆
is a tuning parameter for users to control the trade-oﬀ between the prediction quality and
94

the computation eﬃciency. A large value for ∆ yields a small predictive variance but long
computation time, and vice versa. An illustrative example with three agents sampling the
spatio-temporal Gaussian process in 1-D space is shown in Fig. 5.1.

σt

space

(s∗ , t∗ )

1

2

···

time

10

···

13

t = 15

Figure 5.1: Example with three agents sampling the spatio-temporal Gaussian process in
1-D space and performing Bayesian inference. In this example, σ t = 2.5, η = 2, ∆ = 3,
T
T
t = 15, ct = 2, y = (y1:3 , y6:8 )T and y = y13:15 .
¯
˜
Based on Theorem 5.2.3, we provide the centralized sequential Bayesian prediction algorithm as shown in Table 5.2.

5.2.2

Distributed implementation for a special case

In this subsection, we will show a distributed way (among agent groups) to implement the
2
proposed algorithm for a special case in which β and σf are assumed to be known a priori.

The assumption for this special case is the exact opposite of the one made in [25] where β
2
and σf are unknown and θ is known a priori.

To develop a distributed scheme among agent groups for data fusion in Bayesian statistics,
we exploit the compactly supported kernel for space. Let Cs (h) in (5.1) also be a compactly
supported kernel function as Ct (h) so that the correlation vanishes when the spatial distance
between two inputs is larger than σs , i.e., Cs (h) = 0, ∀h > 1.
95

Consider a case in which M groups of spatially distributed agents sample a spatiotemporal Gaussian process over a large region Q. Each group is in charge of its sub-region
of Q. The identity of each group is indexed by V := {1, · · · , M }. Each agent in group i is
indexed by I [i] := {1, · · · , N }. The leader of group i is referred to as leader i, which implements the centralized scheme to make prediction on its sub-region using local observations
and the globally updated posterior distribution of θ. Therefore, the posterior distribution of
θ shall be updated correctly using all observations from all groups (or agents) in a distributed
fashion.
Let G(t) := (V, E(t)) be an undirected communication graph such that an edge (i, j) ∈
E(t) if and only if leader i can communicate with leader j at time t. We deﬁne the neighborhood of leader i at time t by Ni (t) := {j ∈ V | (i, j) ∈ E(t), j = i}. Let a[i] denote the
quantity as a in the centralized scheme for group i. We then have the following theorem.
Theorem 5.2.5. Assume that y [i] and y [i] for leader i are selected accordingly to The¯
˜
orem 5.2.3 in time-wise. Let y deﬁned by y := ((˜[1] )T , · · · , (˜[M ] )T )T . If the following
˜
˜
y
y
condition is satisﬁed
[i]

[j]

C3: q (t) − qν (t ) ≥ σ s , ∀i = j, ∀ ∈ I [i] , ∀ν ∈ I [j] ,
in the spatial domain, then the weights w(θ|y), based on all observations from all agents, can
be obtained from
M

log w(θ|˜[i] ).
y

log w(θ|y) = log w(θ|¯) +
y

(5.12)

i=1

Proof. The result follows by noting Corr(˜[i] , y [j] |θ) = 0, ∀i = j, when the condition C3 is
y ˜
satisﬁed.
An exemplary conﬁguration of agents which satisﬁes C3 is shown in Fig. 5.2.
96

σs
σs
σs

Figure 5.2: Example with three group of agents sampling the spatio-temporal Gaussian
process in 2-D space and performing Bayesian prediction. The symbol ‘o’ denotes the position
of a leader for a group and the symbol ‘x’ denotes the position of an agent. Distance between
any two sub-regions is enforced to be greater than σ s which enables the distributed Bayesian
prediction.
Suppose that the communication graph G(t) is connected for all time t. Then the aver1
age M

M
y [i]
i=1 log w(θ|˜ )

can be achieved asymptotically via discrete-time average-consensus

algorithm [48]:

log w(θ|˜[i] ) ← log w(θ|˜[i] ) +
y
y

with 0 <

j∈Ni

log w(θ|˜[j] ) − log w(θ|˜[i] ) ,
y
y

< 1/∆(G) that depends on the maximum node degree of the network ∆(G) =

maxi |Ni |.

5.2.3

Adaptive sampling

At time t, the goal of the navigation of agents is to improve the quality of prediction of the
ﬁeld Q at the next sampling time t + 1. Therefore, mobile agents should move to the most
97

informative sampling locations q(t + 1) = (q1 (t + 1)T , · · · , qN (t + 1)T )T at time t + 1 in order
to reduce the prediction error [32].
Suppose at time t + 1, agents move to a new set of positions q = (˜1 , · · · , qN )T . The
˜
qT
˜T
mean squared prediction error is deﬁned as

J(˜) =
q
s∈S

E (z(s, t + 1) − z (s, t + 1))2 ds,
ˆ

(5.13)

where z (s, t + 1) is obtained as in (5.9). Due to the fact that θ has a distribution, the
ˆ
evaluation of (5.13) becomes computationally prohibitive. To simplify the optimization, we
ˆ
propose to utilize a maximum a posteriori (MAP) estimate of θ at time t, denoted by θt ,
i.e.,
ˆ
θt = arg max π(θ|y),
θ∈Θ

where y is the subset of all observations used up to time t. The next sampling positions can
be obtained by solving the following optimization problem

q(t + 1) = arg min

qi ⊂Q s∈S
˜

ˆ
Var(z(s, t + 1)|y, θt )ds.

(5.14)

This problem can be solved using standard constrained nonlinear optimization techniques
(e.g., the conjugate gradient algorithm), possibly taking into account mobility constraints of
mobile sensors.

Remark 5.2.6. The proposed control algorithm in (5.14) is truly adaptive in the sense that
the new sampling positions are functions of all collected observations. On the other hand, if
all parameters are known, the optimization in (5.14) can be performed oﬄine without taking
98

any measurements.

5.3

Simulation

In this section, we apply the proposed sequential Bayesian prediction algorithms to spatiotemporal Gaussian processes with a correlation function in (5.1). The Gaussian process
was numerically generated through circulant embedding of the covariance matrix for the
simulation study [20]. This technique allows us to numerically generate a large number of
realizations of the Gaussian process.

5.3.1

MCMC-based approach on a 1-D scenario

We consider a scenario in which N = 5 agents sample the spatio-temporal Gaussian process
in 1-D space and the central station performs Bayesian prediction. The surveillance region
Q is given by Q = [0, 10]. We consider the squared exponential function
1
Cs (h) = exp(− h2 ),
2

for space correlation and a compactly supported correlation function [24] for time as

Ct (h) =


 (1−h) sin(2πh) 1−cos(2πh)

+ π×2πh , 0 ≤ h ≤ 1,
2πh

 0,

(5.15)

otherwise,

The signal-to-noise ratio γ is set to be 26dB which corresponds to σw = 0.158. The true val2
ues for the parameters used in simulating the Gaussian process are given by (β, σf , σs , σt ) =

(0, 1, 2, 8). Notice that the mean function is assumed to be an unknown random variable,
99

2
i.e., the dimension of the regression coeﬃcient β is 1. We assume that β|σf has the nonin2
formative prior and σf ∼ IG(3, 20). The Gibbs sampler in Table 5.1 was used to generate

samples from the posterior distribution of the parameters. A random sampling strategy was
used in which agents make observations at random locations at each time t ∈ Z>0 . The
prediction was evaluated at each time step for 51 uniform grid points within Q.
The histograms of the samples at time t = 1 and t = 10 are shown in Fig. 5.3-(a) and
Fig. 5.3-(b), respectively. It is clear that the distributions of the parameters are centered
around the true values with 100 observations at time t = 20. The prediction results at time
t = 1 and t = 20 are shown in Fig. 5.4-(a) and Fig. 5.4-(b), respectively. However, with only
100 observations, the running time using the full Bayesian approach is about several minutes
which will soon become intractable.
400

600
400

200
0

400

0

β

150

0

5

200

200

200

5

300

0

2

2
σf

4

0

6

2

150

300

100

100
50

0

β

100

0
1.6 1.8

2

σs

0

2.2

4

6

8

σt

10

12

0

1.6 1.8

0

2

200

50

100
0

1

2
σf

2

3

400
200

2

σs

(a)

0

2.2

4

6

8

σt

10

12

(b)

2
Figure 5.3: Posterior distribution of β, σf , σs , and σt at (a) t = 1, and (b) t = 20.

5.3.2

Centralized scheme on 1-D scenario

We consider the same scenario in which N = 5 agents sample the spatio-temporal Gaussian
process in 1-D space and the central station performs Bayesian prediction. The true values
100

2

2
1

1

0
-1

0

-2
0

2

4

s

6

8

10

(a)

-1

0

2

4

s

6

8

10

(b)

Figure 5.4: Prediction at (a) t = 1, and (b) t = 20 using the MCMC-based approach. The
true ﬁelds are plotted in blue solid lines. The predicted ﬁelds are plotted in red dash-dotted
lines. The area between red dotted lines indicates the 95% conﬁdence interval.
2
for the parameters used in simulating the Gaussian process are given by (β, σf , σs , σt ) =

(20, 10, 2, 8). Notice that the mean function is assumed to be an unknown random variable,
2
i.e., the dimension of the regression coeﬃcient β is 1. We assume that β|σf has the non2
informative prior and σf ∼ IG(3, 20). We also assume the bounds of θ, viz. σs ∈ [1.6, 2.4]

and σt ∈ [4, 12] are known. ∆ = 12 is used and η = 11 is selected satisfying the condition
in Theorem 5.2.3. We use a discrete uniform probability distribution for π(θ) as shown in
Fig. 5.6-(a). The adaptive sampling strategy was used in which agents make observations
at each time t ∈ Z>0 . The prediction was evaluated at each time step for 51 uniform grid
points within Q.
Fig. 5.5 shows the comparison between predictions at time t = 1 using (a) the maximum
likelihood (ML) based approach, and (b) the proposed fully Bayesian approach. The ML
based approach ﬁrst generates a point estimate of the hyperparameters and then uses them as
true ones for computing the prediction and the prediction error variance. In this simulation,
a poor point estimate on θ was achieved by maximizing the likelihood function. As a result,
101

the prediction and the associated prediction error variance are incorrect and are far from
being accurate for a small number of observations. On the other hand, the fully Bayesian
approach which incorporates the prior knowledge of θ and uncertainties in θ provides a more
accurate prediction and an exact conﬁdence interval.
Using the proposed sequential Bayesian prediction algorithm along with the adaptive
sampling strategy, the prior distribution was updated in a sequential manner. At time
t = 100, the posterior distribution of θ is shown in Fig. 5.6-(b). With a larger number of
observations, the support for the posterior distribution of θ becomes smaller and the peak
gets closer to the true value. As shown in Fig. 5.7-(a), the quality of the prediction at
time t = 100 is signiﬁcantly improved. At time t = 300, the prior distribution was further
updated which is shown in Fig. 5.6-(c). At this time, θ = (2, 8)T , which is the true value,
has the highest probability. The prediction is also shown in Fig. 5.7-(b). This demonstrates
the usefulness and correctness of our algorithm. The running time at each time step is
ﬁxed, which is around 12s using Matlab, R2008a (MathWorks) in a PC (2.4GHz Dual-Core
Processor).

5.3.3

Distributed scheme on 2-D scenario

Finally, we consider a scenario in which there are 4 groups, each of which contain 10 agents
sampling the spatio-temporal Gaussian process in 2-D space. The surveillance region Q is
given by Q = [0, 10] × [0, 10]. The parameter values used in simulating the Gaussian process
2
are given by θ = (σs , σt )T = (2, 8)T , β = 0, and σf = 1, last two values of which are assumed

to be known a priori. To use the distributed scheme, we only consider compactly supported
kernel functions for both space and time. In particular, we consider Cs (h) = Ct (h) as in
102

28

28

26

26

24

24

22

22

20

20

18

18

16

16

14
0

2

4

s

6

8

10

(a)

14
0

2

4

s

6

8

10

(b)

Figure 5.5: Prediction at t = 1 using (a) the maximum likelihood based approach, and
(b) the proposed fully Bayesian approach. The true ﬁelds are plotted in blue solid lines.
The predicted ﬁelds are plotted in red dash-dotted lines. The area between red dotted lines
indicates the 95% conﬁdence interval.
(5.15). We also assume the fact that σs ∈ [1.6, 2.4] and σt ∈ [4, 12] are known a priori.
∆ = 12 is used and η = 11 is selected satisfying the condition in Theorem 5.2.3. The
region Q is divided into 4 square sub-regions with equal size areas as shown in Fig. 5.9-(a).
Distance between any two sub-regions is enforced to be greater than σs = 2.4, satisfying the
¯
condition in Theorem 5.2.5, which enables the distributed Bayesian prediction. The same
uniform prior distribution for θ as in the centralized version (see Fig. 5.6-(a)) is used.
The globally updated posterior distribution of θ at time t100 is shown in Fig. 5.8. It
has a peak near the true θ, which shows the correctness of the distributed algorithm. The
predicted ﬁeld compared with the true ﬁeld at time t100 is shown in Fig. 5.9. Due to
the construction of sub-regions, the interface areas between any of two sub-regions are not
predicted. Notice that the prediction is not as good as in the 1-D scenario due to the eﬀect
of curse of dimensionality when we move from 1-D to 2-D spaces. The prediction quality
can be improved by using more number of sensors at the cost of computational time. The
running time of the distributed algorithm in this scenario is about several minutes due to
103

0.2

0.015

0.15

0.01

0.1
0.005

0.05
0
4

0
4
6

σt

8
10
12

1.6

1.8

2.2

2

6

2.4

σt

σs

8
10
12

(a)

1.6

1.8

2.2

2

2.4

σs

(b)

0.8
0.6
0.4
0.2
0
4
6

σt

8
10
12

1.6

1.8

2.2

2

2.4

σs

(c)

Figure 5.6: (a) Prior distribution θ, (b) posterior distribution of θ at time t = 100, (c)
posterior distribution of θ at time t = 300.
the complexity of the 2-D problem under the same computational environment as the one
used for the 1-D scenario. However, thanks to our proposed sequential sampling schemes,
the running time does not grow with the number of measurements.

104

24

24

23

23

22

22

21

21

20

20

19

19

18

18

17

17

16

0

2

4

s

6

8

16

10

0

2

4

(a)

s

6

8

10

(b)

Figure 5.7: Prediction at (a) t = 100, and (b) t = 300 using the centralized sequential
Bayesian approach. The true ﬁelds are plotted in blue solid lines. The predicted ﬁelds
are plotted in red dash-dotted lines. The area between red dotted lines indicates the 95%
conﬁdence interval.

0.4
0.3
0.2
0.1
0
4
6

σt

8
10
12

1.6

1.8

2

2.2

2.4

σs

Figure 5.8: Posterior distribution of θ at time t = 100 using the distributed algorithm.

105

Table 5.2: Centralized Bayesian prediction algorithm.
Input:
2
2
(1) prior distribution on σf , i.e., π(σf ) = IG(a, b)
(2) prior distribution on θ ∈ Θ, i.e., π(θ)
(3) tuning variables ∆ and η
(4) number of agents N
(5) M(θ).A = 0 ∈ Rp×p , M(θ).B = 0 ∈ R, M(θ).C = 0 ∈ Rp , M(θ).D = 0 ∈ R,
M0 (θ) = M(θ), ∀θ ∈ Θ
Output:
(1) The predictive mean at location s∗ ∈ S and time t∗ = t, i.e., µz∗ |y
2
(2) The predictive variance at location s∗ ∈ S and time t∗ = t, i.e., σz |y
∗

At time t, the central station does:
1: receive observations yt from agents, set y = yt−∆+1:t and n = N ∆
˜
˜ = (f (˜(1) ), · · · , f (˜(n) ))T where x(i) is the input of the i-th element in y
2: compute F
x
x
˜
˜
3: for each θ ∈ Θ do
˜
4: compute C = Corr(˜, y ) ∈ Rn×n
y ˜
5: compute the key values
˜ ˜ ˜
F T C −1 F = M(θ).A + F T C −1 F ∈ Rp×p , y T C −1 y = M(θ).B + y T C −1 y ∈ R,
˜ ˜ ˜
T C −1 y = M(θ).C + F T C −1 y ∈ Rp , log det C = M(θ).D + log det C ∈ R
˜ ˜ ˜
˜
F
6: compute a = a + n and
˜
2
˜ = b + 1 y T C −1 y − 1 (F T C −1 y)T (F T C −1 F )−1 (F T C −1 y)
b
2
2
7: update weights via
log w(θ|y) = − 1 log det C − 1 log det(F T C −1 F ) − a log ˜
˜
b
2
2
8: for each s∗ ∈ S do
˜
9:
compute f (x∗ ) ∈ Rp , k = Corr(˜, z∗ ) ∈ Rn
y
10:
compute predictive mean and variance for given θ
˜˜ ˜
˜ ˜ ˜
µz∗ |θ,y = k C −1 y + (f (x∗ ) − F T C −1 k)T (F T C −1 F )−1 (F T C −1 y),
2
σz |θ,y =
11:
12:
13:

˜
b
a−1
˜

∗

˜ ˜ ˜
˜ ˜ ˜
˜ ˜ ˜
(1 − k T C −1 k) + (f (x∗ ) − F T C −1 k)T (F T C −1 F )−1 (f (x∗ ) − F T C −1 k)

end for
if mod(t, ∆ + η) = ∆ then
set M(θ) = M0 (θ), then M0 (θ).A = F T C −1 F , M0 (θ).B = y T C −1 y, M0 (θ).C =
F T C −1 y, and M0 (θ).D = log det C
14: end if
15: end for
16: compute the posterior distribution
π(θ|y) =

w(θ|y)π(θ)
θ w(θ|y)π(θ)

17: compute the predictive mean and variance
µz∗ |y = θ µz∗ |θ,y π(θ|y),
2
σz |y =
∗

2
θ σz∗ |θ,y π(θ|y) +

106

θ

µz∗ |θ,y − µz∗ |y

2

π(θ|y).

(a)

(b)

Figure 5.9: Comparison of (a) the true ﬁeld at t = 100 and (b) the predicted ﬁeld at t = 100
using the distributed algorithm.

107

Chapter 6
Gaussian Process with Built-in
GMRF
Recently, there have been eﬀorts to ﬁnd a way to ﬁt a computationally eﬃcient Gaussian
Markov random ﬁeld (GMRF) on a discrete lattice to a Gaussian random ﬁeld on a continuum
space [17,27,56]. Such methods have been developed using a ﬁtting with a weighted L2 -type
distance [56], using a conditional-mean least-squares ﬁtting [17], and for dealing with large
data by fast Kriging [27]. It has been demonstrated that GMRFs with small neighborhoods
can approximate Gaussian ﬁelds surprisingly well [56]. This approximated GMRF and its
regression are very attractive for the resource-constrained mobile sensor networks due to its
computational eﬃciency and scalability [34] as compared to the standard Gaussian process
and its regression, which is not scalable as the number of observations increases.
Mobile sensing agents form an ad-hoc wireless communication network in which each
agent usually operates under a short communication range, with limited memory and computational power. For resource-constrained mobile sensor networks, developing distributed
108

prediction algorithms for robotic sensors using only local information from local neighboring
agents has been one of the most fundamental problems [4, 6, 10, 13, 25, 47].
In Section 6.1.1, a new class of Gaussian processes is proposed for resource-constrained
mobile sensor networks. Such a Gaussian process builds on a GMRF [54] with respect to
a proximity graph, e.g., the Delaunay graph of a set of vertices over a surveillance region.
The formulas for predictive statistics are derived in Section 6.1.2. We propose a sequential
prediction algorithm which is scalable to deal with sequentially sampled observations in
Section 6.1.3. In Section 6.2, we develop a distributed and scalable statistical inference
algorithm for a simple sampling scheme by applying the Jacobi over-relaxation and discretetime average consensus algorithms. Simulation and experimental study demonstrate the
usefulness of the proposed model and algorithms in Section 6.3.

6.1

Spatial Prediction

In this section, we ﬁrst propose a new class of Gaussian random ﬁelds with built-in Gaussian
Markov random ﬁelds (GMRF) [54]. Then we show how to compute the prediction at
any point of interest based on Gaussian process regression, and provide a sequential ﬁeld
prediction algorithm for mobile sensor networks.

6.1.1

Spatial model based on GMRF

Let γ := (γ(p1 ), · · · , γ(pm ))T ∼ N (0, Q−1 ) be a zero-mean GMRF [54] with respect to
an undirected graph G = (V, E), where the location of vertex i is denoted by pi in the
surveillance region Q. Such locations of vertices will be referred to as generating points. The
inverse covariance matrix (precision matrix) Q
109

0 has the property (Q)ij = 0 ⇔ {i, j} ∈ E.

If the graph G has small cardinalities of the neighbor sets, its precision matrix Q becomes
sparse with many zeros in its entries. This plays a key role in computation eﬃciency of a
GMRF which can be greatly exploited by the resource-constrained mobile sensor network.
The spatial ﬁeld is modeled by a Gaussian process with a built-in GMRF deﬁned as
m

z(s) = µ(s) +

λ(s, pj )γ(pj ),

(6.1)

j=1

where λ(·, ·) is a weighting function. The new class of Gaussian processes is capable of
representing a wide range of non-stationary Gaussian ﬁelds, by selecting
1. diﬀerent number of generating points m,
2. diﬀerent locations of generating points pj | j = 1, · · · , m over Q,
3. a diﬀerent structure of the precision matrix Q, and
4. diﬀerent weighting functions λ(·, pj ) | j = 1, · · · , m .
Remark 6.1.1. The number of generating points could be determined by a model selection
criterion such as the Akaike information criterion [1]. Similar to hyperparameter estimation
in the standard Gaussian process regression, one can estimate all other parameters using
maximum likelihood (ML) optimization [51, 65]. This is non-convex optimization and so the
initial conditions need to be chosen carefully to avoid local minima. In our approach, we use
basic structures for weighting functions and the precision matrix, however, we make them as
functions of the locations of generating points. Diﬀerent spatial resolutions can be obtained
by a suitable choice of locations of generating points. As an example shown in Fig. 6.1, higher
resolution can be obtained by higher density of generating points (see lower left corner). In
110

this way, we only need to determine the locations of generating points. This approach will be
demonstrated with real-world data in Section 6.3.4.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0

0.2

0.4

0.6

0.8

1

(a)

(b)

Figure 6.1: (a) Generating points in blue dots and the associated Delaunay graph with edges
in red dotted lines. The Voronoi partition is also shown in blue solid lines. (b) Gaussian
random ﬁeld with a built-in GMRF with respect to the Delaunay graph in (a).

6.1.2

Gaussian process regression

Suppose we have a collection of observations y := (y1 , · · · , yn )T whose entries are sampled
at the corresponding points s1 , · · · , sn . The noise corrupted measurement yi ∈ R is given
by
yi = z(si ) + i ,
i.i.d.

2
where i ∼ N (0, σw ) is an independent and identically distributed (i.i.d.) Gaussian white

noise. We then have the following results.
Proposition 6.1.2. Let Λ ∈ Rn×m be a matrix obtained by (Λ)ij = λ(si , pj ) and let λ ∈ Rm
be a vector obtained by (λ)i = λ(s0 , pi ), where s0 is a point of interest. Then the covariance
111

matrix of y and the covariance between y and z(s0 ) are given by
2
C := E[(y − E y)(y − E y)T ] = ΛQ−1 ΛT + σw I,

k := E[(y − E y)z(s0 )] = ΛQ−1 λ,
where Q ∈ Rm×m is the precision matrix of the GMRF γ ∈ Rm .

Proof. The (i, j)-th element of the covariance matrix C, i.e., the covariance between yi and
yj , can be obtained by

2
Cij = Cov(z(si ), z(sj )) + σw δij
2
= E(z(si ) − µ(si ))(z(sj ) − µ(sj )) + σw δij

=E

λ(si , pk )γ(pk )

λ(sj , pl )γ(pl )

 k

2
+ σw δij

l


2
λ(si , pk )γ(pk )γ(pl )λ(sj , pl ) + σw δij

= E
k,l

2
λ(si , pk ) E(γ(pk )γ(pl ))λ(sj , pl ) + σw δij

=
k,l

2
λ(si , pk )(Q−1 )kl λ(sj , pl ) + σw δij .

=
k,l

The i-th element of the covariance vector k, i.e., the covariance between yi and z(s0 ),
112

can be obtained by

ki = Cov(z(si ), z(s0 ))
= E(z(si ) − µ(si ))(z(s0 ) − µ(s0 ))
=E

λ(si , pk )γ(pk )
 k

= E

λ(s0 , pl )γ(pj )
l


λ(si , pk )γ(pk )γ(pl )λ(s0 , pl )

k,l

=

λ(si , pk ) E(γ(pk )γ(pl ))λ(s0 , pl )
k,l

λ(si , pk )(Q−1 )kl λ(s0 , pl ),

=
k,l

whose matrix form completes the proof.

By Propositions 6.1.2, we can make prediction at the point of interest s0 using Gaussian
process regression [51]. This is summarized by the following theorem.

Theorem 6.1.3. For given y, the prediction of z0 := z(s0 ) at any location s0 ∈ Q is given
by the conditional distribution

2
z0 |y ∼ N µz |y , σz |y ,
0
0

where the predictive mean and variance are obtained by

ˆ ˆ
µz |y = µ(s0 ) + λT Q−1 y ,
0
2
σz |y
0

=

ˆ
λT Q−1 λ,
113

(6.2)

with
−2
ˆ
Q = Q + σw ΛT Λ ∈ Rm×m ,
−2
y = σw ΛT (y − µ) ∈ Rm .
ˆ

Proof. By using the Woodbury matrix identity (see Appendix A.2.1), the prediction mean
can be obtained by

µz |y = µ(s0 ) + k T C −1 (y − µ)
0
2
= µ(s0 ) + (ΛQ−1 λ)T (ΛQ−1 ΛT + σw I)−1 (y − µ)
2
= µ(s0 ) + λT Q−1 ΛT (ΛQ−1 ΛT + σw I)−1 (y − µ)
−2
= µ(s0 ) + λT Q−1 ΛT (σw I−
−2
−2
−2
σw Λ(Q + σw ΛT Λ)−1 ΛT σw )(y − µ)
−2
= µ(s0 ) + λT (σw Q−1 −
−4
−2
σw Q−1 ΛT Λ(Q + σw ΛT Λ)−1 )ΛT (y − µ)

= µ(s0 ) + λT ΞΛT (y − µ),
114

where
−2
−4
−2
Ξ = σw Q−1 − σw Q−1 ΛT Λ(Q + σw ΛT Λ)−1
−2
−2
−2
= σw Q−1 (Q + σw ΛT Λ)(Q + σw ΛT Λ)−1
−4
−2
− σw Q−1 ΛT Λ(Q + σw ΛT Λ)−1
−2
−4
−2
= (σw I + σw Q−1 ΛT Λ)(Q + σw ΛT Λ)−1
−2
−4
− σw Q−1 ΛT Λ(Q + σw ΛT Λ)−1
−2
−2
= σw (Q + σw ΛT Λ)−1 .

Similarly, the prediction error variance can be obtained by

2
σz |y = λT Q−1 λ − k T C −1 k
0
2
= λT Q−1 λ − (ΛQ−1 λ)T (ΛQ−1 ΛT + σw I)−1 (ΛQ−1 λ)
2
= λT Q−1 − Q−1 ΛT (ΛQ−1 ΛT + σw I)−1 ΛQ−1 λ
−2
= λT (Q + σw ΛT Λ)−1 λ,

where Cov(z(s0 ), z(s0 )) = λT Q−1 λ is obtained similarly as in Proposition 6.1.2.

Remark 6.1.4. When the generating points {p1 , p2 , · · · , pm } are not known a priori, they
can be estimated by maximizing the likelihood function. Given n observations y = (y1 , y2 , · · · , yn )T
sampled at {s1 , s2 , · · · , sn }, the log likelihood of y is given by
1
1
n
log π(y) = − (y − µ)T C −1 (y − µ) − log det C − log 2π,
2
2
2
2
where C = ΛQ−1 ΛT + σw I is the covariance matrix of y. the maximum likelihood estimate

115

of the generating points can be obtained via solving the following optimization problem.

pML = arg max log π(y).
ˆ
p

(6.3)

Remark 6.1.5. Note that the number of generating points m is ﬁxed and the number of
observations n may grow in time, and so in general we consider m

n. Theorem 6.1.3 shows

−2
ˆ
that only the inversion of an m × m matrix Q = Q + σw ΛT Λ is required in order to compute

the predictive distribution of the ﬁeld at any point. The computational complexity grows
linearly with the number of observations, i.e., O(nm2 ), compare to the standard Gaussian
process regression which requires O(n3 ). Moreover, it enables a scalable prediction algorithm
for sequential measurements.

In what follows, we present a sequential ﬁeld prediction algorithm for sequential observations by exploiting the results of Theorem 6.1.3.

6.1.3

Sequential prediction algorithm

Consider a sensor network consisting of N mobile sensing agents distributed in the surveillance region Q. The index of the robotic sensors is denoted by I := {1, · · · , N }. The sensing
agents sample the environmental ﬁeld at time t ∈ Z>0 and send the observations to a central
station which is in charge of the data fusion.
At time t, agent i makes an observation yi (t) at location si (t). Denote the collection of
observations at time t by yt := (y1 (t), · · · , yN (t))T . We have the following proposition.
Proposition 6.1.6. At time t ∈ Z>0 , the predictive mean and variance at any point of
116

interest can be obtained via (6.2) with

−2
ˆ
ˆ
Qt = Qt−1 + σw ΛT Λt ,
t

ˆ
Q0 = Q

−2
yt = yt−1 + σw ΛT (yt − µt ),
ˆ
ˆ
t

y0 = 0,
ˆ

where (Λt )ij = λ(si (t), sj (t)), and (µt )i = µ(si (t)).
Proof. The result can be obtained easily by noting that AT A = AT A1 + AT A2 , where
2
1
A = (AT , AT )T .
2
1
Based on Proposition 6.1.6, we present a sequential ﬁeld prediction algorithm using mobile
sensor networks in Table 6.1.
Table 6.1: Sequential algorithm for ﬁeld prediction.
Input:
a set of target points S
Output:
(1) prediction mean {ˆ(s0 ) | s0 ∈ S}
z
(2) prediction error variance σ 2 (s0 ) | s0 ∈ S
Assumption:
(1) the central station knows p, Q, and λ(·, ·)
ˆ
(2) the central station initially has Q ← Q, y ← 0
ˆ

At time t, agent i ∈ I in the network does:
1: take measurement yi from its current location si
2: send the measurement (si , yi ) to the central station
At time t, the central station does:
1: obtain measurements {(s , y ) | ∀ ∈ I} from mobile sensors
2: compute Λ via (Λ)ij = λ(si , pj )
−2
ˆ
ˆ
3: update Q ← Q + σw ΛT Λ
−2
4: update y ← y + σw ΛT (y − µ), where µi = µ(si )
ˆ
ˆ
5: for s0 ∈ S do
6: compute (λ)i via λ(s0 , pi )
ˆ ˆ
7: compute z (s0 ) = µ(s0 ) + λT Q−1 y
ˆ
2 (s ) = λT Q−1 λ
ˆ
8: compute σ 0
9: end for

117

6.2

Distributed Spatial Prediction

In this section, we propose a distributed approach, in which robotic sensors exchange only
local information between neighbors, to implement the ﬁeld prediction eﬀectively fusing all
observations collected by all sensors correctly. This distributed approach can be implemented
for a class of weighting functions λ(·, ·) in (6.1) that have compact supports. In particular,
we consider the weighting function deﬁned by

λ(s, pj ) = λ( s − pj /r),

(6.4)

where
λ(h) :=



 (1 − h) cos(πh) + 1 sin(πh), h ≤ 1,
π

 0,

otherwise.

Notice that the weighting function λ(·, ·) in (6.4) has a compact support, i.e., λ(s, pj ) is
non-zero if and only if the distance s − pj is less than the support r ∈ R>0 .

6.2.1

Distributed computation

We ﬁrst brieﬂy introduce distributed algorithms for solving linear systems and computing the
averages. They will be used as major tools for distributed implementation of ﬁeld prediction.

• Jacobi over-relaxation method: The Jacobi over-relaxation (JOR) [4] method provides an iterative solution of a linear system Ax = b, where A ∈ Rn×n is a nonsingular
matrix and x, b ∈ Rn . If agent i knows the rowi (A) ∈ Rn and bi , and aij = (A)ij = 0
118

if agent i and agent j are not neighbors, then the recursion is given by

(k+1)

xi

(k)

= (1 − h)xi

+



h 
bi −
aii

(k)
aij xj  .

(6.5)

j∈Ni

This JOR algorithm converges to the solution of Ax = b from any initial condition if
h < 2/n [13]. At the end of the algorithm, agent i knows the i-th element of x = A−1 b.
• Discrete-time average consensus: The Discrete-time average consensus (DAC)
provides a way to compute the arithmetic mean of elements in the a vector c ∈ Rn .
Assume the graph is connected. If agent i knows the i-th element of c, the network
can compute the arithmetic mean via the following recursion [47]

(k+1)

xi

(k)

= xi

(k)

+

aij (xj
j∈Ni

(k)

− xi ),

(6.6)

with initial condition x(0) = c, where aij = 1 if j ∈ Ni and 0 otherwise, 0 < < 1/∆,
and ∆ = maxi (

j=i aij )

is the maximum degree of the network. After the algorithm

converges, all node in the network know the average of c, i.e.,

6.2.2

n
i=1 ci /n.

Distributed prediction algorithm

Consider a GMRF with respect to a proximity graph G = (V, E) that generates a Gaussian
random ﬁeld in (6.1). The index of the generating points is denoted by V := {1, · · · , n}.
The location of the i-th generating point is pi . The edges of the graph are considered to be
E := {i, j} | pi − pj ≤ R , where R is a constant that ensures the graph is connected.
Consider a mobile sensor network consisting of N mobile sensing agents distributed in
the surveillance region Q. For simplicity, we assume that the number of agents is equal to
119

the number of generating points, i.e., N = m. The index of the robotic sensors is denoted
by I := {1, · · · , m}. The location of agent i is denoted by si .
The assumptions made for the resource-constrained mobile sensor networks are listed as
follows.
A.1 Agent i is in charge of sampling at point si within a r-disk centered at pi , i.e., si −pi <
r.
A.2 r is the radius of the support of the weighting function in (6.4) and also satisﬁes that
0 < r < R.
2
A.3 Agent i can only locally communicate with neighbors in Ni := {j ∈ I | {i, j} ∈ E}
deﬁned by the connected proximity graph G = (V, E).
A.4 Agent i knows rowi (Q), i.e., the i-th row of Q, where (Q)ij = 0 if and only if j ∈ {i}∪Ni .
Remark 6.2.1. As in A.1, it is reasonable to have at least one agent collect measurements
that are correlated with a random variable from a single generating point. This sampling rule
may be modiﬁed such that a single agent dynamically samples for multiple generating points
or more number of agents samples for a generating point depending on available resources.
Since there is at least one agent in charge of a generating point by A.1, it is natural to have
A.3 and A.4 taking advantage of the proximity graph for the GMRF. Notice that each agent
only knows local information of Q as described in A.4.
An illustration of agent

sampling a measurement at point s in the intersection of the

supports of the weighting functions of pi and pj is shown in Fig. 6.2.
From A.1 and A.2, since R > 2r, we have λ(s , pi ) = 0 if

∈ Ni . Thus the matrix
/

−2
−2
ˆ
Q = Q + σw ΛT Λ ∈ Rm×m and the vector y = σw ΛT (y − µ) ∈ Rm can be obtained in the
ˆ

120

r

pi
s￿
pj

pk

Figure 6.2: Example of computing (ΛT Λ)ij = λ(s , pi )λ(s , pj ).
following form.

−2
ˆ
(Q)ij = (Q)ij + σw
λ(s , pi )λ(s , pj ),
∈{{i}∪Ni }∩ {j}∪Nj
−2
(ˆ)i = σw
y

(6.7)

λ(s , pi )(y − µ ).

∈{i}∪Ni

ˆ
Notice that Q has the same sparsity as Q. From (6.7), A.3 and A.4, agent i can compute
ˆ
ˆ
rowi (Q) and (ˆ)i by using only local information from neighbors. Using rowi (Q) and (λ)i ,
y
−2
ˆ
agent i can obtain the i-th element in the vector Q−1 λ = (Q+σw ΛT Λ)−1 λ via JOR by using

only local information. Finally, using (ˆ)i and (λ)i the prediction mean and variance can
y
be obtained via the discrete-time average consensus algorithm. Notice that the sequential
ˆ
update of Q and y for sequential observations proposed in Section 6.1.3 can be also applied
ˆ
to the distributed algorithm. The distributed algorithm for sequential ﬁeld prediction under
assumptions A.1-4 is summarized in Table 6.2.
The number of robotic sensors and the sampling rule can be modiﬁed or optimized to
maintain a better quality of the prediction and the corresponding distributed algorithm may
121

Table 6.2: Distributed algorithm for sequential ﬁeld prediction.
Input:
(1) a set of target points S
(2) the topology of sensor network G = (I, E) in which E := {i, j} | pi − pj ≤ R
Output:
(1) prediction mean µz |y | s0 ∈ S
0

2
(2) prediction error variance σz |y | s0 ∈ S
0
Assumption:
(A1) agent i ∈ I is in charge of sampling at point si within a r-disk centered at pi , i.e.,
si − pi < r
(A2) the radius of the support of the weighting function satisﬁes 0 < r < R
2
(A3) agent i ∈ I can only locally communicate with neighbors Ni := {j ∈ I | {i, j} ∈ E}
deﬁned by the connected graph G = (V, E)
ˆ
(A4) agent i ∈ I initially has rowi (Q) ← rowi (Q), (ˆ)i ← 0
y

At time t, agent i
∈
I in the network does the following concurrently:
1: take measurement yi from its current location si
−2
ˆ
ˆ
2: update rowi (Q) ← rowi (Q) + rowi (σw ΛT Λ) by exchanging information from neighbors
Ni
−2
3: update (ˆ)i ← (ˆ)i + (σw ΛT (y − µ))i by exchanging information from neighbors Ni
y
y
4: for s0 ∈ S do
5: compute (λ)i = λ(s0 , pi )
ˆ
6: compute (Q−1 λ)i via JOR
ˆ ˆ
7: compute µz |y = µ(s0 ) + λT Q−1 y via DAC
0
ˆ
8: compute σ 2 = λT Q−1 λ via DAC
9: end for

z0 |y

be derived in a same way accordingly.

6.3

Simulation and Experiment

In this section, we apply the proposed schemes to both simulation and experimental study.
122

6.3.1

Simulation

We ﬁrst apply our proposed prediction algorithms to a numerically generated Gaussian
random ﬁeld z(·) based on a GMRF with respect to a graph G = (V, E) deﬁned in (6.1).
The mean function µ(·) is assumed to be constant and µ = 5 is used in the simulation. We
assume the generating points of the GMRF, indexed by V = {1, · · · , n} where n = 30, are
located at {p1 , · · · , pn } in a 2-D unit area Q. The edges of the graph are assumed to be
E := {i, j} | pi − pj ≤ R , where R = 0.4.
The GMRF γ = (γ(p1 ), · · · , γ(pn ))T has a zero-mean and the precision matrix Q is given
by


 |N (i)| + c0 , if j = i,




(Q)ij =
if j ∈ N (i),
 −1,




 0,
otherwise,
where |N (i)| denotes the degree of node i, i.e., the number of connections it has to other
nodes, c0 = 0.1 is used to ensure Q is positive deﬁnite since a Hermitian diagonally dominant
matrix with real non-negative diagonal entries is positive semi-deﬁnite [54]. We use compactly supported weighting functions deﬁned in (6.4) for both centralized and distributed
schemes with diﬀerent support r. The sensor noise level is given by σw = 0.5. Since the
optimal sampling is beyond the scope of this chapter, in the simulation, we use a random
sampling strategy in which robotic sensors sample at random locations at each time instance.

6.3.2

Centralized scheme

We ﬁrst consider a scenario in which N = 5 agents take samples in the surveillance region
D at certain time instance t ∈ Z>0 and send the observations to a central station in which
123

the prediction of the ﬁeld is made.
The Gaussian random ﬁeld z(·) is shown in Fig. 6.3-(a) with the n = 30 generating
points of the built-in GMRF shown in black circles. The predicted ﬁeld at times t = 1,
t = 5, and t = 20 are shown in Figs. 6.3-(b), (c), and (d), respectively. The sampling
locations are shown in black crosses. Clearly, the predicted ﬁeld gets closer to the true ﬁeld
as the number of observations increases. The computational time for ﬁeld prediction at each
time instance remains ﬁxed due to the nice structure of the proposed Gaussian ﬁeld in (6.1)
and its consequent results from Theorem 6.1.3.

6.3.3

Distributed scheme

Next, we consider a scenario in which prediction is implemented in a distributed fashion
(Table 6.2) under assumptions A.1-4 for the resource-constrained mobile sensor network in
Section 6.2.2. In particular, N = 30 robotic sensors are distributed according to the graph
G = (V, E), which is connected. Agent i is in charge of the sampling with in a r-disk
centered at pi , where the support r = 0.2 is used. Agent i has a ﬁxed neighborhood, i.e.,
N (i) = {j | {i, j} ∈ E}. In the simulation, h = 0.02 in (6.5) and = 0.02 in (6.6) are chosen
to ensure the convergence of the JOR algorithm and the DAC algorithm.
Fig. 6.4-(a) shows the underlying graph G = (V, E) for the GMRF with the generating
points denoted by black circles and the edges in red lines. The sparsity of the precision
matrix Q is shown in Fig. 6.4-(b). Notice that only 316 out of 900 elements in Q are nonzero which enables the eﬃcient distributed computation. The true and the predicted ﬁelds
at time t = 5 are shown in Figs. 6.5-(a) and (b), respectively. The normalized RMS error
computed over about 10000 grid points at time t = 5 is 7.8%. The computational time at
124

1

8.5

1

8.5

8

8

7.5

7.5

7

0.5

7

0.5

6.5
6

0.5

6

5.5
0
0

6.5

5.5
0
0

1

(a)

0.5

1

(b)

1

8.5

1

8.5

8

8

7.5

7.5

7

0.5

7

0.5

6.5
6

0.5

6

5.5
0
0

6.5

5.5
0
0

1

(c)

0.5

1

(d)

Figure 6.3: Simulation results for the centralized scheme. (a) The true ﬁeld, (b) the predicted
ﬁeld at time t = 1, (c) the predicted ﬁeld at time t = 5, (d) the predicted ﬁeld at time t = 20.
The generating points are shown in black circles, and the sampling locations are shown in
black crosses.

each time instance remains ﬁxed due to the nice structure of the proposed Gaussian ﬁeld in
(6.1) and its consequent results from Theorem 6.1.3.
125

0

1

5
0.8

10
0.6

15
0.4

20

0.2

0

25
30
0

0.2

0.4

0.6

0.8

1

0

(a)

5

10

15

20

25

30

(b)

Figure 6.4: (a) Graph G = (V, E). (b) Sparsity structure of the precision matrix Q.
1

1

6

6

5.8

5.8

5.6

5.6

5.4

0.5

5.4

0.5

5.2

5.2

5

5

4.8
0
0

0.5

4.8
0
0

1

(a)

0.5

1

(b)

Figure 6.5: Simulation results for the distributed scheme. (a) The true ﬁeld, (b) the predicted
ﬁeld at time t = 5. The generating points are shown in circles, and the sampling locations
are shown in crosses.

6.3.4

Experiment

In order to show the practical usefulness of the proposed approach, we apply the centralized
scheme in Theorem 6.1.3 on an experimentally obtained observations. We ﬁrst measured
126

depth values of a terrain on grid points by using a Microsoft Kinect sensor [12] as shown in
Fig. 6.6-(a). As pointed out in Remark 6.1.1, we make the structures of weighting functions
and the precision matrix as functions of the locations of generating points. In particular,
two generating points are neighbors if and only if their corresponding Voronoi cells intersect.
The individual weighting function takes the same form as in (6.4) and its support size ri is
selected to be the largest distrance between the generating point i and it’s neighbors. We
then predict the ﬁeld by our model with 20 estimated generating points given by the ML
estimator in (6.3) using a subset of experimental observations, i.e., 200 randomly sampled
observations denoted by crosses in Fig. 6.6-(a). The estimated positions of generating points
along with the predicted ﬁeld are shown in Fig. 6.6-(b). In this experiment, it is clear to
see that our approach eﬀectively produces the predicted ﬁeld, which is very close to the true
ﬁeld for the case of unknown generating points.
840
840
5
5

840
840
5
5

830
830

10
10
15
15

820
820

20
20

830
830

10
10
15
15

820
820

20
20
810
810

25
25

800
800

30
30
35
35

790
790

40
40

810
810

25
25

800
800

30
30
35
35

790
790

40
40
780
780

45
45
10
10

20
20

30
30

40
40

50
50

60
60

780
780

45
45
10
10

(a)

20
20

30
30

40
40

50
50

60
60

(b)

Figure 6.6: (a) True ﬁeld on grid positions obtained by the Kinect sensor and randomly
sampled positions indicated in black crosses. (b) The ﬁtted Gaussian random ﬁeld with a
build-in GMRF with respect to the Delaunay graph.

127

Chapter 7
Bayesian Spatial Prediction Using
GMRF
In this chapter, we consider the problem of predicting a large scale spatial ﬁeld using successive noisy measurements obtained by mobile sensing agents. The physical spatial ﬁeld
of interest is discretized and modeled by a Gaussian Markov random ﬁeld (GMRF) with
unknown hyperparameters. From a Bayesian perspective, we design a sequential prediction
algorithm to exactly compute the predictive inference of the random ﬁeld. The main advantages of the proposed algorithm are: (1) the computational eﬃciency due to the sparse
structure of the precision matrix, and (2) the scalability as the number of measurements
increases. Thus, the prediction algorithm correctly takes into account the uncertainty in
hyperparameters in a Bayeisan way and also is scalable to be usable for the mobile sensor
networks with limited resources. An adaptive sampling strategy is also designed for mobile
sensing agents to ﬁnd the most informative locations in taking future measurements in order
to minimize the prediction error and the uncertainty in hyperparameters. The eﬀectiveness
128

of the proposed algorithms is illustrated by a numerical experiment.
In Chapter 5, we designed a sequential Bayesian prediction algorithm to deal with unknown bandwidths by using a compactly supported kernel and selecting a subset of collected
measurements. In this Chapter, we instead seek a fully Bayesian approach over a discretized
surveillance region such that the Bayesian spatial prediction utilizes all collected measurements in a scalable fashion.
In Section 7.1, we model the physical spatial ﬁeld as a GMRF with unknown hyperparameters and formulate the estimation problem from a Bayesian point of view. In Section 7.2,
we design an sequential Bayesian estimation algorithm to eﬀectively and eﬃciently compute
the exact predictive inference of the spatial ﬁeld. The proposed algorithm often takes only
seconds to run even for a very large spatial ﬁeld, as will be demonstrated in this chapter.
Moreover, the algorithm is scalable in the sense that the running time does not grow as the
number of observations increases. In particular, the scalable prediction algorithm does not
rely on the subset of samples to obtain scalability (as was done in Chapter 5), correctly fusing
all collected measurements. In Section 7.4, an adaptive sampling strategy for mobile sensor
networks is designed to largely improve the quality of prediction and to reduce the uncertainty in the hyperparameter estimation simultaneously. We demonstrate the eﬀectiveness
through a simulation study in Section 7.5.

7.1

Problem Setup

In what follows, we specify the models for the spatial ﬁeld and the mobile sensor network.
Notice that in this Chapter, we slightly change notation for notational simplicity.
129

7.1.1

Spatial ﬁeld model

Let Q∗ ⊂ RD denote the spatial ﬁeld of interest. We discretize the ﬁeld into n∗ spatial sites
S∗ := {s1 , · · · , sn∗ } and let z∗ = (z1 , · · · , zn∗ )T ∈ Rn∗ be the value of the ﬁeld (e.g., the
temperature). Due to the irregular shape a spatial ﬁeld may have, we extend the ﬁeld such
that n ≥ n∗ sites denoted by S := {s1 , · · · , sn } are on a regular grid. The latent variable
zi := z(si ) ∈ R is modeled by

zi = µ(si ) + ηi ,

∀1 ≤ i ≤ n,

(7.1)

where si ∈ S ⊂ RD is the i-th site location. The mean function µ : RD → R is deﬁned as
µ(si ) = f (si )T β,

where f (si ) = (f1 (si ), · · · , fp (si ))T ∈ Rp is a known regression function, and β = (β1 , · · · , βp )T ∈
Rp is an unknown vector of regression coeﬃcients. We deﬁne η = (η1 , · · · , ηn )T ∈ Rn as a
zero-mean Gaussian Markov random ﬁeld (GMRF) [54] denoted by

η ∼ N 0, Q−1 ,
η|θ
where the inverse covariance matrix (or precision matrix) Qη|θ ∈ Rn×n is a function of a
hyperparameter vector θ ∈ Rm .
There exists many diﬀerent choices of the GMRF (i.e., the precision matrix Qη|θ ) [54].
For instance, we can choose one with the full conditionals in (7.2) (with obvious notation as
shown in [54]).
130







1 

E(ηi |η−i , θ) =
2a
4 + a2 







◦ ◦ ◦ ◦ ◦

◦ ◦ ◦ ◦ ◦

◦ ◦ • ◦ ◦

◦ ◦ • ◦ ◦ 


◦ ◦ ◦ ◦ ◦ 



− 1 • ◦ ◦ ◦ • ,



◦ ◦ ◦ ◦ ◦ 


◦ ◦ • ◦ ◦

◦ • ◦ • ◦

◦ • ◦ • ◦ −2 ◦ ◦ ◦ ◦ ◦
◦ ◦ • ◦ ◦

◦ • ◦ • ◦

◦ ◦ ◦ ◦ ◦

◦ ◦ ◦ ◦ ◦

Var(ηi |η−i , θ) = (4 + a2 )κ.
(7.2)

Fig. 7.1 displays the elements of the precision matrix related to a single location that
explains (7.2). The hyperparameter vector is deﬁned as θ = (κ, α)T ∈ R2 , where α =
>0
a − 4. The resulting GMRF accurately represents a Gaussian random ﬁeld with the Mat´rn
e
covariance function [36]

C(r) =

1−ν
22
σf
Γ(ν)

√

2νr

ν

Kν

√

2νr

,

√
where Kν (·) is a modiﬁed Bessel function [51], with order ν = 1, a bandwidth = 1/ α, and
2
vertical scale σf = 1/4πακ. The hyperparameter α > 0 guarantees the positive deﬁniteness

of the precision matrix Qη|θ . In the case where α = 0, the resulting GMRF is a secondorder polynomial intrinsic GMRF [54, 55]. Notice that the precision matrix is sparse which
contains only small number of non-zero elements. This property will be exploited for fast
computation in the following sections.

Example 7.1.1. Consider a spatial ﬁeld of interest Q∗ ∈ [0, 100] × [0, 50]. We ﬁrst divide
131

1

2

2

-2a

4+a2

-2a

2

1

-2a

-2a

2

1

1
Figure 7.1: Elements of the precision matrix Q related to a single location.
the spatial ﬁeld into a 100 × 50 regular grid with equal areas 1, which makes n∗ = 5000.
We then extend the the ﬁeld such that 120 × 70 grids (i.e., n = 8400) are constructed on
the extended ﬁeld Q = [−10, 110] × [−10, 60]. The precision matrix Qη|θ introduced above is
chosen with the regular lattices wrapped on a torus [54]. In this case, only 0.15% elements in
the sparse matrix Qη|θ are non-zero. The numerically generated ﬁelds with the mean function
µ(si ) = β = 20, and the hyperparameter vector θ = (κ, α)T being diﬀerent values are shown
in Fig. 7.2.

7.1.2

Mobile sensor network

Consider N spatially distributed mobile sensing agents indexed by i ∈ I = {1, · · · , N }
sampling from n∗ spatial sites in S∗ . Agents are equipped with identical sensors and sample
at time t ∈ Z>0 . At time t, agent i takes a noisy corrupted measurement at it’s current
132

0
28
10
26
20
24
30
22
40
50
0

20
20

40

60

80

100

(a)
0
28
10

26

20

24
22

30

20

40

18
16

50
0

20

40

60

80

100

(b)
0
30
10
25
20
30

20

40

15

50
0

20

40

60

80

100

(c)

Figure 7.2: Numerically generated spatial ﬁelds deﬁned in (7.1) with µ(si ) = β = 20,
and Qη|θ constructed using (7.2) with hyperparameters being (a) θ = (4, 0.0025)T , (b)
θ = (1, 0.01)T , and (c) θ = (0.25, 0.04)T .

location qt,i ∈ S∗ , i.e.,

yt,i = z(qt,i ) + t,i ,
133

t,i

i.i.d.

2
∼ N (0, σw ),

where measurement errors are assumed to be independent and identically distributed (i.i.d.).
2
The noise level σw > 0 is assumed to be known. For notational simplicity, we denote
T
T
all agents’ locations at time t by qt = (qt,1 , · · · , qt,N )T and the observations made by all

agents at time t by yt = (yt,1 , · · · , yt,N )T . Furthermore, we denote the collection of agents’
T
T
locations and the collective observations from time 1 to t by q1:t = (q1 , · · · , qt )T , and

y1:t = (y1 , · · · , yN )T , respectively.

7.2

Bayesian Predictive Inference

In this section, we propose a Bayesian inference approach to make predictive inferences of a
spatial ﬁeld z∗ ∈ Rn∗ .
First, we assign the vector of regression coeﬃcients β ∈ Rp with a Gaussian prior, namely
β ∼ N 0, T −1 , where the precision matrix T ∈ Rp×p is often chosen as a diagonal matrix
with small diagonal elements when no prior information is available. Hence, the distribution
of latent variables z given β and the hyperparameter vector θ is Gaussian, i.e.,

z|β, θ ∼ N F β, Q−1 ,
η|θ
where F = (f (s1 ), · · · , f (sn ))T ∈ Rn×p . For notational simplicity, we denote the full latent
ﬁeld of dimension n + p by x = (z T , β T )T . Then, for a given hyperparameter vector θ, the
134

distribution π(x|θ) is Gaussian obtained by

π(x|θ) = π(z|β, θ)π(β)
1
1
∝ exp − (z − F β)T Qη|θ (z − F β) − β T T β
2
2
1
= exp − xT Qx|θ x ,
2
where the precision matrix Qx|θ ∈ R(n+p)×(n+p) is deﬁned by




−Qη|θ F 
 Qη|θ
Qx|θ = 
.
TQ
TQ F + T
−F
η|θ F
η|θ
By the matrix inversion lemma, the covariance matrix Σx|θ ∈ R(n+p)×(n+p) can be obtained
by


−1 + F T −1 F T F T −1
Q

Σx|θ = Q−1 =  η|θ
.
x|θ
−1 )T
−1
(F T
T
At time t ∈ Z>0 , we have a collection of observational data y1:t ∈ RN t obtained by
the mobile sensing agents over time. Let A1:t = (A1 , · · · , At ) ∈ R(n+p)×N t , where Aτ ∈
R(n+p)×N is deﬁned by


 1, if si = qτ,j ,
(Aτ )ij =

 0, otherwise.
Then the covariance matrix of y1:t can be obtained by

R1:t = AT Σx|θ A1:t + P1:t ,
1:t

2
where P1:t = σw I ∈ RN t×N t . By Gaussian process regression [51], the full conditional

135

distribution of x is also Gaussian, i.e.,

x|θ, y1:t ∼ N (µx|θ,y

1:t

, Σx|θ,y

1:t

),

where

Σx|θ,y

−1
= Σx|θ − Σx|θ A1:t R1:t AT Σx|θ ,
1:t

µx|θ,y

−1
= Σx|θ A1:t R1:t y1:t .

1:t

1:t

(7.3)

The posterior distribution of the hyperparameter vector θ can be obtained via

π(θ|y1:t ) ∝ π(y1:t |θ)π(θ),
where the log likelihood function is deﬁned by

1
Nt
1 T −1
log 2π.
log π(y1:t |θ) = − y1:t R1:t y1:t − log det R1:t −
2
2
2

(7.4)

If a discrete prior on the hyperparameter vector θ is chosen with a support Θ = {θ1 , · · · , θL },
the posterior predictive distribution π(x|y1:t ) can be obtained by

π(x|y1:t ) =

π(x|θ , y1:t )π(θ |y1:t ).
136

(7.5)

The predictive mean and variance then follow as

µx |y =
i 1:t

µx |θ ,y π(θ |y1:t ),
1:t
i

2
σx |y
i 1:t

2
σx |θ ,y π(θ
1:t
i

(7.6)
=

|y1:t ) +

(µx |θ ,y − µx |y
1:t
i
i 1:t

)2 π(θ

|y1:t ),

2
where µx |θ ,y is the i-th element in µx|θ ,y , and σx |θ ,y is the i-th diagonal element
1:t
1:t
i
1:t
i

in Σx|θ ,y .
1:t
Remark 7.2.1. The discrete prior π(θ) greatly reduced the computational complexity in that
it enables summation in (7.5) instead of numerical integration which has to be performed with
a choice of continuous prior distribution. However, the computation of the full conditional
distribution π(x|θ, y1:t ) in (7.3) and the likelihood π(y1:t |θ) (7.4) requires the inversion of the
covariance matrix R1:t , whose size grows as the time t increases. Thus, the running time
grows fast as new observations are collected and it will soon become intractable.

7.3

Sequential Bayesian Inference

In this section, we exploit the sparsity of the precision matrix, and propose a sequential
Bayesian prediction algorithm which can be performed in constant time and fast enough
even for a very large spatial ﬁeld.

7.3.1

Update full conditional distribution

First, we rewrite the full conditional distribution π(x|θ, y1:t ) in terms of the sparse precision
matrix Qx|θ as follows
x|θ, y1:t ∼ N (µx|θ,y

1:t

137

, Q−1
x|θ,y

1:t

),

where

Qx|θ,y

−1
= Qx|θ + A1:t P1:t AT
1:t

µx|θ,y

= Q−1
x|θ,y

1:t
1:t

1:t

From here on, we will use Qt|θ = Qx|θ,y

1:t

(7.7)

−1
A1:t P1:t y1:t .

and µt|θ = µx|θ,y , for notational simplicity.
1:t

Notice that (7.7) can be represented by the following recursion

1
Qt|θ = Qt−1|θ + 2
σw
1
bt = bt−1 + 2
σw

N

ut,i uT ,
t,i
i=1

N

(7.8)

ut,i yt,i ,
i=1

where bt = Qt|θ µt|θ with initial conditions

Q0|θ = Qx|θ,y

1:0

= Qx|θ , and b0 = 0.

In (7.8), we have deﬁned ut,i ∈ Rn+p as


 1, if sj = qt,i ,
(ut,i )j =

 0, otherwise.

Lemma 7.3.1. For a given θ ∈ Θ, the full conditional mean and variance, i.e., µt|θ and
Qt|θ , can be updated in short constant time given Qt−1|θ and bt−1 .

Proof. The update of Qt|θ and bt can be obviously computed in constant time. Hence µt|θ
can be obtained by solving a linear equation Qt|θ µt|θ = bt . Due to the sparse structure of
Qt|θ , this operation can be done in a very short time. Moreover, notice that Qt|θ and Qt−1|θ
138

have the same sparsity structure and hence the computational complexity remains ﬁxed.

From Lemma 7.3.1, we can compute µx |θ,y in (7.6) in constant time. In order to ﬁnd
1:t
i
2
σx |θ,y in (7.6), we need to compute Σx|θ,y which requires the inversion of Qt|θ . The
1:t
1:t
i

inversion of a big matrix (even a sparse matrix) is undesirable. However, notice that only
the diagonal elements in Q−1 are needed. Following the Sherman-Morrison formula (see
t|θ
2
Appendix A.2.2) and using (7.8), σx |θ,y can be obtained exactly via
1:t
i

−1 
N


−1
diag(Qt|θ ) = diag Qt−1|θ +
ut,i uT  
t,i


i=1

=

diag(Q−1 ) −
t−1|θ

N
i=1

ht,i|θ ◦ ht,i|θ

2
σw + uT ht,i|θ
t,i

,
(7.9)

−1
ht,i|θ = Bt,i|θ ut,i ,

1
Bt,i|θ = Qt−1|θ + 2
σw

i

ut,j uT ,
t,j
j=1

where ◦ denotes the element-wise produce. By this way, the computation can be done
eﬃciently in constant time.

7.3.2

Update likelihood

Next, we derive the update rule for the log likelihood function. We have the following
proposition.
Proposition 7.3.2. The log likelihood function log π(y1:t |θ) in (7.4) can be obtained by
1
Nt
2
log π(y1:t |θ) = ct + gt,θ + bT µt|θ −
log(2πσw )
t
2
2
139

(7.10)

where
1
ct = ct−1 − 2
2σw
1
gt|θ = gt−1|θ −
2

N
2
yt,i ,

c0 = 0,

i=1
N

i=1

1
log 1 + 2 uT ht,i|θ ,
σw t,i

g0|θ = 0,

with ht,i|θ deﬁned in (7.9).

Proof. The inverse of the covariance matrix R1:t can be obtained by
−1
R1:t = (AT Q−1 A1:t + P1:t )−1
1:t 0|θ
−1
−1
−1
−1
= P1:t − P1:t AT (Q0|θ + A1:t P1:t AT )−1 A1:t P1:t
1:t
1:t
−1
−1
−1
= P1:t − P1:t AT Q−1 A1:t P1:t .
1:t t|θ

Similarly, the log determinant of the covariance matrix Σ1:t can be obtained by
log det R1:t = log det(AT Q−1 A1:t + P1:t )
1:t 0|θ
1
2
= log det(I + 2 AT Q−1 A1:t ) + N t log σw
σw 1:t 0|θ
1
= log det(Q0|θ + 2
σw
t

=
τ =1

t

N

τ =1 i=1

2
uτ,i uT ) − log det(Q0|θ ) + N t log σw
τ,i

2
log(1 + uT Q−1 uτ ) + N t log σw .
τ τ −1|θ

140

Hence, we have

log π(y1:t |θ)
1
1 T −1
Nt
= − y1:t R1:t y1:t − log det R1:t −
log 2π
2
2
2
1
1
1 T −1
= − y1:t P1:t y1:t + bT µt|θ −
t
2
2
2

t

N

τ =1 i=1

−1
log(1 + uT Bτ,i|θ uτ,i ) −
τ,i

Nt
2
log(2πσw ).
2

Lemma 7.3.3. For a given θ ∈ Θ, the log likelihood function, i.e., log π(y1:t |θ) can be
computed in short constant time.
Proof. The result follows directly from Proposition 7.3.2.

7.3.3

Update predictive distribution

Combining the results in Lemmas 7.3.1, 7.3.3, and (7.5), (7.6), we summarize our results in
the following theorem.
Theorem 7.3.4. The predictive distribution in (7.5) (or the predictive mean and variance
in (7.6)) can be obtained in constant time as time t increases.
We summarize the proposed sequential Bayesian prediction algorithm in Table 7.1.

7.4

Adaptive Sampling

In the previous section, we have designed a sequential Bayesian prediction algorithm for
estimating the scalar ﬁeld at time t. In this section, we propose an adaptive sampling
strategy for ﬁnding most informative sampling locations at time t + 1 for mobile sensing
141

agents in order to improve the quality of prediction and reduce the uncertainty in hyper
parameters simultaneously.

In our previous work [66], we have proposed to use the conditional entropy H(z∗ |θ =
ˆ
θt , y1:t+1 ) as an optimality criterion, where

ˆ
θt = arg max π(θ|y1:t ),
θ

is the maximum a posterior (MAP) estimate based on the cumulative observations up to
current time t. Although this approach greatly simpliﬁes the computation, it does not count
for the uncertainty in estimating the hyperparameter vector θ.

In this chapter, we propose to use the conditional entropy H(z∗ , θ|yt+1 , y1:t ) which represents the uncertainty remained in both random vectors z∗ and θ by knowing future measurements in the random vector yt+1 . Notice that the measurements y1:t have been observed
and treated as constants. It can be obtained by

H(z∗ , θ|yt+1 , y1:t ) = H(z∗ |θ, yt+1 , y1:t ) + H(θ|yt+1 , y1:t )
= H(z∗ |θ, yt+1 , y1:t ) + H(yt+1 |θ, y1:t ) + H(θ|y1:t ) − H(yt+1 |y1:t ).
Notice that we have the following Gaussian distributions (the means will not be exploited
and hence not shown here):

z∗ |θ, yt+1 , y1:t ∼ N (·, Σx∗ |θ,y

1:t+1

yt+1 |θ, y1:t ∼ N (·, Σy
yt+1 |y1:t

approx

∼

t+1 |θ,y1:t

N (·, Σy

142

),
2
+ σw I),

t+1 |y1:t

2
+ σw I),

in which the last one is approximated using (7.6). Notice that the approximation is used here
to avoid numerical integration over the random vector yt+1 which needs to be done using
Monte Carlo methods. Moreover, the entropy H(θ|y1:t ) = c is a constant since y1:t is known.
Since the entropy for a multivariate Gaussian distribution has a closed-from expression [14],
we have
1
log (2πe)n∗ det(Σx∗ |θ ,y
) π(θ |y1:t )
1:t+1
2

H(z∗ , θ|yt+1 , y1:t ) =

1
log (2πe)N det(Σq |θ ,y ) π(θ |y1:t )
t+1
1:t
2

+
−

1
log (2πe)N det(Σq |y ) + c.
t+1 1:t
2

It can also be shown that
log det(Σx∗ |θ ,y
) = log det(Q−1 )(S∗ )
t+1|θ
1:t+1
= log det(Qt+1|θ )(−S∗ ) − log det(Qt+1|θ ),
where A(S∗ ) denotes the submatrix of A formed by the ﬁrst 1 to n∗ rows and columns
(recall that S∗ = {s1 , · · · , sn∗ }). Notice that the term log det(Qt+1|θ )(−S∗ ) is a constant
since agents only sample at S∗ . Hence, the optimal sampling locations at time t + 1 can be
determined by solving the following optimization problem

∗
qt+1 = arg

= arg

H(z∗ , θ|yt+1 , y1:t )

min
qt+1,i ∈Rt,i

− log det(Qt+1|θ )π(θ |y1:t )

min
qt+1,i ∈Rt,i

+

log det(Σy

)π(θ
t+1 |θ ,y1:t

143

|y1:t ) − log det(Σy

t+1 |y1:t

),

where Rt,i = s | s − qt,i ≤ r, s ∈ S∗

(in which r ∈ R>0 is the maximum distance an

agent can move between time instances) is the reachable set at time t. This combinatorial
optimization problem can be solved using a greedy algorithm, i.e., ﬁnding the sub-optimal
sampling locations for agents in sequence.

7.5

Simulation

In this section, we demonstrate the eﬀectiveness of the proposed sequential Bayesian inference
algorithm and the adaptive sampling strategy through a numerical experiment.
Consider a spatial ﬁeld introduced in Example 7.1.1. The mean function is a constant
β = 20. We choose the precision matrix Qx|θ with hyperparameters α = 0.01 equivalent to a
bandwidth

√
2
= 1/ α = 10, and κ = 1 equivalent to a vertical scale σf = 1/4πακ ≈ 8. The

numerically generated ﬁeld is shown in Fig. 7.2-(b). The precision matrix T of β is chosen to
be 10−4 . The measurement noise level σw = 0.2 is assumed to be known. A discrete uniform
distribution is selected with a support shown in Fig. 7.3. N = 5 mobile sensing agents take
measurements at time t ∈ Z>0 , starting from locations shown in Fig. 7.4-(b) (in white dots).
The maximum distance each agent can travel between time instances is chosen to be r = 5.
Fig. 7.4 shows the predicted ﬁelds and the prediction error variances at times t =
1, 5, 10, 20. The trajectories of agents are shown in white circles with the current locations shown in white dots. It can be seen that agents try to cover the ﬁeld of interest as
time evolves. The predicted ﬁeld (the predictive mean) gets closer to the true ﬁeld (see
Fig. 7.2-(b)) and the prediction error variances become smaller as more observations are
collected. Fig. 7.3 shows the posterior distribution of the hyperparameters in θ. Clearly,
as more measurements are obtained, this posterior distribution becomes peaked at the true
144

0.4

1

0.3

0.8
0.6

0.2
0.4
0.1

0.2

0

0

0.0025
0.01
0.04
α

0.0025
0.01
0.04
α

4
1
0.25 κ

(a)

4
1
0.25 κ

(b)

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

0.0025
0.01
0.04
α

0.0025
0.01
0.04
α

4
1
0.25 κ

(c)

4
1
0.25 κ

(d)

Figure 7.3: Posterior distributions of θ, i.e., π(θ|y1:t ), at (a) t = 1, (b) t = 5, (c) t = 10, and
(d) t = 20.

value (1, 0.01). Fig. 7.5-(a) shows the predicted distribution of the estimated mean β as time
evolves. In Fig. 7.5-(b), we can see that the RMS error computed via

rms(t) =

1
n∗

n∗
i=1

(µz |y − zi )2 ,
i 1:t

145

decreases as time increases, which shows the eﬀectiveness of the proposed scheme.
The most important contribution is that the computation time at each time step does
not grow as the number of measurements increases. This ﬁxed running time using Matlab,
R2009b (MathWorks) in a Mac (2.4 GHz Intel Core 2 Duo Processor) is about 10 seconds
which is fast enough for real-world implementation.

146

Table 7.1: Sequential Bayesian predictive inference.
Input:
(1) prior distribution of θ ∈ Θ, i.e., π(θ)
Output:
n∗
(1) predictive mean µx |y
i 1:t i=1
n∗
2
(2) predictive variance σx |y
i 1:t i=1

Initialization:
1: initialize b = 0, c = 0
2: for θ ∈ Θ do
3: initialize Qθ , gθ = 0
4: compute diag(Q−1 )
θ
5: end for
At time t ∈ Z>0 , do:
1: for 1 ≤ i ≤ N do
2: obtain new observations yt,i collected at current locations qt,i
3: ﬁnd the index k corresponding to qt,i , and set u = ek
4:

y
update b = b + t,i u
2

5:

update c = c −

6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:

σw
1 y2
2
2σw t,i

for θ ∈ Θ do
compute hθ = Q−1 u
θ

h ◦h

update diag(Q−1 ) = diag(Q−1 ) − 2 θ Tθ
θ
θ
σw +u hθ
1 uuT
update Qθ via Qθ = Qθ + 2
σw

update gθ = gθ − 1 log(1 + 1 uT h)
2
2
σw
end for
end for
for θ ∈ Θ do
compute µθ = Q−1 b
θ
compute the likelihood via
log π(θ|y1:t ) = c + gθ + 1 bT µθ
2
end for
compute the posterior distribution via
π(θ|y1:t ) ∝ π(y1:t |θ)π(θ)
compute the predictive mean via
µx |y =
(µθ )i π(θ |y1:t )
i 1:t

19: compute the predictive variance via
2
σx |y =
i 1:t

(diag(Qθ ))i + ((µθ )i − µx |y )2 π(θ |y1:t )
i 1:t
147

0
28
10

26

20

24
22

30

20
18

40

16
50
0

20

40

60

80

100

(a)
0
6
10

5

20

4
3

30

2
40
50
0

1
20

40

60

80

100

(b)
0
28
10

26

20

24
22

30

20
18

40

16
50
0

20

40

60

80

100

(c)
0

8
7

10

6
5

20

4
30

3
2

40

1
50
0

20

40

60

80

100

(d)

Figure 7.4: Predicted ﬁelds at (a) t = 1, (c) t = 5, (e) t = 10, and (g) t = 20. Prediction
error variances at (b) t = 1, (d) t = 5, (f) t = 10, and (h) t = 20.
148

0
28
10

26

20

24
22

30

20
18

40

16
50
0

20

40

60

80

100

(e)
0
6
10

5

20

4
3

30
2
40
50
0

1
20

40

60

80

100

(f)
0
28
10

26

20

24
22

30

20
18

40

16
50
0

20

40

60

80

100

(g)
0

6

10

5
4

20

3
30
2
40
50
0

1
20

40

60

80

100

(h)

Figure 7.4: Predicted ﬁelds at (a) t = 1, (c) t = 5, (e) t = 10, and (g) t = 20. Prediction
error variances at (b) t = 1, (d) t = 5, (f) t = 10, and (h) t = 20 (cont’d).
149

2.8

0.35
t=1
t=5
t=10
t=20

0.3
0.25

2.6
2.4
2.2
2

0.2

1.8
0.15

1.6
1.4

0.1

1.2

0.05

1
0
10

15

20
β

25

30

0.8
0

(a)

5

10
t

15

(b)

Figure 7.5: (a) Estimated β, and (b) root mean square error.

150

20

Chapter 8
Conclusion and Future Work
In this chapter, we brieﬂy summarize the key contributions presented in this dissertation
and propose some promising directions for future work.

8.1

Conclusion

In Chapter 3, we presented a novel class of self-organizing sensing agents that learn an
anisotropic, spatio-temporal Gaussian process using noisy measurements and move in order
to improve the quality of the estimated covariance function. The ML estimator was used
to estimate the hyperparameters in the unknown covariance function and the prediction
of the ﬁeld of interest was obtained based on the ML estimates. An optimal navigation
strategy was proposed to minimize the information-theoretic cost function of the Fisher
Information Matrix for the estimated hyperparameters. The proposed scheme was applied
to both a spatio-temporal Gaussian process and a true advection-diﬀusion ﬁeld. Simulation
study indicated the eﬀectiveness of the proposed scheme and the adaptability to time-varying
covariance functions.
151

In Chapter 4, for spatio-temporal Gaussian processes, we justiﬁed prediction based on
truncated observations for mobile sensor networks. In particular, we presented a theoretical
foundation of Gaussian processes with truncated observations. Centralized and distributed
navigation strategies were proposed to minimize the average of prediction error variances at
target points that can be arbitrarily chosen by a user. Simulation results demonstrated that
mobile sensing agents under the distributed navigation strategy produce an emergent, collective behavior for communication connectivity, and are coordinated to improve the quality
of the collective prediction capability.
In Chapter 5, we formulated a fully Bayesian approach for spatio-temporal Gaussian
process regression under practical conditions. We designed sequential Bayesian prediction
algorithms to compute exact predictive distributions in constant time as the number of
observations increases. An adaptive sampling strategy was also provided to improve the
quality of prediction. Simulation results showed the practical usefulness of the proposed
theoretically-correct algorithms in the context of environmental monitoring by mobile sensor
networks.
In Chapter 6, we introduced a new class of Gaussian processes with built-in GMRFs
for modeling a wide range of environmental ﬁelds. The Gaussian process regression for the
predictive statistics at any point of interest was provided and a sequential ﬁeld prediction
algorithm with ﬁxed complexity was proposed to deal with sequentially sampled observations.
For a special case with compactly supported weighting functions, we proposed a distributed
ﬁeld prediction algorithm in which the prediction can be computed via Jacobi over-relaxation
algorithm and discrete-time average consensus.
In Chapter 7, we have discussed the problem of predicting a large scale spatial ﬁeld
152

using successive noisy measurements obtained by a multi-agent system. We modeled the
spatial ﬁeld of interest using a GMRF and designed a sequential prediction algorithm for
computing the exact predictive inference from a Bayesian point of view. The proposed
algorithm is computationally eﬃcient and scalable as the number of measurements increases.
We also designed an adaptive sampling algorithm for agents to ﬁnd the sub-optimal locations
in order to minimize the prediction error and reduce the uncertainty in hyperparameters
simultaneously.

8.2

Future Work

In the long term, we plan to expanding our current work and exploring on the following
directions:

• consider the optimal sampling strategies for mobile sensing agents with complicated
vehicle dynamics;

• consider the optimal coordination of the mobile senor network subject to energy constraints;

• develop the approximated Bayesian prediction algorithms for resource-constraint mobile robots, such as using integrated nested Laplace approximations;

• expand the work on spatial modeling using GMRF to deal with the more general
spato-temporal process;

• consider the eﬀects of localization error on the posterior predictive distribution;
153

• implement the developed algorithms in experiments using robotic boats under development.

154

Appendix A

Mathematical Background

A.1

Gaussian Identities

The multivariate Gaussian distribution of a random vector x ∈ Rn (i.e., x ∼ N (µ, Σ)) has a
joint probability density function (pdf) given by

p(x; µ, Σ) =

1
exp − (x − µ)T Σ−1 (x − µ) ,
2
(2π)−n/2 |Σ|−1/2
1

where µ ∈ Rn is the mean vector, and Σ ∈ Rn×n is the covariance matrix.
Now, suppose x consists of two disjoint subsets xa and xb , i.e.,
 
xa 
x =  .
xb
155

The corresponding mean vector µ and covariance matrix Σ can be written as
 
µa 
µ =  ,
µb



Σaa Σab 
Σ=
,
Σba Σbb

where Σab = ΣT due to the symmetry of Σ. Then, the marginal distribution of xa is given
ba
by
xa ∼ N (µa , Σaa ),
and the conditional distribution of xa given xb is given by

xa |xb ∼ N (µa|b , Σa|b ),

where
µa|b = µa + Σab Σ−1 (xb − µb )
bb
Σa|b = Σaa − Σab Σ−1 Σba .
bb

A.2

Matrix Inversion Lemma

Matrices can be inverted blockwise by using the following analytic inversion formula:


−1

 A B


T C
B



−1 + A−1 B(C − B T A−1 B)−1 B T A−1 −A−1 B(C − B T A−1 B)−1
A

=
,
T A−1 B)−1 B T A−1
T A−1 B)−1
−(C − B
(C − B

where A, B and C are matrix sub-blocks of arbitrary size. Matrices A and C − B T A−1 B
must be non-singular.
156

A.2.1

Woodbury identity

The Woodbury matrix identity is

(A + U CV )−1 = A−1 − A−1 U C −1 + V A−1 U

−1

V A−1 ,

where A, U , C and V denote matrices with appropriate size.

A.2.2

Sherman-Morrison formula

Suppose A ∈ Rn×n is invertible and u ∈ Rn , v ∈ Rn are vectors. Assume that 1+v T A−1 u =
0, the Sherman-Morrison formular states that

(A + uv T )−1 = A−1 −

A.3

A−1 uv T A−1
.
1 + v T A−1 u

Generating Gaussian processes

In order to implement algorithms in simulation studies, we need to generate multivariate
Gaussian samples from N (µ, Σ) with arbitrary mean µ and covariance matrix Σ. In what
follows, we introduce two approaches.

A.3.1

Cholesky decomposition

Given an arbitrary mean µ and a positive deﬁnite covariance matrix Σ, the algorithm generates multivariate Gaussian samples is shown in Tab. A.1.
157

Table A.1: Generating multivariate Gaussian samples by Cholesky decomposition.
1: compute the Cholesky decomposition of the positive deﬁnite symmetric covariance matrix
Σ = LLT , where L is a lower triangular matrix
2: generate u ∼ N (0, I) by multiple separate calls to the scalar Gaussian generator
3: compute x = µ + Lu which has desired normal distributed with mean µ and covariance
matrix L E[uuT ]LT = LLT = Σ

A.3.2

Circulant embedding

Consider a 1-D zero-mean stationary Gaussian process z(x) with a covariance function
C(x, x ). The covariance matrix Σ of z(x) sampled on the equispaced grids Ω = x(1) , · · · , x(n)
has entries (Σ)pq = C(|x(p) − x(q) |). Notice that the covariance matrix Σ is a positive semideﬁnite symmetric Toeplitz matrix which can be characterized by its ﬁrst row r = row1 (Σ).
The key idea behind circulant embedding method is to construct a circulant matrix S that
contains Σ as its upper-left submatrix. The reason for seeking a circulant embedding is the
fact that, being a m × m circulant matrix, S has an eigendecomposition S = (1/m)F ΛF H ,
where F is the standard FFT matrix of size m with entries (F )pq = exp(2πipq/m), F H is
the conjugate transpose of F , and Λ is a diagonal matrix whose diagonal entries form the
vector s = F s (s is the ﬁrst row of S).
˜
Given a positive semi-deﬁnite circulant extension S of Σ, the algorithm generates the
realization of z(x) sampled on Ω is shown in Tab. A.2. Extension to multidimensional cases
can be found in [20].

158

1:
2:
3:
4:

Table A.2: Generating multivariate Gaussian samples by circulant embedding.
compute via the FFT the discrete Fourier transform of s = F s and form the vector
˜
(˜/m)1/2
s
generate a vector = 1 + i 2 of dimension m with 1 ∼ N (0, I) and 2 ∼ N (0, I) being
independent and real random variables
compute a vector e = ◦ (˜/m)1/2
˜
s
compute via FFT the discrete Fourier transform e = F e. The real and imaginary parts
˜
of the ﬁrst n entries in e yield two independent realizations of z(x) on Ω

159

BIBLIOGRAPHY

160

BIBLIOGRAPHY

[1] H. Akaike. A new look at the statistical model identiﬁcation. IEEE Transactions on
Automatic Control, 19(6):716–723, 1974.
[2] I. F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci. Wireless sensor networks: a survey. Computer Networks, 38(4):393–422, 2002.
[3] D. P. Bertsekas, W. W. Hager, and O. L. Mangasarian. Nonlinear programming. Athena
Scientiﬁc, Belmont, Mass., 1999.
[4] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and distributed computation: numerical
methods. Prentice Hall, Englewood Cliﬀs, 1999.
[5] C. M. Bishop. Pattern recognition and machine learning. Springer, New York, 2006.
[6] F. Bullo, J. Cort´s, and S. Mart´
e
ınez. Distributed control of robotic networks. Applied
Mathematics Series. Princeton University Press, 2009.
[7] C. G. Cassandras and W. Li. Sensor networks and cooperative control. European
Journal of Control, 11(4-5):436–463, 2005.
[8] J. Choi, J. Lee, and S. Oh. Biologically-inspired navigation strategies for swarm intelligence using spatial gaussian processes. In Proceedings of the 17th International
Federation of Automatic Control (IFAC) World Congress, 2008.
[9] J. Choi, J. Lee, and S. Oh. Swarm intelligence for achieving the global maximum
using spatio-temporal gaussian processes. In Proceedings of the 27th American Control
Conference (ACC), 2008.
161

[10] J. Choi, S. Oh, and R. Horowitz. Distributed learning and cooperative control for
multi-agent systems. Automatica, 45:2802–2814, 2009.
[11] V. N. Christopoulos and S. Roumeliotis. Adaptive sensing for instantaneous gas release
parameter estimation. In Proceedings of the 2005 IEEE International Conference on
Robotics and Automation, pages 4450–4456, 2005.
[12] Microsoft Corporation.
Oﬃcial
http://www.xbox.com/en-US/kinect.

website

of

Kinect

for

Xbox

360.

[13] J. Cort´s. Distributed kriged Kalman ﬁlter for spatial estimation. IEEE Transactions
e
on Automatic Control, 54(12):2816–2827, 2009.
[14] T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons,
Lnc, Hoboken, New Jersey, 2nd edition, 2006.
[15] N. Cressie. Kriging nonstationary data. Journal of the American Statistical Association,
81(395):625–634, 1986.
[16] N. Cressie. Statistics for spatial data. A Wiley-Interscience Publication, John Wiley
and Sons, Inc., 1991.
[17] N. Cressie and N. Verzelen. Conditional-mean least-squares ﬁtting of Gaussian Markov
random ﬁelds to Gaussian ﬁelds. Computational Statistics & Data Analysis, 52(5):2794–
2807, 2008.
[18] N Cressie and C K Wikle. Space-time kalman ﬁlter. Encyclopedia of Environmetrics,
4:2045–2049, 2002.
[19] L. Devroye. Non-uniform random variate generation. Springer-Verlag, New York, 1986.
[20] C. R. Dietrich and G. N. Newsam. Fast and exact simulation of stationary gaussian
processes through circulant embedding of the covariance matrix. SIAM Journal on
Scientiﬁc Computing, 18(4):1088–1107, 1997.
[21] A. F. Emery and A. V. Nenarokomov. Optimal experiment design. Measurement Science
and Technology, 9(6):864–76, 1998.
[22] M. Gaudard, M. Karson, E. Linder, and D. Sinha. Bayesian spatial prediction. Environmental and Ecological Statistics, 6(2):147–171, 1999.
162

[23] M. Gibbs and D. J. C. MacKay. Eﬃcient implementation of gaussian processes. Available
electronically from http://www. cs. toronto. edu/mackay/gpros. ps. gz. Preprint, 1997.
[24] T. Gneiting. Compactly supported correlation functions. Journal of Multivariate Analysis, 83(2):493–508, 2002.
[25] R. Graham and J. Cort´s. Cooperative adaptive sampling of random ﬁelds with partially
e
known covariance. International Journal of Robust and Nonlinear Control, 1:1–2, 2009.
[26] W. W. Hager and H. Zhang. A survey of nonlinear conjugate gradient methods. Paciﬁc
Journal of Optimization, 2(1):35–58, 2006.
[27] L. Hartman and O. H¨ssjer. Fast kriging of large data sets with Gaussian Markov
o
random ﬁelds. Computational Statistics & Data Analysis, 52(5):2331–2349, 2008.
[28] A. Jadbabaie, J. Lin, and A. S. Morse. Coordination of groups of mobile autonomous agents using nearest neighbor rules. IEEE Transactions on Automatic Control, 48(6):988–1001, 2003.
[29] P. Kathirgamanathan, R. McKibbin, and R. I. McLachlan. Source term estimation of
pollution from an instantaneous point source. Research Letters in the Information and
Mathmatical Sciences, 3(1):59–67, 2002.
[30] S. M. Kay. Fundamentals of statistical signal processing: estimation theory. Prentice
Hall, Inc., 1993.
[31] A. Krause, C. Guestrin, A. Gupta, and J. Kleinberg. Near-optimal sensor placements:
maximizing information while minimizing communication cost. In Proceedings of the
5th international conference on Information processing in sensor networks, pages 2–10,
2006.
[32] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian
processes: theory, eﬃcient algorithms and empirical studies. The Journal of Machine
Learning Research, 9:235–284, 2008.
[33] N Lawrence, M Seeger, and R Herbrich. Fast sparse gaussian process methods: The
informative vector machine. Advances in neural information . . . , Jan 2003.
[34] J. Le Ny and G.J. Pappas. On trajectory optimization for active sensing in Gaussian
process models. In Decision and Control, 2009 held jointly with the 2009 28th Chinese
163

Control Conference. CDC/CCC 2009. Proceedings of the 48th IEEE Conference on,
pages 6286–6292, 2010.
[35] N. E. Leonard, D. A. Paley, F. Lekien, R. Sepulchre, D. M. Fratantoni, and R.E. Davis.
Collective motion, sensor networks, and ocean sampling. Proceedings of the IEEE,
95(1):48–74, January 2007.
[36] F. Lindgren, H. Rue, and J. Lindstr¨m. An explicit link between Gaussian ﬁelds and
o
Gaussian Markov random ﬁelds: the stochastic partial diﬀerential equation approach.
Journal of the Royal Statistical Society: Series B, 73(4):423–498, 2011.
[37] K. M. Lynch, I. B. Schwartz, P. Yang, and R. A. Freeman. Decentralized environmental
modeling by mobile sensor networks. IEEE Transactions on Robotics, 24(3):710–724,
June 2008.
[38] D. J. C. MacKay. Introduction to gaussian processes. NATO ASI Series F Computer
and Systems Sciences, 168:133–165, 1998.
[39] M. Mandic and E. Franzzoli. Eﬃcient sensor coverage for acoustic localization. In
Proceedings of the 46th IEEE Conference on Decision and Control, pages 3597–3602,
Dec 2007.
[40] K. V. Mardia, C. Goodall, E. J. Redfern, and F. J. Alonso. The Kriged Kalman ﬁlter.
Test, 7(2):217–282, 1998.
[41] S. Mart´
ınez and F. Bullo. Optimal sensor placement and motion coordination for target
tracking. Automatica, 42(4):661–668, 2006.
[42] G. M. Mathews, H. Durrant-Whyte, and M. Prokopenko. Decentralised decision making in heterogeneous teams using anonymous optimisation. Robotics and Autonomous
Systems, 57(3):310–320, 2009.
[43] J. Nocedal and S. J. Wright. Numerical optimization. Springer, 1999.
[44] D. J. Nott and W. T. M. Dunsmuir. Estimation of nonstationary spatial covariance
structure. Biometrika, 89(4):819–829, 2002.
[45] S. Oh, Y. Xu, and J. Choi. Explorative navigation of mobile sensor networks using
sparse Gaussian processes. In Proceedings of the 49th IEEE Conference on Decision
and Control (CDC), 2010.
164

[46] R. Olfati-Saber. Flocking for multi-agent dynamic systems: algorithms and theory.
IEEE Transactions on Automatic Control, 51(3):401–420, 2006.
[47] R. Olfati-Saber, J. A. Fax, and R. M. Murray. Consensus and cooperation in networked
multi-agent systems. Proceedings of the IEEE, 95(1):215–233, January 2007.
[48] R. Olfati-Saber, R. Franco, E. Frazzoli, and J. S. Shamma. Belief consensus and distributed hypothesis testing in sensor networks. Networked Embedded Sensing and Control, pages 169–182, 2006.
[49] F. Pukelsheimi. Optimal design of experiments. New York: Wiley, 1993.
[50] J. Qui˜onero-Candela and C. E. Rasmussen. A unifying view of sparse approximate
n
Gaussian process regression. The Journal of Machine Learning Research, 6:1939–1959,
2005.
[51] C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine learning. The
MIT Press, Cambridge, Massachusetts, London, England, 2006.
[52] W. Ren and R. W. Beard. Consensus seeking in multiagent systems under dynamically
changing interaction topologies. IEEE Transactions on Automatic Control, 50(5):655–
661, 2005.
[53] W. Rudin. Principles of mathematical analysis. McGraw-Hill New York, 1976.
[54] H. Rue and L. Held. Gaussian Markov random ﬁelds: theory and applications. Chapman
& Hall, 2005.
[55] H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 71(2):319–392, 2009.
[56] H. Rue and H. Tjelmeland. Fitting Gaussian Markov random ﬁelds to Gaussian ﬁelds.
Scandinavian Journal of Statistics, 29(1):31–49, 2002.
[57] M. Seeger. Bayesian Gaussian process models: PAC-Bayesian generalisation error
bounds and sparse approximations. PhD thesis, School of Informatics, University of
Edinburgh, 2003.
[58] A Singh, A Krause, C Guestrin, and W Kaiser. Eﬃcient informative sensing using
multiple robots. Journal of Artiﬁcial Intelligence Research, 34(1):707–755, 2009.
165

[59] A. J. Smola and P. Bartlett. Sparse greedy gaussian process regression. In Advances in
Neural Information Processing Systems 13, 2001.
[60] P. Sollich and A. Halees. Learning curves for Gaussian process regression: approximations and bounds. Neural Computation, 14(6):1393–1428, 2002.
[61] V. Tresp. A Bayesian committee machine. Neural Computation, 12(11):2719–2741,
2000.
[62] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. Advances in
Neural Information Processing Systems, 8:514–520, 1996.
[63] C. K. I. Williams and M. Seeger. Using the Nystr¨m method to speed up kernel mao
chines. In Advances in Neural Information Processing Systems 13, 2001.
[64] C .K. I. Williams and F. Vivarelli. Upper and lower bounds on the learning curve for
Gaussian processes. Machine Learning, 40(1):77–102, 2000.
[65] Y. Xu and J. Choi. Adaptive sampling for learning Gaussian processes using mobile
sensor networks. Sensors, 11(3):3051–3066, 2011.
[66] Y. Xu, J. Choi, S. Dass, and T. Maiti. Bayesian prediction and adaptive sampling
algorithms for mobile sensor networks. In Proceedings of the 2011 American Control
Conference (ACC), pages 4095–4200, 2011.
[67] Y. Xu, J. Choi, and S. Oh. Mobile sensor setwork navigation using Gaussian processes
with truncated observations. IEEE Transactions on on Robotics, 27(5):1–14, 2011. to
appear.
[68] M. M. Zavlanos and G. J. Pappas. Distributed connectivity control of mobile networks.
IEEE Transactions on Robotics, 24(6):1416–1428, December 2008.
[69] M. M. Zavlanos and G. J. Pappas. Dynamic assignment in distributed motion planning
with local coordination. IEEE Transactions on Robotics, 24(1):232–242, 2008.

166