i|

 

WUHHWNNMWNH

27»; \M

\

  

  

537

meals

1\ Ox 9 X)

‘ Iliumlililiiliii’limmull

LIBRARY 3 1293 01801 7495
Michigan State
University

 

 

 

 

This is to certify that the
thesis entitled

A NONPARAMETRIC ADAPTIVE DETECTOR
FOR UNKNOWN CHANNELS USING BOOSTING

presented by
Long Ying
has been accepted towards fulﬁllment

of the requirements for

Master's degree in Electrical Eng

4AA”

Major professor

' l
Date g/"l/y/(2?

0-7639 MS U is an Afﬁrmative Action/Equal Opportunity Institution

 

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINE retum on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE

DATE DUE

DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

100 MM“

A NONPARAMETRIC ADAPTIVE DETECTOR
FOR UNKNOWN CHANNELS
USING BOOSTING

By

Long Ying'

A THESIS

submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

MASTER OF SCIENCE
Department of Electrical Engineering

1998

ABSTRACT

A NON PARAMETRIC ADAPTIVE DETECTOR
FOR UNKNOWN CHANNELS
USING BOOSTING

By

Long Ying

Boosting is a new and powerful decision-theoretic machine learning algorithm,
which combines a number of simple decision rules into one highly accurate ﬁnal rule.
Without any prior knowledge of the source statistics, boosting is shown capable of
learning a decision rule in a classiﬁcation problem. The objective of this research is
to design an adaptive detector that automatically adjusts to unknown environments
using a reasonable number of training data. A novel detector for Code Division Mul-
tiple Access (CDMA) communication signals is designed using the recently proposed
learning algorithm known as Boosting. The new detector performs favorablyin com-
parison to other commonly used detectors including the likelihood ratio detector.
The likelihood ratio detector is an optimal detector that minimizes the probability
of error, but requires complete knowledge of channel characteristics. In contrast, the
new adaptive detector proposed here makes no assumptions on the channel. Hence,
the new detector is potentially useful for realistic situations in which little is known

about the communication channel.

Dedicated to Jane

iii

ACKNOWLEDGEMENTS

This interesting research topic is suggested by my advisor Dr. Robert D. Nowak. I am
most grateful for Dr. N owak’s help during this research and during the preparation of
the thesis, and for his support and guidance throughout my studying for the Master’s
degree. I would also like to thank Dr. John Deller Jr. and Dr. Percy Pierre for their

wisdom and many important suggestions for this thesis.

iv

Contents

LIST OF FIGURES

1 INTRODUCTION

2 CDMA COMMUNICATIONS
2.1 The CDMA Signal Structure .......................
2.2 Detection of CDMA Signals .......................
2.2.1 Matched Filter ......... - .................
2.2.2 Sign Detector ...........................

2.2.3 Adaptive Detection ........................

3 BOOSTING

3.1 Background on Machine Learning ....................

3.2 Boosting and AdaBoost .........................
3.3 Experiment with AdaBoost .......................

3.4 Performance Analysis ............ . ...............

4.1 Detecting CDMA Signals Using AdaBoost: Method 1 .........
4.2 Detecting CDMA Signals Using AdaBoost: Method 2

—the AdaBoost Detector .........................

5 EXPERIMENTS AND DISCUSSION

5.1 AdaBoost Detector in Additive White Noise Channel .........

vii

13

13

21

DETECTION FOR UNKNOWN CHANNELS USING BOOSTING 24

24

25

29

5.2 Discussion ................................. 32

6 CONCLUSION 35
A VC DIMENSION 38
REFERENCES 40

vi

List of Figures

1 The CDMA signal structure ........................ 4
2 Matched ﬁlter detection for CDMA signals ................ 5
3 Sign detection for CDMA signals. .................... 7
4 A setup of machine learning ........................ 13
5 Algorithm AdaBoost ........................... 17
6 An experiment with AdaBoost. ........ . ............. 2O
7 AdaBoost detector with less training data ............... 3O
8 Experiments with the AdaBoost detector. ............... 31

vii

1 INTRODUCTION

In digital communication systems, one of a ﬁnite number of symbols is sent during
each bit period. Due to imperfections in the channel and in the receiver’s front end,
the received signal of each transmitted symbol is modeled as a random variable. An
optimal detector, in the sense of minimizing the probability of error is the likeli-
hood ratio test. Throughout this thesis, the term “optimal” refers to the minimum
probability of error criterion. The likelihood ratio test requires complete knowledge
of the received symbol distributions. For some systems, the channel uncertainties
can be modeled very well, and the received signal distributions are known, so we
are able to design ﬁxed detectors that are optimal (e.g., matched ﬁlters for additive
white Gaussian noise, or AWGN, channels [4]). More frequently, however, accurate
models for the channels are not available due to unknown noise sources, unknown
interference, and the fact that the channels can be time—varying. In such situations,
an adaptive detector that adjusts to the changing environment, is necessary.
Depending on the amount of knowledge about the channels, some adaptive de»
tectors assume certain parametric forms of the received symbol distributions and
estimate the unknown parameters from some training data; others adopt nonpara-
metric schemes that assume very little about the channels. For example, in a recent
study [1], a type-based detector, which uses histograms built from training data to esti-
mate the distributions, provides asymptotic optimal performance with no assumptions
about the channel. In many wireless digital communication systems, the channel is

very complex, making a priori modeling very difﬁcult if not impossible. Therefore

nonparametric adaptive detectors are highly desirable.

The general goal of this research is to design a nonparametric adaptive detector
using a new and powerful decision—theoretic learning algorithm known as boosting.
In boosting, a number of simple decision rules are ﬁrst generated using training data
alone—without any knowledge of the received signal distributions. Then these simple
rules are combined (or “boosted”) to produce an overall decision rule that approxi-
mates the optimal likelihood ratio test. This research results in a speciﬁc product—
the AdaBoost detector. It is aimed at extracting the code division multiple access
(CDMA) communication signals from background noises without knowledge of the
noise statistics.l

This thesis will ﬁrst explain the structure and the usual detection methods of the
CDMA signals. Then it will introduce the idea of boosting in detail, with a speciﬁc
algorithm called “AdaBoost” as an example. It will also describe the development of

the AdaBoost Detector, as well as some experiments with this detector.

 

1Experiments conducted at the time of writing this thesis also showed that the detector can do
better than others in rejecting certain interference encountered in wireless channels. However, time
did not allow a comprehensive discussion on that subject to be included in this thesis.

2

2 CDMA COMMUNICATIONS

CDMA is a transmission scheme found in the wireless mobile communications net-
works, wireless local area networks, global positioning systems, etc., and is becoming
more and more important with the rapid growth of the telecommunications industry
[6]. The following is a brief introduction to CDMA communications. The reader is

referred to the text book by Proakis [4] or Ziemer and Peterson [5] for details.

2.1 The CDMA Signal Structure .

The CDMA signal structure can be explained using Figure 1. This example shows
some antipodal binary CDMA signals. The signal level is either +1 or —1. Figure 1a
shows two message symbols to be communicated at a rate of l/Tb bits/sec (bps).
The key to CDMA is that each message symbol must be multiplied (“spread”) by a
sequence of N pseudo-random bits. This sequence is called a code; and each bit in
the code is termed a chip. Figure 1b shows two copies of an 8-chip2 code, each to be

multiplied to a symbol to provide the chip sequence in Figure 1c. For each symbol of

level m, this operation can be expressed as
s[n]=mc[n],n=1,...,N. (1)

Therefore instead of sending the message symbol directly, one would transmit the

ﬁnal sequence (s[l], .. . ,s[N]).

 

1'In reality, N is much larger than 8 (typically over 100).

(a) Message Symbols

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2 I I I T I
m 02.....- Tb __.—> ,
_2 I I I J I J L
2 4 6 8 1o 12 14 16
(b) Code Sequences
2 I I I I M I I
-> IL: I <- ‘ .
{——
c[n] o _ .
_2 I I L _I_ L I I
2 4 6 8 1o 12 14 16
(c) Transmitted Sequences
2 I I I I I r I
0 - .
s [n]
_2 I I I I I I I
2 4 6 8 1o 12 14 16
11

Figure 1: The CDMA signal structure.

2.2 Detection of CDMA Signals3
2.2.1 Matched Filter

To recover the original message m in this type of signals, one needs to decide whether
the received N ~chip sequence is a result of transmitting the code or the negative of
the code. It is easily shown that in AWGN channels, the optimal decision is made
by a matched ﬁlter, or equivalently, by correlating the received sequence with the

original code, as shown in Figure 2.

 

s[n]=m c[n] s’[n]=m’ c’[n]

 

 

 

 

c[n] [ AWGN ]

 

 

Decide
A

m 5301

31m] R[n]=s[n] + s’[n] + W[n]

 

Figure 2: Matched ﬁlter detection for CDMA signals.

 

To make subsequent discussions easier, let us use vectors (shown in bold face) to
denote length-N sequences. For example, R denotes the sequence (R[1], . . . , R[N]). I
will also use capital letters to denote random variables throughout the thesis. Notice
that in addition to the transmitted signal s and the noise W, the received sequence
R also contains another sequence 8’. The signal 8’ is intended for another receiver,

containing a different message m', spread by a different code c’. The key is that the

 

3Throughout the thesis, I will assume that the transmitter and the receiver are synchronized. A
good introduction on synchronization can be found in [4].

5

code c’ is designed to be almost orthogonal to c, i.e.
N
Z c[n] c’[n] 9.: 0. (2)
11:]
Therefore, the correlation output due to s’ is close to O. This means other users can
transmit their messages on the same channel at the same time—CDMA.

This matched ﬁlter is the backbone of the most widely used CDMA receiver—the
Rake receiver [6]. The Rake receiver is a series of matched ﬁlters, combined to rake
in energies from multiple signal paths to achieve diversity.‘1 It is important to know
that the use of the matched ﬁlter for every path is based on the assumption that the
channel adds white Gaussian noise to the signal. Unfortunately, this is not always
true, especially when other users’ codes are not orthogonal to that of the receiver’s

or when intentional jamming is present. In such cases, it is almost impossible to

accurately model the received signal distributions.

2.2.2 Sign Detector

According to [1], a sign detector does not assume an AWGN channel. In fact, it does
not assume the noise distribution to be of any parametric form. It is the optimal ﬁxed
detector for zero-median noise. Figure 3 illustrates the operation of a sign detector.
Each received chip R[n] is individually classiﬁed to m[n] according to its sign.
Once all N chips are determined, the sequence is compared to both the code and the
negative of the code to decide which message symbol was sent. In the case depicted

in Figure 3, it is decided that the sequence R comes from transmitting the symbol

 

‘For a more comprehensive discussion on the Rake receiver and the problem of multipath in
communications, the reader is referred to the works of Price and Green [7] and Rappaport [6].

 

 

 

 

 

 

 

 

 

 

5 1.. l I I I I I I I
X X x
0 x
R[n] x x x
X
'5 h 1 1 L 1 1 1 1 1
0 1 2 3 4 5 6 7
n
[ Chip-wise sign detection
2 I I I I l I I I
x x x x
.. 0
mln]
x x x x
_2 I J I g I I l
0 1 2 3 4 5 6 7 8
n
Compare to c[n] Compare to —c[n]
2 . 2 . . .

 

 

0..., mm .“71—"1—

 

 

 

 

 

 

 

LJ L._J.L __..l_J.. L

 

 

 

O 2 4 6 8 0 2 4 6
n 11

Figure 3: Sign detection for CDMA signals.

+1 because the sequence Iii differs from the positive code by only two chips.

2.2.3 Adaptive Detection

In a recent series of papers [1, 2, 3], Johnson et al. have developed a universal adaptive
detector called the type-based detector. A type, in information theory, means the
histogram estimate of a discrete probability distribution [11]. The type-based detector
uses the types constructed from some training data to estimate the received signal
distributions in order to asymptotically achieve optimal (minimum probability of
error) performance. The rest of this section will explain the use of this detector for
CDMA communications. Some of the operations of this detector are used in the
implementation of the AdaBoost detector as well.

To construct types, the received signals must be quantized. Suppose the signal
quantization levels are {at},T=l , and that a set of quantized observations from one class
is “smug=1 .5 Then the type for that class (the histogram estimate of its distribution)
is simply:

. 1 M .
Pr(a,)=7W—mz=ll([xm]=at),t=1,... ,T (3)
where I() is the indicator function.

In antipodal binary communications, there are two classes of chips, for which
types must be constructed. For the purpose of grouping chips into these classes,

deﬁne the sets n+ and n’ to be the positions of all positive chips and all negative

 

5I will differ the discussion on the quantization process to Section 5.2. The reader should refer
to [2] for details. Also, I will use [rm] to denote the quantized signals and to distinguish it from the
unquantized 2m appearing in later sections.

chips, respectively, in the code. That is

n+ = {n : c[n] = +1}

11' = {n : c[n] = —1}. (4)

The receiver knows these sets. Therefore by observing the received chips’ positions,
the detector can easily separate them into two classes, and construct the two types. To

be more speciﬁc, the chips in a sequence [x] are separated according to the following:

2* = {[2th ,e M

2:2 = {[zr[n]] : n E n‘}. (5)

It is important to notice that if the sequence corresponds to a +1 symbol, the group
at] will contain all the +1 chips, but if the sequence represents the symbol —1, then
the group 2:2 will contain the +1 chips instead. Therefore one can identify the chip
groups only if the transmitted symbol is known (which is true in training), but one
has to hypothesize the signs of the groups when unknown sequences are received.

In order to use a type-based detector, it is required that a labeled training set be
transmitted periodically, ([xl] , yl), . . . , ([xL] , yL), where the label y; is the lth symbol,
and the observation [x1] is the received and quantized sequence of that symbol. As
described, all the chips are separated into a group of +18 1'" and a group of --1s
23‘, each having M = 1/2(L x N) chips6. The types for the two classes of chips Px+
and P3- are constructed according to (3). This completes the training phase of the

type-based detector.

 

6The length N of a pseudo-random sequence is usually odd, but transmitting (1/2 L) +1 symbols
and (1 /2 L) -l symbol can result in equal number (M) of oppositely signed chips. This is not a
requirement, but nonetheless assumed for simplicity.

9

After the training, a sequence [R] corresponding to an unknown message symbol
m is received (with the same quantization). At this point, if the chips are independent
(result of a memoryless channel), one only needs to estimate the likelihood ratio of

each chip using the types. That is

10([Rlnll ISInl = +1)
P([R[nll |Slnl = -1)
Px+ (lRlnll)
Pr- ([Rlnll)

y»
3
ll

 

 

Then the likelihood ratio of the whole sequence can be found.

[Rlnll |m= +1)

A _ “SLIM
_ P(lRlnll|m= - 1)
3\

”11-1
LunEn‘i'
HnEn‘ A"

Therefore the decision will be:
rh=+l
f1 ,5, 1. (8)
However, according to Ziv [16], this structure can not be generalized to accommodate
channels with memories. It turns out that the following method is more versatile
because it can be generalized to cope with interchip interference using higher order
types. The theory behind this is developed by Gutman [15].
After training, the type-based detector also separates the unknown sequence [R]
into two groups of chips W = {[R[n]] : n 6 12+}, and R“ = {[R[n]] : n E n‘}.
Notice the chips in the set 12* are positive only if the sequence represents the symbol

m = +1. And in that case, the chips in this set and in the set x+ (found in training)

should come from the same distribution. Based on this idea, the type-based detector

10

constructs four more types:

N N
PZ" = (3PR-1 + MPI+)/(— + M)

1 2
N N

P2,- : ('2—PR-+MPx-)/(§+M)
N N

P2: = (7193- +MPx+)/(—2—+M)
N N

where Pm and PR- are the types made from R+ and 12‘. These four types are formed
from the sets constructed by pairing 12"” and R‘ .with :1:4‘ and 2:" in both ways. This

allows the detector to hypothesize as follows:

(R+,:r+) ~ PZ+
H1 : m = +1 ‘

(ff-,3-) ~ P21-

(R-,x+) ~ P +
H0 : m = -1 2° (10)

(12+: 1'-) N P2;
It can be shown that the log likelihood of all the types given each hypothesis is:

r1 = 421+ M)H(PZ;.) + g-IHR”) + MH(a:+)

+1;- + M)H(Pz‘-) + -12XH(R‘) + MH(a:‘)

N N

1“0 = —(§+M)H(PZ;)+§H(R‘)+MH(x+)

+1;- + M)H(PZo—) + gram) + MH(:1:’) (11)

where H () is the entropy function. Notice that these test statistics combine the

training and testing data, which is a requirement for optimal detection in channels

11

with memories. The decision rule based on these statistics is then simply:

g1
I‘1 < I‘o- (12)

Ho

The effectiveness of the type-based detector will be demonstrated in Section 5,
where the performance of all the detectors considered in this thesis are compared.

Since CDMA is widely used in wireless communications, it suffers most of the wire-
less channel impairments.7 Very often, the channel does not ﬁt the model for which
the current Rake receiver works best. The type-based detector, in these situations, is
very desirable because it can adapt to any models.

The next section will describe an algorithm called boosting, which is also well
suited for performing classiﬁcations without any assumption. It is the purpose of this

thesis to incorporate it into the detection of CDMA signals.

 

7CDMA also provides some resistance against some of the channel imperfection as part of its
virtue.

12

3 BOOSTING

Roughly speaking, boosting is a learning algorithm that combines or boosts a number
of simple classiﬁers to obtain an overall more powerful decision rule. Boosting is an
algorithm developed in the area of machine learning. Therefore it is appropriate to

explain it in that context.

3.1 Background on Machine Learning

One setup of a machine learning problem is illustrated in Figure 4. The objective is to

{me211 G“ {34.1.1211
f

 

 

 

 

 

 

 

 

 

 

 

Learner
__.J A
Y
,___iL__w m
X J_.JL——_T S"

 

 

Figure 4: A setup of machine learning.

 

adaptively learn about the function G that maps an input :1: to an output y. When the
setup is used in classiﬁcation problems, G is just a classiﬁer that maps an observation
into a class. Because of the nature of this research, I will only talk about using the
learning machine to learn the classiﬁcation rule even though this setup is much more

general.8 One of the resources for learning is a set of training data {xm}x=l with

 

8Other applications are found in the work of Vapnik [13].

13

known labels (or classes) {ym}x=1. The central component of this setup is the learning
machine, henceforth called the learner. It is capable of implementing any function
(or classiﬁer) f from a speciﬁed family .7. These functions can also map any :5 into
a class 9. Thus when the learner selects one f E f to classify a training observation
am, it can observe the output gm, compare it to the label ym, and determine how
good the classiﬁer is. After the learner tries all the classiﬁers on all the training data,
it will pick the one that performs the best—fa, and the learning is ﬁnished. fG will
then be used to classify all future observations. .

More speciﬁcally, the learning process is described as follows. The learner must be
provided with a performance criterion in order to compare the performance of different
classiﬁers. The performance criterion is a loss function denoted by L(ym, gm( f )) In
classiﬁcation problems, this loss can be an indicator of the classiﬁcation error of f on

:c,9 also called O/l error, i.e.

o if = * f
L(y.§(f))= y y” (13)

1 if y at 9U)
Ideally, we would like the average loss over all possible observations to be as small as

possible. Therefore the learner should choose an f to minimize the expected risk,

so) = / L(y,i(f))dF(x,y) (14)

where F (2:,y) is the underlying distribution over the space X x {i1} for a binary
classiﬁcation problem. The expected risk is precisely the probability of error incurred

using f as a classiﬁer. The likelihood ratio test minimizes this error over all functions,

 

9To emphasize that L is the loss of the classiﬁer f, z is eliminated in the following equations. It
is understood that g( f) means 1(3).

14

 

and therefore it is called the minimum probability of error detector. However, the
likelihood ratio test requires complete knowledge of the distribution F, which is not
practical. In reality, F is unknown and the learner has access to only the set of training
data. Therefore one practical alternative is to adopt the empirical risk minimization
(ERM) principle, which chooses the classiﬁer fc that minimizes the empirical risk.

That is

M
fe = arg r113}! 8.(f) = are ﬂip Eli; L(ym.y;z(f))- (15)

3.2 Boosting and AdaBoost

For this setup, it is essential to ﬁnd a good set of functions .7, since the learner’s
ability is limited by the best classiﬁer in f. In order to ensure that .7: contains a
member f that closely matches the desired mapping G, we either require considerable
prior knowledge about G, or we need a very large and complicated set .77. In many
problems we have little or no information about G. Therefore, .7: must include a large

number of (possibly very complicated) classiﬁers. These leads to two problems.

1. It may be difﬁcult to perform the optimization in (15).

2. It may lead to overﬁtting and poor generalization characteristics when fc is

applied to unlabeled data outside the training set.

Fortunately, a change of mindset is provided by a class of learning algorithms that
achieves high accuracy by voting according to the predictions of several classiﬁers [9].

One such method is called boosting.

15

 

ll-w

In boosting, instead of ﬁnding a large and necessarily complex set of functions .7
containing the desired classiﬁer, the algorithm only ﬁnds a number of weak classiﬁers10
{ht};l that may perform just better than random guessing. For an observation :1),
each of these weak classiﬁers will predict its class—h,(:r). The algorithm will combine
(“boost”) them into a more accurate ﬁnal prediction—h f(1'), according to a weighted
voting criterion.

An actual boosting algorithm will help to understand some details of this idea.
Figure 5 is one of the most successful boosting algorithms called “AdaBoost” [8].11
This algorithm starts by assigning an initial probability measure p1 = (p},. . . , p11”),
usually uniform, over the training set ($1,311), . . . , (xM,yM). It then enters a loop
(indexed by t), where it calls a subroutine called WeakLearn. WeakLearn ﬁnds a

weak classiﬁer ht for that round, according to a modiﬁed ERM (MERM) principle:

M
ht = arg £1271} {gmodlhl = Z PinLU/m’ h(xm))} (16)

Notice that (16) differs from (15) in two regards: I) The set of functions 71, in which
h, is found, is much smaller than the set .77. It is not based on too much knowledge
about the underlying classiﬁcation rule to be learned. 2) Instead of weighting the
loss of each training sample equally with a weight 1/M, WeakLearn averages its loss
according to the probability measure pin assigned by the boosting algorithm. The

reason for this will become clear shortly.

 

loIn boosting literature, these are called weak hypotheses. To avoid confusion with the word
hypotheses used in detection theory, weak classiﬁer is used throughout this thesis.

11The original AdaBoost algorithm in [8], which classiﬁes between 1 and O, is slightly modiﬁed
here in order to classify between 1 and —1.

16

 

Input: 0 training set {(231, yl) . . . (1M, yM)}

0 initial distribution p1 = [p] .. .pb] over the M samples (e.g. uniform)
Initialize: w1 = p1

Do for t=1,2,... ,T

wt
1. Set pt = M

Zm=l wt

2. Call WeakLearn, providing it with {(xm,ym,pm)}M get back a weak

m=l;

decision rule h, : X —-> {:tl}
3. Calculate c, = Zmﬂ pml[ym 74 h (rm)]
4. Set ﬁt = Ct/(l — 6;)

5. Set the new weights to be:
wig” = winﬂtl-IlymTh‘(xmll,m = 1,2, . . . , M

Output: the ﬁnal rule h ,(x) is

1hf(-T)>=+ T 1
31;) < [Hash—1110a?

hf($)=—I i=1

T
Z I[h,(:r) =+1] log(

Figure 5: Algorithm AdaBoost

17

 

 

The minimized risk ct = minheu 8m04(h) is a by-product of WeakLearn. Next
the quantity 3, is calculated. This quantity increases from O to 00 as the risk of h,
increases from O to 1. This is an important quantity whose signiﬁcance is clearly
demonstrated in the ﬁnal decision rule.

The last step for each round is a key to boosting. The algorithm updates the
probability measure over the training set for the next round. Notice that w is used
here instead of p. This is because w will be normalized] to a true probability measure
p at the beginning of the next round. The important feature of this updating step is
that it relatively increases the weights of the training samples that h; misclassiﬁed.
When these weights are carried into the next WeakLearn subroutine, they tend to
make the WeakLearn produce the next weak classiﬁer that works better for these
“hard” samples (because the WeakLearn uses the MERM principle taking pt into
account).

After T rounds, T weak hypotheses {ht}?=1 are generated, each associated with a
quantity 6, indicating its risk. The ﬁnal rule says, apply all {h {=1 to an observation
1', weight each vote ht(:r) with log(1/ﬂ,), and the decision is the class receiving the
weighted majority vote. Obviously, the more risk h, has, the less important its vote

becomes.

3.3 Experiment with AdaBoost

The following binary classiﬁcation experiment is used to demonstrate the effectiveness

of AdaBoost.

18

 

The observation at is one dimensional. There are two classes labeled y = 31:1. The
pdf of the ﬁrst class—p(:r|y = +1) is zero-mean, small—variance Gaussian. The class
—1 has a large-variance Gaussian distribution with the same mean. Figure 6a shows
these density functions. The optimal (minimum probability of error) classiﬁer results

from the maximum likelihood criterion,

11(ny = +1) p(-’€|y = -1) (17)

which corresponds to the decision regions separated by the two thresholds (visualized
by the two vertical bars in Figure 6a). Thus the decision rule to be learned will
classify the observations between these two thresholds as +18 and those outside as
—18.

To design a learning algorithm for this problem using AdaBoost, a class of weak

classiﬁers have to be found. The set chosen in this case is:

'H : {h(:r,v,d) = sgn[(:1: — v)d] : v E ’R,d E{:l:1}}. (18)

This classiﬁer is simply a directed threshold. For example, when the directiond = +1,
any observation :1: that falls above the threshold 1) is classiﬁed as +1, and the —1 class
is on the other side of v. The size of the training set is chosen to be 100 for this
experiment; and the number of rounds of boosting T is also 100. Therefore, the
algorithm sequentially generates 100 of these directed threshold, “remembering” the
performance of each of them.

Then a sequence of new observations are presented to the ﬁnal decision rule. For

each observation, the algorithm applies the weighted voting criterion described in

19

 

L ,1!

 

 

 

 

 

 

 

 

2 I I I I I I
1.5 ~ p(xly=+1) -
I" -
a: . _
p( ly) 0-5 _— - p(le=-l)
0..----——---""J :- --------------
_0.5 I I I 4I L I I
-4 -3 -2 -1 0 1 2 3 4
X

 

15 n r T I ‘ﬁ I 1

1- xxx u

 

 

 

 

 

MOOOOOOOOOOOOOOOOOD OOOOOOOOOOOOOOOOOO€
“0-5 I I I I I I
4 4 4 4 O 1 2 3 4

(b) AdaBoost classiﬁcation results.

Figure 6: An experiment with AdaBoost.

20

Section 3.2 to make the ﬁnal prediction. As shown in Figure 6b, the observations in
the middle are classiﬁed as +15; and —1s are on both sides. The double thresholds
from the optimal classiﬁer are also displayed in this graph. The ﬁnal predictions are
obviously very close to being optimal.

Notice that it is very hard to foresee a double threshold type of classiﬁer without
any knowledge of the distributions. But with boosting, one does not need such
clairvoyance. In fact, even if the optimal decision rule has a more complex decision

region (e.g. multiple thresholds), the same algorithm can learn it.

3.4 Performance Analysis

In the area of machine learning, there has been extensive research regarding the
performance of the learning machines. For a more comprehensive discussion, the
reader is again referred to the book by Vapnik [13].

Here, let us review some of the performance bounds. Recall that using the ERM
principle, the learner can ﬁnd a classiﬁer f, E .7 that is closest to the function G
being learned. It was shown [13], with simple constraint on the set .7, that both
the empirical risk and the actual risk of fe converge in probability to the minimum
possible risk provided by any function in .7. The constraint is that the Vapnik-
Chervonenkis (VC) dimension k of the set .7 be ﬁnite. The concept of VC dimension
is introduced in the appendix.

Bounding the empirical risk is not a difficult task. The training error (same as

21

 

empirical risk”) of the ﬁnal rule h f in AdaBoost is bounded as follows [8]:

T
E()hj <exp —2tZ( (1(/2—e, . (19)

=1

Notice that for a binary classiﬁcation problem, (1(/2,) is always smaller than 1/2,
making the training error decay exponentially as the number of weak classiﬁers, T,
increases.

A more important issue is how h f performs when it is applied to instances outside
the training set; i.e. is the generalization error (same as actual risk”) of h f bounded?
Vapnik proved that as long as h I comes from a class of classiﬁers with ﬁnite VC
dimension, its generalization error is bounded [13]. To meet this condition, Freund
and Schapire [8] ﬁrst showed that h 1‘ belongs to the class STU-t), the set of all linear
combinations of the T weak classiﬁers {ht 1T=1- Then they proved that if ’H has a ﬁnite

VC dimension kg,“ then the class of ﬁnal decision rules OT(’H) has a VC dimension
ke S 2(kH +1)(T +1)log2 e(T +1). (20)

Finally, since h I minimizes the training error and has a ﬁnite VC dimension k9, its

generalization error is bounded according to Vapnik [13] as follows:

Pr 81h11s£.(h;)+\/k9(ln% +1124“ /4) 21-11, n>o. (21)

 

 

Therefore the actual risk decays as 1/\/ M with high probability, where M is the size

of the training set.

 

12Freund and Schapire [8] uses “training error” to denote the empirical risk of the ﬁnal classiﬁcation
rule. Therefore I will use these expressions interchangeably.

13The term “generalization error” is used by Freund and Schapire to mean the actual risk of h 1-
Thus I will use these two expressions interchangeably as well.

1‘This is a reasonable assumption since 7-! is simple. For example, the class deﬁned in ( 18) has
VC dimension 2.

22

 

Both of these bounds are distribution-independent. A more appealing fact is that
the second bound describes how the algorithm performs when the size of the training
set is small. There is also a trade-off between the two bounds. As T increases,
the training error in (19) vanishes; but the VC dimension of the class OT('H), k9,
increases, causing the upper bound on the generalization error in (21) to increase as
well. This trade-off between the training error and the generalization error can be

used to guide the choice of T.

23

I-IH'K __ .w—- —. -

 

Jul]. L} l

4 DETECTION FOR UNKNOWN CHANNELS

USING BOOSTING

Since classiﬁcation with boosting does not require assumptions about the source dis-
tributions, and since the performance is quantiﬁable even with limited training data,
boosting is ideally suited for adaptive detection of digital signals in unknown com-

munication channels such as the wireless channels.

4.1 Detecting CDMA Signals Using AdaBoost: Method 1

A straightforward application of boosting in digital communications is detection of
binary signals. Here one of two possible symbols is sent during each bit period. As a
result of the channel impairments, the received bit R is modeled as a continuous ran-
dom variable, which is governed by one of two probability distributions p(RIs = +1)
or p(RIs = —l), depending on the transmitted symbol 3. Our experiment in Sec-
tion 3.2 showed that the AdaBoost algorithm with the weak classiﬁer family deﬁned
in (18) was capable of learning a classiﬁcation rule in exactly this type of situations,
without any assumption about those distributions. The only requirement is for the
system to periodically transmit a set of labeled bits {($1, yl) . . . (:rM, yM)} as the in-
put training data to the receiver. This can easily be done during synchronization, for
example.

In CDMA communications, however, a pseudo-random sequence of chips is trans-
mitted in place of each message symbol, as explained in Section 2.1. Therefore, to
best determine the symbol that was sent, one must observe the whole sequence of

24

received chips R[n], n = 1,... ,N during the bit interval of that symbol.

A natural method is to ﬁrst use AdaBoost’s bit—level detection capability to classify
each chip separately. Then, compare the resulting binary sequence to the code and
its negative. The decision is made by deciding according to the closest match. This
method is very similar to that of the sign detector illustrated in Figure 3. The
advantage of using AdaBoost here is its adaptability to many types of noises.15

In general however, the method described above is suboptimal. For example, when
the noise is zero-mean white Gaussian, its performance can only approach that of the

sign detector at best, not the matched ﬁlter.16

4.2 Detecting CDMA Signals Using AdaBoost: Method 2

—the AdaBoost Detector

This section derives the product of this research—the AdaBoost detector. As de-
scribed in Sections 2.1 and 2.2.1, each message symbol m is spread by an N—chip
code sequence c = (c[l], . . . ,c[N]). The resulting binary sequence 8 is transmitted
through an unknown channel and received as a sequence of N continuous-valued chips
R. (See Figures 1 and 2.) Frequently, a set of training symbols (yl, . . . ,yL), known
to the detector, is sent. The corresponding received sequences are (x1, . . . ,xL). The
AdaBoost detector must train itself with this set, and make classiﬁcations for future

sequences.

 

1"’However, this may be overshadowed by the sign detector’s efficiency when the noise is zero-
median.

16Experiments demonstrating this were performed, but not included in the thesis.

25

‘.-I[ '.'I-.

' {Vn " 11.9.4“; -. .11

 

 

‘3’; z..- "'

First, let us look at the operation of the optimal (minimum probability of error)
detector. The optimal detector for CDMA signals is based on the knowledge of
the likelihood functions of both positive and negative chips—p (R[n]|s[n] = +1) and
p(R[n]|s[n] = —1). From these, the detector can calculate the likelihood ratio for

each received chips:

AzPWMMﬂ=H)
" pWMMﬂ=4Y

 

(22)

However the optimal detector does not make decisions on the chips. Instead, it needs
to calculate the likelihood ratio of the whole N-chip sequence over the bit interval.

Assuming the chips are independent, identically distributed, the likelihood ratio is

 

N
_ PWWW=+U
A_.£IIP(Rln]lm= -1)' (23)

With the knowledge of the chip sets 12"" and 12’ (see the deﬁnitions in (4) and Sec-

tion 2.2.3), the above likelihood ratio can also be expressed as:

An
A = DEL (24)
HnEn‘ A"
The optimal detector makes a decision based on this likelihood ratio:
m=>+l
A < 1. (25)
rit=-l

The detector described above requires complete knowledge of the likelihood ratio,
and therefore is impractical. However, a good estimate of the likelihood ratio made
from empirical training data will also give a detector near-optimal performance [16].
The key contribution of this research is tailoring the AdaBoost algorithm to do just

that.

26

First, the AdaBoost detector must be trained as if it were to be used for classifying
chips. Recall from Section 2.2.3 that the receiver can easily separate all the training
observations into a positive group of chips :12" and a negative group 3:“, based on the
chips’ positions. The receiver then label each chip individually according to its group,
giving the AdaBoost a set of M = L x N labeled training data (:21, ya), . . . , (:EM, yin).
Using the WeakLearn routine described in Section 3.3, AdaBoost can sequentially
generate T weak classiﬁers (hint: 11 and compute their associated risks 6,.

The key to the AdaBoost detector is a modiﬁcation of the ﬁnal decision rule in
AdaBoost. It is also one of the major contributions of this thesis. Instead of boosting
all ht’s predictions to determine each chip’s class, the ﬁnal rule is modiﬁed to give an

estimate of the likelihood ratio of the chip:

,1 = 22111 1080/31) x I[h1(R[n])=1] .
n 21:11080/31) X I[ht(R[Tl]) = —1]

 

(26)

where I () is the indicator function. To classify the whole chip sequence, one simply
use this likelihood ratio estimate and proceed as the optimal detector would with the

true likelihood ratio, i.e.

A

1 = __H...+‘n (27)
[111611. A"
m=>+l
A < l (28)
rit=-l

Notice that the quantity log(l/,6¢) is the weight that AdaBoost assigns to the
prediction by h,, so the expression in the numerator of (26) is simply the total weighted
votes in favor of the chip being +1; and the total weighted votes in favor of —1 is
the quantity in the denominator. Therefore the right hand side of (26) is simply the

27

ratio of the total votes for +1 to that for —1. lntuitively, it is reasonable to use this
quantity as an estimate of the likelihood ratio. The next section will show this with

several experiments.

28

5 EXPERIMENTS AND DISCUSSION

5.1 AdaBoost Detector in Additive White Noise Channel

Experiments on the AdaBoost detector in additive white noise channels were con-
ducted. In order to compare the results to that of Johnson et al. [1], the same
experimental setup was used.

All digital symbols and chips came from antipodal binary classes :hl. The length
of the code N = 256 was chosen for most of the experiments. Four other detectors,
along with the AdaBoost detector were simulated. They were the matched ﬁlter (Sec-
tion 2.2.1), the sign detector (Section 2.2.2), the type—based detector (Section 2.2.3),
and the clairvoyant detector, which was the optimal detector described in the begin-
ning of Section 4.2.

In each experiment, 512 N -chip testing sequences were generated from 512 random
symbols, using a code. A particular type of noise with certain power was simulated
and added to those sequences. Then these sequences were given to the detectors; and
all ﬁve detectors were asked to classify them into their corresponding symbols. Both
the AdaBoost detector and the type-based detector were also given (32 bits) labeled
sequences for training. These sequences are contaminated with the same noises given
to the testing data. At the end of each experiment, the bit error rate of each detector
was determined.

For each type of noise and for each noise power, the experiments were repeated
100 times; and the average bit error rate of each detector was recorded. This resulted

in a plot of average bit error rate vs. the signal-to—noise ratio for each detector and

29

for each type of noise.

These plots are shown in Figure 8. It is obvious that the performance of both
the AdaBoost detector and the type-based detector is close to that of the clairvoyant
detector for all three noisy cases, while the ﬁxed detectors are only useful for certain
types of noises. It is interesting to see that even though the sign detector is overall
the best ﬁxed nonparametric detector for zero-median noises (such as the Laplacian
noise), both the AdaBoost detector and the type-based detector are able to perform
better at high SNR.

The AdaBoost detector also performed slightly better than the type-based detector
in two of the three cases. However, the AdaBoost detector did not out-perform the
type-based detector when fewer training data were used. Figure 7 is the result of

a simulation with only 1/4 of the training data as that used for Figure 8a." A

 

 

‘1 O , s - - ,
‘ Gaussian Noise 3

BER

 

 

 

 

 

 

 

 

1 O 2)- ----- Matched Filter ‘r
E 1111111111111 Sign a.
, ._.-.—._. Type
’ —— AdaBoost
3’ - - Clairvoyant
1 o’ - - -
O 2 4 6 8
SNR (dB)

Figure 7: This is almost the same comparison test as the one shown in Figure 8a.
However, N = 64 was used for the code lengths.

 

 

17There were still L = 32 training symbols, but the code length of N = 64 effectively reduced the
size of the training set M = L x N (Section 4.2).

30

(a) Gaussian noise

 

    
      

v v v vvvvv'

fv vvvv-v'

 

 

 

Matched Filter

Sign
Type

AdaBoost

Clairvoyant

 

 

\
-\-------l . -------

ll.

 

 

(b) Laplacian

4

noise

 

 

A

2

 

4

 

A

6 8

(c) Mixed Gaussian noise

 

 

 

A

‘

 

 

4
SNR (dB)

31

Figure 8: Experiments with the AdaBoost detector.

comparison between Figure 7 and Figure 8a indicates that the AdaBoost detector’s
performance increases faster as the training set becomes larger—apparently making
better use of the training resource (relative to the type-based one). This is also

expected since the second term in (21) says that the actual error should decay like

1/1/ M; and M is the size of the training set.

5.2 Discussion

The experiments have clearly demonstrated the, AdaBoost detector’s ability to per-
form near-optimal (close to minimum probability of error) detection of CDMA sig-
nals in several types of noises, with only one assumption about the channel—that it is
memoryless. It is also shown to outperform another nonparametric adaptive detector,
the type-based detector, in some cases. It has one other advantage over the type-
based detector. Recall that the ﬁrst step of type-based detection is to quantize all
received signals. In fact, the theory behind type-based detection is based on discrete
sources (see Gutman [15]). Without any knowledge of the channel, this quantization
step is quite arbitrary, and is difﬁcult to optimize [2]. This is not a problem with the
AdaBoost detector. However, this research also revealed other open questions about
the AdaBoost detector and about boosting in general.

One fundamental question concerns the quality of the likelihood estimate in (26).
Although it is a key contribution of this research, this method of estimating likelihood
has not been theoretically proven. Nevertheless Schapire et al. have also considered

the total votes on the two classes as the conﬁdence that the algorithm has in them; and

32

have studied the diﬂerence between the votes (instead of the ratio). Interested readers
are referred to the works of Schapire et al. [9, 10]. If this method (or its variation) can
be proven optimal in any sense, it will be useful for many other applications besides
classiﬁcation.

Another question is how to choose T. A good value can be found by the structural
risk minimization principle, which involves a trade-off between the training error and
the bound of the generalization error above it (see Section 3.4). This theory can be
found in Vapnik’s work [14]. It should be pointed out that the choice of T does not
require any knowledge of the channel. Also, there exist other methods for choosing
T, such as cross validation [8]. For the experiments in Section 5.1, the value of T was
arbitrarily chosen to be 16.

Finally, the AdaBoost detector is not designed to be used in channels with mem-
ories. Therefore other methods of incorporating boosting must be investigated. One
possible solution starts with considering each sequence x; = (1:1[1], . . . ,r1[N]) in the
original training set as an N -dimensional vector observation x1 = [21:11,” . , mm]. Re-
' call that the AdaBoost detector breaks all the sequences into two groups of chips
(Section 4.2), discarding all the dependency information among the chips. This alter-
native preserves the full sequence structure of the training set (x1,y1), . . . , (xL, yL),
thus preserving the possible correlation information. The weak classiﬁers, in the case,
can be N -dimensional directed hyperplanes. Fairly efﬁcient algorithms18 to ﬁnd these

hyperplanes are available [12], and can be employed in the WeakLearn subroutine.

 

18These algorithms certainly require more computation than those used to ﬁnd a single directed
threshold.

33

Once a number of these hyperplanes are found, they can be boosted to form com-
plicated decision regions in the N -dimensional space. An unknown sequence R, also
treated as a vector [31,. .. ,RN], can then be classiﬁed.

Since the chipdependencies are manifested in the a posteriori joint probability
distributions p(R1,... ,RNIm = :l:1), and since the boosting algorithm can make
detection without any assumptions about these distributions, this method may be
able to handle interchip interference. However, since N is usually large, a huge
training set may be needed. If the training set is not large enough, a weak classiﬁer
(a high dimensional hyperplane) can probably make classiﬁcations with no errors19
and render boosting ineffective. On the other hand, a large training set will waste
too much of the channel’s time and require a long time to train, making the method
impractical. Therefore one might choose some compromise, working in a relatively
small vector space, e.g. 2 or 3 dimension, dealing with length—2 or 3 chip sequences

at a time. This may capture interchip interference effects without becoming too

unwieldy.

 

19Depending on the WeakLearn, this hyperplane may be a good ﬁnal classiﬁer.

34

6 CONCLUSION

The modeling of a wireless communication channel is challenging due to unknown
noises, interference, and the fact that the channel may be time-varying. It is therefore
desirable to have adaptive detectors that automatically adjust to dynamic environ-
ments without making many assumptions.

This thesis ﬁrst presented the idea of a new class of learning algorithms called
boosting methods. Experiments with one of the algorithms, AdaBoost, showed its
ability to learn the near-optimal decision boundaries in a classiﬁcation setting. Then
the AdaBoost detector was developed for a wireless communication system—the
CDMA system. This detector was designed to extract CDMA signals from back-
ground noises. It assumes a memoryless channel, and adapts to it by learning the
optimal decision rule with the help of a limited amount of training data. To exam-
ine the effectiveness of the AdaBoost detector, different types of noisy channels were
simulated. The results showed that this detector could make near-optimal detection
in all noisy situations.

Through the development, this research not only contributed to CDMA commu-
nications, but also demonstrated new potentials of the boosting algorithms. However,
the design also revealed two major questions that merit future investigations. First,
the main modiﬁcation made to the AdaBoost algorithm, which uses the ratio between
the total weighted votes on the two chip classes as estimate of the chip’s likelihood

ratio, was not rigorously examined. Therefore, even though the performance of the

AdaBoost detector was excellent, solid theoretical analysis has yet to be done.

35

The other issue is how to handle channels with memories. The AdaBoost. detector
was not designed to accommodate dependencies among the chips in CDMA signals.
An extended method of incorporating AdaBoost for CDMA detection was briefly
described in Section 5.2. This method may solve the problem of interchip interference.

In conclusion, the theory of boosting is still relatively new. However, the boosting
algorithms have already established themselves as benchmarks for many practical
classiﬁcation problems [13, 17]. The AdaBoost detector developed in this research was
also effective in extracting CDMA signals from background noises without knowing

the noise type. Therefore, boosting methods are very promising for adaptive detection

in unknown channels.

36

Appendix

A VC DIMENSION

The Vapnik-Chervonenkis (VC) dimension is a quantity that indicates the complexity
of a set of functions. Referring to Figure 4, if the function to be learned, G, is a binary
classiﬁcation rule, then the set of functions .7 that the learner uses is probably a set
of indicator functions. To understand the meaning of VC dimension of such a set,
consider I: observations (.131, . . . , :rk). In a binary classiﬁcation problem, each of them
comes from one of the two classes. Thus there are possibly 2" different combinations.
If the set .7 is very rich, then for every combination there may be some function
f E .7 that can separate the observations perfectly into their classes; and the set .7
is said to shatter k observations. On the other hand, when .7 is a small class, there
may be certain combinations that cannot be separated by functions in .7. The VC
dimension of .7 is simply the maximum number k of observations that the functions
in .7 can classify, regardless of how they are combined (or the maximum number of
observations that can be shattered by .7).

An example will make it clear. Consider the class 71 deﬁned in (18). Each function
is a directed threshold on a real line. Suppose there are two observations a < b. Then
one can classify them with a directed threshold no matter which class each one comes
from. For example, if a comes from —1 and b from +1, then a threshold between
them directed towards a (d = —1) classiﬁes them correctly. However, if there are
three observations a < b < c, such classiﬁers can not shatter them. To see this, let
a, c come from one class and b from another. This combination can not be separated

correctly by a single directed threshold. Therefore the class '71 has a VC dimension

38

of k” = 2.
To complete the above example, it takes a class of double thresholds with direc-
tions to shatter three observations. Obviously, this is a richer class; and it has a larger

VC dimension as a consequence.

39

References

[1]

[2]

[3]

[4]

[5]

l6]

l7]

[8]

[9]

[10]

D.H. Johnson, Y.K. Lee, O.E. Kelly, and J.L. Pistole, “Type-based detection for un-
known channels,” International Conference on Acoustics, Speech and Signal Process-
ing, Atlanta, pages 2475-2478, 1996.

L. Yue and D.H. Johnson, “Signal detection on wireless CDMA downlink,” Vehicular
Technology Conference, 1998.

L. Yue and D.H. Johnson, “Type-based detection in macro-diversity reception for mo-
bile radio signals,” International Conference on Acoustics, Speech and Signal Process-
ing, Munich, pages 4013-4016, 1997.

John G. Proakis, Digital Communications, McGraw-Hill, 1989.

Rodger E. Ziemer and Rodger L. Peterson, Digital Communications and Spread Spec-
trum Systems, Macmillan Publishing Company, 1985.

Theodore S. Rappaport, Wireless Communications: Principles and Practice, Prentice
Hall, 1996.

R. Price and RE. Green, Jr, “A communication technique for multipath channels,”
IRE, vol. 46, pages 555-570, March 1958.

Yoav Freund and Robert E. Schapire, “A decision-theoretic generalization of on-line
learning and an application to boosting,” Computational Learning Theory: Second
European Conference, EuroCOLT ’95, pages 23-37. Springer-Verlag, 1995.

Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee, “Boosting the
margin: A new explanation for the effectiveness of voting methods,” Machine Learning:
Proc. of the 14th International Conference, pages 322-330, 1997.

Robert E. Schapire and Yoram Singer, “Improved boosting algorithms using
conﬁdence-rated predictions,” Eleventh Annual Conference on Computational Learn-
ing Theory, 1998.

40

[11] T.M. Cover and J.A. Thomas, Elements of Information Theory, John Wiley and Sons,
Inc., 1991.

[12] Pierre A. Devijver and Josef Kittler, Pattern Recognition: A Statistical Approach,
Prentice Hall, London 1982.

[13] Vladimir N. Vapnik, The Nature of Statistical Learning Theory, Springer Verlag, New
York 1995.

[14]'Vladimir N. Vapnik, Estimation of Dependences Based on Empirical Data, Springer
Verlag, New York 1982.

[15] Michael Gutman, “Asymptotically optimal classiﬁcation for multiple tests with empir-
ically observed statistics,” IEEE Trans. Info. Th., 35:401-408, 1989.

[16] J. Ziv, “On classiﬁcation with empirically observed statistics and universal data com-
pression,” IEEE Trans. Info. Th., 34:278—286, 1989.

[17] J .R. Quinlan, “Bagging, boosting, and C4.5,” Thirteenth National Conference on Ar-
tiﬁcial Intelligence, pages 725-730, 1996.

41

nICHIan STATE UNIV. LIBRARIES
llll[IllllIll]HIWWII"lllllllllllllllllllllll
31293018917495