i| WUHHWNNMWNH 27»; \M \ 537 meals 1\ Ox 9 X) ‘ Iliumlililiiliii’limmull LIBRARY 3 1293 01801 7495 Michigan State University This is to certify that the thesis entitled A NONPARAMETRIC ADAPTIVE DETECTOR FOR UNKNOWN CHANNELS USING BOOSTING presented by Long Ying has been accepted towards fulfillment of the requirements for Master's degree in Electrical Eng 4AA” Major professor ' l Date g/"l/y/(2? 0-7639 MS U is an Affirmative Action/Equal Opportunity Institution PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINE retum on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 100 MM“ A NONPARAMETRIC ADAPTIVE DETECTOR FOR UNKNOWN CHANNELS USING BOOSTING By Long Ying' A THESIS submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Department of Electrical Engineering 1998 ABSTRACT A NON PARAMETRIC ADAPTIVE DETECTOR FOR UNKNOWN CHANNELS USING BOOSTING By Long Ying Boosting is a new and powerful decision-theoretic machine learning algorithm, which combines a number of simple decision rules into one highly accurate final rule. Without any prior knowledge of the source statistics, boosting is shown capable of learning a decision rule in a classification problem. The objective of this research is to design an adaptive detector that automatically adjusts to unknown environments using a reasonable number of training data. A novel detector for Code Division Mul- tiple Access (CDMA) communication signals is designed using the recently proposed learning algorithm known as Boosting. The new detector performs favorablyin com- parison to other commonly used detectors including the likelihood ratio detector. The likelihood ratio detector is an optimal detector that minimizes the probability of error, but requires complete knowledge of channel characteristics. In contrast, the new adaptive detector proposed here makes no assumptions on the channel. Hence, the new detector is potentially useful for realistic situations in which little is known about the communication channel. Dedicated to Jane iii ACKNOWLEDGEMENTS This interesting research topic is suggested by my advisor Dr. Robert D. Nowak. I am most grateful for Dr. N owak’s help during this research and during the preparation of the thesis, and for his support and guidance throughout my studying for the Master’s degree. I would also like to thank Dr. John Deller Jr. and Dr. Percy Pierre for their wisdom and many important suggestions for this thesis. iv Contents LIST OF FIGURES 1 INTRODUCTION 2 CDMA COMMUNICATIONS 2.1 The CDMA Signal Structure ....................... 2.2 Detection of CDMA Signals ....................... 2.2.1 Matched Filter ......... - ................. 2.2.2 Sign Detector ........................... 2.2.3 Adaptive Detection ........................ 3 BOOSTING 3.1 Background on Machine Learning .................... 3.2 Boosting and AdaBoost ......................... 3.3 Experiment with AdaBoost ....................... 3.4 Performance Analysis ............ . ............... 4.1 Detecting CDMA Signals Using AdaBoost: Method 1 ......... 4.2 Detecting CDMA Signals Using AdaBoost: Method 2 —the AdaBoost Detector ......................... 5 EXPERIMENTS AND DISCUSSION 5.1 AdaBoost Detector in Additive White Noise Channel ......... vii 13 13 21 DETECTION FOR UNKNOWN CHANNELS USING BOOSTING 24 24 25 29 5.2 Discussion ................................. 32 6 CONCLUSION 35 A VC DIMENSION 38 REFERENCES 40 vi List of Figures 1 The CDMA signal structure ........................ 4 2 Matched filter detection for CDMA signals ................ 5 3 Sign detection for CDMA signals. .................... 7 4 A setup of machine learning ........................ 13 5 Algorithm AdaBoost ........................... 17 6 An experiment with AdaBoost. ........ . ............. 2O 7 AdaBoost detector with less training data ............... 3O 8 Experiments with the AdaBoost detector. ............... 31 vii 1 INTRODUCTION In digital communication systems, one of a finite number of symbols is sent during each bit period. Due to imperfections in the channel and in the receiver’s front end, the received signal of each transmitted symbol is modeled as a random variable. An optimal detector, in the sense of minimizing the probability of error is the likeli- hood ratio test. Throughout this thesis, the term “optimal” refers to the minimum probability of error criterion. The likelihood ratio test requires complete knowledge of the received symbol distributions. For some systems, the channel uncertainties can be modeled very well, and the received signal distributions are known, so we are able to design fixed detectors that are optimal (e.g., matched filters for additive white Gaussian noise, or AWGN, channels [4]). More frequently, however, accurate models for the channels are not available due to unknown noise sources, unknown interference, and the fact that the channels can be time—varying. In such situations, an adaptive detector that adjusts to the changing environment, is necessary. Depending on the amount of knowledge about the channels, some adaptive de» tectors assume certain parametric forms of the received symbol distributions and estimate the unknown parameters from some training data; others adopt nonpara- metric schemes that assume very little about the channels. For example, in a recent study [1], a type-based detector, which uses histograms built from training data to esti- mate the distributions, provides asymptotic optimal performance with no assumptions about the channel. In many wireless digital communication systems, the channel is very complex, making a priori modeling very difficult if not impossible. Therefore nonparametric adaptive detectors are highly desirable. The general goal of this research is to design a nonparametric adaptive detector using a new and powerful decision—theoretic learning algorithm known as boosting. In boosting, a number of simple decision rules are first generated using training data alone—without any knowledge of the received signal distributions. Then these simple rules are combined (or “boosted”) to produce an overall decision rule that approxi- mates the optimal likelihood ratio test. This research results in a specific product— the AdaBoost detector. It is aimed at extracting the code division multiple access (CDMA) communication signals from background noises without knowledge of the noise statistics.l This thesis will first explain the structure and the usual detection methods of the CDMA signals. Then it will introduce the idea of boosting in detail, with a specific algorithm called “AdaBoost” as an example. It will also describe the development of the AdaBoost Detector, as well as some experiments with this detector. 1Experiments conducted at the time of writing this thesis also showed that the detector can do better than others in rejecting certain interference encountered in wireless channels. However, time did not allow a comprehensive discussion on that subject to be included in this thesis. 2 2 CDMA COMMUNICATIONS CDMA is a transmission scheme found in the wireless mobile communications net- works, wireless local area networks, global positioning systems, etc., and is becoming more and more important with the rapid growth of the telecommunications industry [6]. The following is a brief introduction to CDMA communications. The reader is referred to the text book by Proakis [4] or Ziemer and Peterson [5] for details. 2.1 The CDMA Signal Structure . The CDMA signal structure can be explained using Figure 1. This example shows some antipodal binary CDMA signals. The signal level is either +1 or —1. Figure 1a shows two message symbols to be communicated at a rate of l/Tb bits/sec (bps). The key to CDMA is that each message symbol must be multiplied (“spread”) by a sequence of N pseudo-random bits. This sequence is called a code; and each bit in the code is termed a chip. Figure 1b shows two copies of an 8-chip2 code, each to be multiplied to a symbol to provide the chip sequence in Figure 1c. For each symbol of level m, this operation can be expressed as s[n]=mc[n],n=1,...,N. (1) Therefore instead of sending the message symbol directly, one would transmit the final sequence (s[l], .. . ,s[N]). 1'In reality, N is much larger than 8 (typically over 100). (a) Message Symbols 2 I I I T I m 02.....- Tb __.—> , _2 I I I J I J L 2 4 6 8 1o 12 14 16 (b) Code Sequences 2 I I I I M I I -> IL: I <- ‘ . {—— c[n] o _ . _2 I I L _I_ L I I 2 4 6 8 1o 12 14 16 (c) Transmitted Sequences 2 I I I I I r I 0 - . s [n] _2 I I I I I I I 2 4 6 8 1o 12 14 16 11 Figure 1: The CDMA signal structure. 2.2 Detection of CDMA Signals3 2.2.1 Matched Filter To recover the original message m in this type of signals, one needs to decide whether the received N ~chip sequence is a result of transmitting the code or the negative of the code. It is easily shown that in AWGN channels, the optimal decision is made by a matched filter, or equivalently, by correlating the received sequence with the original code, as shown in Figure 2. s[n]=m c[n] s’[n]=m’ c’[n] c[n] [ AWGN ] Decide A m 5301 31m] R[n]=s[n] + s’[n] + W[n] Figure 2: Matched filter detection for CDMA signals. To make subsequent discussions easier, let us use vectors (shown in bold face) to denote length-N sequences. For example, R denotes the sequence (R[1], . . . , R[N]). I will also use capital letters to denote random variables throughout the thesis. Notice that in addition to the transmitted signal s and the noise W, the received sequence R also contains another sequence 8’. The signal 8’ is intended for another receiver, containing a different message m', spread by a different code c’. The key is that the 3Throughout the thesis, I will assume that the transmitter and the receiver are synchronized. A good introduction on synchronization can be found in [4]. 5 code c’ is designed to be almost orthogonal to c, i.e. N Z c[n] c’[n] 9.: 0. (2) 11:] Therefore, the correlation output due to s’ is close to O. This means other users can transmit their messages on the same channel at the same time—CDMA. This matched filter is the backbone of the most widely used CDMA receiver—the Rake receiver [6]. The Rake receiver is a series of matched filters, combined to rake in energies from multiple signal paths to achieve diversity.‘1 It is important to know that the use of the matched filter for every path is based on the assumption that the channel adds white Gaussian noise to the signal. Unfortunately, this is not always true, especially when other users’ codes are not orthogonal to that of the receiver’s or when intentional jamming is present. In such cases, it is almost impossible to accurately model the received signal distributions. 2.2.2 Sign Detector According to [1], a sign detector does not assume an AWGN channel. In fact, it does not assume the noise distribution to be of any parametric form. It is the optimal fixed detector for zero-median noise. Figure 3 illustrates the operation of a sign detector. Each received chip R[n] is individually classified to m[n] according to its sign. Once all N chips are determined, the sequence is compared to both the code and the negative of the code to decide which message symbol was sent. In the case depicted in Figure 3, it is decided that the sequence R comes from transmitting the symbol ‘For a more comprehensive discussion on the Rake receiver and the problem of multipath in communications, the reader is referred to the works of Price and Green [7] and Rappaport [6]. 5 1.. l I I I I I I I X X x 0 x R[n] x x x X '5 h 1 1 L 1 1 1 1 1 0 1 2 3 4 5 6 7 n [ Chip-wise sign detection 2 I I I I l I I I x x x x .. 0 mln] x x x x _2 I J I g I I l 0 1 2 3 4 5 6 7 8 n Compare to c[n] Compare to —c[n] 2 . 2 . . . 0..., mm .“71—"1— LJ L._J.L __..l_J.. L O 2 4 6 8 0 2 4 6 n 11 Figure 3: Sign detection for CDMA signals. +1 because the sequence Iii differs from the positive code by only two chips. 2.2.3 Adaptive Detection In a recent series of papers [1, 2, 3], Johnson et al. have developed a universal adaptive detector called the type-based detector. A type, in information theory, means the histogram estimate of a discrete probability distribution [11]. The type-based detector uses the types constructed from some training data to estimate the received signal distributions in order to asymptotically achieve optimal (minimum probability of error) performance. The rest of this section will explain the use of this detector for CDMA communications. Some of the operations of this detector are used in the implementation of the AdaBoost detector as well. To construct types, the received signals must be quantized. Suppose the signal quantization levels are {at},T=l , and that a set of quantized observations from one class is “smug=1 .5 Then the type for that class (the histogram estimate of its distribution) is simply: . 1 M . Pr(a,)=7W—mz=ll([xm]=at),t=1,... ,T (3) where I() is the indicator function. In antipodal binary communications, there are two classes of chips, for which types must be constructed. For the purpose of grouping chips into these classes, define the sets n+ and n’ to be the positions of all positive chips and all negative 5I will differ the discussion on the quantization process to Section 5.2. The reader should refer to [2] for details. Also, I will use [rm] to denote the quantized signals and to distinguish it from the unquantized 2m appearing in later sections. chips, respectively, in the code. That is n+ = {n : c[n] = +1} 11' = {n : c[n] = —1}. (4) The receiver knows these sets. Therefore by observing the received chips’ positions, the detector can easily separate them into two classes, and construct the two types. To be more specific, the chips in a sequence [x] are separated according to the following: 2* = {[2th ,e M 2:2 = {[zr[n]] : n E n‘}. (5) It is important to notice that if the sequence corresponds to a +1 symbol, the group at] will contain all the +1 chips, but if the sequence represents the symbol —1, then the group 2:2 will contain the +1 chips instead. Therefore one can identify the chip groups only if the transmitted symbol is known (which is true in training), but one has to hypothesize the signs of the groups when unknown sequences are received. In order to use a type-based detector, it is required that a labeled training set be transmitted periodically, ([xl] , yl), . . . , ([xL] , yL), where the label y; is the lth symbol, and the observation [x1] is the received and quantized sequence of that symbol. As described, all the chips are separated into a group of +18 1'" and a group of --1s 23‘, each having M = 1/2(L x N) chips6. The types for the two classes of chips Px+ and P3- are constructed according to (3). This completes the training phase of the type-based detector. 6The length N of a pseudo-random sequence is usually odd, but transmitting (1/2 L) +1 symbols and (1 /2 L) -l symbol can result in equal number (M) of oppositely signed chips. This is not a requirement, but nonetheless assumed for simplicity. 9 After the training, a sequence [R] corresponding to an unknown message symbol m is received (with the same quantization). At this point, if the chips are independent (result of a memoryless channel), one only needs to estimate the likelihood ratio of each chip using the types. That is 10([Rlnll ISInl = +1) P([R[nll |Slnl = -1) Px+ (lRlnll) Pr- ([Rlnll) y» 3 ll Then the likelihood ratio of the whole sequence can be found. [Rlnll |m= +1) A _ “SLIM _ P(lRlnll|m= - 1) 3\ ”11-1 LunEn‘i' HnEn‘ A" Therefore the decision will be: rh=+l f1 ,5, 1. (8) However, according to Ziv [16], this structure can not be generalized to accommodate channels with memories. It turns out that the following method is more versatile because it can be generalized to cope with interchip interference using higher order types. The theory behind this is developed by Gutman [15]. After training, the type-based detector also separates the unknown sequence [R] into two groups of chips W = {[R[n]] : n 6 12+}, and R“ = {[R[n]] : n E n‘}. Notice the chips in the set 12* are positive only if the sequence represents the symbol m = +1. And in that case, the chips in this set and in the set x+ (found in training) should come from the same distribution. Based on this idea, the type-based detector 10 constructs four more types: N N PZ" = (3PR-1 + MPI+)/(— + M) 1 2 N N P2,- : ('2—PR-+MPx-)/(§+M) N N P2: = (7193- +MPx+)/(—2—+M) N N where Pm and PR- are the types made from R+ and 12‘. These four types are formed from the sets constructed by pairing 12"” and R‘ .with :1:4‘ and 2:" in both ways. This allows the detector to hypothesize as follows: (R+,:r+) ~ PZ+ H1 : m = +1 ‘ (ff-,3-) ~ P21- (R-,x+) ~ P + H0 : m = -1 2° (10) (12+: 1'-) N P2; It can be shown that the log likelihood of all the types given each hypothesis is: r1 = 421+ M)H(PZ;.) + g-IHR”) + MH(a:+) +1;- + M)H(Pz‘-) + -12XH(R‘) + MH(a:‘) N N 1“0 = —(§+M)H(PZ;)+§H(R‘)+MH(x+) +1;- + M)H(PZo—) + gram) + MH(:1:’) (11) where H () is the entropy function. Notice that these test statistics combine the training and testing data, which is a requirement for optimal detection in channels 11 with memories. The decision rule based on these statistics is then simply: g1 I‘1 < I‘o- (12) Ho The effectiveness of the type-based detector will be demonstrated in Section 5, where the performance of all the detectors considered in this thesis are compared. Since CDMA is widely used in wireless communications, it suffers most of the wire- less channel impairments.7 Very often, the channel does not fit the model for which the current Rake receiver works best. The type-based detector, in these situations, is very desirable because it can adapt to any models. The next section will describe an algorithm called boosting, which is also well suited for performing classifications without any assumption. It is the purpose of this thesis to incorporate it into the detection of CDMA signals. 7CDMA also provides some resistance against some of the channel imperfection as part of its virtue. 12 3 BOOSTING Roughly speaking, boosting is a learning algorithm that combines or boosts a number of simple classifiers to obtain an overall more powerful decision rule. Boosting is an algorithm developed in the area of machine learning. Therefore it is appropriate to explain it in that context. 3.1 Background on Machine Learning One setup of a machine learning problem is illustrated in Figure 4. The objective is to {me211 G“ {34.1.1211 f Learner __.J A Y ,___iL__w m X J_.JL——_T S" Figure 4: A setup of machine learning. adaptively learn about the function G that maps an input :1: to an output y. When the setup is used in classification problems, G is just a classifier that maps an observation into a class. Because of the nature of this research, I will only talk about using the learning machine to learn the classification rule even though this setup is much more general.8 One of the resources for learning is a set of training data {xm}x=l with 8Other applications are found in the work of Vapnik [13]. 13 known labels (or classes) {ym}x=1. The central component of this setup is the learning machine, henceforth called the learner. It is capable of implementing any function (or classifier) f from a specified family .7. These functions can also map any :5 into a class 9. Thus when the learner selects one f E f to classify a training observation am, it can observe the output gm, compare it to the label ym, and determine how good the classifier is. After the learner tries all the classifiers on all the training data, it will pick the one that performs the best—fa, and the learning is finished. fG will then be used to classify all future observations. . More specifically, the learning process is described as follows. The learner must be provided with a performance criterion in order to compare the performance of different classifiers. The performance criterion is a loss function denoted by L(ym, gm( f )) In classification problems, this loss can be an indicator of the classification error of f on :c,9 also called O/l error, i.e. o if = * f L(y.§(f))= y y” (13) 1 if y at 9U) Ideally, we would like the average loss over all possible observations to be as small as possible. Therefore the learner should choose an f to minimize the expected risk, so) = / L(y,i(f))dF(x,y) (14) where F (2:,y) is the underlying distribution over the space X x {i1} for a binary classification problem. The expected risk is precisely the probability of error incurred using f as a classifier. The likelihood ratio test minimizes this error over all functions, 9To emphasize that L is the loss of the classifier f, z is eliminated in the following equations. It is understood that g( f) means 1(3). 14 and therefore it is called the minimum probability of error detector. However, the likelihood ratio test requires complete knowledge of the distribution F, which is not practical. In reality, F is unknown and the learner has access to only the set of training data. Therefore one practical alternative is to adopt the empirical risk minimization (ERM) principle, which chooses the classifier fc that minimizes the empirical risk. That is M fe = arg r113}! 8.(f) = are flip Eli; L(ym.y;z(f))- (15) 3.2 Boosting and AdaBoost For this setup, it is essential to find a good set of functions .7, since the learner’s ability is limited by the best classifier in f. In order to ensure that .7: contains a member f that closely matches the desired mapping G, we either require considerable prior knowledge about G, or we need a very large and complicated set .77. In many problems we have little or no information about G. Therefore, .7: must include a large number of (possibly very complicated) classifiers. These leads to two problems. 1. It may be difficult to perform the optimization in (15). 2. It may lead to overfitting and poor generalization characteristics when fc is applied to unlabeled data outside the training set. Fortunately, a change of mindset is provided by a class of learning algorithms that achieves high accuracy by voting according to the predictions of several classifiers [9]. One such method is called boosting. 15 ll-w In boosting, instead of finding a large and necessarily complex set of functions .7 containing the desired classifier, the algorithm only finds a number of weak classifiers10 {ht};l that may perform just better than random guessing. For an observation :1), each of these weak classifiers will predict its class—h,(:r). The algorithm will combine (“boost”) them into a more accurate final prediction—h f(1'), according to a weighted voting criterion. An actual boosting algorithm will help to understand some details of this idea. Figure 5 is one of the most successful boosting algorithms called “AdaBoost” [8].11 This algorithm starts by assigning an initial probability measure p1 = (p},. . . , p11”), usually uniform, over the training set ($1,311), . . . , (xM,yM). It then enters a loop (indexed by t), where it calls a subroutine called WeakLearn. WeakLearn finds a weak classifier ht for that round, according to a modified ERM (MERM) principle: M ht = arg £1271} {gmodlhl = Z PinLU/m’ h(xm))} (16) Notice that (16) differs from (15) in two regards: I) The set of functions 71, in which h, is found, is much smaller than the set .77. It is not based on too much knowledge about the underlying classification rule to be learned. 2) Instead of weighting the loss of each training sample equally with a weight 1/M, WeakLearn averages its loss according to the probability measure pin assigned by the boosting algorithm. The reason for this will become clear shortly. loIn boosting literature, these are called weak hypotheses. To avoid confusion with the word hypotheses used in detection theory, weak classifier is used throughout this thesis. 11The original AdaBoost algorithm in [8], which classifies between 1 and O, is slightly modified here in order to classify between 1 and —1. 16 Input: 0 training set {(231, yl) . . . (1M, yM)} 0 initial distribution p1 = [p] .. .pb] over the M samples (e.g. uniform) Initialize: w1 = p1 Do for t=1,2,... ,T wt 1. Set pt = M Zm=l wt 2. Call WeakLearn, providing it with {(xm,ym,pm)}M get back a weak m=l; decision rule h, : X —-> {:tl} 3. Calculate c, = Zmfl pml[ym 74 h (rm)] 4. Set fit = Ct/(l — 6;) 5. Set the new weights to be: wig” = winfltl-IlymTh‘(xmll,m = 1,2, . . . , M Output: the final rule h ,(x) is 1hf(-T)>=+ T 1 31;) < [Hash—1110a? hf($)=—I i=1 T Z I[h,(:r) =+1] log( Figure 5: Algorithm AdaBoost 17 The minimized risk ct = minheu 8m04(h) is a by-product of WeakLearn. Next the quantity 3, is calculated. This quantity increases from O to 00 as the risk of h, increases from O to 1. This is an important quantity whose significance is clearly demonstrated in the final decision rule. The last step for each round is a key to boosting. The algorithm updates the probability measure over the training set for the next round. Notice that w is used here instead of p. This is because w will be normalized] to a true probability measure p at the beginning of the next round. The important feature of this updating step is that it relatively increases the weights of the training samples that h; misclassified. When these weights are carried into the next WeakLearn subroutine, they tend to make the WeakLearn produce the next weak classifier that works better for these “hard” samples (because the WeakLearn uses the MERM principle taking pt into account). After T rounds, T weak hypotheses {ht}?=1 are generated, each associated with a quantity 6, indicating its risk. The final rule says, apply all {h {=1 to an observation 1', weight each vote ht(:r) with log(1/fl,), and the decision is the class receiving the weighted majority vote. Obviously, the more risk h, has, the less important its vote becomes. 3.3 Experiment with AdaBoost The following binary classification experiment is used to demonstrate the effectiveness of AdaBoost. 18 The observation at is one dimensional. There are two classes labeled y = 31:1. The pdf of the first class—p(:r|y = +1) is zero-mean, small—variance Gaussian. The class —1 has a large-variance Gaussian distribution with the same mean. Figure 6a shows these density functions. The optimal (minimum probability of error) classifier results from the maximum likelihood criterion, 11(ny = +1) p(-’€|y = -1) (17) which corresponds to the decision regions separated by the two thresholds (visualized by the two vertical bars in Figure 6a). Thus the decision rule to be learned will classify the observations between these two thresholds as +18 and those outside as —18. To design a learning algorithm for this problem using AdaBoost, a class of weak classifiers have to be found. The set chosen in this case is: 'H : {h(:r,v,d) = sgn[(:1: — v)d] : v E ’R,d E{:l:1}}. (18) This classifier is simply a directed threshold. For example, when the directiond = +1, any observation :1: that falls above the threshold 1) is classified as +1, and the —1 class is on the other side of v. The size of the training set is chosen to be 100 for this experiment; and the number of rounds of boosting T is also 100. Therefore, the algorithm sequentially generates 100 of these directed threshold, “remembering” the performance of each of them. Then a sequence of new observations are presented to the final decision rule. For each observation, the algorithm applies the weighted voting criterion described in 19 L ,1! 2 I I I I I I 1.5 ~ p(xly=+1) - I" - a: . _ p( ly) 0-5 _— - p(le=-l) 0..----——---""J :- -------------- _0.5 I I I 4I L I I -4 -3 -2 -1 0 1 2 3 4 X 15 n r T I ‘fi I 1 1- xxx u MOOOOOOOOOOOOOOOOOD OOOOOOOOOOOOOOOOOO€ “0-5 I I I I I I 4 4 4 4 O 1 2 3 4 (b) AdaBoost classification results. Figure 6: An experiment with AdaBoost. 20 Section 3.2 to make the final prediction. As shown in Figure 6b, the observations in the middle are classified as +15; and —1s are on both sides. The double thresholds from the optimal classifier are also displayed in this graph. The final predictions are obviously very close to being optimal. Notice that it is very hard to foresee a double threshold type of classifier without any knowledge of the distributions. But with boosting, one does not need such clairvoyance. In fact, even if the optimal decision rule has a more complex decision region (e.g. multiple thresholds), the same algorithm can learn it. 3.4 Performance Analysis In the area of machine learning, there has been extensive research regarding the performance of the learning machines. For a more comprehensive discussion, the reader is again referred to the book by Vapnik [13]. Here, let us review some of the performance bounds. Recall that using the ERM principle, the learner can find a classifier f, E .7 that is closest to the function G being learned. It was shown [13], with simple constraint on the set .7, that both the empirical risk and the actual risk of fe converge in probability to the minimum possible risk provided by any function in .7. The constraint is that the Vapnik- Chervonenkis (VC) dimension k of the set .7 be finite. The concept of VC dimension is introduced in the appendix. Bounding the empirical risk is not a difficult task. The training error (same as 21 empirical risk”) of the final rule h f in AdaBoost is bounded as follows [8]: T E()hj o. (21) Therefore the actual risk decays as 1/\/ M with high probability, where M is the size of the training set. 12Freund and Schapire [8] uses “training error” to denote the empirical risk of the final classification rule. Therefore I will use these expressions interchangeably. 13The term “generalization error” is used by Freund and Schapire to mean the actual risk of h 1- Thus I will use these two expressions interchangeably as well. 1‘This is a reasonable assumption since 7-! is simple. For example, the class defined in ( 18) has VC dimension 2. 22 Both of these bounds are distribution-independent. A more appealing fact is that the second bound describes how the algorithm performs when the size of the training set is small. There is also a trade-off between the two bounds. As T increases, the training error in (19) vanishes; but the VC dimension of the class OT('H), k9, increases, causing the upper bound on the generalization error in (21) to increase as well. This trade-off between the training error and the generalization error can be used to guide the choice of T. 23 I-IH'K __ .w—- —. - Jul]. L} l 4 DETECTION FOR UNKNOWN CHANNELS USING BOOSTING Since classification with boosting does not require assumptions about the source dis- tributions, and since the performance is quantifiable even with limited training data, boosting is ideally suited for adaptive detection of digital signals in unknown com- munication channels such as the wireless channels. 4.1 Detecting CDMA Signals Using AdaBoost: Method 1 A straightforward application of boosting in digital communications is detection of binary signals. Here one of two possible symbols is sent during each bit period. As a result of the channel impairments, the received bit R is modeled as a continuous ran- dom variable, which is governed by one of two probability distributions p(RIs = +1) or p(RIs = —l), depending on the transmitted symbol 3. Our experiment in Sec- tion 3.2 showed that the AdaBoost algorithm with the weak classifier family defined in (18) was capable of learning a classification rule in exactly this type of situations, without any assumption about those distributions. The only requirement is for the system to periodically transmit a set of labeled bits {($1, yl) . . . (:rM, yM)} as the in- put training data to the receiver. This can easily be done during synchronization, for example. In CDMA communications, however, a pseudo-random sequence of chips is trans- mitted in place of each message symbol, as explained in Section 2.1. Therefore, to best determine the symbol that was sent, one must observe the whole sequence of 24 received chips R[n], n = 1,... ,N during the bit interval of that symbol. A natural method is to first use AdaBoost’s bit—level detection capability to classify each chip separately. Then, compare the resulting binary sequence to the code and its negative. The decision is made by deciding according to the closest match. This method is very similar to that of the sign detector illustrated in Figure 3. The advantage of using AdaBoost here is its adaptability to many types of noises.15 In general however, the method described above is suboptimal. For example, when the noise is zero-mean white Gaussian, its performance can only approach that of the sign detector at best, not the matched filter.16 4.2 Detecting CDMA Signals Using AdaBoost: Method 2 —the AdaBoost Detector This section derives the product of this research—the AdaBoost detector. As de- scribed in Sections 2.1 and 2.2.1, each message symbol m is spread by an N—chip code sequence c = (c[l], . . . ,c[N]). The resulting binary sequence 8 is transmitted through an unknown channel and received as a sequence of N continuous-valued chips R. (See Figures 1 and 2.) Frequently, a set of training symbols (yl, . . . ,yL), known to the detector, is sent. The corresponding received sequences are (x1, . . . ,xL). The AdaBoost detector must train itself with this set, and make classifications for future sequences. 1"’However, this may be overshadowed by the sign detector’s efficiency when the noise is zero- median. 16Experiments demonstrating this were performed, but not included in the thesis. 25 ‘.-I[ '.'I-. ' {Vn " 11.9.4“; -. .11 ‘3’; z..- "' First, let us look at the operation of the optimal (minimum probability of error) detector. The optimal detector for CDMA signals is based on the knowledge of the likelihood functions of both positive and negative chips—p (R[n]|s[n] = +1) and p(R[n]|s[n] = —1). From these, the detector can calculate the likelihood ratio for each received chips: AzPWMMfl=H) " pWMMfl=4Y (22) However the optimal detector does not make decisions on the chips. Instead, it needs to calculate the likelihood ratio of the whole N-chip sequence over the bit interval. Assuming the chips are independent, identically distributed, the likelihood ratio is N _ PWWW=+U A_.£IIP(Rln]lm= -1)' (23) With the knowledge of the chip sets 12"" and 12’ (see the definitions in (4) and Sec- tion 2.2.3), the above likelihood ratio can also be expressed as: An A = DEL (24) HnEn‘ A" The optimal detector makes a decision based on this likelihood ratio: m=>+l A < 1. (25) rit=-l The detector described above requires complete knowledge of the likelihood ratio, and therefore is impractical. However, a good estimate of the likelihood ratio made from empirical training data will also give a detector near-optimal performance [16]. The key contribution of this research is tailoring the AdaBoost algorithm to do just that. 26 First, the AdaBoost detector must be trained as if it were to be used for classifying chips. Recall from Section 2.2.3 that the receiver can easily separate all the training observations into a positive group of chips :12" and a negative group 3:“, based on the chips’ positions. The receiver then label each chip individually according to its group, giving the AdaBoost a set of M = L x N labeled training data (:21, ya), . . . , (:EM, yin). Using the WeakLearn routine described in Section 3.3, AdaBoost can sequentially generate T weak classifiers (hint: 11 and compute their associated risks 6,. The key to the AdaBoost detector is a modification of the final decision rule in AdaBoost. It is also one of the major contributions of this thesis. Instead of boosting all ht’s predictions to determine each chip’s class, the final rule is modified to give an estimate of the likelihood ratio of the chip: ,1 = 22111 1080/31) x I[h1(R[n])=1] . n 21:11080/31) X I[ht(R[Tl]) = —1] (26) where I () is the indicator function. To classify the whole chip sequence, one simply use this likelihood ratio estimate and proceed as the optimal detector would with the true likelihood ratio, i.e. A 1 = __H...+‘n (27) [111611. A" m=>+l A < l (28) rit=-l Notice that the quantity log(l/,6¢) is the weight that AdaBoost assigns to the prediction by h,, so the expression in the numerator of (26) is simply the total weighted votes in favor of the chip being +1; and the total weighted votes in favor of —1 is the quantity in the denominator. Therefore the right hand side of (26) is simply the 27 ratio of the total votes for +1 to that for —1. lntuitively, it is reasonable to use this quantity as an estimate of the likelihood ratio. The next section will show this with several experiments. 28 5 EXPERIMENTS AND DISCUSSION 5.1 AdaBoost Detector in Additive White Noise Channel Experiments on the AdaBoost detector in additive white noise channels were con- ducted. In order to compare the results to that of Johnson et al. [1], the same experimental setup was used. All digital symbols and chips came from antipodal binary classes :hl. The length of the code N = 256 was chosen for most of the experiments. Four other detectors, along with the AdaBoost detector were simulated. They were the matched filter (Sec- tion 2.2.1), the sign detector (Section 2.2.2), the type—based detector (Section 2.2.3), and the clairvoyant detector, which was the optimal detector described in the begin- ning of Section 4.2. In each experiment, 512 N -chip testing sequences were generated from 512 random symbols, using a code. A particular type of noise with certain power was simulated and added to those sequences. Then these sequences were given to the detectors; and all five detectors were asked to classify them into their corresponding symbols. Both the AdaBoost detector and the type-based detector were also given (32 bits) labeled sequences for training. These sequences are contaminated with the same noises given to the testing data. At the end of each experiment, the bit error rate of each detector was determined. For each type of noise and for each noise power, the experiments were repeated 100 times; and the average bit error rate of each detector was recorded. This resulted in a plot of average bit error rate vs. the signal-to—noise ratio for each detector and 29 for each type of noise. These plots are shown in Figure 8. It is obvious that the performance of both the AdaBoost detector and the type-based detector is close to that of the clairvoyant detector for all three noisy cases, while the fixed detectors are only useful for certain types of noises. It is interesting to see that even though the sign detector is overall the best fixed nonparametric detector for zero-median noises (such as the Laplacian noise), both the AdaBoost detector and the type-based detector are able to perform better at high SNR. The AdaBoost detector also performed slightly better than the type-based detector in two of the three cases. However, the AdaBoost detector did not out-perform the type-based detector when fewer training data were used. Figure 7 is the result of a simulation with only 1/4 of the training data as that used for Figure 8a." A ‘1 O , s - - , ‘ Gaussian Noise 3 BER 1 O 2)- ----- Matched Filter ‘r E 1111111111111 Sign a. , ._.-.—._. Type ’ —— AdaBoost 3’ - - Clairvoyant 1 o’ - - - O 2 4 6 8 SNR (dB) Figure 7: This is almost the same comparison test as the one shown in Figure 8a. However, N = 64 was used for the code lengths. 17There were still L = 32 training symbols, but the code length of N = 64 effectively reduced the size of the training set M = L x N (Section 4.2). 30 (a) Gaussian noise v v v vvvvv' fv vvvv-v' Matched Filter Sign Type AdaBoost Clairvoyant \ -\-------l . ------- ll. (b) Laplacian 4 noise A 2 4 A 6 8 (c) Mixed Gaussian noise A ‘ 4 SNR (dB) 31 Figure 8: Experiments with the AdaBoost detector. comparison between Figure 7 and Figure 8a indicates that the AdaBoost detector’s performance increases faster as the training set becomes larger—apparently making better use of the training resource (relative to the type-based one). This is also expected since the second term in (21) says that the actual error should decay like 1/1/ M; and M is the size of the training set. 5.2 Discussion The experiments have clearly demonstrated the, AdaBoost detector’s ability to per- form near-optimal (close to minimum probability of error) detection of CDMA sig- nals in several types of noises, with only one assumption about the channel—that it is memoryless. It is also shown to outperform another nonparametric adaptive detector, the type-based detector, in some cases. It has one other advantage over the type- based detector. Recall that the first step of type-based detection is to quantize all received signals. In fact, the theory behind type-based detection is based on discrete sources (see Gutman [15]). Without any knowledge of the channel, this quantization step is quite arbitrary, and is difficult to optimize [2]. This is not a problem with the AdaBoost detector. However, this research also revealed other open questions about the AdaBoost detector and about boosting in general. One fundamental question concerns the quality of the likelihood estimate in (26). Although it is a key contribution of this research, this method of estimating likelihood has not been theoretically proven. Nevertheless Schapire et al. have also considered the total votes on the two classes as the confidence that the algorithm has in them; and 32 have studied the diflerence between the votes (instead of the ratio). Interested readers are referred to the works of Schapire et al. [9, 10]. If this method (or its variation) can be proven optimal in any sense, it will be useful for many other applications besides classification. Another question is how to choose T. A good value can be found by the structural risk minimization principle, which involves a trade-off between the training error and the bound of the generalization error above it (see Section 3.4). This theory can be found in Vapnik’s work [14]. It should be pointed out that the choice of T does not require any knowledge of the channel. Also, there exist other methods for choosing T, such as cross validation [8]. For the experiments in Section 5.1, the value of T was arbitrarily chosen to be 16. Finally, the AdaBoost detector is not designed to be used in channels with mem- ories. Therefore other methods of incorporating boosting must be investigated. One possible solution starts with considering each sequence x; = (1:1[1], . . . ,r1[N]) in the original training set as an N -dimensional vector observation x1 = [21:11,” . , mm]. Re- ' call that the AdaBoost detector breaks all the sequences into two groups of chips (Section 4.2), discarding all the dependency information among the chips. This alter- native preserves the full sequence structure of the training set (x1,y1), . . . , (xL, yL), thus preserving the possible correlation information. The weak classifiers, in the case, can be N -dimensional directed hyperplanes. Fairly efficient algorithms18 to find these hyperplanes are available [12], and can be employed in the WeakLearn subroutine. 18These algorithms certainly require more computation than those used to find a single directed threshold. 33 Once a number of these hyperplanes are found, they can be boosted to form com- plicated decision regions in the N -dimensional space. An unknown sequence R, also treated as a vector [31,. .. ,RN], can then be classified. Since the chipdependencies are manifested in the a posteriori joint probability distributions p(R1,... ,RNIm = :l:1), and since the boosting algorithm can make detection without any assumptions about these distributions, this method may be able to handle interchip interference. However, since N is usually large, a huge training set may be needed. If the training set is not large enough, a weak classifier (a high dimensional hyperplane) can probably make classifications with no errors19 and render boosting ineffective. On the other hand, a large training set will waste too much of the channel’s time and require a long time to train, making the method impractical. Therefore one might choose some compromise, working in a relatively small vector space, e.g. 2 or 3 dimension, dealing with length—2 or 3 chip sequences at a time. This may capture interchip interference effects without becoming too unwieldy. 19Depending on the WeakLearn, this hyperplane may be a good final classifier. 34 6 CONCLUSION The modeling of a wireless communication channel is challenging due to unknown noises, interference, and the fact that the channel may be time-varying. It is therefore desirable to have adaptive detectors that automatically adjust to dynamic environ- ments without making many assumptions. This thesis first presented the idea of a new class of learning algorithms called boosting methods. Experiments with one of the algorithms, AdaBoost, showed its ability to learn the near-optimal decision boundaries in a classification setting. Then the AdaBoost detector was developed for a wireless communication system—the CDMA system. This detector was designed to extract CDMA signals from back- ground noises. It assumes a memoryless channel, and adapts to it by learning the optimal decision rule with the help of a limited amount of training data. To exam- ine the effectiveness of the AdaBoost detector, different types of noisy channels were simulated. The results showed that this detector could make near-optimal detection in all noisy situations. Through the development, this research not only contributed to CDMA commu- nications, but also demonstrated new potentials of the boosting algorithms. However, the design also revealed two major questions that merit future investigations. First, the main modification made to the AdaBoost algorithm, which uses the ratio between the total weighted votes on the two chip classes as estimate of the chip’s likelihood ratio, was not rigorously examined. Therefore, even though the performance of the AdaBoost detector was excellent, solid theoretical analysis has yet to be done. 35 The other issue is how to handle channels with memories. The AdaBoost. detector was not designed to accommodate dependencies among the chips in CDMA signals. An extended method of incorporating AdaBoost for CDMA detection was briefly described in Section 5.2. This method may solve the problem of interchip interference. In conclusion, the theory of boosting is still relatively new. However, the boosting algorithms have already established themselves as benchmarks for many practical classification problems [13, 17]. The AdaBoost detector developed in this research was also effective in extracting CDMA signals from background noises without knowing the noise type. Therefore, boosting methods are very promising for adaptive detection in unknown channels. 36 Appendix A VC DIMENSION The Vapnik-Chervonenkis (VC) dimension is a quantity that indicates the complexity of a set of functions. Referring to Figure 4, if the function to be learned, G, is a binary classification rule, then the set of functions .7 that the learner uses is probably a set of indicator functions. To understand the meaning of VC dimension of such a set, consider I: observations (.131, . . . , :rk). In a binary classification problem, each of them comes from one of the two classes. Thus there are possibly 2" different combinations. If the set .7 is very rich, then for every combination there may be some function f E .7 that can separate the observations perfectly into their classes; and the set .7 is said to shatter k observations. On the other hand, when .7 is a small class, there may be certain combinations that cannot be separated by functions in .7. The VC dimension of .7 is simply the maximum number k of observations that the functions in .7 can classify, regardless of how they are combined (or the maximum number of observations that can be shattered by .7). An example will make it clear. Consider the class 71 defined in (18). Each function is a directed threshold on a real line. Suppose there are two observations a < b. Then one can classify them with a directed threshold no matter which class each one comes from. For example, if a comes from —1 and b from +1, then a threshold between them directed towards a (d = —1) classifies them correctly. However, if there are three observations a < b < c, such classifiers can not shatter them. To see this, let a, c come from one class and b from another. This combination can not be separated correctly by a single directed threshold. Therefore the class '71 has a VC dimension 38 of k” = 2. To complete the above example, it takes a class of double thresholds with direc- tions to shatter three observations. Obviously, this is a richer class; and it has a larger VC dimension as a consequence. 39 References [1] [2] [3] [4] [5] l6] l7] [8] [9] [10] D.H. Johnson, Y.K. Lee, O.E. Kelly, and J.L. Pistole, “Type-based detection for un- known channels,” International Conference on Acoustics, Speech and Signal Process- ing, Atlanta, pages 2475-2478, 1996. L. Yue and D.H. Johnson, “Signal detection on wireless CDMA downlink,” Vehicular Technology Conference, 1998. L. Yue and D.H. Johnson, “Type-based detection in macro-diversity reception for mo- bile radio signals,” International Conference on Acoustics, Speech and Signal Process- ing, Munich, pages 4013-4016, 1997. John G. Proakis, Digital Communications, McGraw-Hill, 1989. Rodger E. Ziemer and Rodger L. Peterson, Digital Communications and Spread Spec- trum Systems, Macmillan Publishing Company, 1985. Theodore S. Rappaport, Wireless Communications: Principles and Practice, Prentice Hall, 1996. R. Price and RE. Green, Jr, “A communication technique for multipath channels,” IRE, vol. 46, pages 555-570, March 1958. Yoav Freund and Robert E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” Computational Learning Theory: Second European Conference, EuroCOLT ’95, pages 23-37. Springer-Verlag, 1995. Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee, “Boosting the margin: A new explanation for the effectiveness of voting methods,” Machine Learning: Proc. of the 14th International Conference, pages 322-330, 1997. Robert E. Schapire and Yoram Singer, “Improved boosting algorithms using confidence-rated predictions,” Eleventh Annual Conference on Computational Learn- ing Theory, 1998. 40 [11] T.M. Cover and J.A. Thomas, Elements of Information Theory, John Wiley and Sons, Inc., 1991. [12] Pierre A. Devijver and Josef Kittler, Pattern Recognition: A Statistical Approach, Prentice Hall, London 1982. [13] Vladimir N. Vapnik, The Nature of Statistical Learning Theory, Springer Verlag, New York 1995. [14]'Vladimir N. Vapnik, Estimation of Dependences Based on Empirical Data, Springer Verlag, New York 1982. [15] Michael Gutman, “Asymptotically optimal classification for multiple tests with empir- ically observed statistics,” IEEE Trans. Info. Th., 35:401-408, 1989. [16] J. Ziv, “On classification with empirically observed statistics and universal data com- pression,” IEEE Trans. Info. Th., 34:278—286, 1989. [17] J .R. Quinlan, “Bagging, boosting, and C4.5,” Thirteenth National Conference on Ar- tificial Intelligence, pages 725-730, 1996. 41 nICHIan STATE UNIV. LIBRARIES llll[IllllIll]HIWWII"lllllllllllllllllllllll 31293018917495