THREE DIMENSIONAL LOCALIZATION AND TRACKING FOR SITE SAFETY USING FUSION OF COMPUTER VISION AND RFID By Rana Hammad Raza A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Electrical Engineering - Doctor of Philosophy 2013 ABSTRACT THREE DIMENSIONAL LOCALIZATION AND TRACKING FOR SITE SAFETY USING FUSION OF COMPUTER VISION AND RFID By Rana Hammad Raza We propose a state of the art fusion framework of Computer Vision (CV) and Radio Frequency Identification (RFID) to support object recognition and tracking in a three dimensional space. Fusion can significantly improve performance in applications of autonomous vision and navigation and site monitoring, especially in outdoor environments. Increasing safety in construction zones and enhancing security in airports are important problems that involve understanding interactions between objects, machines and material and can be solved using sensor fusion and activity analysis. Identifying objects solely via vision is computationally costly, error prone, limited by occlusion, and sometimes impossible in practice. RFID can reliably identify tagged objects and can even localize targets at coarse spatial resolution. Alternatively, CV can increase the performance of RFID by fine tuning the location information and providing fuzzy features to avoid cloning or deception. We have implemented stereo using commodity cameras and have used a commercial RFID based Real Time Location System (RTLS) for our experiments and have achieved encouraging results. The performance of both modalities was evaluated separately and in fused mode. In our stereo experiments outdoors we obtained an RMS accuracy of within ~7.6 in (19.3 cm) for objects up to 80 ft (24.4 m) away from the cameras. For real time trajectories, RTLS provided 2 m to ~2.6 m location accuracy for dynamic tagged objects in a cell of 40×40 m with four readers. We propose a fusion based tracking algorithm and our research demonstrates benefits obtained when most objects are cooperative and tagged. We abstract the information structures in order to support a Site Safety System (S-3) with diverse information sources and constraints and processes that may not have knowledge of each other. We have used relaxation to control the integration of information from CV, RFID, and naïve physics in tracking. The label elimination approach readily represents the ambiguity occurring in real-life applications. The key to reducing the computational requirements is to eliminate many labels at each filtering step while keeping those labels compatible with observation. As a post processing step to labeling, we have used total track smoothness for optimization to update computed tracks for increasing system tracking reliability. Work site analysis can proceed even when information from one sensor or information source is unavailable at some time instances. We have shown with simulations and real data that fusion can greatly increase tracking performance and can reduce computational cost and combination search space up to 99% in some cases. Test cases showed how fusion can solve some difficult tracking problems outdoors. We assessed performance of tracking using track error i.e fraction of wrong trajectory point assignments. For some object trajectories outdoors, the fused system reduced the track error from 0.53 to 0.13. The likelihood of producing correct object trajectories in regions partially or fully occluded to CV is also increased. We conclude that significant real-time decision-making should be possible if the S-3 system can integrate information effectively between the sensor level and activity understanding level. Engineering faster RFID updates will likely reduce the number of objects that can be sensed; however, this should be a favorable tradeoff in a construction site. Employing knowledge based constraints and analyzing systematically object track initiation and termination are some of the possible research expansions to be worked upon in the near future. Copyright RANA HAMMAD RAZA 2013 DEDICATION To Maheen, Hamna, Sadaf, my brother, my mother and father v ACKNOWLEDGEMENTS I would like to extend my gratefulness to my research advisor George Stockman, for his years of guidance and support. His research insight and sheer persistence has made this a thoughtful and rewarding journey. I owe him debt of gratitude for his countless hours of reflecting, encouraging and guiding me through the entire process and believing in me. He has always been by my side like a father. I cannot find words grand enough to thank him. Likewise, I want to acknowledge my other advisor Lalita Udpa. Her advice and support has continually been valuable to me over the years. I am also grateful to my committee members Mohamed El-Gafy and Subir Biswas for being available in times when I needed their guidance. I would like to recognize the unwavering love, inspiration and constant support of my wife, Sadaf, through all stages of the process and beyond. She is my pillar, my joy and my guiding light. I really owe my beautiful daughters Hamna and Maheen for always cheering me up and even sometimes compromising on their daddy-daughter time. I am grateful and highly indebted to my mother and brother Hakim for their constant love, continued encouragement and prayers, when I needed them most. Cannot thank them enough for giving me the foundation of where I am today. I would like to keep in remembrance my late father, aunt Tasneem and grandparents who have always been with me in spirit. I would also like to show appreciation towards my brother's family, in laws and extended relatives and friends for their prayers and faith in me. Special thanks to Kathy and Adnan for making our stay at East Lansing an incredible experience. Finally and most importantly praises and thanks to Allah for all his blessings. vi TABLE OF CONTENTS LIST OF TABLES xi LIST OF FIGURES xii CHAPTER 1 Introduction 1.1 Research problem and area of focus 1.2 Basic functionality requirement and significance of research 1.3 Motivating application: monitoring activities in kindergarten 1.4 Capabilities and limitations of CV 1.5 Capabilities and limitations of RFID 1.6 Dissertation outline 1 2 3 5 7 8 11 CHAPTER 2 Background 2.1 The functionality and limitations of RFID 2.1.1 RFID operation 2.1.2 Types of RFID tags 2.1.3 Radio frequency 2.1.4 Limitations and needed improvements 2.1.4.1 Standardization and cost 2.1.4.2 Read accuracy 2.1.4.3 Anti collision 2.1.4.4 Security 2.1.4.5 Size and power 2.1.4.6 Miscellaneous limitations 2.1.5 Positional information 2.1.6 Smart objects, networks, location-aware computing 2.2 Computer Vision functionality and limitations 2.2.1 High image processing and search cost 2.2.2 Optical sensing problems 2.2.3 Object modeling 2.2.4 Object tracking 2.2.4.1 Object recognition 2.2.4.2 Object location and pose estimation 2.2.4.3 Sensing and relating 3D object points 2.3 Research projects based on fusion of CV and RFID 2.3.1 Model based recognition 2.3.2 Human-object interaction and activity analysis 2.3.3 Mobile robot localization 2.3.3.1 Map based navigation 2.3.3.2 Obstacle recognition 2.3.4 Miscellaneous 2.3.5 Natural/outdoor site management 12 13 14 16 18 19 20 20 20 21 21 22 22 23 24 29 30 30 31 32 33 33 35 37 41 47 47 49 50 52 vii CHAPTER 3 Proposed solution and research methodology 3.1 Tokens code observations from images and RFID 3.2 Object tracks {} 3.3 Obtaining 3D object location (x, y, z) 3.4 Heuristics from naïve physics 3.5 Fusion platform 3.5.1 Labeling with relational constraints 3.6 Sensor arrangement 3.6.1 Vision infrastructure 3.6.2 RFID infrastructure 3.7 Goals and related research parameters 3.7.1 Calibrating the cameras 3.7.2 Defining the ground truth 3.7.3 Structural stereo approach 3.7.4 Object sensing using multiple RFID readers 3.7.5 Integration of multiple CV sensors 3.7.6 Fusion of RFID and cameras 3.7.7 Smoothness of trajectories 3.7.8 Looming detection 3.7.9 Object inventory 55 56 57 58 60 61 61 62 64 66 69 69 69 70 71 71 71 72 72 73 CHAPTER 4 Localization of objects and scene points 4.1 Object detection and blob analysis using vision 4.1.1 Elliptical shape features for head detection 4.2 Stereo vision using ray-ray combination 4.2.1 Stereo configuration 4.2.2 Computing shortest line segment connecting two rays 4.3 3D location estimation results using stereo vision 4.3.1 Computing residual error using jig 4.3.2 Components of the stereo system used 4.3.3 Indoor stereo computation using a wireframe workspace 4.3.4 Indoor stereo computation using at surveyed lab area 4.3.5 Outdoor stereo computation 4.4 Active RFID based Real Time Location System 4.4.1 RTLS infrastructure 4.4.2 Cell architecture 4.5 RTLS based location estimation 4.5.1 Indoor location sensing 4.5.2 Outdoor location sensing 4.6 Summary discussion 74 75 79 83 84 86 89 89 93 94 98 100 103 104 107 107 108 109 113 CHAPTER 5 Fusion dynamics and analysis 5.1 The fusion process approach 5.2 Multiple sensor configuration 5.3 Building block for multiple sensor fusion 5.4 Benefits of fusing RFID and CV 115 116 117 119 121 viii 5.5 Test cases to explain fusion and its analysis 5.5.1 Test case I - Same colored objects 5.5.2 Test case II - Different colored objects 5.5.3 Test case III - Different colored objects w/ intermittent RFID/CV feeds 5.6 Summary discussion 125 125 126 127 128 CHAPTER 6 Tracking using fusion of CV and RFID 6.1 Object tracking model using sensor fusion 6.2 Labeling via iterative processing 6.2.1 Sensor process 6.2.2 Combination process 6.2.3 Tracking process 6.2.4 Relaxation labeling algorithm 6.2.5 Test cases to analyze discrete relaxation labeling 6.2.5.1 Test case IV - Two same colored objects w/ simple dynamics 6.2.5.2 Test case V - Four same colored objects w/ increased complexity 6.3 Optimization process 6.3.1 Smoothness of trajectories 6.3.2 Tracking algorithm description 6.4 Summary discussion 129 129 130 132 132 132 133 136 136 CHAPTER 7 Experiments, results and analysis 7.1 Generating stereo trajectories 7.1.1 Real indoor trajectories from wireframe workspace 7.1.2 Mathematical trajectories 7.1.3 Real stereo trajectories from indoors lab area 7.1.4 Real stereo trajectories from outdoors 7.2 Real outdoor trajectories using RTLS 7.3 Metrics for evaluation of performance 7.4 Real-time tracking: indoor with RFID feed simulated using color 7.5 Indoor stereo live demo results 7.6 Stereo error analysis in x, y, z dimension versus distance from the cameras 7.7 Least squares analysis on real outdoor stereo trajectories 7.8 RTLS signal availability 7.9 Simulations of object tracking 7.10 Tracking efficiency using fusion 7.10.1 Simulated scenario: two persons and two briefcases 7.10.2 Real outdoor scenario 7.10.3 Outdoor scenarios with varying fusion information 7.11 Object color variations 7.12 Sensor Error and synchronization problems 7.13 Summary discussion 150 151 151 152 154 155 157 158 161 161 164 167 175 176 180 181 182 187 189 193 194 CHAPTER 8 Conclusions and future work 8.1 Background Survey 196 196 ix 137 141 142 143 149 8.2 Evaluation of RFID, CV, and fused sensing 8.2.1 Demonstrated performance, potential and parameters for outdoor applications 8.3 Modeling fusion and its benefits 8.4 Data integration and filtering using relaxation labeling 8.5 3D object tracking algorithm 8.6 Future work and limitations APPENDICIES APPENDIX APPENDIX APPENDIX APPENDIX A B C D Fundamental experiments on looming Stereo concepts and calibration procedure Site survey details Wireless location sensing REFERENCES 197 198 199 199 200 201 206 207 216 229 236 251 x LIST OF TABLES Table 1.1 Advantages and disadvantages of CV and RFID. 10 Table 2.1 RFID tag types. 17 Table 2.2 RFID typical frequencies in use and respective ranges. 18 Table 2.3 RFID frequency attributes. 19 Table 2.4 Obstacle recognition results. See Cerrada et al. [59]. 39 Table 2.5 Activity and object recognition rates. See Jianxin et al. [69]. 43 Table 2.6 Obstacle localization with different antenna settings. Data extracted from 50 Songmin et al. [82]. Table 4.1 Left and right camera transformation matrices. 91 Table 4.2 2D residuals for left and right image of jig - scale is in pixels. 92 Table 4.3 3D stereo residuals for some points in wireframe workspace - scale is in inches. 96 Table 4.4 Comparison between ground and RFID observations in xy-plane- scale is 112 in inches. Table 7.1 3D ground truth and computed data for analyzing outdoor stereo error scale is in inches. Table 7.2 Ball projectile observed using stereo and fitted data along yz-plane - scale 172 is in inches. Table 7.3 Location error for RTLS tag trajectory in Figure 7.1 - scale is in inches. 174 Table 7.4 Track error with different points density and block length m. Track error is a fraction of wrong trajectory point assignments. 180 Table C.1 3D coordinates of some important landmarks in MSU Engineering courtyard - scale is in inches. 235 Table D.1 Location accuracies of cellular radiolocation technologies. See Kos et al. 244 [138]. xi 166 LIST OF FIGURES Figure 1.1 Fusion based Kindergarten students’ online observing system. 6 "For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this dissertation." Figure 2.1 General infrastructure of an RFID system. The RFID tag may be many meters distant from the reader. 15 Figure 2.2 Relationship of camera C with pyramid A and cube B in workspace W. 26 Figure 2.3 Work robot seeing a section of the Alaska Pipeline via a) an intensity image and b) its Laplacian. A few corner points are evident that can enable the robot to get oriented to inspect and operate on the pipeline. See Shapiro et al. [31]. 27 Figure 2.4 The Perspective 3-Point problem: a camera computes its orientation relative to three points seen from a known object. 28 Figure 2.5 Still camera monitoring a workspace: a) workspaces can be monitored by staring cameras b) object motion can be detected by changes in region statistics. 32 Figure 2.6 Pose in 3D simplified by sensing 3D points - Two cameras C1 and C2 knowing their relation to the space W can triangulate to compute the coordinates of a point in that space. That point must lie on the intersection of the two imaging rays [31] Ch 13. 35 Figure 2.7 General architecture of RFID and vision fusion. 36 Figure 2.8 3D objection recognition algorithm. Recreated from Figures 3 and 5 of Cerrada et al. [57]. 39 Figure 2.9 Map based navigation scheme. Recreated from Figures 1 and 3 of Weiguo et al. [79]. 49 Figure 3.1 Sensor error volumes: (a) rays intersection with error cones (b) intersection of error cone with RF lobe. 59 Figure 3.2 Block diagram of the Site Safety System using fusion of RFID and CV. 63 Figure 3.3 Construction scene example [98] with aspect ratio of persons and the field of view. 65 xii Figure 3.4 Schematic of dense placement of reference RFID tags. 67 Figure 3.5 CSL RFID based Real Time Location System: (a) active tag (b) master reader. 68 Figure 4.1 Flow diagram of specific RGB color detection and connected components blob analysis. 76 Figure 4.2 Blob detection outdoors: (a) original image (b) blobs of blue and yellow 77 balls. Figure 4.3 Ellipse geometry showing basic parameters to define an ellipse. 80 Figure 4.4 Implementation steps for elliptical shape feature detection. 81 Figure 4.5 Results of head detection using elliptical shape features: (a) input image 82 640  480 (b) cropped image with best two ellipses (c) cropped edge image with best two ellipses. Figure 4.6 Error cones obtained from projecting 2D imaging error back into 3D. Figure 4.7 General stereo configuration with two cameras viewing a 3D object in a 85 3D workspace W. Figure 4.8 Shortest line segment connecting the two skew rays. 87 Figure 4.9 Jig images with easily recognizable feature points: (a) left image (b) right image. 91 Figure 4.10 Procedure for calculating 2D residuals of jig images. 93 Figure 4.11 Wireframe workspace experimental setup for testing stereo localization indoors - red object with spiral trajectory. 95 Figure 4.12 Procedure for 3D stereo location estimation in wireframe workspace indoors - Flow diagram. 97 Figure 4.13 Computed sphere trajectory in wireframe workspace indoors. 98 Figure 4.14 3D survey data and model of indoors lab area with stereo cameras '●' (green). Calibration points are represented by '■' (yellow). 99 Figure 4.15 Indoors lab area trajectory for a person moving randomly with error estimation: (a) object tracks in 3D (b) histogram of error along z-axis. 100 xiii 84 Figure 4.16 Outdoor test site with calibration points shown by '■' (yellow): (a) left image (b) right image. 101 Figure 4.17 3D view of the outdoor test area with object tracks shown by '+' (red) using stereo. 102 Figure 4.18 Top view of the outdoor test area showing person locations computed using stereo and the ground truth '•'of outliers beyond x = 960 in (24.4 m). 103 Figure 4.19 CSL RFID based RTLS system: Tripods hold the readers. 104 Figure 4.20 RTLS location sensing indoors - setup. 108 Figure 4.21 RTLS location sensing indoors - Floor map. 109 Figure 4.22 RTLS location sensing outdoors - setup: (a) master reader on tripod (b) reader pointing towards test site center (c) reference tag placed in test site. 110 Figure 4.23 Top view of the outdoor test area showing person locations computed using RTLS. 111 Figure 4.24 Computed locations of the person using RTLS and error estimation error circles represent difference between ground truth and RTLS. 113 Figure 5.1 Multi-sensor configurations: (a) cooperative (b) cooperative (c) complementary (d) competitive. 118 Figure 5.2 Basic fusion node architecture. 120 Figure 5.3 CV and RFID supplementing each other - Case I: (a) same colored object tracks (b) label assignments 126 Figure 5.4 CV and RFID supplementing each other - Case II: (a) different colored object tracks (b) label assignments. 127 Figure 5.5 CV and RFID supplementing each other - Case III: (a) considering visual occlusion and intermittent RFID, different colored object tracks (b) label assignments. 128 Figure 6.1 Schematic diagram of object tracking using sensor fusion and relaxation 130 labeling. Figure 6.2 Correct object tracks with possible compatible labels at each block of time frames. xiv 137 Figure 6.3 Test case showing relaxation labeling: (a) correct trajectories of persons 138 and balls. (b) correct balls and incorrect persons' trajectories (c) left matrix - General pattern of four relaxation constraint passes and final compatible label/s. Right matrix - RFID location information. Figure 6.4 Label matrix updating steps for same colored objects at each time frame 140 for Figure 6.3(a) tracks. Figure 6.5 Block diagram of the smoothness algorithm. Figure 6.6 Step by step results of an example with smoothness algorithm applied: 149 (a) input data (b) nearest neighbor assignment (c) smoothed trajectories. Figure 7.1 Example of stereo trajectories generated from wireframe workspace indoors. 152 Figure 7.2 Mathematically generated trajectories using dataset generator with T=11 and N=7. 153 Figure 7.3 Indoors lab area real stereo trajectory where a person moved over predefined points. 154 Figure 7.4 Five outdoor real stereo trajectories: (a) 3D display of site (b) zoomed in 156 top view of trajectories. Figure 7.5 Outdoors 2D RTLS trajectories of a tagged person (green paths). 158 Figure 7.6 3D stereo tracking in wireframe workspace indoors - live demo. 163 Figure 7.7 3D stereo tracking in lab area indoors - live demo: (a) orange cursive 164 writing sample (b) heart shapes using orange and yellow color (c) yellow used as initializing marker to track orange. Figure 7.8 Selected 2D corresponding points in left and right camera image for analyzing outdoor stereo error. Figure 7.9 Stereo RMS error in x,y and z direction versus distance from the camera. 167 Figure 7.10 Left camera images showing projectile trajectory of a ball tossed upward. 168 Figure 7.11 Two different views of 3D points showing ball projectile trajectory computed using stereo. 169 Figure 7.12 Computed ball projectile trajectory (solid blue) with parabolic fitting (dotted magenta). 170 xv 148 165 Figure 7.13 Actual trajectory (dashed blue) with linearly increasing error along y171 axis and corrected trajectory (solid green) using parabolic curve fitting. Figure 7.14 Piecewise line fitting (solid green) on RTLS tag trajectory (dotted blue). 173 Figure 7.15 RTLS tag location error circles. 174 Figure 7.16 RTLS trajectory analysis: (a) RTLS single tag trajectory (green path) (b) RTLS tag location signal availability over time. 176 Figure 7.17 Reduction in combination volume with probability of random ID information availability. 178 Figure 7.18 Possible combination volume with N objects and probability P of object ID in bursts of four tokens. 179 Figure 7.19 Testing fusion using simulated scenario of two persons exchanging briefcases: (a) wrong interpretation of trajectories with CV alone (b) correct interpretation of trajectories with CV & RFID fusion. 182 Figure 7.20 3D view of test site with calculated real trajectories. 183 Figure 7.21 Outdoor scenario to test fusion: (a) left camera view of test site (b) computed correct 3D ball trajectories using fusion. 185 Figure 7.22 Outdoor RTLS trajectories of two tagged persons with varying fusion information - persons split on sides at the center of test site. 188 Figure 7.23 Change in illumination observed in left camera video feed when: (a) sunny (b) shady. Color histograms: (c) blue ball in sunlight, (d) blue ball in shade, (e) yellow ball in sunlight, (f) yellow ball in shade. 190 Figure 7.24 Sample images for different weather and illumination conditions to study 191 blue and yellow color consistency. Figure 7.25 Analyzing blue and yellow ball color consistency in HSV color space 192 under different weather and illumination conditions. Figure 8.1 Occlusion management flow during stereo tracking. 203 Figure A.1 Looming image dataset at different distances: (a) object at ten feet (b) object at two feet. 208 Figure A.2 Graph of bounding box width and height relationship for training datase 209 xvi Figure A.3 Graph of bounding box area vs looming distance relationship for training dataset. 210 Figure A.4 Graph of bounding box area vs looming distance relationship for two real-time datasets. 211 Figure A.5 Indoor lab platform consisting of NXT robotics kit and the iphone 3GS to 212 test collision avoidance using optical flow. Figure A.6 Motion vectors computation during real-time lab demo for collision avoidance:(a) test frame k-1 (b) test frame k (c) motion vectors using Horn-Schunck (d) motion vectors using Lucas-Kanade. 213 Figure A.7 Realtime indoor collision avoidance experiment: (a) robot approaching the obstacle (b) robot detected the obstacle. 214 Figure B.1 Loss of depth information in 2D - caused by projection of 3D points on same viewing line onto 2D image. 217 Figure B.2 Recovering 3D point coordinates using stereo vision. 218 Figure B.3 Stereo correspondence problem: which points in Image 1 actually W W correspond to points P and Q in Image 2? 219 Figure B.4 Epipolar geometry. 220 Figure B.5 3D reconstruction using 2D image points. 221 Figure B.6 Coordinate system used in camera calibration: (a) 3-D world (b) camera. 222 Figure B.7 Shortest line segment connecting the two skew rays. 226 Figure C.1 MSU Engineering building satellite view. 229 Figure C.2 Different views of the courtyard. 230 Figure C.3 Total station surveying equipment - Image extracted from [117]. 231 Figure C.4 Top view of the outdoor test site with legend showing equipment position 232 and the coordinate system. Figure C.5 3D view of the outdoor test site with sensor configuration - scale is in inches. xvii 233 Figure C.6 Outdoor test site with calibration shown by '■' (yellow): (a) left image (b) right image. 234 Figure D.1 Triangulation geometry. 239 xviii _____________________________________________________________________________________________ CHAPTER 1 Introduction _____________________________________________________________________________________________ Site safety and security, which requires understanding the interaction of persons, objects and machines, is an important problem that can be solved using sensor fusion and activity analysis. Identifying objects solely via computer vision (CV) is computationally costly, error prone, limited by occlusion, and sometimes impossible in practice. Radio Frequency Identification (RFID) can reliably identify tagged objects and can even localize targets at coarse spatial resolution. Our research that focuses on construction sites demonstrates benefits obtained when most objects are “cooperative” by being RFID tagged. We do not assume a controlled environment, but do assume that a survey of the terrain exists, including benchmark locations. This partial control is needed since tracking, especially in an outdoor environment, presents difficulties with varying lighting, rain, smoke, dust, and noise, and occasional unexpected agents or objects. Real-time decision-making, which is needed for safety and security applications, should be possible if the overall system can integrate information effectively between the sensor level and activity understanding level. T racking multiple objects is a fundamental problem with wide application and a rich literature. We are interested in application problems in site monitoring, security, and 1 activity analysis. Examples are tracking workers, materials, and machines in construction sites; baggage and people in airports; or patients and care workers in medical care facilities. There are many other important applications that come under this domain such as analysis of social or workplace interactions, analysis of games or shopping, asset management, old age home monitoring, and assisting persons with disability etc. 1.1 Research problem and area of focus In most of our experimental work we have focused on construction site safety in an offhighway outdoor environment. The approach we have presented is conceptually analogous and valid for other applications. Construction sites are planned areas that consist of resources such as personnel, equipment and materials involved in active work tasks. These resources are continuously dynamic and possess uncontrolled and sometimes arbitrary trajectories during the construction work. Also, construction sites are generally constrained and crowded areas. Any spatial interference in such dynamic construction sites can cause accidents involving collisions. Each year, more than 100 workers are killed and over 20,000 are injured in the construction industry [1]. One of the distinct safety problems has been identified as the proximity of workers-on-foot to heavy construction equipment and other vehicles [2]. Fatigue, work pressure, repetition of work [1], lack of awareness of existing specific risk factors, along with blind spots [3] are among the major causes of such work fatalities. Also many commercially available types of proximity warning 2 sensors and systems can be rendered useless in the construction environment when covered with mud, ice, snow, ore, rock, and other material. There are several proximity warning technologies that are available to help eliminate blind spots and associated accidents involving large off-highway equipment [4]. These include Radio Detection and Ranging (RADAR), Global Positioning System (GPS), RFID tags, cameras, 3D laser scanners and combinations of these technologies. Each of these comes with some limitations, such as operating range, signal availability, size and weight, cost, susceptibility to false alarms and applicability to construction environment. Successful implementation of these systems may be achieved if their shortcomings are realized properly and anticipated. To avoid spatial conflicts of construction resources, time-lapse photography and cameras are often used to analyze daily safety procedures [5]. Due to computational complexity in an uncontrolled environment, vision based techniques lack capability for real-time alerts. Therefore using Computer Vision (CV) alone for object recognition, localization and tracking tasks is still challenging. Some of the limiting factors are blind spots, arbitrary trajectories, changes in pose, scale, lighting, occlusion, visual data quality, volume of data and uncertainty. 1.2 Basic functionality requirement and significance of research Construction site safety and similar applications of interest require a Site Safety System (S-3) that provides some or all of the following basic functions. 3 a. Detection of the presence of objects of interest (persons, machines, materials, vehicles…) b. Identification of the objects (by class or by a unique object instance) c. Object location in workspace coordinates or by designated areas, d. Object track, if the object is moving, e. Important object properties, such as shape, color, weight, speed, ownership, supplier, etc. f. A memory representation of space and time including location of objects, trajectories, and behaviors. g. Application specific processes that manage objects and control their behavior (such as collision avoidance or creation) Fusion of computer vision and RFID can provide this functionality in many cases. While humans rely on visual sensing for such problems, an automatic system will perform faster and more reliably with fused input compared to optical input alone. Higher level problem-specific analysis can then be applied on top of functionality (a) to (e) to create a dynamic inventory of a workspace, infer what agents or objects are doing (function f), manage interactions, define and summarize events etc. (function g). In a construction site domain, fusion can be used to design a real-time three dimensional localization and tracking system incorporated to better automate work safety technologies. It is possible to accurately locate and track static and moving objects using three dimensional fused data obtained from RFID and CV. The S-3 system will have video cameras, RFID tags/readers, a spatial database, local and global warning systems, and wireless networking to synchronize and 4 transmit information to displays for remote control and alerting workers/operators. Such a system with effective management practices has sufficient potential to significantly improve safety at construction sites. In a construction site environment, it is safe to assume that the objects are mostly cooperative and tagged, wearing distinctive clothing, and that 3D survey data of the test site exists. The significance of this research is to explore integration of S-3 system capabilities as follows: a. Integrate cost effective sensors for fusion into a real-time construction site environment. b. Enhancing system detection and recognition capabilities. c. Understanding interactions between objects and materials in an outdoor environment. d. Augmenting three dimensional localization and tracking that can enable safety spheres for construction personnel in the work envelope of construction vehicles and heavy equipment. e. Creating dynamic inventory of a workspace for resource efficiency maximization. This will help improve construction site management by utilizing site information to better account for equipment, materials, personnel, and activities. 1.3 Motivating application: monitoring activities in kindergarten Consider a motivating application of “analyzing” what is going on in a kindergarten: Figure 1.1 (see S. Nakagawa et al. [6]). The major output is video. Parents can view their child’s activity on the Internet: they can see what their child does and what other children or toys he/she 5 Figure 1.1 Fusion based Kindergarten students’ online observing system. "For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this dissertation." plays with. Placing video cameras throughout the environment is easy, but selecting the right cameras and time slices is difficult. RFID tags can be placed on the play objects and on the children so that readers within the play space can locate and identify them as they move about. 6 Appropriate cameras can be selected for good views of selected children and/or objects. Alarms can be implemented for interaction between designated pairs of objects. Doing summarization automatically requires automatic identification of the children and toys, and perhaps their interactions. How much time does the child spend with toys versus other children? With which children did the child interact during the day? There is much current CV research on this, but use of RFID to identify children and toys can create a working system today. Motion and activity analysis can then be used to classify and select video segments. Although automatic activity analysis may not be robust, at least the parents will be watching the right child. There are other applications requiring similar functionality – for example, monitoring assisted living facilities or studying how shoppers examine items for sale in a store. Extension of this functionality to outdoor site management can enable some control of real time operations. 1.4 Capabilities and limitations of CV Computer Vision has been very successful in controlled indoor environments, but challenged in uncontrolled outdoor environments. The CV literature contains thousands of reports on object detection and recognition, tracking, and motion analysis. Image sensors are passive, cheap, can be far-seeing, and can collect a good deal of information about a scene. The output images are conveniently interpreted by humans. Commodity cameras easily produce frame rates useable for most human motion analysis. Detections and relationships in a 2D image can often be mapped to the real 3D scene. Using multiple camera stereo, objects can be located in 3D. Or, special active sensors can yield range/depth images. Images are useful to sense the extent and pose of an object, its relationship to other objects, its motion trajectory and behavior. Vision requires a clear 7 line of sight and perhaps the object must be in a suitable pose (position and orientation) for detection and identification. Object identification via CV, function 1.2(a) above, is often difficult and is usually based on sensed features 1.2(e). Even accurate features may not precisely ID an object, e.g. who is that person or what year is that Chevy Cruz. ID by features, if possible and reliable, can be computationally costly given the necessary signal processing and the variation of appearance over many possible 3D poses. So, while person identification using carefully imaged biometrics might yield ID accuracies of over 98%, more general object recognition via color camera image is far less accurate. Acquiring quality images under occlusion and variations in lighting also causes serious problems in CV applications. Uncontrolled outdoor environments might be dark, dusty, have rain or snow, and have both static and dynamic objects occluding the sensor view. Finally, CV may sometimes give too much information; for example, humans do not want to be imaged in private spaces, and may resent being watched in work places. 1.5 Capabilities and limitations of RFID RFID applies generally in industry and business for automatic identification of items. Due to its improved functionality compared with barcodes, it is now replacing those ancestors wherever feasible. The major advantage of RFID is that its operation is independent of line of sight between reader and tag. The tags come with a read and write capability when there is onboard memory. RFID can also be combined with sensors to make sensory tags. RFID can easily and reliably provide a unique object ID by transmitting a digital signal to a reader. Such reliable 8 identification is often difficult using vision. With enough power, RFID can also transmit nonvisual features of a person, such as name, weight, height etc. and in case of objects, their ownership, contents, and the full surface description of a 3D object etc. RFID can operate in smoke, and darkness, thus it is widely used in sales and inventory systems and is replacing bar coding when cost permits. Object ID approaches 100% accuracy in commercial applications where objects are close to (presented to) a reader in a controlled environment. Objects with RFID tags can actually transmit their own physical description to an automated system or a security person. In robotics or material handling, the description might send a CAD model to a CV system to teach it how to recognize the object. RFID technology offers a wide variation in terms of cost, size, sensing distance, memory and processing power, and security [7]. With higher RFID frequency, higher data rate can be achieved. The higher data rate with appropriate anticollision algorithm can enable a single reader to read a large population of tags. RFID can even be used to locate objects. Although the inexpensive nature of passive RFID offers large scale utilization, cheap passive tags have the limitation that the tag can only provide its identity, but to acquire location information complex infrastructure is required. Some applications use active tags for localization. The Real Time Location System [8] that we have used for our RFID sensing is discussed in Chapter 4. RFID codes cannot be sensed by humans and hence can yield less overt ID and might be tolerated in private human spaces. RFID requires that an object be physically tagged, thus changing the object itself and requiring that the object be “cooperative”. Although passive RFID tags can be tiny and do not require their own energy source, they are used for limited range and have limited memory. 9 Highly functional RFID tags require an energy source for communication, memory, and processing. An RFID tag is a proxy for an actual object so it or its communication can be counterfeit, thus making an object appear to be what it is not. Simple examples of this would be one driver stealing an EZ-Pass device from another driver, a shopper moving a tag from a cheap article to an expensive one, or two airline passengers swapping their RFID boarding passes. Thus RFID tags in critical security applications need to use encryption and secure operating system principles. Table 1 highlights the relative strengths and weaknesses that will be discussed in Chapter 2 and referenced throughout this thesis. Table 1.1 Advantages and disadvantages of CV and RFID. CV versus RFID Advantages Computer Vision       Disadvantages      RFID Passive feature-based sensing 2D array represents many properties of 3D world Cheap commodity sensors Similar to human vision Can recognize object and pose Yields human-useable displays  Object oriented sensing  Can provide arbitrary symbolic object properties  Occlusion not a problem in several cases  Object recognition not a problem  Coarse object location possible Occlusion prevents sensing Variations in lighting and object appearance Variations in 3D object pose Some object properties are not observable Object/background separation  Active sensing recommended for distant location sensing  Sensing distance and angle can be complex  Greater distance requires greater power  Tag content must be written  Tags can be cloned or lost 10 1.6 Dissertation outline The rest of this dissertation is organized as follows. Chapter 2 provides a summary of the related literature reviewed in developing the presented work. Chapter 3 provides the problem formulation and approach to its solution. Chapter 4 highlights location sensing with respect to stereo CV and RFID. Chapter 5 explains the fusion model. Chapter 6 provides details on tracking using fusion. Testing methodology and experimental results are given in Chapter 7 along with a discussion of limitations and problems. Finally, Chapter 8 presents the research conclusions and need for future work. 11 _____________________________________________________________________________________________ CHAPTER 2 Background _____________________________________________________________________________________________ This chapter provides the necessary background material for understanding the work and discussion in the rest of this dissertation. Where required, a brief background about the topic under discussion will also appear in other chapters. This chapter discusses the functionality of computer vision and RFID and what are the constraints when these techniques are used separately. As the discussion progresses, the state-of-the-art projects are explained that have used fusion for detection, identification, location and tracking; however, some single mode applications are discussed to show how fusion can improve performance. We have also explained our work in the natural outdoor environment. The categories of these tag based vision projects have been made in the context of application defined functionalities. A nimal vision solves recognition and navigation problems in a 3D world. An organism needs to know what objects and activities are in its environment and what consequences these have. Visual sensing is often integrated with auditory or other sensing -- and also with memory -- for an organism to make decisions. A dog is attacking, a car is passing, a trash bin is heavy. Invention of radio frequency identification tags now provides the capability for an object to notify a nearby agent of its presence and properties; and, it may be that neither the presence 12 nor properties are observable via vision. For example, a buried steel drum can transmit a description of its contents; an unseen vehicle can warn of its approach, and a store item can tell that it has not been purchased. To start here the functionality and limitations of RFID is discussed. 2.1 The functionality and limitations of RFID RFID technology is a mature and economically successful technology that can provide reliable symbolic identification of objects that are tagged, including features of the object that cannot be observed via vision. RFID technology has revolutionized manufacturing, distribution, sales, and transportation technologies by providing non contact, non optical object ID. Cheap low end RFID tags are substitutes for a printed bar code. Higher end RFID tags are communicating devices with a power source, memory and processing power, and are much like a wireless computing device. Moreover, RFID technology can not only provide object ID, but can also be used to provide object location. RFID enables all individual railroad cars [9] and many trucks in the US [10] to be recognized by a nearby reader and tracked via a network. The RFID based automatic vehicle identification system of Trans Core technologies is providing control and tracking of commercial ground transportation vehicles at busiest airports of US [11]. Other tracking solutions integrating RFID with GPS [12] and cell phones are currently available in world markets. In this chapter, we do not concentrate on how RFID can replace optical sensing or CV, but instead on how RFID together with CV can enable new or more capable systems. 13 Automatic Identification (Auto-ID) deals with information control and material flow problems. Contactless identification is an integral part of automatic identification. Contactless identification is an independent research area that combines varying fields such as telecommunication, semiconductor, cryptography, security, data protection and handling. Barcodes, magnetic inks, optical character recognition (OCR), smart cards, biometrics and Radio Frequency Identification (RFID) are automatic identification methods. RFID is a wireless technology since it communicates using Radio Waves. Want [13] describes the benefits of RFID technology, which include more reliable scanning, better tracking, integrated metadata management, reduced back-end communication, efficient label management, and wireless sensing. In addition to its obvious application in business architectures, RFID has also been integrated into robotics and artificial intelligence applications. In the subsections below, we consider several aspects of RFID, such as cost, wavelengths that are used, power that is needed, and distances and angles between object and reader. Different types of engineered solutions are available for different applications. 2.1.1 RFID operation The major components of the RFID infrastructure are the tag, reader, database and a host application: Figure 2.1. The host application manages the RFID reader. It allows a reader, via radio frequency through its antenna, to activate and/or directly communicate with a transponder on a tag attached to an object. Access to the parking lot via an RFID tag presented a few inches from a reader at the gate allows the reader to remotely read and/or write data to the RFID tag. 14 The tag modifies the received signal and transmits back a modulated signal. The reader through its antenna receives this modulated signal and decodes the tag ID. The data related to the tag ID in the database is then used by the host application. For example, when a drum of antifreeze is loaded onto a truck, the tag on the drum and the tag on the truck are read in order to complete shipping and inventory records. RFID read distance ranges from few inches to more o than 100 m and RFID antenna read angle ranges from a pencil beam to 360 . Figure 2.1 General infrastructure of an RFID system. The RFID tag may be many meters distant from the reader. 15 2.1.2 Types of RFID tags RFID tags can be classified in different ways; however, we use the two groups commonly used in industry [14], – by power and by reading distance. First, depending upon the chip’s power requirement, RFID tags can be categorized into passive, semi-passive (sometimes called as semi-active), and active tags. Moreover, the type of tag dictates the memory capacity of the tag. Passive tags consist of a semiconductor chip and an antenna. Electromagnetic energy received from the reader is used to power the response of the passive tag back to the reader. The read distance of passive tags is short due to limited induced voltage and reflected signal. Semipassive tags contain a battery for powering only the logic in the tag; however, their communication principle is just like passive tags. Due to the onboard battery they can operate at long ranges compared with passive tags since they don’t have to rely only on induced voltage. An active tag utilizes an onboard battery for communicating and for powering the tag logic and operates at ranges greater than passive and semi-passive tags. Unlike passive and semi-passive tags, the added benefit of an active tag is that it can initiate, as well as respond to communication with the reader. The broadly categorized RFID structure with their functionalities is listed in Table 2.1. 16 Table 2.1 RFID tag types. Type Functionality Power Life span Security Communication Passive Tags Purely passive and can vary from read only to read and write Power from reader Indefinite Ranges from zero to low security Response only SemiPassive Tags Integrated sensing circuitry and onboard battery power to supplement received energy Onboard battery Depends upon battery life Minimal to highly secured Response only Active Tags Onboard battery, complex protocols and communication with active tags Onboard battery Depends upon battery life Highly secured Respond or initiate A second classification of tags depends upon the field of operation, categorizing the tags as near field or far field. The near field tag’s operating principle is based on Faraday’s law of magnetic induction. The reader, through its antenna coil, induces voltage in the tag’s coil. This induced voltage is then varied by the tag’s onboard circuitry by changing the applied load, thereby encoding a unique tag ID. This varying induced voltage is then sensed at the reader. This method of sending data in the near field operation is called “load modulation”. The far field tag operates on the principle of radio waves. The reader propagates electromagnetic waves using a dipole antenna. Some of this energy is reflected back from the tag dipole antenna due to impedance mismatch. Changing antenna impedance over time causes variation in the reflected signal strength. This pattern is used to encode the tag ID that is then decoded at the reader. This 17 way of sending data to the reader is called “back scattering”. The physics of these designs determine their cost and performance. 2.1.3 Radio frequency RFID uses radio waves with frequencies from 125 KHz to approximately 3.1 GHz. Lack of worldwide uniformity of frequency regulation is hampering international standards of RFID systems. Though there has not been an international consensus on the frequency bands for RFID, typical RFID frequencies with read range comparison and respective costs are given in Table 2.2. A summary of the commonly used RFID frequencies with their different attributes is given in Table 2.3. Table 2.2 RFID typical frequencies in use and respective ranges. Tag class Band Frequency Band RFID Frequency Read range Cost Passive LF HF UHF 30 - 300 KHz 3 - 30 MHz 0.3 - 3 GHz 125 - 134 KHz 13.56 MHz 865 - 956 MHz 0.2 to 0.5 m 1.5 to 3 m 0.5 to 7 m $0.2-$0.8 $0.2-$2 $0.2-$0.8 20 to 100 m $4-$20  100  100  100 $5-$50 $5-$50 $5-$50 Semi-Passive Active Available in LF, HF and UHF band UHF MW UWB 0.3 - 3 GHz 2 - 30 GHz 2 - 30 GHz 433 MHz 2.45 GHz 3.1-10.6 GHz 18 m m m Table 2.3 RFID frequency attributes. Attributes Low Frequency High Frequency Ultra High Frequency Microwave Frequency Frequency 125 - 134 KHz 13.56 MHz 865 - 956 MHz 2.45 GHz Data Rate Slow Moderate High Very high Range Close contact About 3 ft About 10 to 30 ft More than 100 ft Penetration Penetrates water and tissue Not good near metal and penetrates some materials Does not penetrate metals Does not penetrate metals Tag size Relatively bigger Thin tag size construction but relatively bigger tag size Small tag size Small tag size Moisture effect No effect Negative effect Negative effect No effect 2.1.4 Limitations and needed improvements For maximum exploitation of RFID technology there is still a need for technical and security advancement. Following are some of the major limitations. 19 2.1.4.1 Standardization and cost Business demands resulted in reduced RFID cost to replace other labeling technologies. However, there are two standards being generally followed, EPC governed by Auto-ID Center, and the ISO specified set of standards. Therefore, a definition of mutual commonality is required for global application and discerning of RFID coding. 2.1.4.2 Read accuracy Read accuracy is important for a specific application and environment. It is affected by falsenegative readings, i.e. missing a tag, and false-positive readings, i.e. detection of a tag that is not in range. Read accuracy is influenced by RFID reader design, reader and/or tag orientation and obstructions. A typical E-ZPASS installed on a highway toll station operates on UHF (900 MHz) with flawless detection of a low lying sports car to large trucks moving 5 MPH [15]. 2.1.4.3 Anti-collision Anti-collision algorithms are applied to overcome signal collision and data loss during scanning of several tags at same time. The reader as well as tags can adopt an anti-collision algorithm [7]. Presently, anti-collision technologies allow simultaneous communication between a reader and up to 2000 tags in the reading area. 20 2.1.4.4 Security An RFID system should ensure security at all interfaces to prevent unauthorized processes from reading or writing tag data. For example, a business must prevent a shopper from changing the price of an item and the toll authority must prevent driver A from charging an EZ-Pass toll to driver B’s account. Higher level security features result in increased tag cost. Common security features are write lock, password protection, authentication, stream encryption and crypto processors [16]. Note that interfering with an RFID transaction is similar to disguising an object or incorrect feature detection in the optical domain. 2.1.4.5 Size and power Nanotechnology is exploited to reduce the size of RFID tags. A few of the industry’s lowest power microprocessor chips are reported by Phoenix as having 30 pW of power requirement in sleep mode. For comparison, some passive tags power requirements range from 5.1W to 25 W depending upon the frequency and read range. For passive tags, nano-particles are used to produce printable RFID transponders. Semi-passive and active tags incorporate a battery and thus the tag size and cost is increased; and, there is a need to consider the temperature limits as well as time of service. Toward a design without a battery, there is current research focusing on zero energy RFID tags being powered from thermal, vibration (piezoelectric), or solar energy. 21 2.1.4.6 Miscellaneous limitations Having stated the main strengths of RFID towards fusion, we mention the weaknesses relative to our overall goals of object recognition and tracking. RFID tags do not reveal the appearance of an object – size, shape, color, etc. – unless it is symbolically encoded in the tag memory. CV here can also address this issue by providing visual analysis of the scene. RFID tags do not reveal object orientation, unless multiple antennas [17] or lattice of tags are affixed, which demands composite arrangements. In cases where initial estimate of the pose is predictable by RFID, CV can then be used for refined pose estimation and tracking. Finally, more power is required to supply more information to a reader and at greater distances – that is, active RFID is needed. Apart from indoor inventory management and similar long range settings, such a setup is generally required in outdoor environments where CV is mostly constrained due to lighting conditions and other weather effects. 2.1.5 Positional information Within business applications the early focus for RFID use was for identification tasks only. However, with the expansion of RFID to location and tracking problems, position information comes into play; for example, where objects must be moved or grasped by robots or transfer systems. For positioning in outdoor environments GPS is often used, however its reliability in indoor scenes is poor. RFID identification systems generally lack positional information, thereby not providing direct information on the tagged object location. A network of RFID readers can be created, where the readers are used as artificial landmarks. An object can be located by being 22 near to a known reader at a known location. For example, using only RFID, it is easy to determine what cars are in a parking lot that has a reader at the gate, or in what hospital room is a tagged doctor, thus giving “symbolic” location. Coordinate information will be available according to the accuracy of the known reader location. Methods that use signal strength or triangulation from multiple readers to compute general object coordinates are discussed more under Wireless Location Sensing (WLS) in Appendix D. For completeness here, we highlight a few reported results. In [18] the Average Error Distance using the active RFID tag infrastructure working at 433 MHz in an outdoor environment is reported to be better than 7 m within a range of 100~500 m. In [19] the accuracy using 96-bit UHF passive RFID infrastructure to localize objects in an indoor environment was 15 cm within a 2  3 m area. In [8] CSL Technologies provide an economical off the shelf Real Time Location Systems (RTLS) using active RFID infrastructure at 2.4 GHz in an outdoor environment with an accuracy of about 1~2 m within a range of 200 m. The system is in use for tracking of elephants in the Dallas zoo [20] and has been independently evaluated in [21]. RFID localization will depend on the number of objects and amount of environmental clutter in the application. Some other RTLS equipment providers are Ubisense [22] and AeroScout [23] . 2.1.6 Smart objects, networks, location-aware computing The technology of wireless and mobile computing and communication is vast and changing rapidly. Functionally, a cell phone is much like an active RFID tag, the major difference being that the cell phone is designed to connect a human to a network rather than some other object. 23 Cell phones can have large memories and exchange arbitrary data across networks. They can act as GPS receivers and provide location information – hence a new field called “location aware computing”. They also can contain accelerometers that can provide information on movement. Commodity pricing brings impressive power to cell phones at moderate cost. By using local multiple radio signals from known locations, a device or tagged object can locate itself within a few centimeters in a work area one kilometer across [24]. This supports precision agriculture, where precise location supports faster field operations, correspondence of work site points to aerial imagery or other sensor input, and more efficient application of fertilizer or pesticides. Wireless location sensing has provided automation to indoor and outdoor systems. The outdoor location sensing systems are generally based on line of sight technologies e.g. GPS and cellular, while indoor systems use local positioning systems based on WLAN, bluetooth, sensor network, RFID, infrared and ultrasonic, etc. or combinations of these. There is a rich literature available on local sensing [25], [26], [27], [28], [29], [30]. We have briefly highlighted wireless location sensing in Appendix D. 2.2 Computer Vision functionality and limitations Computer vision (CV) is concerned with extracting information about the real world from 2D images, or a retina of pixels. 2.5D and 3D “images” may be included, or can be considered derived from 2D images [31]. Human vision can provide all of the first five functions a) to e) given in Chapter 1 and much of the study of CV involves developing machines with such functionality. A recent survey of pedestrian detection from single video frames from an on board 24 automobile camera concluded that humans are skilled at detections whereas current algorithms perform poorly [32]. Since the vehicle safety application is of great importance, research will continue, and along multiple lines including both monocular imagery and fused input. CV is related to both image processing and artificial intelligence, depending upon whether the (“lower level”) image processing is emphasized or how sensed features are related to memory models in recognizing objects (“higher level”) [33], [34]. Object recognition and visual motion analysis are two difficult problems for CV that generally involve several steps, each of which may be difficult. At the lower level, imagery needs to be normalized or interpreted relative to lighting and background so that image regions or boundaries corresponding to objects of interest can be extracted. At the higher level the features or extent of the object regions must be matched in some way to models of learned objects in memory. Matching of 3D objects can be complex due to unknown scale between the real object and sensed image, the large number of possible viewpoints (poses) and the effects of occlusion that prevents observation of some object features. Objects can be deformable or come in varied sizes and shapes, such as the human body. Moreover, even when there exists a workable CV solution, the computational costs can be high due to the large amount of image data processed and the large number of possible matches to memory. So, clearly there will be many applications that will benefit from using RFID for object recognition. Figure 2.2 shows a camera C observing a workspace with coordinate system W containing two objects, pyramid A and cube B, each with their own local coordinate systems. There are several methods that C can use to calibrate its relationship to W by observing points of W [31] Ch 25 13, which would be necessary to understand the activities of A and B in the space. The shape of each object can be defined in terms of its own object-centered coordinate system. The relationship between objects B and C can be represented in terms of coordinate system W. The camera has its own 3D coordinate system, and its pose defines a projection from 3D space W to the 2D image plane I. The relationship between camera C and an object can be computed using three 3D points of that object and the 2D images of those points. Figure 2.2 Relationship of camera C with pyramid A and cube B in workspace W. 26 Figure 2.3 shows an intensity image from underneath the Alaskan Pipeline. Because of the shadowing, the image intensities were stretched in order to show detail in the shadows. Meaningful image regions are difficult to identify; for example, the surface of the pipe appears striated by corrosion. By applying a Laplacian operator, points of high intensity change are highlighted. Some of these show object structures and corner points, which can be identified in the image using higher level image processing operations [31] Ch 10. If an inspection robot knows its global location and has some model of what it expects to see, it can compute its orientation relative to the pipeline structure in order to perform its work. (b) (a) Figure 2.3 Work robot seeing a section of the Alaska Pipeline via a) an intensity image and b) its Laplacian. A few corner points are evident that can enable the robot to get oriented to inspect and operate on the pipeline. See Shapiro et al. [31]. 27 Figure 2.4 illustrates the “P3P problem”: how does a known camera platform compute its orientation relative to a known object using the coordinates of three points of that object and a perspective projection of those three points. Fischler and Bolles [35] treat this problem, give a solution for P3P, and discuss more robust solutions when more points are known. The human head is a much-studied particular object that often has three observed points that can be used for computing head pose and possibly computing a normalized frontal view using that pose [36], [37]. Figure 2.4 The Perspective 3-Point problem: a camera computes its orientation relative to three points seen from a known object. Some areas of CV are not of concern here: for example, automated object inspection and measurement operate on precise representations of a known object and are not helped by RFID. 28 Image enhancement, restoration, coding, etc. whose outputs are again images are also of no concern. On the other hand, adding symbolic tags to raw video indicating what objects or actions occur within a video segment can indeed benefit from RFID. For image or scene understanding, an autonomous agent must know what the objects are and where they are in its environment. Some environments are controlled, meaning that possible objects, background, and lighting are known. Uncontrolled environments make trouble for autonomous vision, since objects, backgrounds, and lighting are uncontrolled or unknown. Environments can be “in between”, for example, a soccer field or parking lot has mixed properties. There are major challenges to CV in uncontrolled scenes, some of which can be practically solved using RFID as noted below. 2.2.1 High image processing and search cost A 2D image array records many relations in the 3D world. Although cameras can be cheap, processing an array of pixels can be costly. Even if viable object recognition algorithms exist, they may be expensive in time and memory due to both the large number of pixels and the large number of operations on those pixels. In addition, recognition implies stored object representations and a matching algorithm, which imply memory and computational cost [38]. Computational cost rises with the number of possible objects. RFID clearly can alleviate these problems by having objects declare their presence to a reader. The observer/reader can then request an object model from the tagged object, or from a network using the object ID as key. 29 2.2.2 Optical sensing problems Several distortions are present in viewing the 3D world via a 2D image, which may interfere with object extraction or matching for object identification, e.g. lens distortion, lighting variation, digitization noise, and object surface variation. Computation may be required to restore proper object features. Extra computation is needed for partial matching when a perfect object representation cannot be extracted from an image. Excessive computation and the uncertainty inherent in matching partial representations can be avoided if RFID can reliably provide object identification, and possibly even object coarse location. 2.2.3 Object modeling Irving Biederman [39] stated that a six-year old child might recognize 30,000 different objects while having a verbal vocabulary of only a few thousand words. The variety of objects, both man-made and natural, makes general object modeling extremely difficult. Imagine a junk yard robot tasked with sorting all the refuse of society! A grocery store or airline lobby is also very complex. Different types of object models have been proposed for different types of objects and applications. Learning or teaching of objects and the recognition algorithms vary with the type of object model [40]. Three common types of object models are appearance-based, featurebased, and geometric-based. Appearance-based models represent an object, or object part, based on the sensor representation of it [41], [42]. Feature-based models typically represent an object as a fixed length vector [list] of features computed from that object [43]. Geometric-based models typically represent an object as an aggregate of vertices, creases, surface patches, etc. in 30 an object coordinate system. The type of model determines how it is created or learned and how it is used in recognition. Some models may have to be changed even while in use – for example, an object tracker using an appearance-based model has to continuously update the model during tracking as the lighting and image shape changes. If an object can transmit its own model information via active RFID, both the search of the observer’s model memory and the combinatorics of matching model to observations can be greatly reduced. 2.2.4 Object tracking Recognition and tracking an object in an image sequence is one fundamental problem of computer vision. The goal is usually to recognize a moving object and analyze what it is doing. If the observer is in motion, objects that are stationary in 3D will yield apparent motion to the sensor complicating analysis, as in the case of a moving car and moving pedestrians. Figure 2.5(a) shows an image from a staring still camera monitoring a workspace. Entry of a person is detected by a change in region statistics over a few video frames. Simple change detection greatly simplifies segmentation; however, object detection and ID remain as problems. Related applications are motion-based recognition, automated surveillance, video indexing, humancomputer interaction (HCI), traffic monitoring and vehicle navigation. Modeling object appearance and movement in vision-based methods are both computationally complex. Statistical and area processing methods of CV might be replaced by an engineered RFID solution. Khan and Shah [44] survey background on various image processing approaches to tracking objects and present a novel method for tracking people moving on a plane by combining information from multiple single image viewpoints. While their viewpoint combination method 31 performs well on the tests, it can be stymied by strong occlusion and by intersecting object tracks. They describe additional appearance-based methods that might help remove these ambiguities, but probably not as well as can fusion of RFID. (a) (b) Figure 2.5 Still camera monitoring a workspace: a) workspaces can be monitored by staring cameras. b) object motion can be detected by changes in region statistics. 2.2.4.1 Object recognition Performance of recognition and tracking systems strongly depends on their ability to detect and identify objects in some environment. The motion of the object may be necessary for its identification. The detection of an object might be performed in the first frame or in all frames. The complexity increases due to false alarms and false dismissals and also due to objects actually entering or leaving the observed space. Some of the CV processes used in detecting objects include feature point detection, background subtraction, supervised learning, and segmentation 32 [32]. Due to the projection of 3D to 2D and due to articulation of some objects, whatever model that the observer is using has to change over time, adding more complexity to the tracking task. 2.2.4.2 Object location and pose estimation Pose estimation is the process that estimates the position and orientation of an object in some coordinate system. Mathematically, we need to determine the three angles, or orientation parameters, and the three position parameters orienting and locating an object in the 3D coordinate space. Pose is used by a mobile platform for collision avoidance or interaction with the object. Some previous related work is given in [45], [46]. Figure 2.2 shows a camera viewing two objects in a workspace W. The application may need to compute the pose of each object relative to the workspace coordinates, as in the case of surveillance; relative to the observer, e.g. in the case of a robot operating on the objects; or relative to each other, as in the case of activity recognition. Computing the pose of the observer relative to an object has already been introduced and sketched in Figure 2.4. 2.2.4.3 Sensing and relating 3D object points Pose in 3D is simplified by sensing 3D points on the object rather than just 2D points in an image. Stereo can be used to do this, as shown in Figure 2.6. If two cameras with known pose in 3D space W observe the same object point P, then P can be located in space W by intersecting the two camera rays in space. (Due to approximation errors, the closest approach of two rays is actually computed. See [31] Ch 13. If more than two cameras are used, then robust analysis can be used on a set of approximate ray intersections.) 3D sensors have been constructed by 33 packaging multiple cameras. Or, if the camera C2 in Figure 2.6 is replaced by a laser beam or sheet of laser light, then a “structured-light” device is created. LIDAR scanners can compute range to a 3D point of an object surface by comparing the phase difference of a modulated light beam sent to and reflected from that surface. Thus there exist unit “range sensors” that can sense an entire scene as a set, perhaps a dense set, of 3D surface points [47], [48], [49], [50] and some current cars have these for collision avoidance [13]. Having 3D points greatly helps in scene segmentation and object shape analysis relative to having only 2D image features; however, it does not make segmentation and recognition easy. Once a set of 3D points is available from an object surface, they can be matched to a model surface using the general Iterated Closest Point Algorithm [51] to compute relative pose as well as quality of match. Approximate pose is required as a starting point. Point or surface matching is extended to a moving platform with SLAM (Simultaneous Localization and Mapping) [52]. By matching in 3D, we have seen that a sensor can compute its pose relative to 3D object/scene points. The sensor can then move and compute its new pose; moreover, it can compute the pose of newly observed scene points relative to formerly observed scene points and thus grow a map of a scene being explored. 34 Figure 2.6 Pose in 3D simplified by sensing 3D points - Two cameras C1 and C2 knowing their relation to the space W can triangulate to compute the coordinates of a point in that space. That point must lie on the intersection of the two imaging rays [31] Ch 13. 2.3 Research projects based on fusion of CV and RFID Various methods for fusing the visual and tag sensing data have been proposed. These revolve around the basic functions of detection, identification, location and tracking. The 35 categories have been made in the context of fusion-oriented applications. Within the context of identification and tracking, the general architecture of tag-based fusion is given in Figure 2.7. Figure 2.7 General architecture of RFID and vision fusion. 36 2.3.1 Model based recognition One of the earliest approaches using model based recognition and RFID is proposed in [53]. The algorithm identifies an object using RFID and then recognizes it in the scene using an appearance model stored in the object tag. The algorithm compares the observed model with the stored model from the tag and recognizes the object if both models match. If the object is not recognized then it is considered to be an occurrence of a new object with no prior appearance model. The system acquires the object model and saves it in the tag. Using edge data, a model is generated and stored. To accumulate many models, Eigen space analysis is used. Eigen space is updated every time an object with model is observed. Only fixed rigid objects were used for the experiments. On the same lines Boukraa and Ando [54] have reported a 3D scene analysis architecture for polyhedral shaped objects. The object identification is performed using 2.54 GHz passive tags with a read range of 1.2 m. The unique tag ID is received from the tagged object. Using that tag ID, the object model is then located in a model database, through a network. The vision system then detects lines and edges and projective matching [55] is used for registration. Therefore, the object recognition task is reduced to registering the object model to the observed image and the recognition part is independent of the number of models in the database. In [54], [56] Boukraa and Ando used their knowledge-based recognition algorithm only for single object scenes with polyhedral shapes; whereas natural scenes are filled with free form objects. Cerrada et al. [57] approached object recognition and localization for free form static objects in complex scenes using fusion. 3D information of the objects is generated using range sensors. For vision only based techniques, recognition and localization are costly computational 37 algorithms due to the uncertainty of the objects in the current complex scene. The fusion approach reduces the original database to a number of objects in the current view. Their scheme presents comparative results with and without RFID. The RFID reader identifies the objects in the tagged environment with a list of read tags in the read range but does not provide location information. The information is fed to the Weighted Cone Curvature [58] stage from which an initial partial view estimate is acquired. Reduction in the original database is achieved by carrying out comparison of principal components and partial views. Finally, for object recognition and localization, the difference between two clouds of points is minimized by the Iterative Closest Point algorithm [51]. Figure 2.8 shows the block diagram of the proposed 3D recognition method. The validity of the object recognition algorithm in [57] was constrained by having always the same number of objects in the scene. The authors further generalized the methodology in [59] by allowing the number of objects in the scene to range from 4 to 20. The authors used RFID in the segmentation stage since RFID can easily give the number of objects in the scene along with their description. Their approach deals with the paradigm of object recognition for complex 3D scenes having medium and large databases. The statistical analysis provided in Table 2.4 estimates a linear regression model relating number of objects in the scene with the recognition time reduction. Recognition percentage increases by only 6% using RFID, but the computation time is reduced tremendously by 74% on average. 38 Figure 2.8 3D objection recognition algorithm. Recreated from Figures 3 and 5 of Cerrada et al. [57]. Table 2.4 Obstacle recognition results. See Cerrada et al. [59]. No of scenes No of objects in scene 12 9 % Recognition RFID w/o RFID 4 3 91.7 86.1 89.3 16.3 Average Standard Deviation 39 Total time(sec) RFID w/o RFID 2.45 2.96 2.67 0.47 86.1 83.3 84.9 17.0 9.44 9.19 9.33 0.3 The CAD model concept is also used by Hontani et al. [60], [61]. Some CAD based models for computer vision are summarized in [62]. In [60] the proposed system uses a combination of visual retro-reflector tags and RFID tags. After obtaining object ID from the RFID tag, the system gets an initial estimate of pose from the visual tag and then visually tracks objects by a model-based tracking method. The tracking algorithm updates object orientation and position by detecting edge movement in two consecutive frames. For building and learning the models in real time, vision algorithms for model-based recognition require user interaction. In [61] the system identifies the tagged object using RFID. The tagged object CAD model is then retrieved using the internet through a URL server. After determining a shape component in the captured image, the system selects an algorithm for initial estimate of pose relative to the camera. Thereafter the system starts tracking the object in front of the camera. The tracking is based on the difference between the visible edges and edges of the projected model. Similarly, for a test-bed having a robotic manipulator arm to clear dishes from a table, the authors in [63] used RFID passive tags placed on static objects to identify and retrieve object model and information from the database. The reader is placed on the robotic arm. The tag system is used for object recognition, vision for object localization and the robotic arm for object handling. Using pre-stored template images the ceiling mounted vision system provides location information transformed into robot coordinates. This information is then used to position the robotic arm and execute predefined commands to interact with the objects. Here RFID increased the accuracy and speed of the vision system. The group is also working on calculating the orientation of the object based on received signal strength from the static tags on the objects. In another recent approach, Kim et al. [64] used a robot manipulator system for object recognition. 40 The proposed infrastructure uses self-fabricated smart tags having an active landmark (IRED) and a data structure consisting of geometrical, physical and semantic information. IRED is activated as soon as the tagged object comes in the read range. The robot then searches the shimmering light pattern produced by the IRED within the scene. Subsequently the object’s depth, size and pose is calculated using model based vision from its stereovision on a pan-tilt mechanism. The manipulator can then interact with the object. 2.3.2 Human-object interaction and activity analysis Within successful applications of vision-based approaches, human-object interaction and behavior and activity analysis are broadening. The problem domain ranges from tracking moving humans and objects to ubiquitous learning environments. However, based only on vision and image processing algorithms, the development of reliable solutions is still very difficult. A survey on human action recognition methods using vision is provided by Poppe [65]. The author has discussed limitations of vision state of art techniques. For data acquisition and to identify the human, biometric methods in speech recognition, image processing and computer vision are required. The RFID system, like other sensory devices, can provide such data. Like other fusion techniques in behavior and activity analysis, the key is to get the proper combination that can make better decisions and produce higher classification accuracy. Combinatorial fusion analysis is a growing research domain for analyzing data fusion methods from multiple scoring systems [66]; however, we focus only on fusion of RFID and vision. Hsu et al. [67] have proposed a layout for learning behavior monitoring. All the objects on a desk [books, stationary etc.] are tagged with an RFID reader under the table. A camera is used to monitor behaviors such as 41 away, sleeping, studying a particular subject, or doing homework. The system is designed to help the learner improve his/her learning efficiency compared to a planned schedule. N. Krahnstoever et al. [68] proposed a robust real time tracking system for analyzing human and object interactions and did prototype experimentation on a shelf holding objects with varying size and shape. To get the important actions and interactions between the humans and the tagged objects, the system combines stereo based articulated motion tracking and RFID based tracking. The RFID module provides the presence and orientation information and the vision system tracks the human body parts such as head and hand. The object pose is estimated using tag orientation. The orientation and angle of the tag relative to the antenna can be approximated using received signal strength. With three RFID antennas tag orientation can be determined accurately. The vision system target model is a low dimensional approximation of the human upper body. The authors claim that the system can easily detect which item the user is interacting with, which would be a challenging task for a vision-only system. Also it can recognize that the user is probably reading the label on the item, which would likewise be difficult for an RFIDonly system, since it can only estimate the orientation of items and has limited range. This research adds to the application areas that require user tracking and interactions. Another related contribution in activity recognition is given by Wu et al. [69]. Most of the objects are tagged with the user wearing an RFID bracelet. Their proposed system uses a Dynamic Bayesian Network (DBN) framework for learning object models by modeling the correlation between events presented by the RFID and video data. Without any explicit human supervision, the method automatically acquires object models from video and provides the most 42 likely activity along with common-sense knowledge about which objects are likely to be involved. Additionally, the untagged objects in the vicinity are also identified intuitively. They have used skin color models and do segmentation using change detection. For object model representation, the Scale Invariant Feature Transform (SIFT) [70] is used to extract feature descriptors with maximum likelihood estimates for learning. The experimental setup handles 16 different household activities with 33 objects in the tracked environment. The unsupervised approach is used to learn the object models. The learned models are then used to infer the activity and object labels for the same video. Table 2.5 shows the experimental results from [69]. It is atypical to see comparable performance with vision only as compared with the outcome from both vision and RFID. This indicates that the Bayesian framework has utilized all the useful information in RFID and after the object models are learned, RFID is no longer useful. Their system lacks view independence due to a single camera and lack of learning motion information required for human-object tracking. Table 2.5 Activity and object recognition rates. See Jianxin et al. [69]. Common sense used Testing sensors Activity Object Yes RFID only 64% 63% Yes RFID+Vision Vision only 81% 81% 72% 73% No RFID+Vision Vision only 61% 63% 75% 75% Park and Kautz [71] extended the approach in [69] and addressed above limitations. The proposed method deals with incorporating human and object models and also building a DBN 43 that recognized the human-object activities of daily living in a smart home. For obtaining view independent recognition, they used multiple cameras to attain a multi-view vision system. The vision system performs track-level and body-level analysis to attain a coarse and fine level recognition. The RFID module, having hand worn reader (iBracelet from Intel) and tags, is used for learning temporal segmentation of motion and object identification. The detection range of iBracelet is 10 to 15 cm. As the person’s hand approaches a tagged object, the reader detects the tag and transmits wirelessly the time stamped ID information to the PC-based activity recognition system. To generate the activity model, six coarse-level actions are coded for investigation. Reported activity recognition results show that different sensing modules better indicate different activities i.e activities such as “walking around” are better recognized with vision, while activities such as “preparing cereal, drinking water” are better recognized by RFID. Continuing their work on human-object interaction in [17], [72], Deyle et al. [73] have recently presented an approach of constructing RF signal strength images from RFID to be used as a distinct sensing modality. Low cost UHF tags are used in their system. By measuring the signal strength at each bearing, RSSI images are generated by panning and tilting the readers. RSSI images provide ID specific features of the object in range. The RSSI image and the 3D point laser range data are transformed into the 2D camera images. By using a probabilistic framework, these RSSI images are fused with visual and laser data for generating a maximum likelihood 3D point estimate of the tagged object. This information is in turn used by the autonomous mobile manipulator to approach the identified object. The interaction with an object is then remotely specified by the user using the context aware remote user interface. While 44 performing several test trials, the algorithm is validated in an indoor unobstructed scene by achieving authentic localization. Nemmaluri et al. [74] present a system named Sherlock to automatically recognize, locate and index tagged objects. This system is built on their previous work with Ferret in [75]. The primary difference is that Ferret used hand held readers and the Sherlock system infrastructure has fixed readers capable of controlling their own movement using steerable antennas and cameras. Sherlock scans in three different ways i.e fast, coarse and localize. While providing little position information, fast scan determines all objects in the environment and provides regions. Coarse scan gives a rough estimate of objects present in a particular region while localize scan provides position information of a desired individual object. The core component of the system is the fine grain RFID localization subsystem that can precisely recognize and locate objects. The overall cost of the equipment is high. The RFID reader used can control four independent antennas. In addition, the system has an integrated pan-tilt-zoom camera mounted at a fixed location. It provides an interface for query and visual system for user interaction and display. This system is used to help people interact with their moveable belongings in a realistic 3 office environment. Sherlock can localize 90% of objects in a volume of 0.55 m and can localize 100 objects having passive tags in approximately 12 mins. Antenna movement determines the worst case scan time. Tracking humans in cluttered scenes by a mobile robot also requires effective interaction with the surrounding world. Some literature suggests utilizing multiple onboard sensors or visual measurements. In [76] Germa et al. provides initial adequate results for human tracking using a 45 mobile robot. They have modified a mobile platform with additional onboard capabilities such as a monocular pan tilt camera, WiFi, gyroscope and RFID reader with eight antennas for omni directionality. Their multi-modal tracking algorithm applies a particle filter for the heterogeneous data fusion in a stochastic framework. They have utilized vision to provide closed loop position control for a robot end effecter by designing three vision based PID controllers. This control strategy is helpful in providing required feedback control during visual information loss. The proposed infrastructure also roughly estimates collision detection in high-risk areas. Validation of the whole infrastructure is performed in an indoor environment with sporadic occlusions when tracking a tagged individual. As an extension to this work, the authors suggest future research for multi person tracking with better collision avoidance while focusing on overcrowded scenes and coarse measurements by RFID-based distance evaluation. For checking the system robustness to occlusions and target loss, the authors considered the ratio of the frames while the user is in view to the total number of frames. For determining this ratio for tracking one to four persons, the ratio decreased from 0.22 to 0.19 with a vision only system as compared with 0.93 to 0.85 in the fused system, thereby demonstrating the fusion effectiveness. Blind people make use of tactile and haptic perception (the process of recognizing objects through touch and kinematics). The blind assistance devices available in the market require strategies for efficient understanding of the unseen environment. T. McDaniel et al. [77] have suggested a framework for integrating RFID and computer vision in enabling devices for remote object perception. Seeking a wearable device, they used vision and touch features, which can be classified at the perceptual level. The vision module provides users with only relevant information found through RFID identification. Their algorithm efficiently deals with RFID data 46 rate overload by particular content selection using vision and applies to untagged environments as well by gathering tactile features (shape, size, texture, material, etc.) from visual data. As future work the authors have suggested experimentation for usability of the proposed system in a real environment. 2.3.3 Mobile robot localization For a mobile robot to accurately obtain a target pose it needs guided input from the navigation module, which in turn depends on a localization algorithm. The robot actual position and target position varies due to problems such as wheel slippage. These errors accumulate over time. Dead reckoning alone is inadequate in this scenario and additional sensory input is needed. Multi mechanical sensors, RFID, and vision techniques have been proposed to solve this problem. Some fusion based methods are discussed below. 2.3.3.1 Map- based navigation Chae et al. [78] proposed a global to fine localization algorithm for a mobile robot in an indoor environment. They used 915 MHz active RFID tags with 6 m detection range as landmarks for achieving global localization. A mobile robot with camera onboard was used. The mobile robot movement area is divided into tagged regions and a visual map is built with known position of each RFID tag. For global localization, the algorithm assigns appropriate weights to each of the detected tags depending upon the distance of the found tag from the boundary of the region. To detect and describe local features in images, the authors have used SIFT features [70] 47 due to their stability against pose and lighting variation. This feature descriptor is used in a specific region to fine tune localization of the mobile robot, and its comparison with the visual map gives the current view angle of the robot. Given the tagged surrounding and the feature descriptors, the robot localization problem is narrowed down to feature-matching. In a work space of 6.2  7.8 m mean localization error of 0.23 m is reported. Weiguo et al. [79] have used RFID tags placed on the ceiling with a camera onboard the mobile platform to get relative positioning. RFID tags are used as visual landmarks. A topological map of indoor surroundings is built using adjacency relationships of the tags in the surroundings. The camera distance from the ceiling is kept constant thereby simplifying relative position and orientation calculation. The mobile platform traveling in the center uses direction and heading angle information from the identified node in the field of view of the camera as shown in Figure 2.9. The path planning is then post-processed as per the current heading and direction. Another object localization scheme for a mobile robot in a home environment using ceiling cameras is reported by Kamol et al. in [80]. The feature information for each object is also stored in each tag. The algorithm uses RFID to get rough location of the tagged object. For a precise estimate, the system recognizes the object features using a hue-color histogram with a subsequent location estimate using a particle filter. 48 Figure 2.9 Map based navigation scheme. Recreated from Figures 1 and 3 of Weiguo et al. [79]. 2.3.3.2 Obstacle recognition S. Jia et al. [81] built a mobile cart, with RFID reader and a Bumblebee stereo vision camera. Passive tags are placed on the objects and the path. Obstacle detection and avoidance is handled by the RFID module. The presence of an identified object is further validated probabilistically using Bayes Rule. The obstacle/object direction with respect to the robot trajectory and the probability of the map in which that tag may exist is updated. The center position of the maximum probability is considered as the position of the tagged object. Experimental results 2 achieved coarse object localization of 0.26 m with RFID alone. The stereovision cameras are used to fine tune localization results. The camera platform recognizes the tagged objects as landmarks and gets pose estimates with the help of object (obstacle) tag information such as size, 49 color and shape. Once the pose of the tagged object is identified, the robot determines the avoidance route. As continuation of the ongoing work [82], the localization of obstacles was further enhanced by using three RFID antennas instead of one. The research has been extended by the same group [83], [84] for human recognition in which they have used the same infrastructure along with multi RFID antennas. Table 2.6 shows some of the results from experiments with different settings. Table 2.6 Obstacle localization with different antenna settings. Data extracted from Songmin et al. [82]. Antennas configuration Actual position of obstacle (cm) Simulation results (cm) Using 1 antenna (220, 400) (280, 400) (360, 400) (268, 320) (296, 276) (316, 296) Using 3 antennas in parallel (240, 400) (300, 400) (340, 400) (284, 304) (300, 280) (320, 284) (160, 400) (320, 400) (360, 400) (180, 400) (324, 384) (365, 404) o Using 3 antennas with 45 setting 2.3.4 Miscellaneous Tracking in the domain of augmented and virtual reality has been researched for some time. The tracking devices are inbuilt components of virtual reality systems. Tracking with the camera in virtual reality incorporates sensor based or vision based techniques. Gear such as head mounts are used for tracking under prepared calibrated environments. Such tracking gear often uses active sensor based solutions, including electromagnetic, acoustic, optical, radio, mechanical and 50 inertial systems. However, due to certain issues such as wired power, jittering, and computational complexity, they sometimes do not provide viable solutions. In vision-based tracking, the camera orientation and location are tracked using pair wise fundamental matrices, fiducial markers and feature points. However, limitations such as high data rate, computational complexity and feature extraction make use of passive devices very complex. Thus, implementation of certain processes such as color keying or chroma key compositing become very challenging. Color keying generally involves segmenting objects from background by using color cues that is challenging for vision only modules. A common example is the meteorologist presenting weather updates with background weather clips. The utility of RFID is being considered by some researchers in virtual environments to supplement some of the application specific vision limitations. Po et al. [85] proposed an RFID passive tag infrastructure for the 2D camera tracking problem in a virtual studio environment. Passive RFID tags were distributed randomly over the virtual studio area. An algorithm reads each RFID tag and then calculates orientation and velocity of the camera. The scheme reduces camera position estimation error by comparing actual position of the camera and estimated camera position info using RFID. Therefore estimation error is directly related to the distances between RFID tags. As a validation platform, simulations have been performed in Maya 3D view to generate an avatar. For the experiments, avatar position and camera position are known. By analyzing the experimental images, the authors suggest that while distributing RFID tags, a distance between tags over 12 cm is not suitable for a virtual studio environment. Also, small tag distance of 5 cm to 10 cm is difficult for identification by the human eye. Moreover, use of a triangular tag distribution pattern reduces camera position error. 51 The amount of visual information available has been rapidly increasing across various fields, especially in video surveillance applications. There has been research in this area so that video information can be accessed more efficiently by retrieving important segments or highlights from lengthy video. This demands summarization of a large amount of visual data. Generating efficient video digests requires detecting interesting video portions, merging them into a digest and ignoring mundane parts. This research area can produce dependable outcomes by using fusion techniques. In [86] the digest generation method for Kindergarten surveillance uses a nonparametric approach to structure location information from the RFID channel and visual features from a video feed. (Parents are able to view what their children are doing during the day.) Essential video chunks are kept while discarding other portions. The technique computes pose estimation and object detection using background subtraction and motion information using inter frame differencing. Video is divided into different segments forming clusters. As the cluster members are continuous temporally, each cluster is coarsely treated as a single event. The experimental setup consists of two cameras and RFID system with active tags placed in the pockets of students. 63 hrs of video with a resolution of 320  240 resulted in a digest of 20 mins with a processing time of 2 hrs. 2.3.5 Natural/outdoor site management Resource and personnel tracking is a critical requirement in settings such as construction areas, hospitals and airports. This task is difficult due to a large number of sporadic interactions of objects and persons. Occlusions make tracking problems challenging. Radio frequency based tracking technologies have emerged as promising solutions in the market, including GPS, RFID, 52 Bluetooth and Wireless fidelity (Wi-Fi, Ultra-Wideband, etc). Several outdoor tracking and management approaches have been reported that use RFID and computer vision as separate entities. RFID alone is mainly used on sites for asset management [87], while CV has been used for personnel tracking [88]. The fusion of these technologies on sites and in similar applications such as airports [tagged boarding passes] and hospitals [tagged patient bracelets] should improve safety and security. The effectiveness of RFID on construction sites is examined in [89]. The authors checked performance of different tags in the lab and on site. The important findings of the paper show that passive tags do not perform well at long distances though they can be used for tracking tool loss and theft. On the other hand, active tags had 100% read accuracy at any tag orientation with distances of 25 ft or less. These results prove applicability of the fusion approach in an outdoor environment such as a construction site. Airport security and construction site safety and management require understanding the interaction of men, machines, materials, and terrain - other applications have similar requirements. For safety purposes the trajectory for desired objects on site should be updated in real time. We have presented a method to examine the effect of partial object information, via RFID or special visual features, on the performance of object tracking, while solving the trajectory point correspondence problem in 3D space [90], [91]. In the initial work, the RFID feed is simulated and is used for reliable identification and locating target objects at coarse spatial resolution. Vision is used to provide finer spatial resolution for identified tagged objects. We extend geometry-based tracking so that intermittent information on object ID with location can be used in computing the overall quality of a set of paths of N objects over T time steps. We show that 53 partial object information can both reduce computation time and increase the likelihood of producing correct trajectories. Location sensing based on GPS and local beacons is currently used in the outdoor environment for precision agriculture in farm management [24]. Highly precise results of one inch year-to-year and pass-to-pass accuracy also depend on real time kinematics. Such a real time location system relates to our discussion of natural outdoor environments. The ability to manage a large farm and register points of land to all kinds of maps and aerial images is analogous to managing and tracking a work site via real time RFID and CV tracking while recording state information in a database that supports other dynamic analysis. The GRASP Lab from University of Pennsylvania and their colleagues at MIT reported research work on cooperative manipulation and transportation with multiple flying quad-rotors [92]. The quad-rotors perform a number of maneuvers with less than three inches of clearance on all sides. The angular velocity of the quad-rotors is measured with onboard Inertial Measurement Units (IMUs). For the dynamics and control they have used Vicon [93] motion capturing technology. Each quad-rotor is affixed with four passive optical markers that are tracked by multiple infrared cameras, which in turn gives the 3D position in a track volume of 5  5  5 m. The group aims to have computer driven UAV flights that can be used for search and rescue in emergency situations such as earthquakes and fires. The published results have been reported in a lab environment and as future work they plan to validate these in natural outdoor settings. 54 _____________________________________________________________________________________________ CHAPTER 3 Proposed solution and research methodology _____________________________________________________________________________________________ The introduction of fusion in the previous chapters showed that the problems of localization and tracking can be solved. Radio Frequency Identification (RFID) can be used to reliably identify target objects and can even locate targets at coarse spatial resolution, while CV provides fuzzy features for target ID at finer resolution. Our parameterization focuses on the site safety environment. We assume the agents are mostly cooperative and tagged, wearing distinctive clothing, and that 3D survey data of the test site exists. Fusion provides a method to simplify the correspondence problem in 3D space. A Site Safety System (S-3) can query for unique object ID as well as tag ID information, such as target height, texture, shape and color, which can greatly enhance scene analysis. We extend geometry-based tracking so that intermittent information on ID and location can be used in determining a set of trajectories of N targets over T time steps. Our model provides a design for stages of future improvements. The first section of this chapter formulates the problem and discusses the necessary steps. Next, an introduction of the sensor infrastructure used is provided. Finally, the goals and possible research issues are explained. T he research problem is to detect, identify, locate and generate real time tracks of N objects moving within a known 3D workspace within a global view. Observations from 55 diverse sensors are combined into object locations, and possible IDs, at discrete time steps, which must be aggregated into N trajectories. Motion analysis will be triggered by daemons that monitor conditions in the data – e.g. nearness of certain objects. We abstract the information structures in order to support a system with diverse information sources and constraints and processes that may not have knowledge of each other. Without loss of generality, we sometimes ground our discussion using the site safety application. In order to study the global tracking problem and to provide a solution that is independent of a specific application, we abstract the problem as follows. 3.1 Tokens code observations from images and RFID Consider a database that is to be built from observations from RFID readers and/or a sensor network infrastructure together with networked stereo vision sensors. Sensor observations and combination yield tokens , each recording that an object with ID (name) L and feature vector v is at location (x,y,z) at time t. Some tokens will have incomplete or partial information: for example, ID L may be absent from CV observations and visual features may be absent from RFID observations. 3D coordinates may be absent for an observation from a single camera image or single RFID reader. Two or more of these tokens can be combined in the processing to get refined 3D coordinates. To keep the model simple at this point, we treat measurement accuracy and confidence values in a general heuristic manner and not as a component of a token. Higher level motion analysis will use this data and be triggered by 56 daemons that monitor conditions in the data – e.g. nearness of objects of class C1 and class C2. Higher level activity analysis is thus based on the real-time object track data. 3.2 Object tracks {} The Site Safety System (S-3) needs to identify and locate all significant objects in the workspace within a few frames k of real time observation. S-3 may know L = f() from sensor subsystems that use RFID or visual features. When such information is unavailable, the system can use “tracking” to determine L = f () using prior records {}, or perhaps even forward records {}. If object ID L is known, other object features w = f (L) may be available from an RFID tag, such as object mass, or even a CAD model. Finally, we note that if sensors supply object speed or acceleration we consider these as components of v along with color, texture, elongation, etc. of its image. An object track is k or more tokens in time sequence with consistent object ID and features that also satisfy constraints for motion in space. Tracking is an important concern of this dissertation, and is a low level of motion understanding that uses naïve physics to aggregate observations over time. Heuristics from naïve physics enable aggregation of individual tokens into a sequence or track, one for each moving object. As objects move through the workspace they may be occluded at any instant from either cameras or RFID readers so there may not be multiple tokens to fuse. Smoothness constraints, or motion applied over multiple time steps can be used to interpolate. 57 As we will see, it is not possible to assign unique object IDs to every token at every time instance. Consider, for example, the popular shell game where a bean is placed under one of three shells that look alike [94]. When the shells are shuffled quickly in space, most people cannot track the shell containing the bean. If the shells are of distinct colors, then the problem of picking the final shell is easy. If the shells are identical in appearance, but the bean is an RFID tag, RFID readers are unlikely to be able to distinguish the tagged shell in space when the shells are close to each other. Consider three workers with hard hats each with a tag and close together; if the hats are the same color, Real Time Location System (RTLS) cannot distinguish them due to read accuracy; if we know which colors contain which tags and the hats are of different colors, the system can solve the matching problem and locate each hat within the CV distance error. In order to model ambiguity, we will have to allow multiple labels L in the tokens of an object track: these labels record the ambiguity of ID at this point in time and space. 3.3 Obtaining 3D object location (x, y, z) One fundamental sensing concept is that a sensor observes an object along a ray in the 3D space and all sensors are calibrated to the same 3D workspace. If we model sensor error, the object lies in a cone formed by projecting the error at the [2D] sensor into 3D as shown in Figure 3.1(a). Locating an object in 3D space is done by intersecting two (or more) rays [or error cones/lobes]. See [31] Ch 13. This can possibly be done by using two cameras as in the standard stereo solution or a structured light solution, or two RFID readers, or one RFID reader and one camera. Figure 3.1(b) shows the error volume in gray where the RFID reader directional antenna lobe intersects the camera error cone. 58 Ray RF lobe (a) (b) Figure 3.1 Sensor error volumes: (a) rays intersection with error cones (b) intersection of error cone with RF lobe. The sensor fusion algorithm we use computes the shortest line segment connecting two rays [95]. Given a site survey, it is simple to intersect a ray with the ground surface or with a lofted ground surface if we know the object height. The underlying geometry is angle-side-angle, where the side is the known 3D baseline between the two sensors. A second fundamental sensing concept is where the sensor observes an object at some distance d. If the object transmission is observed by three such sensors it can be 59 located by trilateration, intersection of three spheres with radii equal to the sensed distances. An object can also be located by intersection of the ray/cone determined by an image observation and the spherical shell determined by distance d sensed by a single RFID reader. The commercial RTLS system encapsulates multiple RFID readers and yields a token with unique object ID and (x, y) coordinates on the ground plane of the workspace. The principle is similar if there is an encapsulated stereo vision system. Finally, fusion by ray intersection can alleviate the stereo correspondence problem since ID and features may be available from RFID-tagged objects. 3.4 Heuristics from naïve physics Tracking is a lower level of motion understanding that uses naïve physics to aggregate observations of N objects moving over T time frames. Heuristics from naïve physics enable aggregation of individual observations into a sequence or track, one for each moving object. Naïve physics constraints are used to filter out unlikely labels for objects at time t based on the recent history of objects continuing from the k previous time steps. Our goal is to create a smart tracking algorithm based on the heuristics above, which will provide the means for safer activities and more efficient site management. Examples are as follows: a. An object n must be at one and only one place at time t. b. Location can accommodate at most one object at time t. c. Object n is likely to have consistent form and visual features. d. Observations of object n must be consistent with its identity, if known. e. The motion of object n is likely to have smooth direction. 60 f. The motion of object n is likely to have smooth velocity [makes problem more complex]. g. Constraints e and f are likely to be violated only when object n is in close proximity to another object m. h. Known objects are likely to move in a known terrain in predictable ways. j. Some objects are known at some locations and time instants. k. Objects do not enter or exit the workspace [our assumption]. l. Noise may add in input trajectory points during stereo calculation. These constraints are an extension of those used by Sethi-Jain [96] and Veenman et al. [97] and, unfortunately, none are hard constraints. For example, it may be that constraint (b) is violated as one object “consumes” another. Perhaps a machine consumes a worker - which S-3 should prevent!. Perhaps a driver enters a vehicle - should S-3 prevent this? 3.5 Fusion platform We define fusion as the combination of different sensor tokens to obtain tokens containing information from the different sensors or with new information computed from the tokens from the different sensors. Most importantly, RFID and CV tokens will be fused to combine object ID with object features and to provide or to refine object location. 3.5.1 Labeling with relational constraints To manage the complexity of the diverse information being fused and to provide a flexible experimental platform, we propose discrete relaxation to create the tracks of the N objects and to 61 update the time tokens comprising each track. Using relaxation, different sensors and sources of information can be turned on or off for experimentation or for practical reasons at a site. Fusion by relaxation is sketched as follows. Fusion processes operate on a blackboard containing the set of tokens. /* discrete relaxation labeling for objects 1....N moving over time */ a. Calibrate sensors to 3D site. b. Initialize representation for N objects x T time steps x N labels. c. For all time steps t = 0…..T. 1. Run sensor processes to create [partial] tokens for detections. 2. Run combination processes to merge and complete tokens. 3. Run processes to eliminate impossible labels for object k at time t. 4. Run tracking process to apply constraints and remove unlikely labels. 5. Daemon processes possibly invoke higher level analyses processes. d. Output object tracks as N x T label matrix. e. Store object tracks for further analysis. 3.6 Sensor arrangement The sensor setup includes networked static cameras for visual coverage of the site. We have also considered installing cameras on the moving targets which in turn will be helpful for looming detection. The details of preliminary looming experiments are discussed in Appendix A. 62 The RFID readers are placed in known locations and most of the objects and personnel in the workspace are considered to be tagged. A GPS feed can also be used for validity. Figure 3.2 shows the Site Safety System (S-3) block diagram. Sections below explain the basic sensors infrastructure in detail. Figure 3.2 Block diagram of the Site Safety System using fusion of RFID and CV. 63 3.6.1 Vision infrastructure Our system proposes the use of static cameras. As compared to costly cameras, these can be commodity good resolution cameras commercially available. The static cameras will be positioned in stereo pairs on fixed places to provide a global three dimensional field of view (FOV) of the work space. For looming detection moving objects and personnel are proposed to be equipped with wireless cameras for local operation. The site area is covered using a network of static cameras. The distance of the cameras from the tracking area can be governed by factors such as camera focal length, frame resolution and moving target desired size in images. Using low cost fixed focus equipment each camera can be positioned up to an approximate distance of about 100 ft (30.53 m) from the site. In a general sense with a frame resolution of 800  600 pixels the minimal desired target size is approximately 50  30 pixels. This target size provides sufficient information (such as distinctive clothing eg. colorful hats, green and orange safety jackets etc.) for real-time video tracking. The scene extracted from [98] in Figure 3.3 shows the bounding boxes on some of the construction workers to give an idea of the minimal aspect ratio required of the moving objects with respect to a 800  600 pixels field of view. The image in Figure 3.3 extracted from [98] also explains the construction scenario where the proposed fusion technique will be well suited. The static cameras can be used to process stereo tracking of the moving objects whose presence in the view is also validated by the RFID feed. The preliminary experiments done on stereo tracking are explained in Chapter 6. Looming object detection can be sensed using the local dynamic cameras with motion detection techniques such as optical flow, background subtraction and frame differencing. We have performed some preliminary tests on looming detection to study its feasibility, see Appendix A. The 3D survey data of the site is 64 given as input for processing structural stereo. The overall system will monitor safety of multiple tracked targets and will generate a proximity warning about a possible collision threat to the tracked object; for example a worker-on-foot or vehicle backing up etc. It is known that processing is more complicated when the camera is on a moving platform where platform motion will cause optical flow even in the background. Though, this can be cancelled out as the object motion, direction and velocity information can be accessed from the trajectories with time stamping information, however, it is computationally expensive in real-time. Figure 3.3 Construction scene example [98] with aspect ratio of persons and the field of view. 65 3.6.2 RFID infrastructure Although GPS is widely used today for personal and commercial outdoor applications in open areas, it does not perform satisfactorily in indoor areas. Also it is not cost effective to equip every moving target with a GPS device. Since RFID works on wireless protocols, identification of the tagged object to be tracked can be conveyed to the visual feed for validation. The RFID location information, however coarse in nature, can supplement the visual info. RFID localization can be achieved using schemes such as lateration with distance estimation. The scene analysis can be enhanced with the deployment of extra reference tags, however, with RFID alone; target location in real time will not be as accurate as when using cameras. Refer to Section 2.1.5 for highlights on RFID positional accuracy. Each RFID localization approach and equipment has its own strengths and weaknesses. Most targets to be tracked are considered to be equipped with an active tag, if possible. Keeping in view the outdoor dynamics the optimum locations of the readers can be analyzed by performing different trials. Also using readers with different read ranges will provide interesting results [99]. By properly placing the readers in known locations, the whole region can be divided into number of sub-regions called cells, where each sub-region can be uniquely identified by the subset of readers that cover that cell. Given an RFID tag, based on the subset of readers that can detect it, the system should be able to associate that tag with a known sub-region. The accuracy of this approach depends upon the number of readers, the placement of these readers, and the range and power level of each reader. In order to increase accuracy without placing more readers, the system might use extra fixed location reference tags to help location calibration. 66 These reference tags serve as reference points in the system (like landmarks in our daily life). This approach shown in Figure 3.4 helps offset many environmental factors that contribute to the variations in detected range because the reference tags are subject to the same effect in the environment as the tags to be located. In our experiments we have used and evaluated the commercially available RFID-based Real Time Location System (RTLS) from CSL (Convergence Systems Ltd.) [8] for its accuracy and reliability of object detection and location. The RFID based RTLS development kit used has six readers (one master and five slaves). They can be used in different settings to form a cell. The system uses time-of-arrival (TOA) concept where the distance between the tag and the readers is calculated by the roundtrip time. Figure 3.4 Schematic of dense placement of reference RFID tags. 67 The tags communicate with the readers using time-division-multiplexing (TDMA). Each o o reader has a beam width of 80 in portrait and 30 in landscape orientation. To perform location sensing each reader has to be pointed towards the center of the test site. To cover a larger area more readers can be installed thereby generating more cells to enhance the location accuracy in difficult configurations. The tags used are 2.4 GHz active tags with up to 200 m of read range in an open outdoor space. The tags run on 3  AAA batteries and are approximately of the size of an iphone 4S and weigh less than the phone with batteries installed. Figure 3.5 shows the readers and the tags used in the RFID RTLS system. (a) (b) Figure 3.5 CSL RFID based Real Time Location System: (a) active tag (b) master reader. 68 3.7 Goals and related research parameters Using fused information in the multi-object tracking scenario, the approach envisions evolving according to the general steps and related research parameters discussed below. 3.7.1 Calibrating the cameras The static cameras are to be attached at fixed positions in the work space so as to provide the global three dimensional Field of View (FOV). They will be used to provide stereo tracking of the moving objects. The cameras are calibrated using the affine calibration method. A 3D global workspace coordinate system [X,Y,Z] is created. Place some fixed visual markers or identify structural landmarks in the scene to provide calibration points. The system can have apriori survey information about the 3D world coordinates of these points. Synchronize these cameras in time and space. Finally a transformation matrix for each fixed camera is obtained for stereo calculation. Details of the calibration process is provided in Chapter 4 and Appendix B. 3.7.2 Defining the ground truth Ground truth data is required to evaluate the system performance. To define ground truth measurements a mesh can be created which represents the surface of the ground upon which work will be done. Modern surveying instruments can be used for this task. In a lab environment this can be done by projecting a grid through a projector and later recording the surface detail. The ground can contain some fixed visual markers so that cameras can monitor and validate their 69 movement due to any undesirable factors, such as wind or vibration. Also the mobile objects are known to move on the ground, which provides constraints on their location. This helps formulate ground truth data by moving objects on predefined paths. 3.7.3 Structural stereo approach Solving correspondence and camera calibration in stereo are key issues. This requires autonomous relative camera orientation and stereo matching that uses the relationship between the correspondence problem and the camera pose estimation problem. Each camera needs to be calibrated to the workspace to obtain its camera matrix. Each camera should be able to see some moving objects A, B, C and a, b, c, etc. Object matching can be done using color blob detection and/or object ID. The color detection performance can also be analyzed in different lighting conditions. In case of visual occlusion some objects might be identified by RFID alone which can help in the matching process. For each object region in the image of camera 1, compatible matching regions are to be found in the image of camera 2. The camera matrices obtained from the calibration process will be used to compute 3D location [x,y,z] for each possible object match. The consistency of the object can be checked with the ground plane. One to one mapping should be created for each object in camera 1 and camera 2. IDs provided by the RFID system can be used to make mapping more efficient. Stereo using a single camera and a single RFID reader can also be examined. To analyze and evaluate the accuracy of the stereo system, it needs to be established how accurate can this matching be using visual and/or RFID information and how accurate motion trajectories are obtained after comparison with the ground truth. 70 3.7.4 Object sensing using multiple RFID readers For the tagged objects, detection and spatial accuracy is to be analyzed in RFID sensing mode. The object detection and read accuracy will vary in different settings. Similarly the object location accuracy will change if the object is directly visible to more readers or otherwise or the readers configuration gets changed. 3.7.5 Integration of multiple CV sensors To cover the view of the workspace from all directions multiple cameras can be networked together. However, selecting the right cameras and time slices is difficult. The ID of each object by RFID can help choose the right camera for further calculations. 3.7.6 Fusion of RFID and cameras The fusion algorithm is the back bone of the S-3 system. Also the RFID and vision data being diverse in nature needs to be transformed in an appropriate format so that it can be compared and fused into refined tokens. All the sensors are to be calibrated to the same coordinate system. Synchronization within cameras or between cameras and RFID will be a complicated task. The number of RFID readers required to fully cover the desired area and their distribution needs to be worked out. Provision of any additional information through the database that can be used by daemon processes will help reduce runtime complexity. 71 3.7.7 Smoothness of trajectories The dynamic scene can contain multiple independently moving objects. The objects are permanent; except for new objects entering or old objects leaving the workspace. Track initiation and termination are to be carefully examined as it might be that a tracked object reappeared after occlusion for a short time or a new object has entered the workspace. Smoothness is a global operation which can help high level processes to define and/or correct computed object tracks. Smoothness of trajectories requires a burst of time frames for reliable results. Defining appropriate naive physics constraints will help identify correct objects. It is safe to assume here that typically objects move along the surveyed ground plane. 3.7.8 Looming detection Motion and looming detection will apply toward the cameras placed on the head gear and moving vehicles. An optical flow algorithm can be used to detect the motion and looming phenomenon. Lucas-Kanade and Horn-Schunck are widely used methods for optical flow estimation. If the object is not fully available in the Field of View (FOV) then RFID input can help. Motion vectors can be extracted (possibly 3D) for one or more moving objects in the FOV. Looming object may or may not be a “smart object”. It can be analyzed how accurately the distance from smart object to sensing object (camera platform) can be computed. Depth measurement accuracy and motion measured parallel to the image plane can also be studied. 72 3.7.9 Object Inventory The work safety system monitors 6-tuples in real-time and outputs safety controls. Other applications can analyze these attributes offline for work efficiency, materials inventory and person tracking, etc. 73 _____________________________________________________________________________________________ CHAPTER 4 Localization of objects and scene points _____________________________________________________________________________________________ Localization involves computing object location with respect to some external frame using sensory data. Pose is an umbrella term that defines object location and orientation relative to the global reference frame. Visual image data has embedded information such as color, texture, and shape etc. which can be used to ascertain object pose. However, the errors due to vision system constraints and lighting conditions may propagate during computation. Data fusion from other active sensors such as RFID can help with object detection, identification and coarse location estimation. We introduce and discuss methods and configurations used for location estimation by both RFID and stereo vision and show accuracy that could be achieved by each of these modalities. In outdoor experiments stereo provided location accuracy within 7.6 in (19.3 cm) whereas for RFID was up to 1.5 m for a tagged person being stationary for few seconds at predefined points. T he ability to detect, identify and locate objects in an environment are some tasks that determine the performance of a tracking system. Object appearance, features, orientation and motion are some characteristics that are used in the recognition and tracking processes. Over the past decade, fusion of RFID and CV has also being used in indoor mobile and industrial robotics to support tasks such as autonomous recognition, localization and tracking. RFID alone 74 has also been researched widely in this quarter. Passive stereo vision can locate detected objects in a 3D volume provided the image of the same object can be identified in two or more cameras. An RFID reader can be used to ID an object observed in some 2D image, thus aiding stereo; or, a network of RFID readers can provide coarse 3D location without cameras. Thus RFID can help with object localization in multiple ways. RFID technology also enables smart objects to communicate information about themselves not available to optical sensors; for example object weight, container content, etc. A tagged rigid object can even help provide an optical observer with a network downloaded CAD model of itself to be used for pose computation by the observer. The focus of this chapter is to analyze object location procedures and accuracy using vision and RFID as single modalities. 4.1 Object detection and blob analysis using vision The site safety system acquires video frames from cameras observing the work site. Simultaneous frames from two cameras can be used as a stereo pair for detecting desired objects and locating them in 3D. To locate an object, detection and identification are initial tasks that are complex processes using vision as a single modality. RFID supports vision in this step. In our approach we locate objects in all frames. Working toward an automatic system, we currently have some manual steps in the research methods. Our vision based detection stage uses color and blob analysis for real-time processing and also allows user interactivity at times for outdoor data to avoid detection failure. We have also explored Hough transform based elliptical shape features for head detection and have used it for offline processing. We have used MATLAB® 2009a to acquire video from the cameras. The video frames are extracted using the image acquisition 75 toolbox. For color based detection, the desired color is extracted from the RGB image. Figure 4.1 shows steps involved in the RGB color detection process. We have also used HSV space for detecting colors, which will be covered in the coming paragraphs. HSV space is capable of separating color components from intensity and is more robust towards lighting changes and shaded regions. The blob analysis steps in our approach are the same in RGB and HSV based color detection. Figure 4.1 Flow diagram of specific RGB color detection and connected components blob analysis. 76 The input image is converted to a grayscale image. The desired color is then subtracted from the grayscale image and the resultant image is converted to binary. Through connected components analysis, the algorithm merges object pixels that are close to each other so as to create blobs. For example, pixels that represent yellow are a portion of a person's head gear and are grouped together. Next, in the blob analysis the properties of the region are extracted and bounding boxes of these blobs are calculated. Based on these blob properties the individual bounding boxes maybe merged together so that each tracked object-part (eg. safety helmet etc.) is enclosed by a single bounding box. The center of each bounding box in both images is then considered to be the object center point for performing stereo correspondence. An object will have little displacement between two consecutive frames, therefore, the center of the blob provides a strong and useful feature for locating and tracking objects. Figure 4.2 shows desired blobs detected in input image in an outdoor environment. (a) (b) Figure 4.2 Blob detection outdoors: (a) original image (b) blobs of blue and yellow balls. 77 For HSV based color detection the input RGB image is converted to the HSV space and H, S and V images are extracted individually. A histogram of these individual color bands is then calculated. Based on the color to be detected the minimum and maximum threshold is defined for hue, saturation and value image. For example, we have used the following threshold values for detecting the yellow color in our indoor experiments: Hue threshold low = 0.11 Hue threshold high = 0.19 Saturation threshold low = 0.39 Saturation threshold high = 1 Value threshold low = 0.39 Value threshold high = 1 The defined threshold is then applied to each color band and individual masks are generated. The masks are then combined together to find where all of these are true for the color to be detected. With a little iteration the threshold values of H, S and V can be adjusted according to test environment. Based upon the size of the detected object the smaller objects are then filtered out. Holes are then filed in individual color band images. The desired color mask is then obtained to mask out the desired color from the RGB image. . 78 4.1.1 Elliptical shape features for head detection The human head can be approximated well with elliptical shape features. We have used Hough transform based ellipse shape detection given in [100]. This method takes advantage of the major axis of an ellipse to find ellipse parameters fast and efficiently. For an arbitrary ellipse, there are five unknown parameters as shown in Figure 4.3. These are orientation α, center (p0, q0), major axis 2m and minor axis 2n. Their relationship is shown in equations below. The algorithm is initiated by giving a range of major axis [user specified] which is then used to find the minor axis of the ellipse. Since only a one-dimensional accumulator array is required to accumulate the length of the minor axis this step of the transformation is very efficient. p0  ( p1  p2 ) 2 (4.1) q0  (q1 q2 ) 2 (4.2) ( p2  p1 ) 2  (q2  q1 ) 2 m 2  q2  q1    p2  p1    a tan  n m 2b 2 sin 2  m 2 b 2 cos 2  79 (4.3) (4.4) (4.5) m2 b2  d 2 cos   2mb (4.6) Figure 4.3 Ellipse geometry showing basic parameters to define an ellipse. The background subtraction technique is applied on the incoming frame and the cropped image of the desired person is generated. This step helps reduce the edge pixels space required in the next steps. Edge detection is performed on the R, G and B channel of the cropped image and a union edge image is acquired. Binary image dilation using linear structuring elements is then performed to acquire boundary pixels. Each pair of edge pixels is considered as candidate for two vertices on the major axis of an ellipse. Using these two candidate pixels the four parameters are calculated. Another arbitrary point is used to find the half-length of the minor axis n. A voting process is then initiated to acquire the desired n using a one-dimensional accumulator array. Figure 4.4 shows the implementation steps. 80 Figure 4.4 Implementation steps for elliptical shape feature detection. 81 Figure 4.5 shows the results of head detection using elliptical shape features. Figure 4.5(a) shows the input 640  480 image. The cropped image with best two ellipses [red and yellow color] is shown in Figure 4.5(b). Figure 4.5(c) shows the cropped edge image with the same two best fit ellipses. For this case, eccentricity ratio of 0.4 with orientation angle range between 50o to 90o was used. (a) (b) (c) Figure 4.5 Results of head detection using elliptical shape features: (a) input image 640  480 (b) cropped image with best two ellipses (c) cropped edge image with best two ellipses. 82 4.2 Stereo vision using ray-ray combination Stereo vision based on the principle of ray-ray intersection uses cameras that have slightly different pose in space. Unlike various other stereo approaches, the cameras do not need to be specially configured relative to each other, however they should have an effective and common field of view. Each camera is calibrated to the 3D workspace. The 3D point is calculated by solving the correspondence problem, that entails observing the same feature point in two or more 2D images from these cameras. Some basic concepts about stereo vision and 3D reconstruction are provided in Appendix B. Due to factors such as lens distortion, digitization noise, small camera vibrations and subpixel difference in correlating corresponding points, the errors generate and propagate in the stereo system. If we model camera errors and project the error at the cameras into 3D, then due to the propagating nature of these errors, rays are transformed to cones. The object in 3D lies in the overlapping volume of these cones referred to as the error volume. Apart from the factors mentioned, the error volume also varies with the selected pose of the cameras. The variation in the error volume with change in camera pose is demonstrated in Figure 4.6. 83 Figure 4.6 Error cones obtained from projecting 2D imaging error back into 3D. 4.2.1 Stereo configuration In our setup, the vision system consists of two cameras separately calibrated to the 3D work space. The known 3D points in the work space are used for camera calibration. Figure 4.7 shows the perspective model having two cameras viewing the same 3D workspace. A right hand coordinate system is used. The points farther from the camera have more positive depth coordinates in the camera coordinate system. The site area is considered as the 3D world with its own global coordinate system. A 3D world point is represented as The intersection of the two imaging rays, W PO1 and 84 W W P=[ W x W P , y W P , z t P ] . PO2 determines the location of the 3D point W P. We have adopted a general stereo approach [31], [101], [102] where the same feature I W point Pi in two or more calibrated cameras is used to calculate the 3D world point P. Figure 4.7 General stereo configuration with two cameras viewing a 3D object in a 3D workspace W. 85 For any required computation, the pose of camera 1 and camera 2 in the 3D world coordinate system W and camera intrinsic parameters such as the focal length etc. shall be known through calibration. This information is defined in the camera matrix obtained by calibration, known as the affine method; see [31] Ch 13 and [103] Ch 12. The calibration procedure does not model radial distortion and we do not rectify the images. This method provides a more general form of camera parameterization and the exterior and interior parameters are combined in the elements Cij of the camera matrix. Fewer parameters means fewer required calibration points. The affine camera matrix calibration procedure used is explained in Appendix B. For stereo processing, the correspondence between a set of 2D and 3D points needs to be recognized. 4.2.2 Computing shortest line segment connecting two rays In practice, two camera rays will not intersect in 3D space. The main cause of this can be due to the approximation errors in camera models and due to errors in image point location. Such errors can occur even due to sub-pixel inaccuracy in the image points. Once generated, this error amplifies as the ray propagates in space. To get a reasonable 3D location estimate, the approach of shortest line segment connecting two rays [31] Ch 13, [95] Ch 10 is used and is shown in Figure 4.8. The coordinate system symbols are dropped from the notation hereafter. The center of this line segment will represent the 3D point. So the smaller the segment, the better is the correspondence of image points and vice versa. We have also used this segment length criterion as a constraint to solve the correspondence problem. Epipolar constraints are also used in conjunction for robustness. Refer to Appendix B for background on epipolar constraint. 86 Figure 4.8 Shortest line segment connecting the two skew rays. P1 and P2 are the points on the ray originating from camera optical center O1 and passing through image point I1 while Q1 and Q2 are the points on the ray originating from camera optical center O2 passing through image point I2. If the optical center of the cameras is not known then camera 1 ray points can be computed using the two equations in Equation A.9 while choosing an arbitrary value of W P z = z. If the computed ray is parallel with the z-axis then the same procedure can be repeated for y and z while W P z = x and so on. u1 and u2 are the unit vectors along these rays respectively. The shortest line segment is represented by vector V and is orthogonal to both u1 and u2 and is given as: V = (P + a1u1 ) - (Q1 + a2u2 ) 1 (4.7) The variables a1 and a2 can be computed using the following set of linear equations. Here ' ' represents dot product: 87 [(P + a1u1 ) - (Q1 + a2 u2 )] 1 u1 = 0 u2 = 0 (4.8) u1 = 0 u2 = 0 [(P + a1u1 ) - (Q1 + a2 u2 )] 1 (4.9) Rearranging Equations 4.8: [(P - Q1 ) + (a1u1 - a2 u2 )] 1 [(P - Q1 ) + (a1u1 - a2 u2 )] 1 [( P - Q1 )] 1 u1 + [(a1u1 - a2 u2 )] u1 = 0 u2 + [(a1u1 - a2 u2 )] u2 = 0 (4.10) [( P - Q1 )] 1 u1 + [(a1 .1)] - [(a2 u2 )] u1 = 0 u2 + [(a1u1 )] u2 - [(a2 .1)] = 0 (4.11) a1 - a2 (u1 u2 ) = -[( P - Q1 )] 1 (4.12) [( P - Q1 )] 1 [( P - Q1 )] 1 -a2 + a1 (u1 u2 ) = -[( P - Q1 )] 1 u1 (4.13) u2 Solving Equation 4.12 and 4.13 further to get a1 and a2 . Multiply Equation 4.13 by (u1 u2 ) and subtract from Equation 4.12: u2 )2 ] = [(Q1 - P )] 1 a1[1 - (u1 a1 = [(Q1 - P )] 1 u1 - [(Q1 - P ) 1 [1 - (u1 Multiply Equation 4.12 by (u1 a2 [(u1 u2 ](u1 u2 ](u1 u2 ) u2 ) (4.14) (4.15) u2 ) 2 ] u2 ) and subtract Equation 4.13 from Equation 4.12: u2 )2 - 1] = [(Q1 - P )] 1 a2 = u1 - [(Q1 - P ) 1 [(Q1 - P ) 1 u2 - [(Q1 - P ) 1 u1 ](u1 u1 ](u1 u2 ) - [(Q1 - P )] 1 2] [1- (u1 u2 ) u2 u2 ) (4.16) (4.17) If the magnitude of vector V is less than a desired threshold then the 3D world coordinates x, W y, z of the point P are given as the midpoint of V: 88 W P = 1 [( P + a u ) + (Q + a u )] 1 1 1 1 2 2 2 4.3 (4.18) 3D location estimation results using stereo vision We provide here the details and results obtained from our stereo experiments. We have used commodity cameras in an indoor and outdoor scenario. The indoor test site was surveyed using laser range finder and tape measurements. The outdoor test site was surveyed using a total station [surveying equipment], laser range finder and tape measurements. Details about site survey procedures are provided in Appendix C. 4.3.1 Computing residual error using jig The cameras were calibrated using the affine transformation procedure in Appendix B. The test stereo image pairs of a jig were used for further analysis. The jig is a physical object with precise and easily recognizable feature points (yellow) as shown in Figure 4.9. The feature points/corners are then used as calibration points to compute the transformation matrix of each camera. For ground truth the 3D dimensions of the jig assembly are known. A typical camera matrix C is represented as follows: c11 c12 c13 c14    C = c21 c22 c23 c24  c31 c32 c33 1    89 (4.19) To do a fundamental experiment we used test images. These images contain the jig with a video tape box. We have used ten corresponding calibration points i.e i ≥10 in Equation 4.20 to calculate the transformation matrix. c11    c12  c13    c14   W x , W y , W z , 1, 0,0,0,0,-W x  I r , -W y  I r , -W z  I r  c21   I r    Pi  Pi Pi Pi Pi Pi Pi Pi Pi    Pi c22     x y z y c y c z c c 0,0,0, 0, W P i , W P i , W P i , 1,- W P i  I P i ,- W P i  I P i ,- W P i  I P i  c23   I P i       c24    c31  c   32  c33    (4.20) The calibration points (A,B,C,D,E,F,K,L,N,P) are shown in Figure 4.9(a) and (b). The lower bound on the number of calibration points to be used is eight. 90 (a) (b) Figure 4.9 Jig images with easily recognizable feature points: (a) left image (b) right image. The camera matrices obtained from the right and left images are shown in Table 4.1. Table 4.1 Left and right camera transformation matrices. Right camera Left camera 146.1 35.6 -0.008 114.2 -73.6 0.01 -24.9 -168.4 -0.01 525.7 911.3 1 183.6 -89.9 0.01 -154.9 -80.9 0.01 -22.7 -206.9 -0.01 1206.6 1721.2 1 For quantitative analysis Table 4.2 shows the 2D residuals (absolute difference between original and computed 2D points) in pixels for the left and right image of the jig with tape. The 91 underlined readings show an error greater than one pixel. The RMS error of the left image is L L R R Xrms = 0.9 pixel, Yrms = 0.9 pixel and of the right image is Xrms = 0.7 pixel, Yrms = 1 pixel. Table 4.2 2D residuals for left and right image of jig - scale is in pixels. Points World 3D points Left Image original 2D points A 0,0,0 B 0,6,0 C 11,6,0 D 11,0,0 E 8.25,0,-4.5 F 2.75,0,-4.5 K 2,0,0 L 2,6,0 N 9,0,0 P 2.75,0,-1.81 1208 1721 254 1142 1859 200 2792 635 2389 1617 1644 2179 1528 1500 583 952 2537 810 1646 1740 Left Left Right Image image Image computed 2D original 2D points Residuals 2D points 1206.6 1721.2 256.3 1143.2 1858.8 199.7 2791.5 633.6 2390.2 1617.3 1643.2 2179.1 1530.8 1499.1 581.9 951.9 2537.8 809.1 1645.9 1738.3 1.4 0.2 2.3 1.2 0.2 0.3 0.5 1.4 1.2 0.3 0.8 0.1 2.8 0.9 1.1 0.1 0.8 0.9 0.1 1.7 527 910 1121 434 2854 871 2349 1437 1878 2002 1012 1717 833 1000 1413 511 1992 1333 976 1323 Right Image computed 2D points Right image 2D Residuals 525.7 911.2 1120.9 434.9 2853.7 872.3 2350.9 1436.1 1878.5 2000.6 1011.6 1718.7 831.9 999.4 1413.4 508.8 1991.8 1332.9 975.3 1320.1 1.3 1.2 0.1 0.9 0.3 1.3 1.9 0.9 0.5 1.4 0.4 1.7 1.1 0.6 0.4 2.2 0.2 0.1 0.7 2.9 Figure 4.10 shows the procedure to compute 2D points for the jig that are not used in the calibration. The 2D points are recalculated from 3D projected into 2D using the transformation matrices. 92 Figure 4.10 Procedure for calculating 2D residuals of jig images. 4.3.2 Components of the stereo system used To perform the stereo experiments the platform used is a core i5 580M with 4GB RAM. Two Logitech C210 fixed focus cameras were used for the stereo trials. The focal length of these cameras is ~4 mm. The frame rate is 15 fps with 640×480 resolution. All the computation is done using MATLAB® 2009a. 93 4.3.3 Indoor stereo computation using a wireframe workspace Next for our stereo experiments in an indoor lab environment, we have used a wireframe workspace as our test volume. The object detection and identification is achieved using symbolic color [representing RFID]. To avoid the illumination and pose problems, the experiment is performed in a controlled indoor scenario using a red ball. Since the ball has a spherical shape, it will have the same projection regardless of view point/pose. Both the cameras were connected to the same laptop. Since the cameras acquire image frames one at a time, there is a very small lag in synchronization, which can be disregarded at this stage. To calculate the relative accuracy of the stereo results compared to the ground truth, we initially did experiments based on the dataset of images with known 3D points, instead of live feed frames. The cameras are calibrated using the affine transformation calibration procedure explained in Appendix B. The wireframe workspace volume x  y  z is 27  31.75  24 in (68.6  80.6  61 cm). The camera pair used were 27.5 in (69.9 cm) apart. As explained before our stereo approach doesn't require the camera poses to be specially configured relative to each other. The distance between them is provided here just to present the layout of the test environment. The cameras were 59.25 in (150.5 cm) away from the wireframe. Figure 4.11 shows the experimental setup for stereo testing and the spiral track for the red sphere. For reference, point 1 in is considered to be the origin. Both the cameras locate the red sphere in RGB color space. Runs have also been conducted using the HSV space. The color detection module gives the center of the sphere to the stereo module for every frame captured by the two cameras. The system 94 computes the 3D world coordinates by observing the same feature point (center of the red sphere) in two 2D images from cameras that are calibrated to the 3D workspace. Figure 4.11 Wireframe workspace experimental setup for testing stereo localization indoors - red object with spiral trajectory. 95 With this procedure, whenever two or more cameras are calibrated, the user can then use the camera models to compute the 3D locations of any identifiable 2D feature point set that has not been used for calibration. Ground truth data for several 3D points in the wireframe workspace were acquired. Some of these points were used for calibration and the rest were used to test the 3D sensing accuracy. Experiments show that generally eight calibration points in this setup can provide sufficient accuracy for further location estimation. These eight points can be chosen in a way to create a volumetric space of interest as shown in Figure 4.11 (b) and (c). During the error analysis, the 3D residuals are computed. For points whose ground truth coordinates are known, the residuals are the difference between the ground truth coordinates and those computed via stereo. For other points, the residuals are an estimate of the standard deviation of the computed estimate of the coordinates. Table 4.3 shows the 3D residuals for some points in wireframe workspace. The maximum error is only 0.34 in (8.6 mm) in a tracking space volume of 27  31.75  24 in (68.6  80.6  61 cm). Table 4.3 3D stereo residuals for some points in wireframe workspace - scale is in inches. Actual 3D points W Px WPy W Pz r r r 13.2 15.9 8.2 -30.7 -14.6 -10.3 -20.7 -15.9 -0.5 Calculated 3D points W Px WPy W Pz rX rY rZ 13.54 16.18 8.46 0.34 0.28 0.26 0.09 0.31 0.26 0.29 0.09 0.15 -30.61 -14.29 -10.04 96 -20.99 -15.81 -0.35 3D residuals The RMS error in three directions is calculated as XRMS = 0.3 in (0.76 cm), YRMS = 0.24 in (0.61 cm), ZRMS = 0.2 in (0.51 cm). The maximum error is ~1.3% of the X-axis of the calibrated volume. This error scalable to a construction space of 40  40  40 m turns out to be XRMS = 44 cm, YRMS = 30 cm, ZRMS = 33 cm. Keeping in view a 5 m radius circle of safety for moving personnel to avoid any collision in the construction environment the localization error of less than an arm length in the 3D space using a vision only method validates its real-time applicability. Figure 4.12 shows the 3D stereo location estimation procedure. Figure 4.13 shows the computed trajectory of the sphere. Figure 4.12 Procedure for 3D stereo location estimation in wireframe workspace indoors - Flow diagram. 97 Figure 4.13 Computed sphere trajectory in wireframe workspace indoors. 4.3.4 Indoor stereo computation at surveyed lab area To test stereo performance at a room scale, we surveyed an indoor lab area 280  476  105 in (7.1  12.1  2.7 m). The cameras were calibrated using affine calibration explained in Appendix B. The 11 calibration points chosen are marked with '■' (yellow) in Figure 4.14. The lab area and the 3D model generated from the survey data along with the position of the stereo cameras represented with '●' (green) are also shown in Figure 4.14. 98 Figure 4.14 3D survey data and model of indoors lab area with stereo cameras '●' (green). Calibration points are represented by '■' (yellow). The stereo cameras were placed 22.5 in (57.2 cm) apart. The area was divided into 4  2 grid with center points marked as shown in Figure 4.14 top image. Each grid cell measured 119  140 in (3.02  3.6 m). The person moved in the common view area of both cameras while wearing a colored hat. The center of the hat was located in both cameras to perform stereo. Color detection was performed in both RGB and HSV space. To calculate the relative accuracy of the stereo results as compared to the ground truth, the experiments were performed with the person standing at the center of each grid cell ~15 ft (4.6 m) to 34 ft (10.4 m) from the cameras. Various other trajectories were also observed to analyze the stereo error. The RMS error observed was 99 XRMS =4.1 in (10.4 cm), YRMS = 6.4 in (16.3 cm), ZRMS = 2.7 in (6.9 cm). It is to be noted that the Y-axis here is along the camera viewing direction. Another example of a trajectory of a person moving randomly is shown in Figure 4.15(a). Since the height of the person is known i.e ~72 in (1.8 m), therefore it is easy to compare the zdimension error here. Figure 4.15(b) shows the histogram of error in z-direction with error mostly accumulated in 3.82 in histogram bin. (a) (b) Figure 4.15 Indoors lab area trajectory for a person moving randomly with error estimation: (a) object tracks in 3D (b) histogram of error along z-axis. 4.3.5 Outdoor stereo computation We installed our cameras in a 40×40 m outdoor test area. The cameras were placed on 4.5 ft (1.37 m) high pillars. The separation between the pillars is ~10 ft (3.05 m). The persons were wearing distinctive color clothing and head gear. We located the center of the head of the 100 persons who moved on predefined points without stopping or, in some runs the 3D data was obtained with the persons being stationary for few seconds at those points. The analysis was done offline over several trajectories to compare with the ground truth. Figure 4.16 shows the left and right image with eight 2D calibration points shown by '■' (yellow). These points were selected in nearby and distant parts of the image. Choosing appropriate calibration points covering the near and far field of the scene is necessary for results with minimal error. (a) (b) Figure 4.16 Outdoor test site with 2D calibration points shown by '■' (yellow): (a) left image (b) right image. The 3D view of the outdoor test site with object tracks is shown in Figure 4.17. '●' (red) shows the center of the test area. The 3D location of the person's head is represented by '+'. Analyzing individually two persons' trajectories moving within 10 ft (3.05 m) to 80 ft (24.4 m) distance from the stereo system, the RMS error observed was XRMS =7.6 in (19.3 cm), YRMS = 101 5.2 in (13.2 cm), ZRMS = 3.8 in (9.7 cm) as compared to the scaled error of XRMS = 44 cm, YRMS = 30 cm, ZRMS = 33 cm, from our lab experiments indoors. Figure 4.17 3D view of the outdoor test area with object tracks shown by '+' (red) using stereo. As shown in Figure 4.18 the observations beyond 80 ft (24.4 m) distance show the outliers with '•' (black) representing the ground. The stereo cameras are represented by '▲' (grey). The actual height of the person in Figure 4.18 is 72 in (1.83 m). The detected height of the person on average within 10 ft (3.05 m) to 80 ft (24.4 m) from the stereo system is 70 ± 4 in (10.16 cm). 102 Figure 4.18 Top view of the outdoor test area showing person locations computed using stereo and the ground truth '•'of outliers beyond x = 960 in (24.4 m). 4.4 Active RFID based Real Time Location System A Real Time Location Systems (RTLS) typically refers to a collection of sensors that work together to automatically identify and track the location of objects (including people) in real time. 103 4.4.1 RTLS infrastructure We have used the Convergence Systems Limited RTLS [8] development kit. The kit includes one narrow beam width master RFID reader (CS5113TD) with ethernet support, five narrow beam width slave readers (CS5111TD) and ten active RFID tags (CS3151TC). Figure 4.19 shows four readers and tripods. The master and slave readers come in different beam width configurations. Due to our test area size and narrow beam width readers we have used one master and three slave readers as specified by OEM. Figure 4.19 CSL RFID based RTLS system: Tripods hold the readers. 104 The equipment operating frequency is 2.4 GHz and it uses Time-Of-Arrival (TOA) for location determination, where the distance between the tag and the readers is calculated by the roundtrip time. The equipment works on Non Line of Sight (NLoS) communication. As compared to Received Signal Strength Indicator (RSSI) methods the TOA based location estimation and 2.4 GHz frequency makes the system more robust towards RF energy absorption o o by water and dynamic environments. Each reader has a beam width of 80 in portrait and 30 in landscape orientation. As per OEM instructions, for a cell with a square or near square shape, all the readers should be set up in a portrait manner. For a cell with a highly rectangular shape where one dimension is much longer than the other, all the readers should be set up in a landscape orientation. The active tags run on 3×AAA batteries and are of the size of an iphone 4S and weigh less than the phone with batteries installed. The tag read range is up to 200 m in an open space outdoors. Triangulation is based on the geometric principle of triangles: if one side and two angles of a triangle are known then the other two sides of the triangle can be calculated . Trilateration determines the object position in 2D by measuring its distance simultaneously from three known locations using the geometry of circles. With the TOA scheme, location can be estimated using triangulation or trilateration. Triangulation can be used by antennas that search over a range of angles for best signal strength. Trilateration requires raw data from at least three readers. Intersection of the three circles around the readers can yield the location of the tagged object on the ground plane. However, the problem becomes more complex with issues such as clock synchronization, software delays and multiple paths that results in degraded position accuracy. Beep-Beep is another localization technique reported in [104] which avoids sources of 105 inaccuracy found in typical TOA schemes. The proposed beep-beep system uses cell phones commercially available and provides around two centimeters accuracy within a range of ten meters. However, sound level in the environment can limit the range of the beep-beep system accuracy. Like all other electromagnetic waves, radio waves travel at the speed of light. The CSL RTLS basic principle of operation is that the total time of radio waves travel between the tag and the reader is multiplied by the speed of light to calculate the total distance of travel [21]. This number is divided by two to determine one way distance between the reader and tag. The CSL system further refines the estimate by accounting for hardware latency and uses probabilistic positioning algorithms that apply Bayesian statistics. As per CSL policy, details of location estimation algorithm were not provided, therefore analysis at the algorithm level is not possible. The overall location accuracy varies depending on how often the tag transmits, the number of readers used, whether the tag is stationary or moving and the type of structures in the environment. The OEM reports that the system can provide average accuracy of one meter outdoors and two meters indoors for stationary objects. The RTLS tag trajectories acquired in real time provide two dimensional data in the 3D workspace xy-plane due to limitation in CSL software provided with the equipment. Subsequently the readers placement are in the xy-plane of the workspace. The z-plane information is however considered constant and requires user interaction by defining tag height during the run. This assists in estimating correct location of the tag in the 3D workspace. The 106 RTLS location in the xy-plane of the 3D workspace carries sufficient information for supporting the 3D stereo location data and should not be confused with the 2D imagery data. Acquiring real time height data from RTLS is not limited theoretically and can be obtained by updating the software and deploying the readers in a 3D space. 4.4.2 Cell architecture The system allows defining geographical areas, called cells, where tagged objects are localized and tracked. Each cell is made up of at least four readers, in most cases six, and up to eight readers in challenging environments such as an indoor warehouse. The maximum cell size can be 100  100 m. Larger areas can be segregated into multiple cells. To achieve good outdoor location accuracy, the tags should preferably be in the cell area where antenna beams of at least three readers intersect without any solid obstruction that can reflect RF energy. Moreover to increase system accuracy, software optimization tools can also be used. 4.5 RTLS based location estimation We deployed the active RFID based location sensing equipment in indoors and outdoors respectively to analyze its location performance. We provide here the details and results obtained from our experiments. 107 4.5.1 Indoor location sensing We placed our four RTLS readers (one master and three slaves) indoors in a hallway 27×12 ft (8.23  3.66 m). The readers were placed on the four corners to form a single cell. The reference tag was placed at the center of the cell and other points to check for the location accuracy. Figure 4.20 shows the indoor setup of RTLS. The average location accuracy for static tags was ~1.9 m. The performance for dynamic tags varied a lot due to multipath effects in a compact space. Figure 4.20 RTLS location sensing indoors - setup. Figure 4.21 shows the floor map of RTLS indoors. It shows one active cell with the location of four readers and one tag in active state. The master reader is represented here as M1, and the slave readers are represented as S1, S2 and S3. 108 Figure 4.21 RTLS location sensing indoors - Floor map. 4.5.2 Outdoor location sensing To check RTLS outdoors we installed the four readers in the four corners of the same test site (40×40 m) where we previously tested the stereo system. A single cell was configured using the o four readers. Each reader was placed in a portrait orientation to provide beam width of 80 . Figure 4.22 shows the outdoor setup of RTLS. 109 (a) (b) (c) Figure 4.22 RTLS location sensing outdoors - setup: (a) master reader on tripod (b) reader pointing towards test site center (c) reference tag placed in test site. o As per the operator manual a 15 spatial offset was applied to the readers to direct the antenna beam towards the center of the test area. Reference tags were also placed at specific positions to estimate location accuracy. A tagged person followed exactly the same predefined eight path points used in stereo runs with the person being stationary for few seconds at those points. The tag location was recorded when the person reached the desired points. This approach provided RTLS system accuracy in the workable area of the test site. Figure 4.23 shows the person locations ('♦') computed using RTLS at eight points. Four green squares ('■') show the location of the four readers. 110 Figure 4.23 Top view of the outdoor test area showing person locations computed using RTLS. Table 4.4 shows location data in the xy-plane and the difference between the ground truth observations and the location obtained using RTLS. The average error obtained along x-axis is Xavg diff = 60.1 in (1.53 m) and similarly along y-axis is Yavg diff = 54.1 in (1.37 m). 111 Table 4.4 Comparison between ground and RFID observations in xy-plane- scale is in inches. W Px ground WPy ground W Px RFID WPy RFID X diff Ydiff 120 250 370 534 700 810 950 1070 -1.2 0.7 -2.9 -4.1 0.4 -1.3 -3.2 2.1 53.02 162.5 333.1 473.9 636.4 781 912.8 970.3 50.4 -36.5 89.2 63.7 -45.4 53.4 -22.9 -55.2 67.0 87.5 36.9 60.1 63.6 29 37.2 99.7 51.6 37.2 92.1 67.8 45.8 54.7 26.1 57.3 Figure 4.24 shows the tagged person's locations computed using RFID ('♦') and the ground truth (○). The radius of the circles [shown in dotted blue line] represents the location error between the ground truth and the RFID observations. The center of the circle is at the ground truth observations. The circles are shown in different colors for easy representation. Since the test site center has solid structures, due to factors such as multipath and partial occlusion the location error increases as the tagged person moves towards the center of the test site. In the next chapters we will explain how this coarse location can help fusion. 112 Figure 4.24 Computed locations of the person using RTLS and error estimation - error circles represent difference between ground truth and RTLS. The average accuracy for the person location while being stationary at desired points was observed to be ~ 1.5 m. The readers' locations needs to be accurately provided in the software otherwise the tag location accuracy will be affected. It is also noted that the RTLS system has at the minimum ~ +2 sec refresh rate with four tags in the area. The more tags in the cell the greater will be the time required to update the next location of each tag. 4.6 Summary discussion In this chapter we have evaluated the localization performance of stereo vision and RFID as single modalities. Working towards an automatic system, we currently have some steps in the 113 research methods that require user interactivity. We have studied and implemented the ray-ray stereo scheme [31] for 3D localization in an indoor lab environment using commodity cameras and have reported an RMS accuracy of ~0.34 in (8.64 mm). Later the efficacy of the stereo approach was tested outdoors in a surveyed test site. In our analysis we have obtained RMS location accuracy within ~7.6 in (19.3 cm) which offers potential for future researchers to examine automated three dimensional tracking outdoors using economical vision sensors. Later an RFID based location system was used for analyzing location performance. We have assessed that the system can achieve ~1.5 m accuracy outdoors for tagged persons following predefined paths and being stationary for few seconds at selected points. The acquired RTLS system presently provides real time location in the xy-plane of the 3D workspace with approximately two seconds minimum system latency due to OEM software constraint and no provision of copyrighted location processing details to the customers. The z dimension, though available comes with planar restriction defined by the tag height. Getting interpolated location estimates and varying tag height data is however, not limited in theory. In general we have shown how proper sensor placement can support localization and tracking. This enables the system to spend more resources on tracking anomalies -"unauthorized" objects, machines, or materials in a site safety environment. 114 _____________________________________________________________________________________________ CHAPTER 5 Fusion dynamics and analysis _____________________________________________________________________________________________ In performing data fusion, our aim is to combine and enhance the sensor information - so that it is better than would be possible if the sensor observations were used individually. In this chapter we define and characterize our sensor relationship w.r.t recognized standards and explain the basic building block of fusion. Next, general fusion benefits in a tracking environment are listed. Finally, particular examples of fusion using CV and RFID are explained to highlight how fusion may address the weaknesses of single sensing modalities. W ith a single sensing modality the sensor will cover a limited region of the environment and provide measurement data of only local events, aspects, or attributes. Frequency of measurement or the refresh rate and the accuracy/precision of the basic sensing element in the sensor are some other constraints worth considering while choosing sensors for a tracking application. Fusion of sensor data is a dynamic process that involves association, correlation, and combination of data derived from multiple sensors resulting in a fused product with a common representational format, which is more complete and accurate. Employing more than one sensor 115 in the workspace may enhance the synergistic effect in several ways, including: increased spatial and temporal coverage, increased robustness to sensor and algorithmic failures, better noise suppression and increased estimation accuracy. Data from multiple sensors could be of the same type or of different types. For the fusion process this data has to be represented in a common format that is meaningful in order to estimate or predict some aspect of an observed scene. In the paragraphs below we characterize our sensor setup and fusion approach in light of recognized standards. 5.1 The fusion process approach Boudjemaa et al. [105] categorized fusion as across sensors, across attributes, across domains or across time. In fusion across sensors a same property is measured by a number of sensors versus the across attributes category where sensors measure different properties associated within the same workspace. In across domain the sensors measure the same attribute over varying domains. Lastly the fusion is categorized as across time when current measurements are fused with prior information. Our sensor infrastructure as single modalities exploit a fusion scheme across sensors. This is because two or more cameras are used to observe the same workspace for stereo vision; however independent observations can be utilized when calculating depth information from single camera and single RFID reader. Also four or more RFID readers provide object ID and location information. Common representation of the location information x,y,z from RFID and vision as single modalities, into tokens  depicts fusion across sensors. However, the 116 different feature set v of these modalities provided to resolve ambiguities between objects exemplifies the use of fusion across attributes. Our S-3 system can also use prior records {}, or perhaps even forward records {} to determine L = f () to update the object tracks presenting fusion across time. 5.2 Multiple sensor configuration Durrant-Whyte [106] classifies a multiple sensor data fusion system according to three basic sensor configurations. They are described as complementary, competitive and cooperative. In complementary configuration the sensors may not have a direct dependency relationship, however they can be combined to provide a comprehensive image of the phenomenon under observation. It is exemplified by the use of multiple cameras, each observing different parts of a workspace, to provide a complete view of the scene. Complementary sensors help resolve the problem of incompleteness. Fusion of complementary data is relatively easy because the data from independent sensors can be appended to each other. A competitive relationship among sensors is described as independent measurement of the same property by each sensor. This configuration allows combining competing sensors that are not necessarily identical. Calculating refined object location by RFID and vision location estimate uses competitive relationship. It provides robustness and fault-tolerance because comparison with another competitive sensor can be used to reduce the effects of uncertain and erroneous observations. Cooperative type data provided by two independent sensors are used to derive information that would not be available from a single sensor. The resulting data will be sensitive to the inaccuracies in all the individual 117 sensors. Our stereo approach where two independent cameras are used to compute the 3D pose of an object comes under cooperative sensor configuration. In terms of usage, these sensor configurations are not mutually exclusive because more than one of these categories can be used in most cases. Figure 5.1(c) illustrates complementary configuration where cameras are networked to cover most of the workspace area and increase the workspace visual coverage. Figure 5.1(d) explains competitive sensor configuration where both RFID are CV are used to obtain ID and refined 3D location of the tagged object. (a) (b) (c) (d) Figure 5.1 Multi-sensor configurations: (a) cooperative (b) cooperative (c) complementary (d) competitive. 118 5.3 Building block for multiple sensor fusion The basic building block for multiple sensor fusion is called a fusion node. An overall system can have a distributed network of these nodes. The sensor observations s Oi are received as single or group inputs to the fusion node: s {CV , RFID, group} i {,1, 2,3,......} j {,1, 2,3,......} s Oi, j  s xi, j , s yi, j , s zi, j , s ti, j  (5.1) The single inputs, for example, can be 2D image points in case of stereo calculation. The sensor group input is provided when a 3D location of the object is provided by the cameras or RFID. The s Z i, j is not applicable for 2D calculations. The feature set s v i, j is provided to the node as auxiliary information from the sensor. This includes information such as color, texture, ID, height, dimensions, predefined labels, etc., that are generated dynamically or accessed from the database. The sensor process in the node processes the observations and forms tokens s  i, j . s i, j   s xi, j , s yi, j , s zi, j , s ti, j , s vi, j , s Li , j  (5.2) The node receives data in common representational format and initiates data association, estimation and filtering processes. Data association and estimation can be based on hard decision methods, such as nearest neighbor, Euclidean distance etc., whereas Kalman filtering or particle filtering can be used for a probabilistic approach. We have used a hard decision method to 119 correlate sensor observations, which involves discrete relaxation labeling. All the data received by the node is combined together to produce a fused token. The output token represented as  may, or may not, differ in position, time, value and uncertainty from the input observations. The fused output is also provided to the node for fusion across time which might be required to generate the object tracks. Figure 5.2 shows a graphical representation of a fusion node [107] which integrates in applications such as temporal tracking of objects in a given environment. Figure 5.2 Basic fusion node architecture. 120 5.4 Benefits of fusing RFID and CV Although object recognition and scene understanding using a purely CV approach is advancing, performance lags what many applications require. Lai et al. [108] report on object modeling and recognition experiments with 300 common objects using color, shape, and depth features. Coarsely summarized, they show 80% precision at 60% recall. This work involved stationary objects in indoor environments and mostly manufactured surfaces. NIST reports on a large base of biometrics experiments for person identification [109]. Some cases are reported where automatic systems perform better than humans and where there are viable commercial applications. In general, excellent performance is achieved only when high quality images are available. In the Advantage I-75 project, Walton et al. reports [110] optical license plate reading performance of 35% to 45% while RFID systems routinely achieve read rates exceeding 99%. Many of the difficult cases for CV are due to poor image quality such as caused by dirty or bent license plates. In many applications, engineering with RFID technology can avoid the problems of a purely CV recognition approach and can yield reliable object recognition leading to better object tracking and motion analysis as we will see in the next sections. In addition, some purely RFID solutions can benefit from adding some CV. Many retail stores use either bar code or RFID tags on items for machine reading. A customer, or “sweetheart” clerk, can cheat by changing the tag from one item to another, or perhaps using a counterfeit tag. Some CV capability at the checkout station can guard against this: visual features of the item can be sensed and compared to symbolic features stored on the tag. IBM’s Veggie Vision system [43] can recognize 350 121 produce items using color, texture, shape, and size features -- produce items are usually not tagged with bar codes unless packaged. This technology can be extended to thousands of other supermarket items as a check on bar code or RFID recognition. A related application of RFID is the EZ Pass highway toll-collection system [15]. Typically, a video camera is included in the system to take frames of cars that pass through without a legal transaction so that fines may be levied based on license plate ID. CV could be used here as a check that the car with the EZ Pass transponder has visual features corresponding to those stored in the active RFID tag. This adds security against theft or cloning. CV can be used similarly for person ID when credit cards or “smart cards” are used. Symbolic features on the card can be compared to an image of the live person using the card - either by a clerk or automatically. The state of the art of person verification is good enough to support this operation. There are many possible applications based on fusion of RFID and vision. They can range from reliable object and person recognition, to assisting persons with disabilities, making shipping more efficient, and enhancing construction site safety. Specific examples are already provided in Chapter 2. Some important general characteristics of the fusion are as follows: a. Improved object recognition accuracy - Uncertainty depends on the object being observed and arises in case of occlusions and limited sensor measurement accuracy. Since the primary purpose of RFID is object identification, object detection and identification/recognition will be much more accurate in the fused infrastructure compared to a CV only approach thereby decreasing uncertainty. 122 b. View independent tracking - The fused system has the main benefit of view independent tracking. This is only possible due to the powerful tool of wireless identification in RFID. c. Easy object representation - Representation of the objects and/or humans is a complicated operation when using only 3D CV. The fusion approach provides direct symbolic representation. d. Robust segmentation - With the fusion approach, object identification in a scene becomes easy and accurate since object model information can be retrieved from the object tag. Fusion can provide robust segmentation of images in real time. e. Training free learning - The fused system can train itself on the fly for the tagged objects in the environment. By directly sensing object ID, the system can efficiently learn object appearance models using the CV component without human intervention. New objects teach the system when they arrive. f. Compatibility of RFID passive tags and CV - RFID passive tags are of very small size. Applying them to objects does not change their appearance or pose. Therefore their use does not hamper operation of any vision based techniques. Active tags though come in bigger sizes. 123 g. Efficient handling of occlusion - The fused system can be used for handling the occlusion problem without increasing the computational and hardware complexity if multiple cameras are used. h. Supplemental object information - The onboard memory of an RFID tag can carry a variety of information about the tagged object. For example, an object can transmit its color and size, which supplements the visual feed. This element is essential in designing autonomous systems that require real time interaction. Moreover, RFID tags can transmit important non-visual information, such as object weight or chemical composition. i. Increased spatial and temporal coverage - The networked RFID and cameras can increase the workspace spatial and visual coverage. j. Improved resolution - When multiple independent observations from similar sensors of the same property are fused, the resolution of the resulting value can be better than a single sensor’s observation To benefit from sensor fusion, there is a need to combine the strengths of RFID and CV to avoid problems of each mode. Fusion must be designed so that problems such as failure in untagged environments, high data rate, and lack of positional information do not defeat its principles. 124 5.5 Test cases to explain fusion and its analysis We illustrate in this section how CV and RFID in a competitive configuration supplement each other in critical test cases. For clarity we first assume that for case I and II, both CV and RFID feeds are continuously available and the objects are not occluded by each other or by the background and are moving with approximately the same velocities. We briefly mention relaxation labeling based filtering which eliminates the incompatible labels iteratively. To maintain sequential flow of concepts, relaxation labeling will be explained in Chapter 6. The cases mentioned below are symbolic versions of our outdoor test trials. 5.5.1 Test case I - Same colored objects For case I, consider two objects represented as '▲' and '■' with 3D data at each instance over nine time frames. For better visualization the tracks in Figure 5.3(a) are displayed in the xyplane. The objects are converging from north to south towards each other, and intersect at time frame 5 and thereafter follow their direction of motion. Even if the objects were moving in a straight line, the points would appear to be scattered along the true path due to propagating location errors and distortions in a 3D space. CV and RFID location accuracies are shown with circles. The inner circle around every point shows the localization error of CV and the outer circle represents that of RFID. Figure 5.3(a) and (b) shows case I where both the objects are of the same color. Figure 5.3(a) shows the object tracks and Figure 5.3(b) represents label assignments. The CV system can correctly assign labels to '▲' as label 1 and '■' as label 2 up until time frame t = 4. Thereafter there is a probability of no ID assignment, which based on 125 relaxation labeling means no wrong label elimination and is represented here as keeping both possible labels for both points. On the other hand, RFID provides correct label assignments other then at t = 5 due to fully overlapping localization error of point '▲' and '■'. In this case RFID helps CV to generate correct object tracks. However, no label elimination in the intersection area is possible. CV contributes by refining location. (a) (b) Figure 5.3 CV and RFID supplementing each other - Case I: (a) same colored object tracks (b) label assignments. 5.5.2 Test case II - Different colored objects Figure 5.4(a) and (b) shows case II where objects are of different color. Due to no occlusion CV will be able to provide correct label assignments. However, RFID will have no label assignments at t = 5. Fusing both feeds, CV supplements RFID here and the label assignment at t = 5 is obtained. 126 (a) (b) Figure 5.4 CV and RFID supplementing each other - Case II: (a) different colored object tracks (b) label assignments. For both case I and II, CV support can also be clearly appreciated when RFID location error is maximum (i.e on outer circle boundary) for two points having overlapping localization error in consecutive frames. 5.5.3 Test case III - Different colored objects with intermittent RFID/CV feeds Figure 5.5(a) and (b) shows case III where some of the objects are occluded by the background and the RFID feed is intermittent. This is represented here as missing vision and/or RFID location accuracy circles. The dash symbol shows non-availability of observation for that time instance. Comparing Figure 5.5(a) and (b) it is obvious that CV and RFID supplement each other at missing spots and fusion of these generates correct label assignments. 127 (a) (b) Figure 5.5 CV and RFID supplementing each other - Case III: (a) considering visual occlusion and intermittent RFID, different colored object tracks (b) label assignments. 5.6 Summary discussion This chapter presented a multi-sensor fusion configuration that combines information from vision and RFID and generates tokens. We have presented the attributes of our fusion scheme to explain its adaptability towards established fusion standards. It explains the competitive and cooperative relationships of our sensors and the fusion building block needed to develop the fused system. The potential benefits of fusion focusing on localization and tracking tasks are also highlighted. To practically elaborate the potential in fusion, we have demonstrated test cases where fusion disambiguates object tracks and combines the strengths of RFID and vision to avoid problems of each mode. 128 _____________________________________________________________________________________________ CHAPTER 6 Tracking using fusion of CV and RFID _____________________________________________________________________________________________ Object tracking comprises estimation of the current position and orientation of a tracked object and its motion, usually based on noisy measurements that generate uncertainties especially for dynamic objects. This chapter covers the algorithm development procedure for object tracking using fusion. We introduce a relaxation labeling scheme that can implement object tracking. The constraint satisfaction process is based on fusion of CV and RFID. Further, we have used smoothing for optimization, which is a high level technique that operates globally, to update computed tracks for increasing tracking reliability. A n object track is k or more tokens in time sequence with consistent object ID and features that also satisfy constraints for motion in space. The complexity of the tracking problem increases for multiple object tracks. Observations from multiple sensors helps decrease the uncertainty. 6.1 Object tracking model using sensor fusion In the process of object tracking, object tracks are updated by correlating sensor tokens with the existing tracks or by initiating new tracks using tokens from different sensors. An object 129 track here can be defined as a temporal sequence of assigned tokens with consistent features and label that also satisfy constraints for motion in space. Token association, which provides tokentoken or token-track correlation, facilitates the constraint satisfaction iterative steps. The process of relaxation labeling filter out incompatible objects leaving behind compatible candidates for track updates by the optimization process. We have used smoothness as an optimization process. Details of the smoothness algorithm will be provided towards the end of this chapter. Figure 6.1 describes tracking using sensor fusion. Figure 6.1 Schematic diagram of object tracking using sensor fusion and relaxation labeling. 6.2 Labeling via iterative processing Relaxation labeling is an attractive technique because it is highly parallel, involving the propagation of local information via iterative processing. Suppose there are N objects detected at 130 a particular time instance t. We use discrete relaxation to create the tracks of these N objects and to update the time tokens comprising each track. Using relaxation, different sensors and sources of information can be turned on or off for experimentation or for practical reasons at a site. Fusion processes operate on a blackboard containing the set of tokens. When an observation is made, its initial label set is the set of all possible N known objects. Filtering processes are then applied to eliminate labels inconsistent with constraints. Sensing continues over the T time steps and naïve physics processes aggregate object consistent tracks. For clarity, suppose that N objects are detected at time t=1 and that we arbitrarily label these objects 1,…,N. At time t=2, we have another N observations and we want to label each of those with the labels from time t=1. A label possible for a token at time t=2 will be consistent in color, motion, and RFID ID with the tokens at time t=1. Initially, a new observation may detect any of the known objects, so all labels L are possible. A totally new object entering the site could be given a new unknown label. Most of these labels are filtered out quickly by failing constraint satisfaction criteria. For example, suppose five orange hard hats are detected at t=1 and these have initial labels 3,4,6,8,9. For any token for time t=2 that is not orange, labels 3,4,6,8,9 will be deleted from its possible label set. Filtering can be done by space as well as by color. If any token at time t=2 is unreasonably far from a token m at time t=1, then label m should be deleted from its label set. 131 6.2.1 Sensor process A CV sensor takes a video frame, segments it into special regions of color, and provides two points on each of the imaging rays that are presented as sensor observations. Object region features are provided in the feature vector v. Using this information, the sensor process then generates tokens, one for each segment detected. Object label L is initially unknown. An RFID reading produces a similar token, except that an object label L is known in almost all cases. 6.2.2 Combination process Fusion processes take the sensor observations and generate tokens and possibly merge information using ray intersection, ray-surface intersection, etc., whichever applies, and outputs a token with refined 3D location or label information. Filtering processes eliminate unlikely token labels by comparing tokens and by looking at feature vectors over time. The current software implementation of relaxation inputs combined tokens that have been pre-computed from stereo correspondence. Similarly, RFID tokens have 3D information from the encapsulated RTLS system. 6.2.3 Tracking process Naïve physics constraints are used to filter out highly unlikely labels for objects at time t based on the recent history of objects continuing from the k previous time steps. Our current results have used the current and two previous time steps. 132 6.2.4 Relaxation labeling algorithm The processes and constraints described above are now formalized into an algorithm: 3 Output: Object Labels at Lk and 3D location XYZrefined ϵ R for each object 3 Input: Object Labels at Lk-2 and Lk-1 with color, RFID and XYZRFID and XYZstereo ϵ R FOR  /*Process all time frames */ t = k :Kframes Obtain color information if any for XYZstereo observations from 2D histogram matching  /* How many colored hats and balls Sort colors into groups and which colors*/ Detect  n number of XYZstereo observations detected /* Motion detection and color detection*/  m number of XYZRFID observations detected /* Active RFID*/  Merge tokens and generate empty label matrices for /* p = max(n,m)*/ p observations  Assign p labels to all p observations and proceed to next pass 133 /* Binary relationship criteria*/ Identify  Identify XYZstereo observations based on color information  Identify XYZRFID observations based on ID  Correlate identity information IF Only one color group /*All XYZstereo observations have same color*/ No label elimination and proceed to next pass ELSEIF Different color groups /*Some XYZstereo observations have different color*/ Eliminate labels from p label matrices based on respective color groups and proceed to next pass END IF Locate /* Binary relationship criteria*/  /*Thresholds are defined based on Set stereo and RFID location threshold values sensor location accuracy and object speed*/  Locate XYZstereo observations  Locate XYZRFID observations with ID and location  Correlate location information 134 FOR i=1:p StereoObservationSet(i) = Starting at t=k-1 use stereo threshold and find near neighbors at t=k for XYZstereo observation IF Label is in StereoObservationSet(i) Keep label ELSE Eliminate label and proceed to next pass END IF END Smooth  /* z dimension gives valuable XYZstereo observation at t=k relative to k-2 and k-1  Calculate direction of flow/velocity for every information here */ Correlate with RFID label/s ID and location information from XYZRFID IF No difference in flow detected Labels kept ELSEIF Difference in flow detected /* Only compatible labels remaining.*/ Eliminate unlikely labels END IF 135 /*All possible labels for specific Compatible label/s obtained object*/  Compatible label/s provided to optimization process to obtain XYZrefined END FOR 6.2.5 Test cases to analyze discrete relaxation labeling To realize how the dynamics of relaxation labeling can fuse information we describe here some related critical test cases. 6.2.5.1 Test cases IV - Two same colored objects with simple dynamics We start with a simpler dynamics as shown in Figure 6.2 where two objects having the same color features moving from west to east first converge, move side by side for some time, and then diverge. The possible compatible labels after fusion are shown below each block of time frames. Assuming there is no occlusion, the labeling algorithm generates unique compatible labels '▲', '■' before the object tracks intersect at t=4. Thereafter there will be no label elimination until t=7 due to overlapping location errors of CV and RFID. The applied constraints can then categorize inconsistent labels at t=8 and onwards. Moving towards a more complex scenario in the next test case, the structure and efficacy of the relaxation scheme is explained step by step. 136 Figure 6.2 Correct object tracks with possible compatible labels at each block of time frames. 6.2.5.2 Test cases V - Four same colored objects with increased complexity Two persons '▲', '♦' carrying two balls '●', '■' move towards each other and meet at the center of the test area. They then exchange the balls and move back towards the direction of their starting positions. The return paths are separated for better illustration. Both persons and both balls are tagged. To test algorithm robustness and increase complexity we consider that the color of the balls and the persons head gear are the same. Figure 6.3(a) and (b) show the 3D points over consecutive time frames. Here Figure 6.3(a) represents the correct trajectories of the persons and the balls. If we assume that there is no occlusion and the stereo feed is continuously available then Figure 6.3(b) shows incorrect trajectories of the persons calculated by the stereo feed alone. 137 (a) (b) (c) Figure 6.3 Test case showing relaxation labeling: (a) correct trajectories of persons and balls. (b) correct balls and incorrect persons' trajectories (c) left matrix - General pattern of four relaxation constraint passes and final compatible label/s. Right matrix - RFID location information. 138 It is assumed that we have information about the feature set and 3D location of the object labels at time frame t=1 and 2. For subsequent time frames we correlate CV and RFID information and apply constraints in detect, identify, locate and smooth passes. Constraint based label elimination by these filtering passes update the label matrix for every observed point at every time instance t>2. Once all impossible labels are removed and no further elimination is possible then the remaining label/s is/are considered as compatible label/s. The label/s is/are then passed on to the post-processing optimization process for updating fused token feature vector v and the refined location XYZ and, where required, determining a possible unique label amongst the four compatible labels sets. The optimized labels acquired are then assigned to the observed points respectively. Figure 6.3(c) demonstrates a typical label matrix on the left that shows all four passes with the remaining compatible labels at the end. The RFID location information on the right is shown with each label matrix to provide evidence of objects presence. For the observations in Figure 6.3(a), a step-by-step explanation is provided on how the label matrices in Figure 6.4 are updated. For time frame t=3 and 4 in each label matrix the objects are detected in the detect pass based on motion, color and ID and subsequently all the possible labels are assigned to all the observed object points. In the identify pass the system identifies objects based on color groups and ID from RFID. Since the color information for the observed points is the same no label is eliminated at this pass. The color histogram similarity measures can be used for color based sub-grouping. In the locate pass the labels are eliminated based on near neighbors where thresholding is done using sensor location accuracy and object speed. This helps identify '▲','●' and '♦', '■' as consistent label pairs. The two inconsistent labels are then eliminated from the respective label matrices. 139 The smooth pass correlates labels with RFID and deletes one further label with unlikely motion according to local (3 point or 2 point) smoothness and object height constraints. This completes relaxation labeling. Global tracking is done as post processing. Note that at t = 5 the process of label elimination is complex due to the overlapping location errors of stereo and RFID and therefore label elimination is not possible in detect, identify and locate passes. During the smooth pass, RFID provides no label elimination information showing all four labels '▲','♦','●','■' as valid; however, the system identifies '●','■' and '▲','♦' as possible label set pairs based on object height and velocity constraint and subsequently outputs two compatible labels. t= 1 t= 2 t=3 t=5 t=6 . . . . (d) Figure 6.4 Label matrix updating steps for same colored objects at each time frame for Figure 6.3(a) tracks. The compatible labels from relaxation are fed to the post optimization process to identify the optimal label for each observation. It is to be noted here that the system keeps one extra label as 140 part of the possible compatible label set. This explains a tradeoff between increased post processing computation for keeping a wrong label and the cost of eliminating a correct label. Since the objects have the same color and are assumed to be moving with the same velocity, at t=6 the color and near neighbor constraints will not provide valuable information for label elimination. In the smooth pass based on height and direction of flow relative to previous velocity vector direction, CV identifies '▲','●', and '♦','■' as compatible label sets for the respective observations. These two label pairs for each observation represent correct trajectories of the balls; but incorrect trajectories of the persons as shown in Figure 6.3(b). However RFID provides '♦','●', and '▲','■' as possible label pairs. Correlating this information helps obtain one correct compatible label for each observation. 6.3 Optimization process An optimization process is applied on the compatible labels obtained after relaxation labeling so that the tracks can be optimized. We explain here how smoothness, which is a global operator, can help optimize object tracks. The process only optimizes some token parameters without violating specified constraints. Smoothness of trajectories requires a burst of time frames. Our tracking version used a burst of four consecutive time frames to compute track smoothness, curvature, and acceleration. 141 6.3.1 Smoothness of trajectories Our smoothing algorithm is motivated by the Sethi-Jain [96] and Veenman et al. [97] algorithms. It can work in either 3D or 2D and includes more information on some objects at some time instants [as is available from the RFID or vision sensors] than those previous algorithms. If a 2D algorithm is used, the constraint that two objects cannot be in the same location at the same time should be relaxed since it may just be that one object is occluding the other at some instant. The general algorithm will have different specializations depending on the application and how much sensor information and object constraints are available. For example, in the S-3 system, our cameras are calibrated to a surveyed 3D terrain, so if an image object is known, then an approximate 3D object location can be computed using the image from a single calibrated camera (we can just intersect a camera ray, or cone, with a surface in the 3D space). Our goal is to optimize the tracking based on the heuristics explained in Section 3.4, which will provide the means for safer activities and more efficient site management. A set of T vectors of information for each time frame t=1,2,3,…T is provided as input. Each of these time frames represent “frame vectors” that contain N tuples , where label L may identify a known object (L=1,2, …,q or N) or it may be unknown (L=0). The purpose of the algorithm is to assign (discover) labels L = 1,2,3 …q or N, to each position tuple at each time t. 142 6.3.2 Tracking algorithm description The algorithm takes input tokens containing N observations of 3D points over T time instants, thus NT tuples total are grouped into T time frames. It extracts a smoothest set of paths through these points, observed in frames 1 to T; all tuples are now grouped into N tracks. The object ID and location provide labeling information with the 3D points when available i.e L = 1,2,3,……,N. The number of such “tagged points” must be less than or equal to T for any of the objects n. For object track n, the trajectory consists of 3D points at each time frame t=1,2,3, ………,T. The trajectory with label L is represented as: CL = [ Pn1 ,Pn2 ,.........,PnT ] ; Pn,t  x, y, z, t , v, L  (5.3) As in Sethi-Jain [96], the path difference between two consecutive 3D points is defined as: Dn,t  Pn,i  Pn, j ; i  j  t (5.4) Smoothness at a current point Pn,t is calculated using the previous point Pn,t-1 and next point Pn,t+1. Dn,t-1 is the path difference between the current and previous point and Dn,t+1 is the path difference between current and future point. Smoothness value Sn,t of a 3D point is then defined as follows: 143 2 D .D  n,t-1 n,t+1  + (1 - w)   D   n,t-1 Dn,t+1   .D  D  n,t-1 n,t+1 Sn,t = w  Dn,t-1 Dn,t+1       (5.5) To yield 0  S n ,t  1 a weight factor w is used such that 0  w  1 . The initial points of N object tracks are assigned arbitrarily. The total sum of smoothness over T time frames for a single object track n with assigned label L is then given as: SL Total T 1   t 2 Sn,t (5.6) One can then define total smoothness for all tracks by summing smoothness over all N tracks. For efficient implementation, the algorithm uses a burst or block set concept. A real time algorithm must make decisions within, say, a fifth of a second, or six video frames. This limits the amount of look ahead that can be used. A burst or block set denoted by B is defined as a sequence of N-tuples for a fixed length of time frames m. The size of B is then N  m. From RFID properties, it is reasonable to assume that object IDs and their respective locations persist or are absent for multiple frames. In reality, these bursts can have arbitrary length and start time; however, in our simulations and analysis we assume more regularity. Step by step implementation of the algorithm is as follows: Input: N tokens of 3D points over T time instants 144 /*Each time frame having N number of rows is considered as a frame vector*/ Output: Smoothed trajectories CN.  At t=1, for all N object tracks, assign labels n /* Initialize labeling */ =1,2,3,……,N arbitrarily to the frame vector.  Define burst length m /* m = 3,4,5,6 */  Assign k = 0. /* Initialize k */ t = 2 : m-1 : T-1 /* Loop over T-1 time frames with FOR increment of m-1 */  k = k+1;  Label L may identify a known object (L=1,2, …,q /* Availability of partial label /* Increment k */ information will reduce number of or N) or it may be unknown (L=0). possible combinations */  Optional - Use nearest neighbor assignments for t- /* If selected, this helps reduce 1:t frames with N tracks.  combination volume in next step */ Consider t-1:t-1+m time frames with N object /* Generate frame vector block set tracks and form frame vector block set Bk having of length m starting t=1*/ k c N sub trajectories. /* If nearest neighbor assignment selected then consider t:t+m-1 time frames */ 145  Compute all possible combinations |U| of the /*|U| is r  m+1 */ /*|U| is r  m with nearest neighbor elements of block set Bk. option */ /* r is the product of the number of elements of N trajectories in block set Bk */ FOR j=1:r FOR d=2:m /* d=2:m-1 for nearest neighbor option */  Calculate smoothness S j,d at every total instance in m-1 time frames for r combinations. END  r Calculate total smoothness S total over m-1 time frames for each combination of r. END  Sort end total smoothness for each combination in descending order.  While indexing, choose highest total smoothness /*Point Pn,k in frame vector is of N pairs with combinations having different assigned once to a trajectory in one elements in each frame vector in an instance. 146 time instance and cannot be reassigned elsewhere*/  Exchange points and assign k cN smooth trajectories in Bk as a subset of final smoothed trajectories.  k Save c N .  Increment t. END /*When last positive integer value of t is reached */  Correlate similar end points of k c N smoothed /* Based on this similarity measure, k+1 trajectories in Bk with similar initial points of c N smoothed trajectories in Bk+1.  k+1 rearrange the order/label of c N smoothed trajectories */ Combine similar label subset trajectories from processed frame blocks and generate final smoothed trajectories CN. Figure 6.5 shows the block diagram of the smoothness algorithm. Tracking algorithms sometimes lose correct object tracks at ambiguous intersecting points of trajectories. Optimizing object tracks here using trajectory smoothness helps the system in such ambiguous areas to interpret correct trajectories. Real-time implementation is possible as availability of ID from RFID could reduce the search space up to 99%. 147 Figure 6.5 Block diagram of the smoothness algorithm. To show the effectiveness of the smoothness algorithm we demonstrate step by step results of an example. The input data is displayed in Figure 6.6(a) and consists of a sequence of six time frames with three trajectories having 3D data at each point. Each point of the trajectories is symbolized by '■', '●' or '►'. For better visualization the z dimension of all the input data is fixed. Figure 6.6(b) shows the trajectory assignments after nearest neighbor linking. In the next step the exchange candidates are then decided using total smoothness. For this example block set Bk of length m=6 is used. Figure 6.6(c) shows the smoothed trajectories with '■' as label 1, '●' as label 2 and '►' as label 3. The algorithm took only 0.102 sec with no assignment error. 148 (a) (b) (c) Figure 6.6 Step by step results of an example with smoothness algorithm applied: (a) input data (b) nearest neighbor assignment (c) smoothed trajectories. 6.4 Summary discussion We have proposed a three dimensional object tracking scheme using fusion of vision and RFID. Data integration and filtering are important tasks in tracking. We used discrete relaxation to control the integration of information from CV, RFID, and naïve physics. We have provided the theoretical and practical understanding of our proposed relaxation filtering technique that is based on the constraint satisfaction. The label elimination approach easily represents the ambiguity occurring in real-life applications. As a post processing step to labeling we have used smoothness for optimization to update computed tracks for increasing tracking reliability. We have shown how fusion can greatly increase tracking performance. 149 _____________________________________________________________________________________________ CHAPTER 7 Experiments, results and analysis _____________________________________________________________________________________________ To study the value of fused sensor information in localizing and tracking multiple objects we report here experiments, results and analysis. Based on the defined performance metrics, analysis for CV and RFID as single and fused modalities is reported. First, we have evaluated use of stereo vision both indoors and outdoors for 3D accuracy and reliability of object location. Secondly, we have evaluated the commercially available RFID-based Real Time Location System (RTLS) for its accuracy and reliability of object detection and location. Finally, we have explored via simulations and also with the real time data how fusion can reduce the combinatorics of tracking. T o validate the localization and tracking approach provided in Chapters 4 and 6 we provide in this chapter detailed experiments, results and analysis. For stereo experiments we have used Logitech C210 fixed focus commodity cameras. The focal length of these cameras is ~4mm with a frame rate of 15 fps at 640×480 resolution. For RFID based experiments we have used the CSL RTLS kit for localization and tracking. The computing platform used to run the simulations is a core i5 580M with 4GB RAM. All the simulations were done using MATLAB® 2009a. 150 7.1 Generating stereo trajectories To evaluate use of stereo vision both indoors and outdoors for 3D accuracy and reliability of object location we have generated three types of trajectories. The first type of stereo data was extracted from an indoor lab bench using stereo vision. The second type of stereo data was generated artificially using mathematical curves and the third type was the real stereo trajectories obtained indoors and outdoors. Simulations allowed us to construct interesting test cases and to control the ground truth, however, gaps between simulation and real-time may occur with respect to the assumption about the sensor capabilities, natural environment and the tracked objects attributes and dynamics. 7.1.1 Real indoor trajectories from wireframe workspace To generate real trajectories indoors, we used 3D stereo. A colored sphere on a stick was moved by hand along a specified trajectory within a wireframe workspace of 27  31.75  24 in (68.6  80.6  61 cm). The structure was used for calibrating the cameras. The trajectory of the sphere yielded T records for object track L at times 1,2, …, T. The experimenter then repeated this using the stereo system to generate more trajectories until there were N of them, one for each object moving in the workspace: L = 1,2,3 … N. Each of these N sequences was an N “object track”. If we had N object tracks, then there were 2 subsets of these to choose for study. We had generated multiple tracks by varying the path and velocity through the workspace and also took care to create some near collisions. Figure 7.1 shows a set of a few trajectories generated using our stereo setup. 151 Figure 7.1 Example of stereo trajectories generated from wireframe workspace indoors. 7.1.2 Mathematical trajectories We created a dataset generator that can randomly create smooth object tracks with various speeds and densities without collision. We generated N smooth paths for T time frames each in 3D space using a helix structure, which was randomly spread out for a selected number of time frames using pseudorandom values as shown in Figure 7.2. The circular helix of radius a and pitch 2πb in 3D space can be parameterized with Cartesian coordinates as follows: 152 x (t )  a cos(t ) y (t )  a sin(t ) z (t )  bt Figure 7.2 Mathematically generated trajectories using dataset generator with T=11 and N=7. To meet the constraints in Section 3.4, the generated data has the following parameters by default: 153 a. Object tracks N=10. b. Time frames T =11. c. Smooth velocity vectors. d. Unique trajectory directions. e. No chance of collision. f. Randomly spread out trajectories in a 3D space of 1 m3. 7.1.3 Real stereo trajectories from indoors lab area We generated several trajectories using colored hats i.e orange and yellow in our lab area. We acquired random as well as ground truth trajectories where persons wearing colored hats moved over specified paths in real-time or while stopping at known points for few seconds. Figure 7.3 shows the trajectory generated while the person moved on a predefined path. Figure 7.3 Indoors lab area real stereo trajectory where a person moved over predefined points. 154 7.1.4 Real stereo trajectories from outdoors We conducted several experiments to test the stereo setup outdoors. The persons moved over predefined measured points without stopping or, in some runs the 3D data was obtained with the persons being stationary for few seconds at those points. We tracked head, shoulder and hands of two persons. The ground truth points were carefully generated so that the persons were not occluded by each other or by the background. The tracker solves stereo correspondence and generates object tracks automatically. With cheap commodity cameras placed 10 ft apart and 4.5 ft high we were able to track five same color coded points in real time with 6 fps. The RMS error was within 7.6 in (19.3 cm). Some of the real trajectories where the persons moved continuously are shown in Figure 7.4. 155 (a) (b) Figure 7.4 Five outdoor real stereo trajectories: (a) 3D display of site (b) zoomed in top view of trajectories. The system was able to track and distinguish between the head, shoulder and hands of the persons during the run time. This was achievable by applying epipolar geometry along with the 156 threshold defined for the shortest line segment constraint while solving the correspondence problem in 2D. 7.2 Real outdoor trajectories using RTLS Using RTLS outdoors we also generated data sets of ground truth trajectories for the tagged persons and objects. The z dimension requires user interaction for defining the tag height that is helpful in estimating tag location. The readings were taken while the persons moved in real time and didn't change tag height. We also placed some stationary reference tags to help analyze RTLS location performance. Figure 7.5 shows sample trajectories of a tagged person. The CSL software generates trajectories in XML format which are then imported to MATLAB. The XML file contains date, time and location information for each observation. Below is a sample XML parent-child node structure generated by the RTLS. The parent node is accessed by the tag name. The child nodes under the parent tree then contains the desired location information for each time instance. EE4CBB6A6223 EE4CBB6A6223 - - 14.504 9.519 48 157 RFID Reader Tag location Figure 7.5 Outdoors 2D RTLS trajectories of a tagged person (green paths). 7.3 Metrics for evaluation of performance In this section we provide information on performance evaluation metrics used. We have evaluated fusion of CV and RFID as well as their effectiveness as single modalities for localization and tracking. Following are the details of the evaluation metrics used: 158 a. Location accuracy - The location accuracy is defined as the difference between location observations obtained by sensing and the corresponding ground truth data. We have primarily used statistical measures such as root mean square (RMS) error to express the location error in x,y and z direction. b. Least squares error - We have used least squares and polynomial fitting schemes for evaluating the error in the real data obtained in our experiments for which acquiring ground truth data was complex. We have also utilized a piecewise line fitting scheme to analyze the RTLS trajectories to have more meaningful results. c. Probability and percentage of observation availability - The RTLS readers’ communication with the tags varies from one place to another. Also the system refresh rate changes with different number of tags present in the test site. We have measured probability of tag location-signal availability in respective runs. This information is useful in analyzing the reduction in the combination search space of the tracking algorithm. For real-time tracking the object recognition and tracking needs to be automated. There is a fair chance of missed observations due to lack of object detection/identification, object exiting the field of view of the sensor, or occlusion caused by another object or background. We express missed observation quantity in terms of percentage. 159 d. Track error - We assessed performance of tracking using smoothness constraints in terms of track error. Track error is defined as a fraction of wrong trajectory point assignments by the tracking algorithm. At present we have assumed that the objects do not enter or exit; therefore a pair of wrong label assignments between points Pj,t and Pk,t (where j≠k ) is considered as one error. The final track error is then averaged over the number of simulation runs. Alternatively, track error can be more fairly defined in terms of point sensing tolerance; an object label assigned to a sensed point is considered correct if the sensed point is within measurement tolerance of the ground truth sensed point. This alternative does not penalize switching the labels of observations that are very close together in space. We assessed performance of tracking using track error for different burst/block set lengths ‘m’ defined in Section 6.3.2. Varying the burst length and the density of points over time affects the track error performance. Increasing the block length decreases the error but increases the number of combinations. Fusing ID information from another source such as RTLS helps reduce this combination space. e. Similarity of object color - For studying color based object detection outdoors we have correlated histograms of the segmented objects under different illumination conditions, i.e sun and shade. For comparing these we have used Euclidean distance to characterize color histogram variations. 160 7.4 Real-time tracking performance: indoor with RFID feed simulated using color The stereo approach was tested in the real time indoor environment while tracking two balls in the wireframe workspace of 27×31.75×24 in (68.6  80.6  61 cm). Since it is impossible for us to gather the number of cases needed using real data, RFID is simulated using same and different color (red and blue) -- in extracting trajectories, object location and ID are sometimes randomly provided to the tracking algorithm for some of the observations where possible. The tracking algorithm was applied to the observations to segment them into separate object trajectories. Later, to assess the performance real-time trajectories were then compared with ground truth data were available. The tracks were generated and displayed in parallel. We show that recognition of some objects during some time intervals can greatly speed up and make more reliable the organization of time frame information into the tracks of separate objects. For one of our tests the system took 58.6 sec to acquire 1000 frames from both cameras while executing and displaying the input and the output. The tracking algorithm had less than 1.4% missed observations with zero track error when both the colored balls were moving and being tracked. It was established that the proposed approach has an ability to track the object while generating its tracks on the fly. 7.5 Indoor stereo live demo results To analyze the stereo live demo results visually we generated trajectories that can be interpreted easily. Details of the stereo tracking live demo are provided below and shown in Figure 7.6. 161 a. Blue object kept stationary while red object is moving. The red cursive writing sample in the yz-plane while a red smiling face in the xz-plane is the red ball trajectory. The blue object was stationary and is shown by blue dots. b. Blue and red object moving. The heart shapes in blue and red are the respective trajectories of blue and red balls being tracked. The shapes were made in the yz-plane and its view from the xz-plane is shown for better understanding. 162 Figure 7.6 3D stereo tracking in wireframe workspace indoors - live demo. Next the system was tested at room scale and real-time trajectories were generated in the lab 280  476  105 in (7.1  12.1  2.7 m) using one (orange) and two color (orange and yellow) combination. As above for better visual analysis Figure 7.7(a) and (b) shows cursive text and heart shape trajectories using single and two colors tracking respectively. Figure 7.7(c) shows the 163 two color detection where yellow color is used as an initializing marker for tracking orange. The image also shows 2D bounding boxes over colored hats. (a) (b) (c) Figure 7.7 3D stereo tracking in lab area indoors - live demo: (a) orange cursive writing sample (b) heart shapes using orange and yellow color (c) yellow used as initializing marker to track orange. 7.6 Stereo error analysis in x,y,z dimensions versus distance from the cameras Slight inaccuracies in selecting the calibration points in the image generates error in the camera matrix while error in image point location of objects will yield error in the imaging rays 164 used in stereo computations. Also factors such as lens distortion, lighting variation, digitization noise, and object surface variation contribute to image point error and hence the stereo error. The stereo error in x,y,z dimensions is observed by computing 3D location for six selected ground truth points not chosen for the calibration procedure. The 3D points were selected on the basis of their distances from the cameras. The run was repeated eight times. Figure 7.8 shows the selected 2D corresponding points in the image pair. X Z Y Figure 7.8 Selected 2D corresponding points in left and right camera image for analyzing outdoor stereo error. We have used RMS error to represent the stereo location accuracy in all the three dimensions. Table 7.1 tabulates the 3D ground truth and the 3D computed data from eight runs. 165 Table 7.1 3D ground truth and computed data for analyzing outdoor stereo error - scale is in inches. 3D ground truth Computed 3D data from eight runs RMS error X1 242 237.9 237.9 238.6 237.1 237.9 237.9 237.9 238.6 4.06 Y1 60 61.2 61.2 61.5 61.2 61.2 61.2 61.2 61.5 0.17 Z1 36 31.6 31.3 31.3 31.7 31.6 31.6 31.3 31.3 0.19 X2 357 355.5 357.1 357.1 356.9 356.9 357.1 358.5 358.5 0.92 Y2 60 60 60 60 60.5 60.5 60 60.5 60.5 0.33 Z2 36 36.7 36.3 36.6 36.6 36.6 36.6 36.6 36.6 0.59 X3 429 434 433.9 433.9 431.7 436 433.7 431.7 433.8 4.77 Y3 72 73.6 73.6 73.6 73.6 74.2 74.2 73.6 74.2 1.84 Z3 41 43.5 43.9 43.7 44 43.5 43.5 43.5 43.3 2.63 X4 540 542.2 542.1 538.9 545.3 542.2 545.4 542.1 538.8 3.13 Y4 0 1 1.8 0.6 1.4 1 1.4 1 1.4 1.27 Z4 0 2.4 2.5 2.9 2.4 2.4 2.2 2.1 2.4 2.43 X5 953 960.4 965.4 951 960.8 960 959.6 951 960.4 7.28 Y5 0 1.6 1.5 1.5 1.5 0.9 0.9 1.5 1.5 1.39 Z5 0 2 3 3.9 2.6 2.9 2 3.9 3 2.99 X6 1190 1210 Y6 0 3.4 2.5 3.4 2.5 2.5 4.2 2.5 4.2 3.25 Z6 0 4.8 6.5 4.3 6 4.8 3.2 4.8 6.1 5.17 1194.8 1210.5 1194.9 1195.6 1196.1 1195.6 1194.6 11.09 The RMS error in all three dimensions was within 7.6 in (19.3 cm) for 3D points 20 ft to 80 ft distance from the cameras. The RMS error has nearly linear behavior in relation to the distance from the cameras and is shown graphically in Figure 7.9 and supports the error cone concept. The magnitude of error in x dimension is more compared to y and z dimension. The x dimension here is in line with camera viewing direction. 166 Figure 7.9 Stereo RMS error in x, y and z direction versus distance from the camera. 7.7 Least squares analysis on real outdoor stereo trajectories Figure 7.10 shows some of the left camera frames that were used to compute the trajectory of a ball [tossed upward] using stereo alone. The ball is circled in the images for easy representation. 167 Z X Y Figure 7.10 Left camera images showing projectile trajectory of a ball tossed upward. The projectile motion of the five 3D points corresponding to the five images in Figure 7.10 is shown in Figure 7.11. The ball’s trajectory can be seen in two different views. The 3D points are represented by '+' in green. 168 Figure 7.11 Two different views of 3D points showing ball projectile trajectory computed using stereo. A solid (blue) line in Figure 7.12 shows the ball's computed trajectory in the yz-plane. To analyze stereo performance here we used least squares fitting. The dotted (magenta) line shows the parabolic curve fitted on the computed data. We took note of the estimated start [position 1 ~98 ft from cameras] of the ball during the experiment which is different by ~15.3 in from the one observed. This supports our analysis in the previous section about stereo RMS error 169 increasing linearly especially at a greater rate along the axis in line with camera viewing direction i.e y-axis in this case. Figure 7.12 Computed ball projectile trajectory (solid blue) with parabolic fitting (dotted magenta). We use three position estimates here that includes initial estimated thrower position, ball position in frame 3, ball position while touching the ground. These points can define the projectile trajectory of the ball. Using a three parameter model the parabolic curve can be fitted to these three points as shown by solid (green) line in Figure 7.13. 170 Figure 7.13 Actual trajectory (dashed blue) with linearly increasing error along y-axis and corrected trajectory (solid green) using parabolic curve fitting. It is noted that the ball's trajectory has gradual increase in stereo error (right to left) along the y-axis away from the cameras and beyond the site center (~84 ft from cameras). Table 7.2 shows comparison of the observed and corrected [after curve fitting] observations. This started decreasing as the projectile crossed the site center towards the cameras. Modeling error estimates along y-axis with corrected curve fitting can help calculate intermediate trajectory points. 171 Table 7.2 Ball projectile observed using stereo and fitted data along yz-plane - scale is in inches. Point Yobserved Zobserved Yfitting Zfitting Ydiff Zdiff 1 2 4 -181.6 -79.7 348.4 76.7 155.9 93.9 -166.3 -67.1 345.8 71.5 149.4 90.2 15.3 12.6 2.6 5.2 6.5 3.7 The problem of increasing stereo error was explained here to give an idea about the upper bound using Logitech C210 cameras. This can be addressed in practice by using multiple cameras in the test site and naive physics constraints. The combined 3D observations from two or more stereo pairs can be smartly combined to cover an overlapping area. For example as soon as an object trajectory enters the 70+ ft range from one stereo pair then another set of stereo pairs can be engaged for which the object is within the desired tracking range. To analyze RTLS location performance we consider the tagged person trajectory (without stopping) given in Figure 7.5. Piecewise line fitting on the tag trajectory is shown in Figure 7.14. To find the least squares error, we have evaluated the sum of the squares of the differences between the line fit and the tag trajectory. The least squares error here is 22.1 in. 172 Figure 7.14 Piecewise line fitting (solid green) on RTLS tag trajectory (dotted blue). Next we compared this RTLS tag linear trajectory [dashed green] with the ground truth [dotted red] for checking location accuracy. For reader understanding, the location error circles are drawn at every instant around the ground truth observations as shown in Figure 7.15. The radius of the circle represents the spatial difference between the ground truth and RTLS location observations. The tag observations are represented by '■' [green] and that of ground truth by '♦' [111]. 173 Figure 7.15 RTLS tag location error circles. As shown in Table 7.3 the location error increases as the tag moves towards the arch structure. This is likely due to the multipath effect as our test site represents a semi-indoor environment. Similarly the location error decreases as the tag moves away. The RMS location error for the tag trajectory is 80.7 in (2.05 m). Table 7.3 Location error for RTLS tag trajectory in Figure 7.1 - scale is in inches. 1 2 3 4 5 6 7 8 29.4 43.1 11.9 91.5 113.8 110.6 92.9 123.3 9 10 11 12 13 14 RMS 21.4 17.9 65.6 80.7 102.2 94.6 80.7 174 7.8 RTLS signal availability Having RFID feed available significantly reduces CV tasks for tagged object detection and identification. The RFID feed availability depends on the refresh cycle and the number of tags. Also, missing tag information is another key factor to be considered. Figure 7.16(a) shows the RTLS real trajectory of a single dynamic tag obtained when there were three other tags actively transmitting and present in the test site. The trajectory was designed to cover most of the test area. The RFID signal for a single tag was available on average after every 2.5 sec. Note that in Figure 7.16 there are some visible gaps between two consecutive tag locations, which represent missing observations. Missed observation means non-availability of location information when the RTLS signal is expected to be there. This can occur due to tag orientation, miss reads by the reader, or some direct occlusions which resulted in read failure. For quantitative analysis there were ~25 missed observations in addition to 136 times signal was availabe i.e ( 25 ×100 ) 15.5% missed observations. 136 + 25 175 Figure 7.16 RTLS trajectory analysis: (a) RTLS single tag trajectory (green path) (b) RTLS tag location signal availability over time. 7.9 Simulations of object tracking Prior to working on real fused outdoor data, we performed many simulations in order to assess how effective RFID labels could be in tracking under smoothness constraints – using observations of location but not color. For simulations we used ten subsets of real indoor stereo trajectories explained in Section 7.1.1. N observations over the time steps 1...T were selected and presented to our tracking algorithm to see what tracks would be aggregated using the naïve 176 physics constraints. Smoothness of trajectories requires a burst of time frames for reliable results. Below, we have used burst length of four time frames to compute track smoothness, curvature, and acceleration. If we consider n objects and a burst of m time frames then the number of possible paths will m be (n) . Assume that T is divisible by m. If there is no ID information available then the number m of combinations for T time frames will be (T/m)×(n) . Depending upon the probability P for an observation ID being available for the burst, the combination volume may be reduced accordingly. It is considered that the ID when present is available for the whole burst. For example n=3, m=4 and T = 60 the total number of track combinations will be 1215. Different frequency of ID availability across bursts will have different impact in reduction of combinations. As shown in Figure 7.17, for P=0.267, ID availability settings represented by a solid red line reduces the possible combinations to 435; however for P=0.267 the dotted blue line setting reduces it to 631. This shows that for the same probability, an object ID can be available in many configurations. In this case 435 also explains the upper bound in combination reduction with the lower bound ranging to 891. Therefore, the more the RFID signal availability is spread across time, more volume reduction is possible. 177 Figure 7.17 Reduction in combination volume - with probability of random ID information availability. The simulations were done using MATLAB®2009 on a Core i5 M580 2.67 GHz platform. Simulations were conducted for N=5, 6 and 10 object tracks and T=60 time steps. Using probability P, the ground truth ID was provided in the token. Figure 7.18 shows results for possible reduction in combination volume with increase in probability P of object ID being in the token. Computation time is also shown at marked places to realize the reduction in volume. With respect to our outdoor experiments, the probability P represents the time percentage for which the RFID feed for a tag was available for each object. The algorithm was run with frame burst length, m=4. ID was assumed to be randomly available [across bursts] over time steps T. Figure 7.18 shows that while tracking ten objects the combination volume can be decreased up to 99.9% with the partial ID feed thereby significantly reducing computation time. The effect of having some ID in the tokens increases as the number of object tracks N increases. Also, object location and ID info increase the accuracy of calculated trajectories. This data shows the difficulty faced by tracking algorithms that only use motion of image points to aggregate object tracks. Without any object ID, quantifying motion over several time steps leads to too many possible tracks. Although color, shape and texture features can be used by a passive CV system, the reliability of 178 unique labels from RFID can yield correct tracks with far less computation. Thus we were motivated to implement an actual Site Safety System using fusion of CV and RFID. Figure 7.18 Possible combination volume with N objects and probability P of object ID in bursts of four tokens. Table 7.4 shows the behavior of the algorithm in terms of track error performance. The experiments were conducted while randomizing observations of three real time 3D trajectories acquired from the stereo system. The simulations were run twenty times with different burst length m. Different values of time frames T were used. The outputs were then averaged to generate the results. The results were also compared to the ground truth. It is clear from Table 7.4 that varying m and density of points over time affects the track error performance. Since the 179 indoor stereo system readings generate an error of up to 0.34 in (~8.64 mm), this error tolerance can be used in comparison to ground truth in determining track error. The last column of Table 7.4 shows the results with error tolerance applied while using m=6. Increasing m decreases the error, however, the number of possible combinations also increases so it affects computation time. These combinations as explained above can be reduced if we have partial knowledge of the trajectory points. Table 7.4 Track error with different points density and block length m. Track error is a fraction of wrong trajectory point assignments. T m=3 m=4 m=5 m=6 m=6 w/ error tolerance 10+ 20+ 30+ 40+ 50+ 60+ 0.121 0.196 0.239 0.251 0.262 0.266 0.095 0.096 0.103 0.134 0.159 0.167 0.043 0.054 0.067 0.083 0.091 0.098 0.027 0.042 0.058 0.066 0.071 0.075 0.0 0.004 0.021 0.023 0.036 0.038 7.10 Tracking efficiency using fusion We have explored via simulations and real scenarios how fusion can reduce the combinatorics of tracking. These cases can help reveal our fusion approach behavior. Although we have acquired both RFID and CV data from an outdoor site, it is impossible to explore the many possible parameterizations using real data. Moreover, simulations allow us to construct interesting test cases and to control their ground truth. 180 7.10.1 Simulated scenario: two persons and two briefcases We consider two persons who walk toward each other, exchange brief cases, and then move to different final positions. Due to smoothness constraints, the geometric data will produce incorrect tracks with the persons continuing with their briefcases to different final positions. However, reliable location and ID of whichever person using either RFID or CV enables the correct interpretations to be extracted. This case is simulated by generating two trajectories (N=2 for T=70 time frames) with ID info randomly provided and simulated using color. The point of intersection occurs at frame t=43. The mean velocity of both the object tracks is kept the same. As shown in Figure 7.19(a) using the smoothness criteria alone with no labeling information, produced wrong trajectories with a track error of 0.39. However, once ID labels with location information are provided near the intersection, the tracking algorithm interprets correct object tracks as shown in Figure 7.19(b). Therefore to avoid ambiguities in the vicinity of collision points, some localization and object ID information is necessary outside the area of collision. 181 (a) (b) Figure 7.19 Testing fusion using simulated scenario of two persons exchanging briefcases: (a) wrong interpretation of trajectories with CV alone (b) correct interpretation of trajectories with CV & RFID fusion. 7.10.2 Real outdoor scenario We generated a scenario to track the activity in the test site outdoors. The cameras were placed on tripods at a height of 9 ft with a baseline of 10 ft. The center of the test site was 84 ft from the cameras. The area of activity was 47 ft to 93 ft from the cameras. The persons wore bright colored clothing, which was helpful to generate a good quality feature vector. Four RTLS readers (one master and three slaves) were installed in the test site of 40×40 m to generate a single cell. Two tagged persons wearing distinctive color clothing and helmets slowly move forward towards each other and meet at the center of the test site. They exchange RFID tagged colored 182 balls and then backtrack to their starting positions. All the movement was done on predefined paths to compare the acquired data with the ground truth trajectories. The run was carefully conducted so that the exchange interaction over some of the time frames is either fully or partially not visible to the cameras. If a ball’s feature vector does not contain color, then CV detection using shape/size might show wrong 3D labels of the ball’s track. This might result as if the persons took them back towards their starting position. Even if the color info were available there still would be uncertainty or loss of trajectory points as the interaction was occluded from the cameras at the center of the test site. This might again result in wrong labeling or lost tracks. Additionally, in case of person tracking, the smoothness constraints would fail to provide correct trajectories (see Section 3.4 for details). Figure 7.20 shows the outdoor arrangement as well as the 3D map of the site with computed real trajectories. Slave Reader 1 Origin Work Area Trajectory 1 Center + Trajectory 2 Master Reader Slave Reader 2 Right Camera Left Camera Slave Reader 3 Figure 7.20 3D view of test site with calculated real trajectories. 183 The experimental runs were conducted while tags were placed on the objects or in the pockets of the persons. The reported experiment was conducted on a partly cloudy day with considerable variation in illumination; moreover, 30 mph gusts of wind typically shook cameras and RFID readers at some point during each data collection trial. We also placed some reference tags in the area during the experiments. It is observed that the location accuracy of the stationary tags, when visible by all four readers, was within 1.5 m. This ranged up to 4.3 m at some points where the tags were visible only to two readers. The test site selected is such that it has some indoor properties -- brick structures and trees -- that cause obstruction and generate a multipath effect. These obstacles also provided occlusion for CV, which helped our study. We also note that the RFID system software initially lost track of all the tags due to the operating system protection scheme to counter hacking attacks. Once registered, the system on average kept track of five RFID tags 79% of the time. Figure 7.21 shows the left camera view of the test site and correct 3D trajectories using fusion. 184 Person 2 Person 1 Ball 2 Ball 1 (a) Ball 1 Trajectory + Ball 2 Trajectory (b) Figure 7.21 Outdoor scenario to test fusion: (a) left camera view of test site (b) computed correct 3D ball trajectories using fusion. 185 For this scenario we assessed performance of tracking using track error. To sync video frames and RFID refresh cycle the test was conducted by tracking the balls and the persons with thirty non-consecutive time frames (i.e T=30) approximately three seconds apart and burst length of four (i.e m=4). While person 1 was holding ball 1 the stereo track error performance for ball 1 and person 1 trajectory was 0.06. For the time person 2 was holding ball 2, the stereo track error for ball 2 and person 2 trajectory was 0.53. The track error for ball 2 and person 2 was much larger due to fewer distant camera calibration points in the image, which resulted in larger 3D stereo error. Also due to strong winds the cameras position was not stable which likely increased stereo error. Linearly increasing stereo error is an another reason. Therefore the trajectories beyond the site center point (84 ft from the cameras) had more stereo location accuracy error, which subsequently resulted in greater track error. The greater stereo error however, generated favorable conditions to investigate fusion efficacy. Due to occlusion, the vision system often lost track of the balls at the center of the test site (84 ft from cameras) while the persons were exchanging the balls. The tracking algorithm assigned correct labels to both person trajectories up to the site center. Thereafter the labels were assigned incorrectly due to change in direction of the persons as they backtracked to their starting position. The interaction at the test site center was clearly detected by the readers since the tags were directly visible to all four readers in that area. The RFID location information, when fused around these points, helped reassign correct tokens resulting in correct labels and trajectories of both persons and balls. When RFID location information was applied, the computation time in this area was reduced due to reduction in the combination space. Also the track error for the initial trajectory of ball 2 and person 2 decreased from 53% to 13%. Discarding the outliers, the 186 RFID location accuracy for dynamic tags averaged 2.6 m, which is nearly comparable to the one mentioned by the OEM [21]. To achieve this location performance it is noted that the tags should be in the cell area where antenna beams of at least three readers intersect without any solid obstruction. 7.10.3 Outdoor scenarios with varying fusion information To analyze further we generated other ground truth trajectories with varying sensor information in different configurations. The persons are assumed to be moving with the same mean velocities. We generated two other cases where two tagged persons start from opposite directions towards each other and: a. Keep going without any direction change. b. Split on sides at the center of test site as shown in Figure 7.22. 187 Figure 7.22 Outdoor RTLS trajectories of two tagged persons with varying fusion information - persons split on sides at the center of test site. For site safety and security, we can assume that a high percentage of objects are tagged and cooperative. We have shown how proper sensor placement can support tracking. This enables the system to spend more resources on tracking anomalies -- "unauthorized" animals, machines, or materials. Location accuracy and compute time of a central algorithm, although good, is insufficient to handle collision avoidance, so we recommend that moving objects use local collision avoidance -- perhaps based on looming (CV) or locally shared kinematics. Fundamental tests on looming detection are reported in Appendix A. Cell phone and sensor network technology are advancing rapidly and probably will soon provide such functions [104]. 188 7.11 Object color variations Figure 7.23 shows four color histograms of segmented objects extracted from the left camera video feed under different outdoor conditions at different time instances. The histograms were computed using HSV color space. There are two objects, a blue ball and a yellow ball, and two different illumination conditions, sun and shade. The irregular outdoor illumination variations and abrupt changes of brightness is evident in Figure 7.23(a) and (b). If color is to be used by CV to help tag and distinguish objects, then the objects must be for the most part distinguishable in the video images. In many cases workers will be wearing hard hats or vests of special coloring. The S-3 should be able to take advantage of these distinctive colors by exploiting color consistency for reliable color clustering. Even though there is variation in illumination however the histograms (in sun and shade) of the balls (blue or yellow) in Figure 7.23 show noticeable association within each color group. The experiments reported in the previous sections of this chapter did not use automatic color similarity computations to distinguish the class of object color: instead, a symbolic color was assigned to the token. 189 (a) (b) (c) (d) (e) (f) Figure 7.23 Change in illumination observed in left camera video feed when: (a) sunny (b) shady. Color histograms: (c) blue ball in sunlight, (d) blue ball in shade, (e) yellow ball in sunlight, (f) yellow ball in shade. We collected various samples and analyzed HSV color space consistency for the blue and yellow balls in different weather (winter and summer) and illumination (sun and shade) conditions as shown in Figure 7.24. 190 Figure 7.24 Sample images for different weather and illumination conditions to study blue and yellow color consistency. The results shown in Figure 7.25 show the color consistency for blue and yellow balls for reliable color clustering. The points shown are the average pixel values of the ball area taken from different frames. The yellow marker 'o' represents the yellow ball HSV value and the blue marker '+' represents the blue ball HSV value. The color clusters are clearly separated along the hue axis which supports usefulness of CV to help distinguish objects based on color in an outdoor environment. 191 Figure 7.25 Analyzing blue and yellow ball color consistency in HSV color space under different weather and illumination conditions. 192 7.12 Sensor Error and synchronization problems In an outdoor environment we evaluated positional accuracy and reliability for the RFID based Real Time Location System (RTLS) and the stereo computation obtained using commodity cameras. The calibration of stereo system and RTLS system were done on the same test site. An adequate number of distant points (in the background) and nearby points (in the foreground) were acquired to serve as calibration markers for stereo computation. The stereo infrastructure provided RMS positional accuracy within 7.6 in (19.3 cm) for x, y and z directions. The reported location accuracy for the RTLS system for static tags is ~1.5 m and that of dynamic tags is ~2.6 m. RMS error does not include the occasional outliers that are possible from incorrect stereo correspondence or multiple path effects in RFID. The RTLS location accuracy however can be increased by deploying further readers in the test site. One significant practical problem for fusion is the different sampling rates of the sensors or the extended time needed to smooth data or to make decisions about the motion of an object. Due to time division multiplexing, our RFID system provides data on all objects every two to three seconds, while our stereo implementation could produce ten updates per second for a few objects. In our experiments, we typically force a common sampling time for RFID and CV and look back two time samples to estimate motion. The uncertainty of location for RFID is much larger than for CV for static objects and even larger for moving objects due to under-sampling. Interpolation using CV locations can be used with sparse RFID samples with reliable ID – another benefit of fusion. Finally, it is possible that an object is invisible at some time steps to 193 either or both CV and RFID due to occlusion and higher level processes are left to interpret what is happening. 7.13 Summary discussion In this chapter we have reported experiments and results that evaluate use of stereo vision and the commercially available RTLS system in single and fused modes. To test the 3D accuracy and reliability of object location using stereo we have generated three types of trajectories that includes mathematical, real indoors and real outdoors. We acquired ground truth data for various predefined points in the surveyed test site with which the RTLS and stereo data can be compared. We have discussed the performance metrics criteria used to evaluate localization and tracking schemes. We assessed performance of tracking using smoothness constraints in terms of track error. The stereo approach with RFID simulated using color was tested in a real time indoor lab bench. The tracking algorithm there had less than 1.4% missed observations with zero track error. For visual analysis we showed trajectories both in the wireframe volume and lab area. We analyzed that the stereo error has a linear behavior when the distance between the observed object and stereo setup varies. Least squares analysis was also performed to assess the error in location using vision and RFID. The stereo provided within 7.6 in (19.3 cm) accuracy and for RTLS it varies between ~2 m to ~2.6 m for moving tagged objects. To achieve this location performance it is noted that the tags should be in the RTLS cell area where antenna beams of at least three readers intersect without any solid obstruction. With a refresh rate of two to three seconds the RTLS hardware provided 79% to 84.5% signal availability which can significantly reduce the combination space. Also in the fused system, the RFID information availability at ambiguous instants in tracking could reduce runtime up to 99%. 194 The likelihood of producing correct object trajectories in regions partially or fully occluded to CV is also increased. Lastly we have studied object color variation in different illumination conditions and found noticeable association within each object color group that can help object detection in outdoors. We have shown how fusion can improve identification, localization and tracking results while also reducing computational cost. In general fusion of RFID and CV is better than using only one mode alone and, where costs are justified, will produce systems that are better than those using only one modality. 195 _____________________________________________________________________________________________ CHAPTER 8 Conclusions and future work _____________________________________________________________________________________________ I n this dissertation we have presented, a generalized framework for the fusion of Computer Vision (CV) and Radio Frequency Identification (RFID) that can produce more accurate object localization and tracking in a three dimensional space and do so using more efficient computation. The important components of a fused system have been implemented and tested and the results obtained support the premise that fusion can improve performance in various applications over use of RFID or CV alone. The basic rationale is that RFID can provide highly reliable unique object identification, although with coarse object location, while CV can provide more accurate object location along with confirming visual features and can also avoid cloning of tags and decrease counterfeiting. Below we provide the concluding discussion highlighting our contributions and the research expansions that are possible as a future work. 8.1 Background Survey Research and development using fusion of RFID and computer vision has been thriving over the past decade. Almost all work has been done in indoor environments. During our background research we have presented the collection of these schemes and have related the research and 196 actual applications and installations. Dozens of publications were found that directly used the fusion approach. The work had been generally from the area of recognition, localization, and tracking where RFID was mostly being used at the initial stages of object detection and/or identification. Moreover, a few dozen more publications in either RFID or CV showed clear potential for improvement via fusion. Most reported work was done in an indoor controlled environment at small scales and using passive tags which require close read range. We also learned that RFID can be very useful outdoors in a number of applications, such as construction site safety, when tagging is possible. Moreover, formal linkage of RFID based RTLS with vision is a new and expanding research area with great potential. All these factors gave us motivation towards exploring these modalities for real time localization and tracking. We believe that the survey of the fusion approaches that we have provided in this dissertation will offer great support to the researchers interested in this area. 8.2 Evaluation of RFID, CV, and fused sensing For RFID we have used a commercially available Real Time Location System [8] and we developed our own stereo system with a laptop, MATLAB, and two commodity color cameras. Our total hardware cost was only about US$5500, for both the RTLS and only one stereo pair of cameras. High level performance would require more cameras and more RFID readers in the workspace than we have used. We have defined our performance metrics and have done error analysis for location estimation by these modalities. We have evaluated the use of stereo vision as a single modality both indoors and outdoors for 3D accuracy and reliability of object location. We have studied and implemented the ray-ray stereo scheme [31] for 3D localization in an 197 indoor environment and have reported an RMS accuracy of ~0.34 in in a wireframe workspace and ~6.4 in (16.3 cm) at room level. The average RTLS location accuracy indoors for static tags was ~1.9 m. The performance for dynamic tags varied a lot due to multipath effects in a compact space. However, using optimization tools RTLS accuracy indoors can be improved. 8.2.1 Demonstrated performance, potential and parameters for outdoor applications We analyzed the efficacy of the stereo approach outdoors in a surveyed test. In our analysis we have obtained RMS location accuracy within ~7.6 in (19.3 cm) in x, y, and z for trajectories within range of 30 ft to 70 ft from the cameras in a workspace that is 40×40 m. Choosing appropriate calibration points covering the near and far field of the scene is necessary for results with minimal error. This offers potential for future researchers to examine automated three dimensional tracking outdoors using economical vision sensors. We established that the location accuracy for RFID in the same outdoor test area was ~1.5 m in x and y ground coordinates for static objects, but ~2 m to ~2.6 m for dynamic objects, which we attribute to the location update frequency i.e one RTLS observation approximately every two seconds. Deploying more readers in the area can substantially improve the location accuracy for moving objects. The z dimension, though available, comes with a planar restriction defined by the tag height. Getting interpolated location estimates and varying tag height data is, however, not limited in theory. We have shown with simulations and real data that fused sensing increases the likelihood of producing correct object trajectories in regions partially or fully occluded to CV. As the stereo system readings generate an error of up to 7.6 in (19.3 cm), this error tolerance can be helpful in comparison to ground truth in determining track error. We have also studied object 198 color variation in different illumination conditions and found noticeable association within each object color group that can help object detection in outdoors. In general we have shown how proper sensor placement can support localization and tracking. This enables the system to spend more resources on tracking anomalies -- "unauthorized" objects, machines, or materials in a site safety environment. 8.3 Modeling fusion and its benefits We have presented our fusion model and have described the competitive and cooperative relationship of our sensors and the fusion building block needed to develop the fused system. The features and characteristics of our fusion scheme are also provided to explain its adaptability towards established fusion standards. With a focus towards localization and tracking applications we have provided the potential benefits achievable from the fused system and have documented examples where fusion disambiguates object tracks and combines the strengths of RFID and vision to avoid problems of each mode. 8.4 Data integration and filtering using relaxation labeling To manage the complexity and integration of the diverse information being fused and to provide a flexible experimental platform we proposed and demonstrated an algorithm based on discrete relaxation. Discrete relaxation was chosen to control tracking so that we could easily experiment by switching on or off sources of information and develop our software in a modular way. Moreover, the label elimination approach easily represents the ambiguity occurring in real- 199 life applications. If there are N objects and N labels, the computational complexity of tracking is 2 potentially of the order N across just two time steps. The key to reducing the computational requirements by using relaxation is to eliminate many labels at each filtering step while keeping those labels compatible with observation. The output labels from the relaxation labeling process can be optimized further by the post processing operation. 8.5 3D object tracking algorithm We have proposed a three dimensional object tracking scheme using fusion of vision and RFID. As explained above, relaxation was used for filtering out incompatible labels in the tracking algorithm. As a post processing step to relaxation labeling we have used total track smoothness for optimization to update computed tracks for increasing system tracking reliability. We assessed performance of tracking using smoothness constraints in terms of track error. With simulations we have shown how fusion can greatly increase tracking performance while also reducing computational cost and combination search space up to 99% in some cases. Test cases show how fusion can solve some difficult tracking problems outdoors. For some object trajectories outdoors, the fused system reduced the track error from 0.53 to 0.13. We have demonstrated cases where fusion disambiguates object tracks and we have also given cases where disambiguation is impossible, as in the well known shell game. We demonstrated how uncooperative objects can cheat the system. However, in general, fusion of RFID and CV is better than using only one mode alone and, where costs are justified, will produce systems that are better than those using only one modality. Moreover, an automatic system detects the ambiguities and can cue the attention of higher level processes or longer lived processes, 200 including the attention of human security personnel. Simulations of tracking over many ground truth paths demonstrates how knowledge of unique object ID for some time instances can significantly improve correct tracking as well as reduce computation time in producing the tracks. Thus, many more objects can be tracked in practice if fused sensing is available compared to tracking by CV alone. A fast tracking implementation would be active – it could plan more efficient work, warn of possible collisions, or detect illegal operations. Finally, it is clear that the global workspace view we have used is too imprecise for detailed object interactions, such as cooperation compared to collision, or handing off carried objects. Object born touch or looming sensors would be needed for some applications. Our current work shows that pursuit of these extensions should be fruitful. 8.6 Future work and limitations One significant problem in fusing the RFID and CV feeds is the difference in sensing frequency. Commodity cameras are designed to represent human motion well and produce upwards of ten video images per second, whereas our RTLS system produced tokens for all tags at approximately two second intervals. Engineering faster RFID updates will likely reduce the number of objects that can be sensed; however, this should be a favorable tradeoff in a construction site. It may also be good design to have a hierarchy of RFID sensing with a slow system for asset/material inventory and a fast system for critical objects such as workers and moving machinery. 201 We need to continue to develop our system to perform the lower level token combination and to test it fully using a set of objects with some typical behavior. We will also make the revisions that model objects that appear and disappear from the surveyed workspace. Also the constraints and heuristics used in the tracking algorithm should be further studied and improved. There are many knowledge based constraints that we have not yet applied. Much of what has been discussed assumed objects were single independently tracked points. Clearly, some objects would be a rigid aggregate of points. For example, a truck might have a single RFID tag and perhaps four or eight visual markers that would reduce combinatorics and enable rigid motion analysis. Such planar rigid structures and symmetries are also helpful to track moving objects over wide variations in position and orientation of the objects. Considering the fact that most construction sites involve collaborative work, interactions between workers, and workers-machines will happen frequently. The interactions will introduce static or dynamic occlusion, which causes difficulties for visual tracking. In case of shortduration, partial object-object and background-object occlusions, the vision system should continue to track the object. RFID though can help CV here, but it is important to deal with occurrences of an occlusion while having an RFID feed failure scenario in rare cases. Knowledge based information input can be used to inform the tracking processes to handle occlusion. Also by tracking each object, it is possible to use global 3D information to tackle occlusion with predictive trajectories in our optimization process. The occlusion management framework to be worked on is shown in Figure 8.1. 202 Figure 8.1 Occlusion management flow during stereo tracking. Location accuracy and computing time of a central algorithm, although good, is insufficient to handle collision avoidance. So there is need to have a local object-object communication. We recommend that moving objects use local collision avoidance -- perhaps based on looming (CV) or locally shared kinematics. Cell phone and sensor network technology are advancing rapidly and probably will soon provide such functions [104]. As part of our future work we have presented some of the fundamental experiments on looming in Appendix A. The results we have presented apply to tracking in either 3D space or in a 2D image of that space. Our conceptual model is ahead of our current implementation, and thus provides a design for stages of future improvements. Incorporating object track initiation and termination, more 203 cameras and readers, more constraints, and more sensed features, such as object velocity are on our list for future work. Historically, most security and surveillance systems have used video input fed to human monitors. Automated methods in CV have been developed to replace or augment the human recognition duties with varying success. In controlled areas, such as airports, hospitals, workplaces, and construction sites, many objects can be tagged, including cooperative humans. Thus RFID based detection and location can be available for integration with the video data. Tracking of tagged objects using RFID can drastically reduce the computational load of a vision only approach as well as increase its performance. The CV component would only need to process exceptions and might be able to pass some of them to a human monitor. This fused tracking information will increase the safety or security of those being tracked and their activities. Current applications include individual and group recreational and commercial sports. We finish this dissertation with a conviction that the work on fusion of RFID and CV would further the cause of mankind; e.g. more secure and safer public transportation and human management [such as at airports], while even more efficient and economical resource management. Looking in the future, it can be applied for disaster/rescue management as in an air crash. Currently, a downed plane can be localized but the search and rescue of unfortunate passengers is still dependent on human sight. Imagine the possibility of human movement tracking through cell phones or medium tolerant RFID bracelets worn by passengers. This could be achieved by deploying preprogrammed flying robots equipped with cameras and RFID tag 204 readers and tag libraries, which would relay real time and processed crash site info to the main rescue vehicle for timely actions and making the difference between living or otherwise. 205 APPENDICES 206 _____________________________________________________________________________________________ APPENDIX A Fundamental experiments on looming _____________________________________________________________________________________________ Looming is present in animal vision and is vital for collision avoidance and alighting. In a Site Safety System (S-3) realizing collision avoidance requires understanding local interactions between workers and machines. For understanding looming concepts, fundamental experiments were carried out. We have studied the relationship between the rigid object area versus the looming distance. Later we employ and analyze significance of looming information for collision avoidance in real time. The details of our indoor test platform are also provided that uses optical flow algorithm for object and looming detection. Possible future developments using other sensors onboard the smart phones are also discussed at the end. A.1 Generating looming dataset The initial experiment was conducted offline to generate a training dataset for looming in a controlled indoor environment. It included a ball placed on a LEGO NXT robot as shown in Figure A.1. A red colored ball was used due to a rigid spherical shape and easy detection. A lightly textured background also helped in object detection. The assembly start position was at ten feet and the stop position was at two feet from the camera. The distance markers were placed 207 on the floor after every two inches for providing distance info. The robot was programmed to move towards the camera in a straight line while stopping for five seconds on the predefined points. Images were captured when the robot was stationary and a dataset of images was generated with distance stamps. The camera used was a low cost Logitech C210 at a resolution of 640  480 pixels. The simulation was done using MATLAB © 2009a. (a) (b) Figure A.1 Looming image dataset at different distances: (a) object at ten feet (b) object at two feet. As explained in Section 4.1 we used color and blob analysis to detect the ball. The algorithm generated an [approximately square] bounding box around the ball as shown in Figure A.1(b). Ideally, after object detection the algorithm should be able to produce an approximate square around the ball. However, in practice it becomes challenging to obtain precise object edge information due to factors such as low camera resolution and varying illumination conditions. As our further analysis depends upon area of the bounding box, therefore we considered comparing the bounding box height and width to test their degree of uniformity. Figure A.2 shows the linear relationship between the bounding box parameters [width and height in pixels] relative to 208 the distance from the camera . The maximum error observed due to the noise was within 4% of the linear theoretical range. Figure A.2 Graph of bounding box width and height relationship for training dataset. Figure A.3 shows the bounding box normalized area versus the distance relationship. The graph provides information that the area of the rigid square object changes quadratically with the change in looming distance from the camera. 209 Figure A.3 Graph of bounding box area vs looming distance relationship for training dataset. A.2 Object distance measurement in real-time Next the same run was conducted in real-time for ten trials in a controlled environment. Following the supervised learning approach by having a training dataset, the looming algorithm computed the normalized area of the ball [in pixels] and estimated the distance of the ball from the camera. Figure A.4 shows the results for two real-time trials compared to the offline results generated from the training dataset in Section A.1. The variation in the graph represents variation 210 in the bounding box width and height. This mainly occurred due to pixelation effects, varying illumination, jpg compressed version of the acquired images from the camera, and the system inherent noise which in turn affect the color and blob detection output. The RMS error in the area for the dataset 1 compared to the training set was 22.6 pixel square and for dataset 2 was 68.7 pixel square. This experiment presented a basic understanding on how local workspace agents can learn and acquire knowledge about looming and distance for rigid objects in a controlled setting. Figure A.4 Graph of bounding box area vs looming distance relationship for two real time datasets. 211 A.3 Real-time lab demo for collision avoidance Next for studying looming detection for collision avoidance in an indoor lab environment we designed our own test platform with a LEGO NXT robotics kit with an onboard wireless camera. The collision avoidance algorithm is based on the optical flow. The wireless video was obtained by installing the iPhone 3GS on the NXT robot as shown in Figure A.5. Figure A.5 Indoor lab platform consisting of NXT robotics kit and the iphone 3GS to test collision avoidance using optical flow. The iPhone camera video was accessed as an IP camera over a local WiFi network, using IP © Cam application [112]. Since all the simulations were conducted in MATLAB 2009a, an m-file routine was written to acquire iPhone video feed using the MATLAB image acquisition toolbox. For acquiring motion vectors using optical flow we have used both Horn-Schunck [113] and Lucas-Kanade [114] methods separately. The optical flow algorithm converts the acquired RGB 212 image into a gray scale image. Figure A.6 shows the results of motion vectors computed between a test image pair [gray scale] using Horn-Schunck and Lucas-Kanade algorithm. The performance of both the algorithms varied by changing their controlling parameters. A shadow effect at the base of the ball is visible in both the motion vector images. (a) (b) (c) (d) Figure A.6 Motion vectors computation during real-time lab demo for collision avoidance:(a) test frame k-1 (b) test frame k (c) motion vectors using Horn-Schunck (d) motion vectors using Lucas-Kanade. 213 We have used the RWTH Aachen University's NXT MATLAB toolbox [115] to interface NXT with MATLAB. The NXT wirelessly communicates with the PC using the bluetooth protocol. The optical flow algorithm was used to calculate the motion vector magnitudes in both halves of each consecutive frame acquired. If the sum of the magnitudes at a particular time instance reaches a certain threshold then it was considered to be an obstacle and the robot changed its path. The direction in which the robot turned was again governed by the motion vector magnitude sum of the image halves. If the sum of the magnitude of the left half was smaller than that of the right half, then the robot turned left and vice versa. Figure A.7(a) shows a motion vector frame when the robot was approaching the object. The clip on the left shows the actual image acquired by the system. Figure A.7(b) shows the motion vector image when the robot detected the object and changed its trajectory. Due to inherent noise and factors explained above, there was a small number of motion vectors detected when the robot was stationary. (a) (b) Figure A.7 Realtime indoor collision avoidance experiment: (a) robot approaching the obstacle (b) robot detected the obstacle. 214 We are also interested in accessing the sensors in the iPhone using the User Datagram Protocol (UDP). These can be useful in acquiring the object pose and trajectory in real-time that will support collision avoidance structure. The iPhone 3GS along with the camera has digital compass, accelerometers and GPS onboard. The latest smart-phone versions also carry a gyroscope and secondary cameras which can be of additional value. Presently we have been able to access the sensor data using the SensorData application [116]. We have written our m-file routine to access this buffered sensor data in MATLAB through the UDP port. Typical data accessed through the iPhone carries information in the following format: Timestamp,Accel_X,Accel_Y,Accel_Z,MagHeading,TrueHeading,HeadingAccuracy,MagX,MagY ,MagZ,Lat,Long,LocAccuracy,Course,Speed,Altitude Though GPS information cannot be utilized indoors, digital compass and accelerometers can be used to calculate a machine’s or person’s course, speed, altitude and pose. Such data from workspace agents indoors as well as outdoors will also be helpful for the local as well as global processes in the Site Safety System (S-3) to make appropriate decisions and generate system alarms. 215 _____________________________________________________________________________________________ APPENDIX B Stereo concepts and calibration procedure _____________________________________________________________________________________________ We have provided basic stereo concepts here that are helpful to understand the stereo approach used in this dissertation. It is explained how stereo can be used for recovering 3D information from 2D and what is a correspondence problem. Later we have provided the camera calibration procedure used in our experiments. The method of computing 3D from 2D using the shortest line segment approach is also highlighted. B.1 Basic stereo vision principles 3D world points on the same viewing line have the same 2D point on the image. Therefore the inverse process in general will be unable to recover all 3D point coordinates from 2D image coordinates, which results in depth information loss as shown in Figure B.1. W W W C C X, Y, Z represents world coordinates and camera coordinates are represented by X, Y, C Z. Both 3D world points W P(W P x , W P y , W P z ) and W Q(W Q x , W Q y , W Q z ) project into 216 the same image point I P( I P r , I P c )  I Q( I Q r , I Q c ) which makes it impossible to recover W W I I P and Q from P = Q. Figure B.1 Loss of depth information in 2D - Caused by projection of 3D points on same viewing line onto 2D image. The 3D information can be fully recovered using two 2D images of the same scene with slightly different views. Figure B.2 shows how W P and W Q can be recovered when the same points in Figure B.1 are viewed by another camera at a slightly different position. The setting 217 represents stereo vision. O1 and O2 are the optical centers of the two cameras and the relative pose of both cameras is independent of each other. Figure B.2 Recovering 3D point coordinates using stereo vision. Now consider that both points W P and W Q are not on the same viewing line of camera 1 and camera 2. To perform stereo computation there is a need of identifying 2D projections of W I I I W P and I Q in image 1 ( P1 , Q1 ) which can be identified as the same points in image 2 ( P2 , Q2 ). This is known as the correspondence problem as shown in Figure B.3. The geometric relationship between the 3D world points and the 2D projections is known as the epipolar geometry and is explained below. Epipolar constraints can be used to solve the correspondence problem. 218 Figure B.3 Stereo correspondence problem: which points in Image 1 actually correspond to points W P and W Q in Image 2? The projection of one camera's optical center into the image of the other camera is called the epipole. Figure B.4 shows the epipolar geometry. I W I P here represents the point of interest in both W cameras. Points P1 and P2 are the projections of point P onto the left and right image planes respectively. The projection of O2 on the image 1 plane is the left epipole e1, similarly projection of O1 on the image 2 is the right epipole e2. The plane defined by known as the epipolar plane. The ray O1 W P, O1, O2 is W P is seen by the camera 1 as a point because it is directly in line with the camera's optical center. However, camera 2 sees this ray as a line in its I image plane. That line e2 P2 in camera 2 is called an epipolar line. In other words intersection 219 of the epipolar plane with the image plane represents the epipolar line. It is the property of the system that all epipolar lines should go through the camera’s epipole. Figure B.4 Epipolar geometry. I W W Given an image point P1, P can lie anywhere on the ray from O1 through P. To establish I the epipolar constraint the correct match of P1 must lie on the epipolar line on right image. The search for correspondences is reduced to a 1D problem. This makes the epipolar constraint effective in rejecting false matches due to occlusion. Conjugate points along corresponding epipolar lines have the same order in each image. However, ordering is not a hard constraint because corresponding points may not have the same order if they lie on the same epipolar plane and imaged from different sides. Once the correspondence problem is solved then using cameras 220 transformation matrices the 3D coordinates can be recovered and the object model can be reconstructed as show in Figure B.5. Figure B.5 3D reconstruction using 2D image points. The camera transformation matrices are obtained by calibrating the cameras by the 3D world. The section below explains the camera calibration procedure that we have used. 221 B.2 Camera calibration The coordinate system used in the affine camera calibration procedure [31] is shown below in Figure B.6. (a) (b) Figure B.6 Coordinate system used in camera calibration: (a) 3-D world (b) camera. We define here the transformation formula to project every point of 3D model to camera I image coordinates. P represents here the image coordinates, C is the calibration matrix and W P is the world point. I P  IC W P W (B.1) W P x  s I Pr   c11 c12 c13 c14     W P y    s I P c  = c   21 c22 c23 c24     W P z  c31 c32 c33 1   s           1    (B.2) 222 In the matrix above, the parameter s is the scale factor which is used to adjust the pixel position according to the unit difference. To calculate the 11 parameters in the transformation matrix following derived formula need to be used, and then input the set of corresponding points from 3D world coordinates and the camera images from the stereo pair. Repeated experiments show that at least eight (i.e i ≥ 8) 3D calibration points in most cases are required to provide good results. c11    c12  c13    c14   W x , W y , W z , 1, 0,0,0,0,-W x  I r , -W y  I r , -W z  I r  c21   I r    Pi  Pi Pi Pi Pi Pi Pi Pi Pi    Pi c22     x y z y c y c z c c 0,0,0, 0, W P i , W P i , W P i , 1,- W P i  I P i ,- W P i  I P i ,- W P i  I P i  c23   I P i       c24    c31  c   32  c33    (B.3) Each input pair of points has two equations in the left matrix, and the size of the left matrix is 2×11, so least squares fit method to calculate the 11 parameters can be used: (B.4) A2n × 11 X 11× 1 = B2n × 1   X 11× 1 = AT × A \ AT × B 223  (B.5) The ‘X’ transform matrix, once calculated, can be recomposed as follows to get the camera calibration matrix for each camera: c11    c12  c13    c14  c   21  c22    c23  c24    c31  c   32  c33    B.3  c11 c12 c13 c14    c21 c22 c23 c24  c31 c32 c33 1    (B.6) Computing 3D from 2D Using the above camera model the real world 3D point  W P x , W P y , W P z  can be     r c r calculated from the two images  I P1 , I P1  and  I P2 , I P c . This yields the following two 2        camera models. W P x  s I Pr    1  b  11 b12 b13 b14   W y   I c   P   s P  = b21 b22 b23 b24   1  W z   b 31 b32 b33 1   P    s        1    224 (B.7) W P x  t I P r    2  c11 c12 c13 c14   W y   I c    P  t P2  = c21 c22 c23 c24      c31 c32 c33 1   W P z     t       1    (B.8) Eliminating the homogeneous coordinates s and t following 4 equations and 3 unknowns are obtained:       I Pc = b - b I Pc W P x + b - b I Pc W P y + b - b I Pc W P z + b 24  22 32 1   23 33 1  1  21 31 1  I Pr = c - c I Pr W P x + c - c I Pr W P y + c - c I Pr W P z + c 14  12 32 2   13 33 2  2  11 31 2  I Pc = c - c I Pc W P x + c - c I Pc W P y + c - c I Pc W P z + c 24  22 32 2   23 33 2  2  21 31 2  I Pr = b - b I Pr W P x + b - b I Pr W P y + b - b I Pr W P z + b 11 31 1 12 32 1 13 33 1 14 1  I Pr  1 I c  P1   I Pr  2 I c  P2       b -b - b14   11 31     b21 - b31 - b24   =  - c14   c11 - c31     c -c - c24    21 31    W P x      W y    P =     W P z        b11 - b31 I P1r  b21 - b31 I P1c   c11 - c31 I P2r   c21 - c31 I P2c   I Pc 1 I Pr 2 I Pc 2 I Pr 1 b12 - b32 I P1r  b22 - b32 I P1c  c12 - c32 I P2r  c22 - c32 I P2c  b12 - b32 I P1r  b22 - b32 I P1c  c12 - c32 I P2r  c22 - c32 I P2c  225 (B.9) b13 - b33 I P1r   W x   P  b23 - b33 I P c     1  W y  P  (B.10)    c13 - c33 I P r     2 W P z    I Pc   c23 - c33 2   b13 - b33 I P1r   I P1r - b14     c b23 - b33 I P1    I P c - b24   1 \   I Pr   I Pr - c  c13 - c33 2    2 14  c23 - c33 I P2c    I P2c - c24  (B.11) Any 3 of these 4 equations can be solved to obtain the 3D world point  W P x , W P y , W P z  ,     however, due to approximation errors in the camera model and image points, each subset of three W equations will yield slightly different coordinates for P. These inaccuracies explained above once generated, amplifies as the ray propagates in space. Therefore we have used a more robust shortest line segment approach [95] as shown in Figure B.7. We have dropped the coordinate system symbols from the notation. The center of this line segment will represent the 3D point. So the smaller the segment better is the correspondence of image points and vice versa. We have also used this segment length criterion as a constraint to solve the correspondence problem. Epipolar constraints are also used in conjunction for robustness. Figure B.7 Shortest line segment connecting the two skew rays. 226 P1 and P2 are the points on the ray originating from camera optical center O1 and passing through image point I1 while Q1 and Q2 are the points on the ray originating from camera optical center O2 passing through image point I2. If the optical center of the cameras is not known then camera 1 ray points can be computed using the two equations in Equation B.13 while choosing an arbitrary value of W z P = z. If the computed ray is parallel with the z-axis then the same procedure can be repeated for y and z while W z P =x and so on. u1 and u2 are the unit vectors along these rays respectively. The shortest line segment is represented by vector V and is orthogonal to both u1 and u2 and is given as: V = (P + a1u1 ) - (Q1 + a2u2 ) 1 (B.12) The variables a1 and a2 can be computed using the following set of linear equations. Here ' ' represents dot product: [(P + a1u1 ) - (Q1 + a2 u2 )] 1 [(P + a1u1 ) - (Q1 + a2 u2 )] 1 u1 = 0 u2 = 0 (B.13) u1 = 0 u2 = 0 (B.14) Rearranging Equations B.13: [(P - Q1 ) + (a1u1 - a2 u2 )] 1 [(P - Q1 ) + (a1u1 - a2 u2 )] 1 [( P - Q1 )] 1 u1 + [(a1u1 - a2 u2 )] u1 = 0 u2 + [(a1u1 - a2 u2 )] u2 = 0 (B.15) [( P - Q1 )] 1 u1 + [(a1 .1)] - [(a2 u2 )] u1 = 0 u2 + [(a1u1 )] u2 - [(a2 .1)] = 0 (B.16) a1 - a2 (u1 u2 ) = -[( P - Q1 )] 1 (B.17) [( P - Q1 )] 1 [( P - Q1 )] 1 227 u1 a2  a1 (u1 u2 )  [( P  Q1 )] 1 u2 (B.18) Solving Equation B.17 and B.18 further to get a1 and a2 . Multiply Equation B.18 by (u1 u2 ) and subtract from Equation B.17: a1[1 - (u1 a1 = u2 )2 ] = [(Q1 - P )] 1 [(Q1 - P )] 1 u1 - [(Q1 - P ) 1 [1 - (u1 Multiply Equation B.17 by (u1 a2 [(u1 a2 = u2 ](u1 u2 ](u1 u2 ) u2 ) (B.19) (B.20) u2 ) 2 ] u2 ) and subtract Equation B.18 from Equation B.17: u2 )2 - 1] = [(Q1 - P )] 1 [(Q1 - P ) 1 u1 - [(Q1 - P ) 1 u2 - [(Q1 - P ) 1 u1 ](u1 u2 ) - [(Q1 - P )] 1 2] [1- (u1 u2 ) u2 u1 ](u1 u2 ) (B.21) (B.22) If the magnitude of vector V is less than a desired threshold then the 3D world coordinates x, W y, z of the point P are given as the midpoint of V: W P = 1 [( P + a u ) + (Q + a u )] 1 1 1 1 2 2 2 228 (B.23) _____________________________________________________________________________________________ APPENDIX C Site survey details _____________________________________________________________________________________________ This appendix provides information about the outdoor test site that we have used in our experiments. Some extra pictures of the test site are also provided to elaborate the local dynamics of the test environment. We have also briefly explained the survey procedure used to acquire the survey data. W e have selected the MSU Engineering building courtyard as our outdoor test site. Figure C.1 shows the satellite view of the Engineering building obtained by Google Earth. Figure C.1 MSU Engineering building satellite view. 229 The Engineering building latitude and longitude are 42.72477, -84.481594 respectively. For object localization and tracking the courtyard possesses complex semi-indoor features due to surrounding walls, trees, different sized pillars and arch structure in the middle of the courtyard. For our experiments the ground was considered to be leveled. Figure C.2(a) shows the aerial view of the courtyard. To have better insight of the local structures Figure C.2(b) to Figure C.2(e) provide local images. The approximate position and direction where these pictures were taken is highlighted in Figure C.2(a). (c) (b) (d) (e) (b) (a) Origin (c) (d) Figure C.2 Different views of the courtyard. 230 (e) We started survey of the site using the total station provided courtesy of the MSU Civil Engineering department. Figure C.3 shows the similar equipment used in the survey. The location where we placed the total station was selected carefully so that most of the points are in direct line of sight of the total station. The equipment location was selected as the origin of the 3D coordinate system. Figure C.3 Total station surveying equipment - Image extracted from [117]. We have used a right hand side coordinate system as shown in Figure C.4. We started the survey by leveling the equipment. The height of the equipment once leveled was adjusted at 4.8 ft. The goal was to obtain the coordinate information for all the possible corners that will be helpful to design a simulated 3D model of the test environment. The angles and distances of the predefined points from the total station were acquired. The relative position of the points from the origin was then calculated using trigonometry. We also used laser meters, tape measurements and ranging poles to obtain detail for the points not visible to the total station. This was also helpful to validate data obtained from the total station. Later the acquired survey data was 231 imported to MATLAB for making scaled model of the test site that were then used in simulations and experiments. The data was obtained in feet. To remain in sync with the unit system used in the lab experiments we later converted it to inches during the simulation. Figure C.4 Top view of the outdoor test site with legend showing equipment position and the coordinate system. The RFID readers were placed at a 40×40 m space. In our experiments, to assess the performance of stereo system we selected two different positions which provided varying choice of near and far field calibration points. The landmarks for camera positions, RFID reader 232 positions, origin, test site center and the coordinate system are mentioned in Figure C.4. Figure C.5 shows the simulated 3D view of the outdoor test site in same orientation of Figure C.4. The location of the RFID readers and the cameras in position 2 are also shown. Figure C.5 3D view of the outdoor test site with sensor configuration - scale is in inches. 233 Figure C.6 shows labeled calibration points (shown by '■' in yellow) used during camera calibration. (a) (b) Figure C.6 Outdoor test site with 2D calibration points shown by '■' (yellow): (a) left image (b) right image. Table C.1 provides coordinates of some of the important 3D landmarks in inches. The 3D ground truth points with respect to the 2D calibration points shown in Figure C.6 are also mentioned. 234 Table C.1 3D coordinates of calibration points and some important landmarks in MSU Engineering courtyard - scale is in inches. Landmarks Origin Site center Master reader Slave reader 1 Slave reader 2 Slave reader 3 Left camera position 2 Right camera position 2 Point 1 Point 2 Point 3 Point 4 Point 5 Point 6 Point 7 Point 8 X Y Z 0 524 -534 524 1585 524 0 0 242 242 242 242 429 429 945.5 945.5 0 0 0 -1061 0 1061 72 -72 60 -60 60 -60 84 -84 72 -72 0 0 0 0 0 0 54 54 36 36 0 0 140 140 30 30 235 _____________________________________________________________________________________________ APPENDIX D Wireless location sensing _____________________________________________________________________________________________ Wireless location sensing (WLS) methods have provided a new layer of automation to many indoor and outdoor location systems that need to know the physical location of objects and persons either relative to a known location or within a coordinate system. The process of WLS estimates the node location, where a node can be a smart phone, GPS receiver, wireless sensor, tag or a cellular base station. In this appendix we have briefly explained location system topologies, principles and methods with a summary of recent updates in the location sensing technology. W ireless location sensing is an umbrella term used for many type of indoor and outdoor location providing schemes and is interchangeably used with wireless positioning systems (WPS). The variety of location systems can broadly be categorized into global positioning system (GPS) and local positioning systems (LPS) [30]. GPS provides global position information whereas LPS provides relative position information. Some of the outdoor systems use cellular or satellite based positioning which are mostly line of sight based, while the indoor schemes use local positioning technologies such as Wireless Local Area Network (WLAN), cameras, bluetooth, sensor networks, Radio Frequency Identification (RFID), infrared 236 and ultrasonic etc. The wireless assisted Global Positioning System (GPS) and cellular base station linked with indoor mobile client can also be used for indoor localization. For example, locating an airport might best be done using GPS, but analyzing the behavior of a group of people waiting at an airport gate can be done in an LPS for just that gate or for just that airport. Several surveys covering broader local sensing aspects are available on the topic [25], [28], [29], [118], [119] and a handbook of location estimation is recently published by Zekavat and Buehrer [30]. In the ensuing paragraphs we have provided the basic information about WLS systems with recent advances and trends. D.1 Location system topologies The LPS can have following four different system topologies [120]. a. Self positioning - Such a system has receiver and measuring unit onboard the mobile object. It communicates with several geographically distributed transmitters with known locations and calculates its position accordingly. An inertial navigation system (INS) works on this phenomenon. b. Indirect remote positioning - A self-positioning system sending position information to a mobile unit via a wireless link. 237 c. Remote positioning - The mobile transmitter onboard the tracked object communicates with fixed receivers. Based on the received signals the position of the mobile object is measured in a central unit. d. Indirect self positioning - A remote positioning system sending position information to a mobile unit via a wireless link. D.2 Location system principles In a broader spectrum, location sensing has two general principles i.e triangulation and trilateration. a. Triangulation - With triangulation multiple sensor nodes observe some other node. Triangulation can be done in 2D if two network nodes fixed in space can compute the heading to the moving receiver node (angle-side-angle). Figure D.1 shows the triangulation concept. Same concept applies in 3D, where three fixed nodes form a tetrahedron with the mobile object. 238 Figure D.1 Triangulation geometry. The location of the receiver in Figure D.1 can be calculated as: 2 L1 = L2 + L2 - 2L2 L3cos 2 3  = 180 -  -  b. Trilateration - Using trilateration a mobile node locates itself relative to other transmitting base nodes that are in known locations. Distance from each base node is computed from signal strength or timing. In a 2D space, the mobile node locates itself at the intersection of three circles whose radii are the sensed distances, and in 3D at the intersection of four spheres. Using more than the minimum number of base nodes enables more robust computation of location in real noisy environments. 239 D.3 Location methods The location methods utilized in WLS are either geometric or time based. The geometric based techniques are angle of arrival (AOA), also called direction of arrival (DOA), received signal strength indicator (RSSI), or phase of arrival (POA); while propagation time based systems are time of arrival (TOA), also called time of flight (TOF), time difference of arrival (TDOA), and round trip time of flight (RTOF) [25], [118]. AOA estimates signal direction/angle at the desired point from at least two known reference points. RSSI calculates signal strength by comparing transmitted and received signal. This signal attenuation factor is then used to estimate range. The POA method estimates the received signal phase difference to get range estimates. The TOA/TOF measures one way signal travel time from a transmitter to receiver. For distance calculation the time measurements should at least be from three points of reference. TDOA on the other hand estimates difference in time at which the same signal arrives at multiple receivers instead of absolute arrival time. RTOF measures the two way signal travel time between the transmitter and receiver. Propagation time based systems are sensitive to the availability of line of sight (LOS) [121] and doesn’t work well in mountainous terrain or around skyscrapers. However, non line of sight (NLOS) method such as RSSI is affected only slightly with lack of LOS. To improve position and tracking performance, location sensing technologies also use parameter estimators such as Kalman, Particle filter and Bayesian estimation. 240 D.4 Some location sensing system descriptions There are a number of different implementation approaches that exist for the above systems. Some of them are given below. D.4.1 Received Signal Strength based localization An electronic fingerprint makes it possible to identify a wireless device by its unique radio transmission characteristics. Using spectrum analyzers, the RF location fingerprints of the scene are initially calculated. To estimate the position of an object, the observed measurements are compared with the fingerprint database. In a Wi-Fi environment, the RSS based algorithms mostly use Wireless Local Area Networks (WLAN) signatures for indoor localization. RADAR [122] was the first indoor system location and tracking system. Depending on the environment and application, RADAR and similar WLAN based location sensing systems provide tracking accuracy of one to three meters [123]. The accuracy of other typical WLAN positioning systems is approximately 3 to 30 m [25]. Some of the recent work on WLAN based localization is presented in [124], [125], [126]. RSS based localization is classified as RF fingerprinting, model based and kernel based. a. RSS fingerprinting - These methods [127] work in two steps i.e offline and online. RF signatures are captured in the off line mode and the database is generated. In online mode the location is estimated based on the database matching. Fingerprinting location techniques do not rely on LOS geometric assumptions. 241 b. RSS Model based - Model based RSS use a statistical model to generate the relationship between the RSS and distance [128], [129]. c. RSS Kernel based - Kernel based methods are statistical algorithms which provide the relationship between RSS and physical location using kernel functions [130]. D.4.2 Radiolocation using cellular signals Position can also be estimated by cellular phones using measurements for the signal between different signal towers and the phone. Location sensing using cellular signals has a benefit that mobile existing hardware can be used and the system can potentially provide location estimates anywhere wireless service is available. Radiolocation through cellular telephony mainly includes techniques, such as cell identification, AOA, TOA/TDOA, Assisted GPS (AGPS) [131] and Enhanced Observed Time Difference (E-OTD). AGPS combines mobile technology and GPS. EOTD measures the signal arrival time difference at handset, transmitted from minimum three synchronized base towers. United States Federal Communications Commission (FCC) for reliability, subscriber safety and quicker response, directed wireless carriers to provide automatic location identification [132] for the 911 emergency calls. Consequently there has been a wave of exploration in this area by the cellular services. For instance 2G GSM (Global System for Mobile communication) using E-OTD and 3G HSDPA (High Speed Downlink Packet Access) based wireless providers have integrated FCC positioning accuracy requirements in their systems. Universal Mobile Telecommunication System (UMTS) is a third generation technology for GSM 242 networks. The observed TDOA (O-TDOA) is considered as the UMTS version of E-OTD. CDMA (Code Division Multiple Access) based networks are utilizing TDOA and AGPS techniques for location based services. There are several solutions reported for positioning using cellular phones [133], [134], [135], [136], [137], [138]. A large variety of smartphones in the market has also played an important role. Radiolocation from the cellular infrastructure can be achieved by handset based methods (upgraded handsets with GPS-based technology), network based methods, SIM based methods or hybrid. a. Handset based - These methods require a client software running on the phones. Development of such client software with multi OS interface and cooperative mobile subscriber are some of main concerns in this approach. b. Network based - Network based cellular localization method requires additions only in the provider’s infrastructure. Its accuracy varies with the concentration of the signal towers and the timing method being used. c. SIM based - Using the SIM it is possible to get cell ID, RSS and the RTOF measurements. d. Hybrid - Hybrid systems use mixture of techniques, for example, network based technique can use GPS feed for validating location information. As cell sizes vary from 243 tens of meters in crowded urban areas to thousands of meters in rural area (having clear LOS), therefore, the location accuracy using cell ID varies. Fusing techniques such as TOA and TDOA with cell IDs can increase the accuracy. 2G GSM networks mostly use TDOA techniques however, for more accuracy AT&T now also utilize GPS feed for position estimation just like CDMA based networks. FCC has specified that position estimation for 67% of emergency calls should be within 50 m for handset based and 100 m for network based methods. Table D.1 from [138] compares the location accuracies in cellular phones using above mentioned technologies. Table D.1 Location accuracies of cellular radiolocation technologies. See Kos et al. [138]. Type Rural Suburban Urban Indoor Cell ID 1-35 Km 1-10 Km 150-500 m 10-50 m E-OTD - 50-150 m 50-150 m good AGPS 10m 10-20 m 10-100 m variable D.4.3 Localization using smart phone sensors Increasing the number of embedded sensors such as Wi-Fi radio, cellular radio, accelerometer, gyroscope, compass, cameras, magnetometer, microphone, speakers and GPS in the cell phones presents new opportunities for logical localization. Using phone embedded hardware to determine RSSI fingerprints, Martin et al. has reported localization accuracy of 1.5 244 m [139]. The accuracy is better in regions having more Wi-Fi radios in range. The authors in [136] have used smart phone compasses and accelerometers for localization without relying on Wi-Fi wireless networks with reported accuracy of around 11 m. Location estimation is also done using photo-acoustic signatures such as sound, light and color (from microphone and camera) and user motion (from accelerometers) [140]. Overview of how AGPS provides better accuracy and cost is given in [131]. Peng et al. [104] provides an acoustic based ranging system using only the phone’s microphone and speaker. The software based algorithm relies on twoway sensing, self recording and sample counting to estimate the location. Their system provides one to two centimeter accuracy in an area of about 10 m. A similar approach in 3D without any infrastructure support has been reported in [141]. Their acoustic signatures are based on time of arrival and power level. Their system can provide localization accuracy of 13.9 cm for 90% of estimates when the phones are several meters apart. Kessel and Werner [142] evaluated location based services using deterministic 802.11 RSS fingerprinting and a digital compass on a smart 2 phone. The reported position accuracy is 2.74 m over an area of 250 m . D.4.4 Sensor networks Advancement in micro electro mechanical systems (MEMs) has enabled small size, low cost, low power wireless sensors possible. Sensor networks are generally used to monitor the environment but location based services and GPS positional accuracy can be combined. Solving location estimation using sensor networks faces challenges such as lack of central control system, computational capability, limited wireless bandwidth and high data traffic. In [143] the RSSI based location-tracking of an object in sensor networks was simulated by cooperation of 245 sensors through an election process and initiation of a mobile tracking agent. A mobile software agent is an intelligent program that follows an automated sequence of actions to track the target object. The system has prior knowledge of global and relative position information of each sensor. The mobile agent monitors the object by choosing the sensor closest to the object; i.e inviting nearby sensors and inhibiting irrelevant sensors. Each object is marked with its unique ID code by interpreting signal strengths from different sensors. The data overload issue was addressed by forwarding tracking histories to a location server. Recent research indicates that using low cost wireless sensors is an acceptable approach to scalable target tracking applications such as smart homes, fleet monitoring, air traffic control and security. In an indoor sensor network setup, an RMS location error of 1.2 m for TOA and 2.2 m RSSI are reported [144]. Some of the background work on sensor network localization methods is given in [145], [146], [147], [148]. D.4.5 Infrared positioning The infrared positioning systems do not have reflection problems and are widely used for high accuracy applications such as virtual reality, games and computer graphics in the movie industry. One popular infrared camera based motion tracking system is provided by Vicon [93]. Its results, operating range and accuracy varies over different applications and environments, mainly due to camera placement, lighting conditions and volume location effects etc. The Vicon system is also being used by the group at University of Pennsylvania for controlling highly accurate maneuvers of cooperating flying robots [149]. Also, IR motion trackers are used for medical applications such as surgical navigation [150]. 246 D.4.6 Ultrasonic trackers Due to ultrasonic noise interference, ultrasonic trackers are more suitable for sound controlled areas such as indoor environments (offices, hospitals, labs etc.). Their low cost and good accuracy for small distances leverage their use in human movement analysis [151] and for robot collision avoidance and distance measurement [152], [153]. D.4.7 Laser range finders Laser range finders (LRF) are also being used in position estimation systems especially in the field of robot navigation. They provide the estimate of how far is the closest obstruction from the robot. Wall mounted LRFs are also used in human tracking system with people wearing infrared tags [154]. The system merges multi-sensor information using Bayesian filter and perform identity estimation. Efficiency of LRFs is independent of the lighting conditions and provides accuracy within centimeters in controlled indoor environments [155], [156]. D.4.8 Magnetic motion trackers Position and orientation information can also be obtained by magnetic motion trackers [157]. These systems generate magnetic pulses by the transmitter, which are then observed and reported by the magnetic receiver mounted. These sensors do not require LOS and are small and lightweight but have high cost and small range of operation (within ~3 m of the transmitter). Since the sensors can be affected by ferrous material and electricity [158], [159], the highest 247 accuracy is ensured in a controlled indoor environment where there is minimal magnetic distortion. The wide range trakSTAR system estimates X, Y, Z positional coordinates and orientation angles within 2.1 m range from the transmitter with a single sensor static accuracy of 3.8 mm. The system is used for human motion and activity capturing and analysis [160], biomechanics, simulations and computer graphics. The typical accuracy of a magnetic tracking system is less than 10 mm [160]. Unlike other postion sensors, its permeability through human tissue allows tracking objects inside the human body and therefore they are used to track surgical equipment and drug delivery inside the human body [161]. D.4.9 Ultra Wide Band Another localization technique uses Ultra Wide Band (UWB) signals. UWB wireless technology uses frequency spectrum larger than 500 MHz. The UWB trackers have wall penetration capability, typical accuracy between 30-50 cm in 10 m working range (better than RF) and require low transmission power. However the system itself is costly. A commercially available UWB based tracking system is provided by Ubisense [22]. The system has tens of meter range and estimates 3D location of UWB moving tags. The company claims that the system provides 15 cm accuracy 95% of the time. UWB use in military applications and systems is given in [162]. D.4.10 Bluetooth Bluetooth wireless networking technology can be used for location sensing; however, due to fewer transmitters and low scan/refresh rate it does not make an ideal choice for real time 248 location systems (RTLS). If sufficient transmitting beacons are available then typically Bluetooth can provide up to 10 m accuracy [163]. Authors in [163] used a combination of WLAN and Bluetooth technologies to improve the location accuracy. Purely Bluetooth RSSI based indoor position estimation is reported in [164]. D.4.11 Inertial Measuring Units (INS) For outdoor localization and tracking INS is being used in airplanes, submarines, shuttles, spacecrafts and unarmed vehicles (UAVs). By virtue of micro-electro-mechanical systems (MEMs) the smaller version of these systems have now made their place as a position and orientation estimator in object location and tracking. Inertial Measuring Units (IMU) are the main component of INS. The IMUs consist of gyroscopes and accelerometers and they provide position accuracy of around 10 m without the requirement of LOS. An autonomous positioning system having IMU as a system component is explained in [165], which can locate and track the firefighter’s position during rescue operations. An IMU integrated with GPS allows the GPS feed to continue in case of GPS signal loss. Integrating information from IMUs and marker based video tracking, a system for 3D indoor location tracking is provided in [166]. D.4.12 Miscellaneous One of the system examples for RTOF based position estimation is Siemens local positioning radar [167]. The system is claimed to provide an accuracy of a few centimeters. To locate the position of the object earlier systems such as the Active badge system [168], Cyberguide [169] 249 used infrared and CricketNav [170], ActiveBat [171] used ultrasound. The active badge system and CricketNav estimate the room or portion of a room where the device is located. The 2 ActiveBat system provides accuracy of 9 cm 95% of the time in a 100 m area. These technologies however, suffer from LOS restriction and require large amount of extra hardware to be installed. 250 REFERENCES 251 REFERENCES [1] S. G. Pratt, D. E. Frosbroke, and S. M. Marsh, "Building safer highway work zones: measures to prevent worker injuries from vehicles and equipment," Center for Disease Control and Prevention 2001. [2] J. Teizer, M. Venugopal, and A. Walia, "Ultrawideband for Automated Real-Time ThreeDimensional Location Sensing for Workforce, Equipment, and Material Positioning and Tracking," Transportation Research Record: Journal of the Transportation Research Board, vol. 2081, pp. 56-64, 2008. [3] D. E. Fosbroke, "Studies on heavy equipment blind spots and internal traffic control," in Roadway Work Zone Safety & Health Conference, Baltimore, MD, 2004. [4] T. M. Ruff, "Monitoring blind spots - a major concern for haul trucks," Engineering and Mining Journal, vol. 202, pp. 17–26, 2001. [5] J. Bohn and J. Teizer, "Benefits and Barriers of Construction Project Monitoring Using High-Resolution Automated Cameras," Journal of Construction Engineering and Management, vol. 136, pp. 632-640, 2010/06/01 2009. [6] S. I. Nakagawa, K. I. Soh, S. i. Mine, and H. Saito, "Image systems using RFID tag positioning information," NTT Technical Review Journal, vol. 1, pp. 79-83, 2003. [7] J. Banks, RFID applied: Wiley. com, 2007. [8] "Convergence System Ltd. RTLS development kit," ed, 2013. [9] RFID and Rail: Advanced Tracking Technology; An interview with RFID pioneer J. Landt [Online]. Available: http://www.railway-technology.com/features/feature1684/ [10] J. Crabtree. (1995) Advantage I-75 - Electronic Clearance Test Project. Public Roads. [11] Airport RFID Services. Available: http://www.transcore.com/rfid#963767e05df8b6f90c45c200b5ef4fde [12] Track and locate Available: http://www.rfidc.com/ [13] R. Want, RFID Explained: A Primer on Radio Frequency Identification Technologies, 2008. [14] R. Want, "An introduction to RFID technology," Pervasive Computing, IEEE, vol. 5, pp. 25-33, 2006. 252 [15] K. Bonsor. How E-Zpass works. Available: http://auto.howstuffworks.com/e-zpass1.htm [16] S. Ahson and M. Ilyas, RFID handbook : applications, technology, security, and privacy. Boca Raton: CRC Press, 2008. [17] T. Deyle, C. C. Kemp, and M. S. Reynolds, "Probabilistic UHF RFID tag pose estimation with multiple antennas and a multipath RF propagation model," in Intelligent Robots and Systems, 2008. IROS 2008. IEEE/RSJ International Conference on, 2008, pp. 1379-1384. [18] H. Xin, R. Janaswamy, and A. Ganz, "Scout: Outdoor Localization Using Active RFID Technology," in Broadband Communications, Networks and Systems, 2006. BROADNETS 2006. 3rd International Conference on, 2006, pp. 1-10. [19] K. Chawla, G. Robins, and Z. Liuyi, "Object localization using RFID," in Wireless Pervasive Computing (ISWPC), 2010 5th IEEE International Symposium on, 2010, pp. 301-306. [20] Dallas Zoo Tracks Elephants Using CSL Real Time Location System. Available: http://rfid.net/news/399-dallas-zoo-track-elephants-real-time-location-system [21] How to Install a Real Time Location System – RTLS Available: http://rfid.net/basics/rtls/241-how-to-install-a-real-time-location-system-rtls [22] Ubisense research and development packages. Available: http://www.ubisense.net/en/rtls-solutions/research-packages.html [23] AeroScout: Technology overview. Available: http://www.aeroscout.com/technology [24] R. Buik, "Gps guidance and automated steering renew interest in precision farming technique," Trimble Navigation Limited. July, pp. 1-10, 2006. [25] M. Vossiek, L. Wiebking, P. Gulden, J. Wieghardt, C. Hoffmann, and P. Heide, "Wireless local positioning," Microwave Magazine, IEEE, vol. 4, pp. 77-86, 2003. [26] S. Gezici, "A survey on wireless position Communications, vol. 44, pp. 263-282, 2008. [27] H. Liu, H. Darabi, P. Banerjee, and J. Liu, "Survey of wireless indoor positioning techniques and systems," Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 37, pp. 1067-1080, 2007. [28] T. Teixeira, G. Dublon, and A. Savvides, "A survey of human-sensing: Methods for detecting presence, count, location, track, and identity," ACM Computing Surveys, vol. 5, 2010. 253 estimation," Wireless Personal [29] J. Raper, G. Gartner, H. Karimi, and C. Rizos, "Applications of location–based services: a selected review," Journal of Location Based Services, vol. 1, pp. 89-111, 2007. [30] R. Zekavat and R. M. Buehrer, Handbook of position location: Theory, practice and advances vol. 27: Wiley. com, 2011. [31] L. G. Shapiro and G. Stockman, Computer Vision. Upper Saddle River, NJ: Prentice Hall PTR, 2001. [32] P. Dollar, C. Wojek, B. Schiele, and P. Perona, "Pedestrian Detection: An Evaluation of the State of the Art," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 34, pp. 743-761, 2012. [33] E. Charniak, Introduction to artificial intelligence: Pearson Education India, 1985. [34] S. L. Tanimoto, The elements of artificial intelligence using Common Lisp: WH Freeman & Co., 1993. [35] M. A. Fischler and R. C. Bolles, "Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography," Communications of the ACM, vol. 24, pp. 381-395, 1981. [36] K. Ohmura, A. Tomono, and Y. Kobayashi, "Method Of Detecting Face Direction Using Image Processing For Human Interface," 1988, pp. 625-632. [37] D. Colbry, G. Stockman, and A. Jain, "Detection of Anchor Points for 3D Face Verification," in Computer Vision and Pattern Recognition - Workshops, 2005. CVPR Workshops. IEEE Computer Society Conference on, 2005, pp. 118-118. [38] G. Stockman, "Object representation for recognition-by-alignment," in Object Representation in Computer Vision, ed: Springer, 1995, pp. 77-87. [39] I. Biederman, "Recognition by components: a theory of human image understanding," Psychological review, vol. 94, p. 115, 1987. [40] G. Stockman, "Object Recognition, in Interpretation of Range Images," ed: R. Jain and A. Jain (Eds), 1989. [41] H. Murase and S. K. Nayar, "Visual learning and recognition of 3-D objects from appearance," International journal of computer vision, vol. 14, pp. 5-24, 1995. [42] M. A. Turk and A. P. Pentland, "Face recognition using eigenfaces," in Computer Vision and Pattern Recognition, 1991. Proceedings CVPR '91., IEEE Computer Society Conference on, 1991, pp. 586-591. 254 [43] R. M. Bolle, J. H. Connell, N. Haas, R. Mohan, and G. Taubin, "VeggieVision: a produce recognition system," in Applications of Computer Vision, 1996. WACV '96., Proceedings 3rd IEEE Workshop on, 1996, pp. 244-251. [44] S. M. Khan and M. Shah, "Tracking Multiple Occluding People by Localizing on Multiple Scene Planes," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 31, pp. 505-519, 2009. [45] C. Jin-Long and G. C. Stockman, "Determining pose of 3D objects with curved surfaces," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 18, pp. 52-57, 1996. [46] G. Stockman, "Object recognition and localization via pose clustering," Computer Vision, Graphics, and Image Processing, vol. 40, pp. 361-387, 1987. [47] D. F. Huber and M. Hebert, "A new approach to 3-D terrain mapping," in Intelligent Robots and Systems, 1999. IROS '99. Proceedings. 1999 IEEE/RSJ International Conference on, 1999, pp. 1121-1127 vol.2. [48] Flying robots equipped with 3D gear. Available: http://www.homelandsecuritynewswire.com/dr20120507-flying-robots-equipped-with3d-gear-better-surveillance-on-the-cheap [49] F. Goulette, F. Nashashibi, I. Abuhadrous, S. Ammoun, and C. Laurgeau, "An integrated on-board laser range sensing system for on-the-way city and road modelling," The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 34, 2006. [50] J. Teizer, "3D range imaging camera sensing for active safety in construction," Electron. J. Inf. Technol. Constr., vol. 13, pp. 103-17, 2008. [51] P. J. Besl and N. D. McKay, "Method for registration of 3-D shapes," in Robotics-DL tentative, 1992, pp. 586-606. [52] S. Thrun, Y. Liu, D. Koller, A. Y. Ng, Z. Ghahramani, and H. Durrant-Whyte, "Simultaneous localization and mapping with sparse extended information filters," The International Journal of Robotics Research, vol. 23, pp. 693-716, 2004. [53] Y. Mae, T. Umetani, T. Arai, and K. Inoue, "Object recognition using appearance models accumulated into environment," in Pattern Recognition, 2000. Proceedings. 15th International Conference on, 2000, pp. 845-848 vol.4. [54] M. Boukraa and S. Ando, "Tag-based vision: assisting 3D scene analysis with radiofrequency tags," in Image Processing. 2002. Proceedings. 2002 International Conference on, 2002, pp. I-269-I-272 vol.1. 255 [55] I. Weiss and M. Ray, "Model-based recognition of 3D objects from single images," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 23, pp. 116-128, 2001. [56] M. Boukraa and S. Ando, "A computer vision system for knowledge-based 3D scene analysis using radio-frequency tags," in Multimedia and Expo, 2002. ICME '02. Proceedings. 2002 IEEE International Conference on, 2002, pp. 245-248 vol.2. [57] C. Cerrada, S. Salamanca, E. Perez, J. A. Cerrada, and I. Abad, "Fusion of 3D Vision Techniques and RFID Technology for Object Recognition in Complex Scenes," in Intelligent Signal Processing, 2007. WISP 2007. IEEE International Symposium on, 2007, pp. 1-6. [58] M. Adan, A. Adan, C. Cerrada, P. Merchan, and S. Salamanca, "Weighted conecurvature: applications for 3D shapes similarity," in 3-D Digital Imaging and Modeling, 2003. 3DIM 2003. Proceedings. Fourth International Conference on, 2003, pp. 458-465. [59] C. Cerrada, S. Salamanca, A. Adan, E. Perez, J. A. Cerrada, and I. Abad, "Improved Method for Object Recognition in Complex Scenes by Fusioning 3-D Information and RFID Technology," Instrumentation and Measurement, IEEE Transactions on, vol. 58, pp. 3473-3480, 2009. [60] H. Hontani, K. Baba, T. Kugimiya, K. Sato, and M. Nakagawa, "Visual tracking system using an ID-tag and the network," in SICE 2003 Annual Conference, 2003, pp. 23752380 Vol.3. [61] H. Hontani, M. Nakagawa, T. Kugimiya, K. Baba, and M. Sato, "A visual tracking system using an RFID-tag," in SICE 2004 Annual Conference, 2004, pp. 2720-2723 vol. 3. [62] O. Camps, P. J. Flynn, and G. C. Stockman, "Recent progress in CAD-based computer vision: an introduction to the special issue," Computer Vision and Image Understanding, vol. 69, pp. 251-252, 1998. [63] C. Nak Young, H. Hongu, M. Miyazaki, K. Takemura, K. Ohara, K. Ohba, et al., "Robots on self-organizing knowledge networks," in Robotics and Automation, 2004. Proceedings. ICRA '04. 2004 IEEE International Conference on, 2004, pp. 3494-3499 Vol.4. [64] J.-Y. Kim, C.-J. Im, S.-W. Lee, and H.-G. Lee, "Object recognition using smart tag and stereo vision system on pan-tilt mechanism," in Proceedings of International Conference on Computer Applications in Shipbuilding, 2005, pp. 2379-2384. [65] R. Poppe, "A survey on vision-based human action recognition," Image and Vision Computing, vol. 28, pp. 976-990, 2010. 256 [66] D. F. Hsu, Y.-S. Chung, and B. S. Kristal, "Combinatorial Fusion Analysis: Methods and Practices of Combining Multiple Scoring Systems," in Advanced Data Mining Technologies in Bioinformatics, ed: IGI Global, 2006, pp. 32-62. [67] H.-H. Hsu, Z. Cheng, T. Huang, and Q. Han, "Behavior analysis with combined RFID and video information," presented at the Proceedings of the Third international conference on Ubiquitous Intelligence and Computing, Wuhan, China, 2006. [68] N. Krahnstoever, J. Rittscher, P. Tu, K. Chean, and T. Tomlinson, "Activity Recognition using Visual Tracking and RFID," in Application of Computer Vision, 2005. WACV/MOTIONS '05 Volume 1. Seventh IEEE Workshops on, 2005, pp. 494-500. [69] W. Jianxin, A. Osuntogun, T. Choudhury, M. Philipose, and J. M. Rehg, "A Scalable Approach to Activity Recognition based on Object Use," in Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, 2007, pp. 1-8. [70] D. G. Lowe, "Distinctive image features from scale-invariant keypoints," International journal of computer vision, vol. 60, pp. 91-110, 2004. [71] P. Sangho and H. Kautz, "Hierarchical recognition of activities of daily living using multi-scale, multi-perspective vision and RFID," in Intelligent Environments, 2008 IET 4th International Conference on, 2008, pp. 1-4. [72] T. Deyle, C. Anderson, C. C. Kemp, and M. S. Reynolds, "A foveated passive UHF RFID system for mobile manipulation," in Intelligent Robots and Systems, 2008. IROS 2008. IEEE/RSJ International Conference on, 2008, pp. 3711-3716. [73] T. Deyle, N. Hai, M. Reynolds, and C. C. Kemp, "RF vision: RFID receive signal strength indicator (RSSI) images for sensor fusion and mobile manipulation," in Intelligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ International Conference on, 2009, pp. 5553-5560. [74] A. Nemmaluri, M. D. Corner, and P. Shenoy, "Sherlock: automatically locating objects for humans," presented at the Proceedings of the 6th international conference on Mobile systems, applications, and services, Breckenridge, CO, USA, 2008. [75] X. Liu, M. D. Corner, and P. Shenoy, "Ferret: RFID localization for pervasive multimedia," presented at the Proceedings of the 8th international conference on Ubiquitous Computing, Orange County, CA, 2006. [76] T. Germa, F. Lerasle, N. Ouadah, V. Cadenat, and M. Devy, "Vision and RFID-based person tracking in crowds from a mobile robot," in Intelligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ International Conference on, 2009, pp. 5591-5596. [77] T. L. McDaniel, K. Kahol, D. Villanueva, and S. Panchanathan, "Integration of RFID and computer vision for remote object perception for individuals who are blind," presented at 257 the Proceedings of the 2008 Ambi-Sys workshop on Haptic user interfaces in ambient media systems, Quebec City, Canada, 2008. [78] C. Heesung and H. Kyuseo, "Combination of RFID and Vision for Mobile Robot Localization," in Intelligent Sensors, Sensor Networks and Information Processing Conference, 2005. Proceedings of the 2005 International Conference on, 2005, pp. 7580. [79] L. Weiguo, J. Songmin, Y. Fei, and K. Takase, "Topological navigation of mobile robot using ID tag and WEB camera," in Intelligent Mechatronics and Automation, 2004. Proceedings. 2004 International Conference on, 2004, pp. 644-649. [80] P. Kamol, S. Nikolaidis, R. Ueda, and T. Arai, "RFID Based Object Localization System Using Ceiling Cameras with Particle Filter," in Future Generation Communication and Networking (FGCN 2007), 2007, pp. 37-42. [81] J. Songmin, S. Erzhe, T. Abe, and K. Takase, "Localization of Mobile Robot with RFID Technology and Stereo Vision," in Mechatronics and Automation, Proceedings of the 2006 IEEE International Conference on, 2006, pp. 508-513. [82] J. Songmin, S. Jinbuo, and K. Takase, "Obstacle recognition for a service mobile robot based on RFID with multi-antenna and stereo vision," in Information and Automation, 2008. ICIA 2008. International Conference on, 2008, pp. 125-130. [83] J. Songmin, S. Jibuo, D. Chugo, and K. Takase, "Human recognition using RFID technology and sterero vision," in Robotics and Biomimetics, 2007. ROBIO 2007. IEEE International Conference on, 2007, pp. 1488-1493. [84] J. Songmin, S. Jinbuo, and K. Takase, "Human recognition using RFID system with multi-antenna," in Advanced Intelligent Mechatronics, 2008. AIM 2008. IEEE/ASME International Conference on, 2008, pp. 1213-1218. [85] Y. Po, W. Wenyan, M. Moniri, and C. C. Chibelushi, "RFID tag infrastructures for camera tracking in virtual studio environment," in Visual Media Production, 2007. IETCVMP. 4th European Conference on, 2007, pp. 1-8. [86] W. Yu, J. Kato, Z. Wei, and S. Yokoi, "Digest Generation of Kindergarten Surveillance Video with Location Information and Visual Features," in Innovative Computing, Information and Control (ICICIC), 2009 Fourth International Conference on, 2009, pp. 768-771. [87] F. Zoega, "Review of the Current State of Radio Frequency Identification (RFID) Technology, Its Use and Potential Future Use in Construction," 2006. 258 [88] J. Yang, O. Arif, P. A. Vela, J. Teizer, and Z. Shi, "Tracking multiple workers on construction sites using video cameras," Advanced Engineering Informatics, vol. 24, pp. 428-434, 2010. [89] E. C. Jones, K. Kopocis, T. Wentz, R. Franca, and T. L. Stentz, "Measuring the Effectiveness of RFID on Mechanical Contracting Jobsites: A Practical Evaluation," University of Nebraska, LincolnNovember 28, 2007. [90] R. H. Raza and G. C. Stockman, "Target tracking and surveillance by fusing stereo and RFID information," in Proc. of SPIE Vol, 2012, pp. 83921J-1. [91] R. H. Raza and G. C. Stockman, "Fusion of stereo vision and RFID for site safety," in Proceedings of 25th International conference on Computer Applications in Industry and Engineering, New Orleans, Louisiana USA, 2012. [92] N. Michael, J. Fink, and V. Kumar, "Cooperative manipulation and transportation with aerial robots," Autonomous Robots, pp. 1-14, 2009. [93] Vicon systems Available: http://www.vicon.com [94] Shell game. Available: http://en.wikipedia.org/wiki/Shell_game [95] R. O. Duda and P. E. Hart, Pattern classification and scene analysis. New York,: Wiley, 1973. [96] I. K. Sethi and R. Jain, "Finding Trajectories of Feature Points in a Monocular Image Sequence," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. PAMI-9, pp. 56-73, 1987. [97] C. J. Veenman, M. J. Reinders, and E. Backer, "Motion tracking as a constrained optimization problem," Pattern Recognition, vol. 36, pp. 2049-2067, 2003. [98] Queen Victoria building construction site, Melbourne. Available: http://commons.wikimedia.org/wiki/File:QV_Building_construction_site,_Melbourne__March_2002.jpg [99] N. Vaidya and S. R. Das, "Rfid-based networks: exploiting diversity and redundancy," ACM SIGMOBILE Mobile Computing and Communications Review, vol. 12, pp. 2-14, 2008. [100] X. Yonghong and J. Qiang, "A new efficient ellipse detection method," in Pattern Recognition, 2002. Proceedings. 16th International Conference on, 2002, pp. 957-960 vol.2. [101] S. T. Barnard and M. A. Fischler, "Computational stereo," ACM Computing Surveys (CSUR), vol. 14, pp. 553-572, 1982. 259 [102] R. M. Haralock and L. G. Shapiro, Computer and robot vision: Addison-Wesley Longman Publishing Co., Inc., 1991. [103] R. Jain, R. Kasturi, and B. G. Schunck, Machine vision. New York: McGraw-Hill, 1995. [104] C. Peng, G. Shen, Y. Zhang, Y. Li, and K. Tan, "Beepbeep: a high accuracy acoustic ranging system using cots mobile devices," in Proceedings of the 5th international conference on Embedded networked sensor systems, 2007, pp. 1-14. [105] R. Boudjemaa and A. B. Forbes, Parameter Estimation Methods for Data Fusion: National Physical Laboratory.Great Britain, Centre for Mathematics and Scientific Computing, 2004. [106] H. F. Durrant-Whyte, "Sensor models and multisensor integration," The International Journal of Robotics Research, vol. 7, pp. 97-113, 1988. [107] S. Houzelle and G. Giraudon, "Contribution to multisensor fusion formalization," Robotics and autonomous systems, vol. 13, pp. 69-85, 1994. [108] K. Lai, B. Liefeng, R. Xiaofeng, and D. Fox, "A large-scale hierarchical multi-view RGB-D object dataset," in Robotics and Automation (ICRA), 2011 IEEE International Conference on, 2011, pp. 1817-1824. [109] Detection, Inspection, and Enforcement. Available: http://www.nist.gov/mml/mmsd/security_technologies/detection.cfm [110] J. Walton and J. Crabtree, "A needs assessment and technology evaluation for roadside identification of commercial vehicles," SAE transactions, vol. 108, pp. 516-522, 1999. [111] M. Maeterlinck, A. L. Teixeira de Mattos, and A. Sutro, Joyzelle. New York: Dodd, Mead and Company, 1907. [112] IP Cam Viewer Pro Application. Available: https://itunes.apple.com/us/app/ip-camviewer-pro/id402656416 [113] B. K. P. Horn and B. G. Schunck, "Determining Optical Flow," Massachusetts Institute of Technology1980. [114] B. D. Lucas and T. Kanade, "An iterative image registration technique with an application to stereo vision," presented at the Proceedings of the 7th international joint conference on Artificial intelligence - Volume 2, Vancouver, BC, Canada, 1981. [115] RWTH - Mindstorms NXT Toolbox for MATLAB. Available: http://www.mindstorms.rwth-aachen.de/ 260 [116] Sensor Data Application. Available: https://itunes.apple.com/us/app/sensordata/id397619802 [117] Total Stations. Available: http://www.brandt.ca/SiteCollectionImages/TotalStations/GTS-240NW/GTS-240NW.jpg [118] L. Hui, H. Darabi, P. Banerjee, and L. Jing, "Survey of Wireless Indoor Positioning Techniques and Systems," Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 37, pp. 1067-1080, 2007. [119] S. Gezici, "A Survey on Wireless Position Estimation," Wireless Personal Communications, vol. 44, pp. 263-282-282, 2008. [120] M. Vossiek, L. Wiebking, P. Gulden, J. Weighardt, and C. Hoffmann, "Wireless local positioning-concepts, solutions, applications," in Radio and Wireless Conference, 2003. RAWCON'03. Proceedings, 2003, pp. 219-224. [121] W. Xu and S. Zekavat, "Spatially correlated multi-user channels: LOS vs. NLOS," in Digital Signal Processing Workshop and 5th IEEE Signal Processing Education Workshop, 2009. DSP/SPE 2009. IEEE 13th, 2009, pp. 308-313. [122] P. Bahl and V. N. Padmanabhan, "RADAR: an in-building RF-based user location and tracking system," in INFOCOM 2000. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE, 2000, pp. 775-784 vol.2. [123] A. M. Ladd, K. E. Bekris, A. Rudys, L. E. Kavraki, and D. S. Wallach, "Robotics-based location sensing using wireless ethernet," Wireless Networks, vol. 11, pp. 189-204, 2005. [124] F. Shih-Hau, L. Tsung-Nan, and L. Kun-Chou, "A Novel Algorithm for Multipath Fingerprinting in Indoor WLAN Environments," Wireless Communications, IEEE Transactions on, vol. 7, pp. 3579-3588, 2008. [125] U. Grossmann, M. Schauch, and S. Hakobyan, "RSSI based WLAN Indoor Positioning with Personal Digital Assistants," in Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, 2007. IDAACS 2007. 4th IEEE Workshop on, 2007, pp. 653-656. [126] X. Yubin, W. Yong, and M. Lin, "A Novel WLAN Indoor Positioning Algorithm Based on Positioning Characteristics Extraction," in Genetic and Evolutionary Computing (ICGEC), 2010 Fourth International Conference on, 2010, pp. 134-137. [127] M. Brunato and C. Kiss Kallo, "Transparent location fingerprinting for wireless services," 2002. [128] A. LaMarca, J. Hightower, I. Smith, and S. Consolvo, "Self-mapping in 802.11 location systems," in UbiComp 2005: Ubiquitous Computing, ed: Springer, 2005, pp. 87-104. 261 [129] Y. Ji, S. Biaz, S. Pandey, and P. Agrawal, "ARIADNE: a dynamic indoor signal map construction and localization system," in Proceedings of the 4th international conference on Mobile systems, applications and services, 2006, pp. 151-164. [130] L. M. Ni, Y. Liu, Y. C. Lau, and A. P. Patil, "LANDMARC: indoor location sensing using active RFID," Wireless Networks, vol. 10, pp. 701-710, 2004. [131] G. M. Djuknic and R. E. Richton, "Geolocation and assisted GPS," Computer, vol. 34, pp. 123-125, 2001. [132] E911 Phase II Decision: Fact Sheet of FCC Wireless 911 Requirements. Available: http://transition.fcc.gov/pshs/services/911services/enhanced911/archives/factsheet_requirements_012001.pdf [133] J. J. Caffery and G. L. Stuber, "Overview of radiolocation in CDMA cellular systems," Communications Magazine, IEEE, vol. 36, pp. 38-45, 1998. [134] B. Ludden and L. Lopes, "Cellular based location technologies for UMTS: a comparison between IPDL and TA-IPDL," in Vehicular Technology Conference Proceedings, 2000. VTC 2000-Spring Tokyo. 2000 IEEE 51st, 2000, pp. 1348-1353 vol.2. [135] I. K. Adusei, K. Kyamakya, and K. Jobmann, "Mobile positioning technologies in cellular networks: an evaluation of their performance metrics," in MILCOM 2002. Proceedings, 2002, pp. 1239-1244 vol.2. [136] I. Constandache, R. R. Choudhury, and I. Rhee, "Towards Mobile Phone Localization without War-Driving," in INFOCOM, 2010 Proceedings IEEE, 2010, pp. 1-9. [137] I. Constandache, X. Bao, M. Azizyan, and R. R. Choudhury, "Did you see Bob?: human localization using mobile phones," in Proceedings of the sixteenth annual international conference on Mobile computing and networking, 2010, pp. 149-160. [138] T. Kos, M. Grgic, and J. Kitarovic, "Location Technologies for Mobile Networks," in Systems, Signals and Image Processing, 2007 and 6th EURASIP Conference focused on Speech and Image Processing, Multimedia Communications and Services. 14th International Workshop on, 2007, pp. 319-322. [139] E. Martin, O. Vinyals, G. Friedland, and R. Bajcsy, "Precise indoor localization using smart phones," in Proceedings of the international conference on Multimedia, 2010, pp. 787-790. [140] M. Azizyan, I. Constandache, and R. Roy Choudhury, "SurroundSense: mobile phone localization via ambience fingerprinting," in Proceedings of the 15th annual international conference on Mobile computing and networking, 2009, pp. 261-272. 262 [141] J. Qiu, D. Chu, X. Meng, and T. Moscibroda, "On the feasibility of real-time phone-tophone 3d localization," in Proceedings of the 9th ACM Conference on Embedded Networked Sensor Systems, 2011, pp. 190-203. [142] M. Kessel and M. Werner, "SMARTPOS: Accurate and precise indoor positioning on mobile phones," in MOBILITY 2011, The First International Conference on Mobile Services, Resources, and Users, 2011, pp. 158-163. [143] Y.-C. Tseng, S.-P. Kuo, H.-W. Lee, and C.-F. Huang, "Location tracking in a wireless sensor network by mobile agents and its data fusion strategies," The Computer Journal, vol. 47, pp. 448-460, 2004. [144] N. Patwari and A. O. Hero, "Location estimation accuracy in wireless sensor networks," in Signals, Systems and Computers, 2002. Conference Record of the Thirty-Sixth Asilomar Conference on, 2002, pp. 1523-1527. [145] S. Meguerdichian, F. Koushanfar, G. Qu, and M. Potkonjak, "Exposure in wireless adhoc sensor networks," in Proceedings of the 7th annual international conference on Mobile computing and networking, 2001, pp. 139-150. [146] M. Rudafshani and S. Datta, "Localization in wireless sensor networks," presented at the Proceedings of the 6th international conference on Information processing in sensor networks, Cambridge, Massachusetts, USA, 2007. [147] J. Ash and L. Potter, "Sensor network localization via received signal strength measurements with directional antennas," in Proceedings of the 2004 Allerton Conference on Communication, Control, and Computing, 2004, pp. 1861-1870. [148] A. Boukerche, H. A. B. Oliveira, E. F. Nakamura, and A. A. F. Loureiro, "Localization systems for wireless sensor networks," Wireless Communications, IEEE, vol. 14, pp. 612, 2007. [149] N. Michael, D. Mellinger, Q. Lindsey, and V. Kumar, "The GRASP Multiple MicroUAV Testbed," Robotics & Automation Magazine, IEEE, vol. 17, pp. 56-65, 2010. [150] Z. Ping, L. Yue, and W. Yongtian, "Multiple infrared markers based real-time stereo vision positioning system for surgical navigation," in Instrumentation and Measurement Technology Conference, 2009. I2MTC '09. IEEE, 2009, pp. 692-696. [151] R. B. Huitema, A. L. Hof, and K. Postema, "Ultrasonic motion analysis system— measurement of temporal and spatial gait parameters," Journal of biomechanics, vol. 35, pp. 837-842, 2002. [152] L. Choon-Young, C. Ho-Gun, P. Jun-Sik, P. Keun-Young, and L. Sang-Ryong, "Collision Avoidance by the Fusion of Different Beam-width Ultrasonic Sensors," in Sensors, 2007 IEEE, 2007, pp. 985-988. 263 [153] G. Hueber, T. Ostermann, T. Bauernfeind, R. Raschhofer, and R. Hagelauer, "New approach of ultrasonic distance measurement technique in robot applications," in Signal Processing Proceedings, 2000. WCCC-ICSP 2000. 5th International Conference on, 2000, pp. 2066-2069 vol.3. [154] D. Fox, J. Hightower, L. Lin, D. Schulz, and G. Borriello, "Bayesian filtering for location estimation," Pervasive Computing, IEEE, vol. 2, pp. 24-33, 2003. [155] D. Fox, W. Burgard, and S. Thrun, "Active markov localization for mobile robots," Robotics and autonomous systems, vol. 25, pp. 195-207, 1998. [156] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit, "FastSLAM: A factored solution to the simultaneous localization and mapping problem," in AAAI/IAAI, 2002, pp. 593598. [157] trakSTAR by Ascension Technology Corporation. Available: http://www.ascensiontech.com/realtime/RTtrakSTAR.php [158] E. R. Bachmann, X. Yun, and A. Brumfield, "Limitations of Attitude Estimnation Algorithms for Inertial/Magnetic Sensor Modules," Robotics & Automation Magazine, IEEE, vol. 14, pp. 76-87, 2007. [159] J. Hummel, M. Figl, C. Kollmann, H. Bergmann, and W. Birkfellner, "Evaluation of a miniature electromagnetic position tracker," Medical physics, vol. 29, p. 2205, 2002. [160] J. F. O'Brien, R. E. Bodenheimer Jr, G. J. Brostow, and J. K. Hodgins, "Automatic joint parameter estimation from magnetic motion capture data," 1999. [161] C. Tercero, S. Ikeda, T. Uchiyama, T. Fukuda, F. Arai, Y. Okada, et al., "Autonomous catheter insertion system using magnetic motion capture sensor for endovascular surgery," The International Journal of Medical Robotics and Computer Assisted Surgery, vol. 3, pp. 52-58, 2007. [162] R. J. Fontana, "Recent applications of ultra wideband radar and communications systems," in Ultra-Wideband, Short-Pulse Electromagnetics 5, ed: Springer, 2002, pp. 225-234. [163] A. LaMarca, Y. Chawathe, S. Consolvo, J. Hightower, I. Smith, J. Scott, et al., "Place lab: Device positioning using radio beacons in the wild," in Pervasive Computing, ed: Springer, 2005, pp. 116-133. [164] S. Feldmann, K. Kyamakya, A. Zapater, and Z. Lue, "An Indoor Bluetooth-Based Positioning System: Concept, Implementation and Experimental Evaluation," in International Conference on Wireless Networks, 2003, pp. 109-113. 264 [165] Y. Suh, "Development of an INS integrated autonomous positioning system for assisting effective fire-fighting activity," KSCE Journal of Civil Engineering, vol. 8, pp. 569-574, 2004/09/01 2004. [166] B. Hartmann, N. Link, and G. F. Trommer, "Indoor 3D position estimation using lowcost inertial sensors and marker-based video-tracking," in Position Location and Navigation Symposium (PLANS), 2010 IEEE/ION, 2010, pp. 319-326. [167] L. Wiebking, M. Glanzer, D. Mastela, M. Christmann, and M. Vossiek, "Remote local positioning radar," in Radio and Wireless Conference, 2004 IEEE, 2004, pp. 191-194. [168] R. Want, A. Hopper, V. Falcão, and J. Gibbons, "The active badge location system," ACM Transactions on Information Systems (TOIS), vol. 10, pp. 91-102, 1992. [169] G. D. Abowd, C. G. Atkeson, J. Hong, S. Long, R. Kooper, and M. Pinkerton, "Cyberguide: A mobile context-aware tour guide," Wireless Networks, vol. 3, pp. 421433, 1997. [170] N. B. Priyantha, A. Chakraborty, and H. Balakrishnan, "The cricket location-support system," in Proceedings of the 6th annual international conference on Mobile computing and networking, 2000, pp. 32-43. [171] A. Harter, A. Hopper, P. Steggles, A. Ward, and P. Webster, "The anatomy of a contextaware application," Wireless Networks, vol. 8, pp. 187-197, 2002. 265