ENERGY EFFICIENT OBJECT DETECTION AND MEASUREMENT FOR SMART GLASSES By Jing Yang A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science - Master of Science 2014 ABSTRACT ENERGY EFFICIENT OBJECT DETECTION AND MEASUREMENT FOR SMART GLASSES By Jing Yang We design and implement a novel object detection and measurement system called Lockon for smart glasses. Lockon takes the advantages of the mounting position of the smart glasses, and provides users two useful functions that can benefit a wide range of applications. Lockon can accurately locate the object of interest (OoI) in the view of the user and inform the user the position of the OoI in real-time, using the front-facing camera equipped on the smart glasses and advanced computer vision and image processing techniques. To conserve energy, Lockon implements a motion trigger to intelligently activate the object detection process only when it is necessary. Lockon can also accurately measure the dimension of the object, with a 3D ranging technique. This capability allows user to remotely estimate the dimension of the object. We implement Lockon on Google Glass Explorer Edition, and evaluate the performance of Lockon using extensive experiments. Our results indicate that Lockon can achieve high detection accuracy (0.95 true positive rate and 4.3 × 10−7 false positive rate), low object dimension measurement error (3.3% when distance is less than 1 m), and low delay (300 ms). This thesis is dedicated to someone. iii ACKNOWLEDGEMENTS I would like to express my gratitude to my supervisor Prof. Guoliang Xing for the continuous support of my master study and research, for his patience, motivation, enthusiasm and immense knowledge. His guidance helped me in all the time of study, research and writing of this thesis. Furthermore I would like to thank the rest of my thesis committee: Prof. Richard J. Enbody and Prof. Xiaoming Liu, for their encouragement, insightful comments, and valuable discussions. Finally, and most importantly, I would like to thank my husband Ruogu Zhou. His support, encouragement, patience and unwavering love were undeniably the bedrock upon which my life have been built. I would like to express my heart-felt gratitude to my parents, Jianhua Yang and Yuying Kang, for their faith in me and allowing me to be as ambitious as I wanted. Also, I thank Ruogu’s parents, Hongbin Zhou and Xiaojuan Jin, for providing me with unending encouragement and support. I thank my dear daughter Eva Zhou. She spent countless hours entertaining herself while I sat at the computer typing away. You are the best little girl. I love you. iv TABLE OF CONTENTS LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 3 CHAPTER 2 BACKGROUND . . . . . . . . 2.1 Smart Devices and Wearable Devices . . 2.2 Smart Glasses . . . . . . . . . . . . . . 2.3 Object Detection . . . . . . . . . . . . 2.4 OpenCV: Open Source Computer Vision 5 5 7 8 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 3 CHALLENGES AND SYSTEM OVERVIEW . . . . . . . . . . . . . . 10 3.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 CHAPTER 4 ROBUST OBJECT DETECTION 4.1 Viola-Jones Cascade Detector . . . . . . . . 4.2 Hog: Histogram of Oriented Gradients . . . 4.3 Performance Optimization . . . . . . . . . 4.4 Detector Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 14 17 18 19 CHAPTER 5 MOTION TRIGGERED DETECTION . . . . . . . . . . . . . . . . . . 21 5.1 Gravity Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 CHAPTER 6 OBJECT DIMENSION MEASUREMENT 6.1 Stereo Triangulation Based Dimension Measurement 6.2 Head-tilting Scheme . . . . . . . . . . . . . . . . . . 6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 25 28 29 IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 CHAPTER 8 EXPERIMENTATION . . . . . . . . . 8.1 Detection Performance . . . . . . . . . . . . . . 8.1.1 Detection Accuracy vs Number of Stages 8.1.2 Detection Delay . . . . . . . . . . . . . . 8.2 Accuracy of Distance Measurement . . . . . . . CHAPTER 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 32 32 34 35 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 v LIST OF FIGURES Figure 2.1 Examples of smart devices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Figure 2.2 Examples of wearable devices. . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Figure 2.3 A Google Glass Explorer Edition and its components. . . . . . . . . . . . . . . 7 Figure 3.1 Architecture of Lockon in on Android platform. . . . . . . . . . . . . . . . . . 11 Figure 3.2 User Interface of Lockon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Figure 4.1 An illustration of cascade detector. . . . . . . . . . . . . . . . . . . . . . . . . 15 Figure 4.2 An illustration of multiscale detection. . . . . . . . . . . . . . . . . . . . . . . 16 Figure 4.3 Visualization of HOG features. . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Figure 4.4 Cascade detectors training GUI provided by Matlab. . . . . . . . . . . . . . . . 19 Figure 4.5 Training Samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Figure 5.1 Linear acceleration measurement of a device in motion. . . . . . . . . . . . . . 23 Figure 6.1 An illustration of typical camera systems. . . . . . . . . . . . . . . . . . . . . 26 Figure 6.2 The camera-object distance can be computed using stereo triangulation. . . . . 27 Figure 6.3 Illustration of head-tilting scheme. . . . . . . . . . . . . . . . . . . . . . . . . 28 Figure 8.1 Detection accuracies of detectors using HOG and LBP feature descriptors. . . . 32 Figure 8.2 Overall false positive rates and true positive rates rates. . . . . . . . . . . . . . 34 Figure 8.3 Detection delay incurred on Google Glass. . . . . . . . . . . . . . . . . . . . . 35 Figure 8.4 Measurement error of the camera-object distance. . . . . . . . . . . . . . . . . 36 vi CHAPTER 1 INTRODUCTION Recent years have witnessed the emergence of a new class of wearable devices, including smart watch, smart glasses, and smart bracelets for health tracking. A representative example of smart glasses is Google Glass [3]. It is estimated that the sale of smart glasses would grow from 8,7000 in 2013 to at least 10 M by 2018 [2]. Almost all the major players in the mobile industry including Google, Samsung, and Apple, have already released or reported to plan to release smart glasses products in the near future. Intergrating ubiquitous connectivity (e.g., cellular, WiFi, Bluetooth, and etc.), rich computing capability, and versatile sensing ability (e.g., camera, microphone, accelerometer, and etc.) into a pair of glasses that could be worn comfortably by users, smart glasses offer unique advantages over traditional smart devices like smartphones and tablets. The high mounting position (on user’s head) provides smart glasses unobstructed view, allowing them to see what the users are looking at. As the screen of the smart glasses is mounted directly above or over the eyes of the user, it can provide the user real-time information without requiring any cumbersome between the user and the device. These traits of smart glasses open up a wide range of new applications, ranging from augmented reality, activity tracking, to vision-based crowd sensing. A fundamental functionality that is required by these exciting new applications is real-time object detection and measurement. Sharing the view of the user, smart glasses can help user identify objects of interest in the view, and offer user helpful information accordingly. By measuring the distance between the user and the object, as well as the dimension of the object, it can enable many exciting applications in areas such as medical care, tourist guiding, and education. However, existing object detection systems [18] [14] designed for general smart devices usually incur high computational and energy overhead, thus cannot work well on smart glasses that are equipped with slower CPUs and significantly smaller batteries. Moreover, existing object detection systems are usually optimized for general-purpose smartphones and tablets, thus are cumbersome to operate 1 on smart glasses that lack friendly user interface. Existing remote object dimension measurement approaches [20] [21] also cannot be readily applied to smart glasses, since their operation requires either special hardware that is not available on smart glasses, or stationary fixtures that are ill-suited for mobile applications. We propose a new real-time object detection and measurement system called Lockon for smart glasses. Lockon adopts the Voila-Jones object detection framework to achieve high computational efficiency, and the HOG feature descriptor to offer robustness against illumination variations. We extensively optimize the detector to improve its robustness to object orientation variations while minimizing the detection delay. To reduce the energy consumption incurred by object detection, we design a motion trigger to intelligently activate the object detection process if object detection is needed, and deactivate the process otherwise. Lockon accurately measures the dimension of the object using the stereo triangulation technique. We design a scheme called head-tilting for users to conveniently measure the object in close distance (< 1 m). In Summary, we make the following major contributions in this work. 1. We design as robust, low complexity, and energy-efficient object detection system for smart glasses. This system can detect various type of objects, with high accuracy and low delay. We extensively optimize the detector for improving robustness. To reduce system energy consumption, we also design a motion trigger that intelligently determines if the object detection is needed, and activates/deactivates the object detection process accordingly. 2. We design a novel object dimension measurement system, which can measure the object accurately in close range (< 1 m) using 3D ranging techniques. We also design a scheme for smart glasses users to rapidly and conveniently measure the object by only tilting their heads. 3. We implement a prototype of Lockon on Google Glass Explorer Editon, and evaluate its performance using extensive experiments. The results indicate that Lockon can achieve high 2 detection accuracy and object dimension measurement accuracy while only incurring low detection delay. 1.1 Related Work Real-time object detection has drawn extensive research interests for decades. D.M.Gavrila et al. proposed a method using Discrete Transformation based matching techniques to detect stop signs and pedestrians using a camera mounted on vehicle [14]. A security surveillance system [18] proposed by A. Roy et al. detects moving objects in real-time using background modeling techniques. Although these approaches show adequate performance in detecting certain objects under typical settings, they may perform poorly in detecting objects subject to large illumination and orientation variations. Moreover, these approaches require powerful computing hardware to achieve high accuracy and low delay, thus are not applicable to resource-constrained mobile systems. Various modern digital cameras [6] can accurately detect faces present in the view finder with low delay. However, the detection techniques used on cameras have been extensively optimized for detecting only faces, thus may not provide adequate performance in detecting other types of objects. Several mobile apps like LookTel [5], have been developed for general smart devices like smartphones and tablets, to detect objects of user’s interest. Unfortunately, as these apps are not designed for smart glasses that have slower CPU and less energy resource, they tend to perform poorly on smart glasses. Moreover, these apps require extensively manual interaction with the device, which are ill-suited for smart glasses that are usually cumbersome to physically interact. Remote dimension measurement has been well studied in applications like passive remote sensing [8], in which a satellite or an aircraft flies over an area and uses onboard cameras to take photos of the area. The terrain of the area can be reconstructed using the photos and advance image processing techniques. Although this technique could be applied to measure the dimension of objects, it incurs significant computational overhead in finding the matching points of the objects on photos, thus cannot be use on smart glasses. Techniques [21] commonly used in 3D object modeling could accurately measure the dimension of the modeled objects. However, most of them involve 3 using special hardware to generate laser beam that slowly sweep through the surface of the entire object. As a result, these techniques cannot be applied to dimension measurement on Lockon. T. Wang et al. propose an alternative approach [20] to measure the 3D dimension of a remote object with only a single camera mounted on an adjustable tripod. The stereo triangulation technique is used to measure the object dimension. However, this technique requires using a stationary tripod, thus cannot be employed by smart glasses that are highly mobile. 4 CHAPTER 2 BACKGROUND 2.1 Smart Devices and Wearable Devices Figure 2.1: Examples of smart devices. From left to right: a Google Nexus 7 tablet, a Nintendo 3DS game console, and an Apple iPhone 5. Intergrading ubiquitous connectivity, rich computing capability, and versatile sensing ability, smart devices can intelligently collect data and make decisions, enabling a wide range of novel applications. An example of such applications is Take Runkeeper [9], which is fitness-tracking application for Android and iOS. It tracks users’ physical activities including running, walking, cycling, and hiking using various sensors including GPS and accelerometer. Smart devices are equipped with a diverse set of sensors, including camera, microphone, accelerometer, digital compass, gyroscope, GPS, and thermometer. Connectivity-wise, smart devices are commonly equipped with WiFi, Bluetooth, 3G/4G cellular interfaces. Some devices also have built-in ZigBee and NFC support. As the large number of onboard sensors and the ubiquitous connectivity could generate 5 large volume of data in short time, smart devices must have sufficient computational power to process the generated data. Over the last few years, the computational power of the smart devices increased rapidly, which is indicated by the fast growing CUP operating frequency and the number of CPU cores. Figure 2.2: Examples of wearable devices. Including watch, wristband, and glasses. Photo is from http://bits.blogs.nytimes.com. The advancing of smart devices triggers the emergence of a class of wearable devices [11], which can be integrated onto clothing and accessories, such as watches, glasses, and headbands, can be comfortably worn by users. Equipped with similar capabilities as general smart devices, wearable devices can perform not only tasks commonly found on general smart devices, but also specialized applications thanks to the integration of special sensors and the unique mounting positions of wearable devices. For example, in medical care applications, patient heart rate can be monitored remotely by a bracelet-like wearable device attached to the wrist of the patient. It is estimated that 90 million wearable devices will be shipped worldwide in 2014. 6 2.2 Smart Glasses Smart glasses are gaining significant momentum as the market of wearable devices expends at a fast speed. It is estimated that the sale of smart glasses would grow from 8,7000 in 2013 to at least 10 M by 2018 [2]. A growing number of big players including Google, Samsung, and Apple, have entered or plan to enter this market. Between the time when the first smart glasses (Google Glass) launched and April 2014, over 20 glasses models have been launched to consumer market, with the price tags ranging from $25 to a few thousands dollars. The fast growing of smart glasses’ popularity is largely due to the potential of enabling numerous new applications For example, in medical care, smart devices can be used to display real-time patient health data to doctors and nurses on-the-fly, who can then react to emergency health conditions immediately. Battery Sensors Touchpad Display Camera Figure 2.3: A Google Glass Explorer Edition and its components. A representative example of smart glasses is Google Glass. It has a TI OMAP 4430 proces7 sor,1GB of RAM and 16GB of storage, which are powerful enough to run most applications found on general smart devices. For connectivity, Google Glass has 802.11b/g and Bluetooth radios. A small reflective surface above the eye position reflects the image projected from a small LCD inside the Glass, allowing clear display without blocking the normal vision of the user. Besides the sensors commonly available on general smart devices (accelerometer, gyroscope, compass, etc.), it also has a wink sensor that can detect eye blinking of the user, which enables a wide range of new applications like fatigue measurement and provides a new user interface. A touchpad on the side of the Glass enables easy interaction, although the Glass could be also controlled by voice command. The Glass also equips a front-facing camera besides the reflective surface. Due to the unique mounting position (on user’s head), Google Glass is always facing the same way with the user’s head, and is almost always horizontally positioned. This trait of Google Glass significantly benefits certain applications such as realtime object tracking/detection, reality augmentation, and activity tracking. 2.3 Object Detection Object detection is the task of finding the objet of interest (OoI) in an image or video sequence, using object models which are known prior. Usually, object detection algorithms use extracted features and learning algorithms to recognize instances of an object category. Although it has been studied for years, a robust, accurate and fast object detection approach is still a great challenge today. Many factors can affect the detection performance, such as the amount of visual features of target object, training image quality and quantity, and the characteristics of the detection algorithms. Moreover, object may look significantly different under varying environmental factors, such as illumination, viewing perspective, and distance. Popular object detection algorithms fall into 3 basic categories. Geometery-based approaches employ the 3D geometric models of the OoI to deal with the appearance variation caused by varying perspective and illumination. Appearance-based approaches utilize advanced feature descriptors and patterns recognition algorithms to find the shape and the texture of the OoI. The feature8 based approaches discover the interesting points of the OoI that are insensitive to scale, perspective, and illumination changes. Object detection is fundamental to many important applications, including industrial visionbased control (e.g., robot control on assembly lines), and human-computer interface (e.g., Microsoft Kinnect). Recently, the increasing popularity of smart devices has promoted the wide adoption of object detection techniques in mobile apps development. Two examples of these apps are diet tracking, and camera-based vehicle adaptive cruise control. Due to the popularity of objective detection techniques on smart device development, many computer vision toolboxes, such as OpenCV and FastCV, now offer smart device support to facilitate the mobile app development. The object detection implemented in our system is based on OpenCV. 2.4 OpenCV: Open Source Computer Vision Open Source Computer Vision (OpenCV) is an open-source computer vision library for real-time image processing. It offers more than 2,500 computer vision related algorithms, and supports a wide range of operating systems, such as Windows, Linux, Mac OS X, Android and iOS. It is widely adopted in real-time computer vision applications, including camera calibration, face recognition, gesture recognition and motion tracking. 9 CHAPTER 3 CHALLENGES AND SYSTEM OVERVIEW 3.1 Challenges As mentioned in Section.1, the mounting position of smart glasses provides smart glasses several unique traits. Lockon takes advantage of these traits and provides users two useful functions that can benefit a wide range of applications. Lockon can accurately locate the object of interest (OoI) in the view of the user and inform the user the position of the OoI in real-time, using the frontfacing camera equipped on the smart glasses and advanced computer vision and image processing techniques. Moreover, Lockon can also accurately measure the dimension of the object, with a 3D ranging technique. This capability allows user to remotely estimate the dimension of the object. The applications of Lockon can be found in areas such as medical care, augmented reality, tourist guide, and education. For example, Lockon can be employed by automatic museum touring guide systems, which identify and locate the exhibitions appeared in the view of the tourist, and offer useful information about the exhibitions to the tourist. Lockon can be also used in wildlife observatory, in which Lockon detects and identifies the wild animal present in the user’s view, and measures the size of the animal as well as the distance to the animal. Several challenges must be addressed in the design of Lockon. First, smart glasses usually have very tight energy budget due to their small form factor. However, object detection and imaging processing are intrinsically compute intensive. Processing images taken by the camera at high rate in real-time consumes large amount of energy. Lockon must be able to minimize its energy consumption without scarifying the detection performance. Second, as the OoI can be presented to Lockon in a variety of illumination conditions and orientations in practice, Lockon must be robust to these variations. Although complex detector and extensive image pre-processing could help improve the detector’s robustness to these variations, they tend to incur long detection delay. As 10 a result, achieving robustness in object detection while incurring low delay is challenging. Third, to accurately measure the dimension of the object, the distance between the camera and the object must be measured first. However, the sensors on typical smart glasses cannot measure distance. 3.2 System Overview Fig. 3.1 illustrates the system architecture of Lockon, which operates as an application on the glasses OS and interacts with the hardware components, such as sensors and display, via system API calls. Lockon Object Dimension Measurement Image segments Object Detection API call OpenCV Lib Smart Device OS Motion Sensors Results & Info Control Camera Raw image Accelerometer Measurements Motion Trigger LCD Figure 3.1: Architecture of Lockon in on Android platform. As shown in Fig. 3.1, Lockon is composed of three major components, namely objective detection, motion trigger , and object dimension measurement. The object detection module implements the fundamental detection function of Lockon using OpenCV libraries. It employs the detectors trained offline to determine if the image frames fetched from the camera of the smart device contain the OoI. Lockon adopts the Viola-Jones detection framework (cascade detection) for improving detection performance and reducing computational overhead. To provide detection robustness to object illumination variations, Lockon employs HOG (Histogram of Oriented Gradients)[13] feature descriptor which performs localized image normalization to enhance image contrast. Ro11 bustness to object orientation change is achieved by utilizing multiple detectors that handle objects viewed from different perspectives. After detection, the image segments that contain the OoI are returned by the detector, and are marked on the screen of the smart device to indicate the locations. Cascade detectors use a sliding window to scan over the image for OoI at multiple scales. As a result, it incurs significant computational overhead to process images captured from the camera at 30 frame-per-second. As a result, object detection process should be turned off when it is not being used by user to conserve the energy of the smart glasses. The motion trigger module of Lockon serves as a switch to intelligently activate the object detection process only when it is necessary. Specifically, Lockon utilizes the accelerometer that is commonly available on smart glasses to monitor the motion of the user, and intelligently determines when to activate the object detection process. This motion triggered detection control scheme does not require users to physically interact with the device, which greatly benefits the usually diminutive wearable devices that are cumbersome to operate. Remotely measuring the dimension of the OoI is very useful for applications like navigation and tourist guiding. Although the size of the detected segments contain useful information about the dimension of the OoI, accurately calculating it requires knowing the distance between the detected object and the smart device. Based on the detection results, Lockon utilizes a 3D ranging technique called stereo triangulation to accurately measure the distance, as well as the dimension of the detected object. Specifically, the detection results of two frames taken at different locations are analyzed, from which the distance to the detected object and the dimension of the object are computed. We implemented Lockon on Google Glass Explorer Edition running Android 4.0.3. The UI of Lockon is shown in Fig. 4.5. The yellow boxes on top of the UI indicates the locations of detected OoIs, with the labels indicating the type of the OoI. The green progress bar on the bottom of the UI indicates the status of motion trigger. The length of the bar is proportional to the duration that the device has been staying still since last movement. A full length of the progress bar indicates that the object detection is ongoing. 12 Figure 3.2: User Interface of Lockon. The objects detected are two wrench sockets used for mechanical work. 13 CHAPTER 4 ROBUST OBJECT DETECTION The most fundamental task of this work is to design a robust object detection system for smart glasses. We have a few design objectives. First, the system should be robust, i.e., achieving satisfactory performance (low false positive rate and high true positive rate) regardless of environments (e.g., illumination) or object orientation. Second, the detection algorithm should have low computational complexity, since the object detection is performed on smart glasses which have limited computational resource and tight energy budget. Third, the detection delay should be within acceptable range (<1s). Achieving all of these objectives are challenging on resource-limited smart glasses, requiring carefully design and optimization of the detector. 4.1 Viola-Jones Cascade Detector Lockon adopts the Viola-Jones object detection framework (“Viola-Jones framework”)[19] due to its high accuracy and low computational overhead. Proposed by Paul Viola and Michael Jones in 2001, it is the first object detection framework that could offer real-time object detection with favorable accuracy. Due to the use of cascade classifiers, the detection process can be performed with low latency, although it tends to incur high computational overhead during training. Another advantage of Viola-Jones framework is its versatility in detecting various types of objects, although it was originally proposed to for face detection problems. Most commonly used object detection algorithms employs window sliding through all the regions of the image that may contain the OoI. Typically, a sample is created from the window after each sliding, and processed by the detector. Due to the high computational complexity, detection algorithms proposed prior to Viola-Jones framework struggle to process video in real-time with acceptable detection accuracy. Viola-Jones framework utilizes cascade detector to accelerate the detection, which is a type of ensemble learning process. Different from other multiexpert based 14 ensemble detectors (e.g., voting and stacking) that are constructed with strong detectors running in parallel, cascade detector is constructed by concatenating weak detectors. The unique advantage of the cascade detector over multiexpert detector is its low computational complexity. In multiexpert ensemble algorithms, each strong detector processes all the features of a sample to detect the presence of the object, and the final decision is made by voting or stacking. As the strong detectors are generally complex, multiexpert detectors incur high computational overhead and hence are usually slow. On the other hand, cascade detectors adopt weak detectors that only examine a subset of all features at every stage, and utilize the output from the previous stage to facilitate the detection of the next stage. Specifically, the samples that are classified as negative at previous stages are discarded and excluded from the samples to be processed in following stages. To further improve efficiency, the weak detectors are arranged in ascending order in terms of complexity. As a result, negative samples that are easy to determine are quickly discarded by the weakest detectors, leaving only a few difficult samples to be processed by the stronger, more complex classifiers. This technique significantly increases the processing speed without harming the performance much. Stage1 F Stage2 T T T T Samples StageN … F F Object F NonObject Figure 4.1: An illustration of cascade detector. We now show two key properties of cascade detectors. For a cascade detector with N stages, the overall true positive rate, D, and overall false positive rate, F, can be expressed as: D = ΠN i=1 di (4.1) F = ΠN i=1 fi (4.2) 15 where di and fi are the true and false positive rates of stage i, respectively. From 4.1 and 4.2, we can observe two interesting properties of cascade detectors. For achieving a low overall false positive rate F, each stage can have a poor false positive rate fi . For example, if per stage false positive rate is 50%, then the overall false positive rate would be around 10−6 when N = 20, which is sufficiently good for most tasks. However, for achieving acceptable overall true positive rate D, the per stage true positive rate should be sufficiently close to 1. For example, to achieve an overall true positive rate of 90% with 20 stages, each stage should have a true positive rate of at least 99.4%. To improve detection performance without incurring high computational overhead, the cost-aware ADAboost based algorithms [15] are adopted as the weak detector for every stage. Sliding Window Sliding Window Sliding Window Figure 4.2: An illustration of multiscale detection. Viola-Jones detection algorithm employs a technique called multi-scale object detection to deal with scaling [19], which incrementally scales the image after each round of detection. The size of the sliding window remains unchanged after scaling. The output of the detection process is a series of segments of the images that are determined as the OoI. Due to the use of the sliding window and the multi-scale detection, the OoI could be detected several times by the detector, causing the detector to return multiple overlapping segments. A combining algorithm is usually adopted to combine all the overlapped segments into a single segment. Viola-Jones framework is originally proposed using Haar-like features for detection. However, other types of features, such as LBP [16] and HOG [13] could also be employed for detection, depending on the applications. We discuss the feature used in Lockon in next section. 16 4.2 Hog: Histogram of Oriented Gradients Lockon adopts a feature descriptor called Histogram of Oriented Gradients [13] (HOG), which is widely used for object detection. HOG utilizes the local intensity gradients of objects to characterize the appearance and the shape of the OoI. Specifically, it computes the histogram of the occurrences of the gradient orientations in local image segments, and encodes the histogram to obtain the feature descriptor. HOG resembles techniques such as edge orientation histograms, scale-invariant feature transform (SIFT) descriptors, and shape contexts. However, a key difference that separates HOG from other approaches is that HOG utilizes a dense grid of uniformly spaced cells for computing the gradient orientations, and employs overlapping local contrast normalization to improve accuracy. Compared with other features (Haar, LBP, and etc.) that are commonly used in conjunction with cascade detector, HOG has several unique advantages. As the the HOG descriptor is calculated based on local cells, it is robust against geometric and photometric transformations, which mainly occurs in larger spatial regions. Moreover, as Dalal and Triggs discovered [13], coarse spatial sampling, fine orientation sampling, and strong local photometric normalization allows HOG to ignore the movements of minor parts of the objects, as long as the object maintains the same orientation. These traits make HOG highly suitable for Lockon to detect various types of objects. Computing HOG starts with calculating the gradient. The entire image is first divided into small spatial regions called cells, which can be either rectangle or radial. The gradients and their orientations are then calculated in each cell. In order to account for contrast and brightness changes caused by illumination and shadowing, each cell is locally normalized within a block, which consists of multiple adjacent cells. As the blocks are usually overlapping with each other, each cell could be included by multiple blocks. The blocks could either be rectangular (R-HOG block) which can be considered as square grids, or circular (C-HOG block) which resembles the scale-invariant feature transform descriptors (SIFT). The HOG descriptor is then constructed as the vector of all the components of the normalized histogram of the block. The constructed descriptors can be used as features by object detectors. 17 (a) Original image. (b) Visualized HOG features of the image. Figure 4.3: Visualization of HOG features. To help better understand HOG features, we visualize the HOG features using a feature visualization algorithm [1] developed by an MIT group. Fig. 4.3 (a) and (b) show the original image and the visualized HOG features of that image. Fig. 4.3 (b) depicts the dense grids of HOG, and the intensity and direction of the gradients of each grid. We can see that HOG can accurately capture the shape of the objects. Moreover, it can also capture the fine structure of the object even it’s poorly illuminated (too strong or too weak), thanks to the local normalization. 4.3 Performance Optimization Achieving good performance on Viola-Jones cascade detectors heavily relies on the implementation and the parameter tuning of the detector. There are some general optimization guidelines of implementation and tuning for improving detection performance. Specifically, Lockon employs the following optimization methods. Lockon adopts multiple detectors for detecting a single object, with each detector handle a limited amount of orientation variations of the object. Cascade detectors are usually sensitive to rotation, especially to out-of-plane rotations that distort the aspect ratio of the object. As a result, using a single detector to handle all possible object orientations would not offer good performance. However, using too many detectors inevitably increases the computational complexity and incurs long delay. To tradeoff, we use 2 detectors to handle different perspectives of each OoI. Interestingly, we find that the overall computational complexity does not increase drastically. This 18 is because using multiple detectors decreases the orientation variations that each detector has to handle, which results in simpler detectors with lower computational complexity. Figure 4.4: Cascade detectors training GUI provided by Matlab. There is a design choice on the number of stages in a detector. Although both of them can offer similar overall false and true positive rates if well trained, the computational complexity of the system with more stages is generally lower than that of the system with less stages. This is because the overall false positive rate decreases exponentially with each additional stage. For instance, given per stage false positive rate 50%, the overall false positive rates of a two-stage detector and a three-stage detector are 25% and 12.5%, respectively. However, the two-stage detector would be significantly more complex than the three-stage detector. Nevertheless, more training data is required for a higher number of stages. Lockon maximizes the number of stages while ensuring that the detector could be trained sufficiently with the amount of available training data. 4.4 Detector Training We employ the Computer Vision System Toolbox of MATLAB to train the cascade object detector, which offers a user-friendly training GUI, as shown in Fig. 4.4. We implemented 2 detectors for 19 detecting the wrench sockets (see Fig. 4.5) used in mechanical work in both horizontal and vertical directions. For each detector, a set of positive samples containing 800 images, and a set of negative samples containing 9000 images are supplied to the training function. The negative images has a diverse set of content, which represent the common background that the OoI may appear in. We use the Cascade Training GUI provided by Matlab to mark all the positive objects. The training algorithm follows the standard Viola-Jones algorithm. The negative samples are automatically pulled by the algorithm to each training stage. We perform training on a PC equipped with an Intel i7 2600K CPU and 12G RAM. Training a 20 stages detector with overall false positive rate of 2 × 10−5 and true positive rate of 95% takes about 1.5 hour. Figure 4.5: Training Samples. 20 CHAPTER 5 MOTION TRIGGERED DETECTION Compared with other commonly used object detection approaches, the Viola-Jones framework adopted in Lockon is more computationally efficient. However, it still requires significant computational resource to process the images captured by the camera at 30 Hz. Leaving the object detection process running would rapidly drain the battery of the smart glasses. For example, face detection algorithms implemented using OpenCV library drains the battery of a fully charged Google Glass in merely 38 minutes [17]. Moreover, as the presence of the OoI in the camera view is usually considered as a rare event, activating object detection constantly is in fact unnecessary. As a result, object detection process should be turned off when it is not being used by user to conserve the energy of the smart glasses. Lockon implements a trigger to control the activation of the object detection process, which is deactivated by default upon startup. There are several ways to design the trigger. For example, Lockon can implement UI component such as a menu or a button, which allows the user to activate the object detection process manually. Lockon can also utilize the physical input components (e.g., physical buttons on the phone) on smart glasses to control the activation. However, these methods could work well only on smart devices like smartphones and tablets that offer user-friendly interfaces. For smart glasses that are diminutive and usually lack such interfaces, controlling the activation of the detection process becomes cumbersome. Moreover, for applications that require extensive use of user’s hands, such as surgery, it is often impossible for user to manually operate the smart glasses using hands. Lockon can also adopt voice commands issued by the user to activate the detection process. However, the effectiveness of the voice command relies on many factors, including the sensitivity of the microphone, the level of background noise, and the design of the audio signal processing circuit, which vary significantly cross platforms, applications, and environments. As a result, the accuracy of voice command recognition varies significantly and cannot be assured for all scenarios. 21 Furthermore, in some applications like medical care, the users are often required to keep quiet. Clear images without significant blur is often required to properly recognize an object. To obtain clear images, the camera on the smart glasses must be staying still when capturing the image. Moreover, it is natural for human to stay still while recognizing an object. This motivates us to design a motion trigger that activates the object detection process intelligently. Specifically, the object detection process is only activated when the smart glasses have been staying still for a certain amount of time. To determine if the smart glasses are still, Lockon employs the accelerometer that is available on most smart glasses to detect the motion. Acceleromters measure acceleration of the smart glasses along three axels. Measurements from the accelerator that are sufficiently close to zero indicate that the smart glasses are still1 . However, acceleromters implemented on smart glasses usually measure the proper acceleration[7] of the device, which is not exactly the rate of velocity change, i.e., the linear acceleration. Instead, it is the acceleration associated with the phenomenon of weight experienced by the accelerometer. As result, the acceleration measurements when the device is staying still contain the components of the gravity along one or more axles. Due to this reason, to properly detect if the device is still, the gravity components must be removed from the accelerometer measurements. 5.1 Gravity Removal A few methods are available to remove gravity from acceleromter measurements and compute the linear acceleration. The first type of methods utilize the gyroscope or compass to measure or estimate the direction of the gravity, using which the linear acceleration could be computed by fusing the acceleromter measurements and the direction of the gravity. However, smart glasses may not be equip with the gyroscope or compass. Moreover, invoking these sensors incurs additional power consumption. The second method, which is employed by Lockon, utilizes a high pass filter to filter out the constant gravity components from the measurements. Specifically, an averaging 1 Although a zero acceleration could be also caused by the uniform motion of the smart glasses, however, in practice the duration of the uniform motion rarely exceeds a few seconds. 22 window is adopted to smooth the measurements, and the linear acceleration could be computed by subtracting the smoothed measurement from the instant measurement. This method is less accurate than the sensor fusion approaches and introduces minor delay, however its performance is more than sufficient for Lockon which does not require high motion measurement accuracy or short measurement delay. A pseudo-code for the filter is given in Algorithm 1. Fig. 5.1 depicts the measured linear acceleration of the smart glasses wearing by a user who is trying to take images using the onboard camera. There are two periods (0 s to 2.3 s, and 7.5 s to 10 s) that the device is still when the user takes images. During the time between the two periods, the device is moving by the user who is adjusting the viewing perspective of the camera. It can be seen that the measured acceleration matches the motion of the device very well. This clearly illustrates the effectiveness of the motion trigger. 0.4 axel X axel Y axel Z Linear Acceleration (m/s2) 0.3 0.2 0.1 0 −0.1 −0.2 −0.3 Device is Still −0.4 0 2 4 6 8 10 Time (s) Figure 5.1: Linear acceleration measurement of a device in motion. Lockon continuously monitors the linear acceleration after startup. The object detection process is only activated if the monitored the linear acceleration is constantly below 0.1m/s2 for 1s. 23 The detection is immediately deactivated if any linear acceleration measurement is above 0.1m/s2 . The pseudo-code of the motion trigger is given in Algorithm 2. Algorithm 1 High pass filter for removing gravity Input: a p : proper acceleration samples of a single axel; N: number of proper acceleration samples in buffer; w: length of the window for computing gravity Output: al : computed linear acceleration for the axel. 1: for all i ∈ (1, N) do 2: g = mean(a p (i − w/2 : i + w/2)) 3: al (i) = a p (i) − g 4: end for 5: return al Algorithm 2 Motion Trigger Input: alx , aly , andalz : linear acceleration samples of three axels; N: number of acceleration samples in buffer per axel; s: accelerometer sampling rate (Sa/s) Used sub-function: ActiveDet: routine to activate object detection; DeactiveDet: routine to deactiveate object detection. 1: count = 0 2: for all i √ ∈ (1, N) do 3: al = alx (i)2 + aly (i)2 + alz (i)2 4: if al >= 0.1 then 5: count = 0 6: else 7: count + + 8: end if 9: if count > s then 10: ActiveDet() 11: else 12: DeactiveDet() 13: end if 14: end for 24 CHAPTER 6 OBJECT DIMENSION MEASUREMENT Remotely measuring the dimension of the object of interest is very useful for applications like navigation and tourist guiding. As Lockon can detect the object of interest, it can measure the dimension of the detected object projected to the image sensor of the camera. However, to convert it to the physical dimension of the object, Lockon has to know the distance between the camera on the smart device and the OoI (“camera-object distance”). There are generally two methods to measure the camera-object distance on typical smart devices. The first method relies on the auto-focus function of the camera. To focus on an object, the distance between the lens and the image sensor is adjusted until the image of the object is clearly formed on the image sensor. Autofocus automatically adjusts the position of the lens using miniature motors that can be finely controlled. After the object is properly focused, the camera-object distance can be calculated using the camera focal length, and the distance between the lens and the image sensor. Unfortunately, cameras on smart glasses may not support autofocus. Moreover, for systems equipped with autofocus cameras, the distance between the lens and the image sensor may not be exposed to apps. The second method, which is employed by Lockon, involves with using stereo images to measure the camera-object distance with a technique called stereo triangulation [10]. Specifically, this method utilizes the position disparity of the OoI in images taken at different locations, to calculate the camera-object distance. This approach does not require any specific hardware, thus can work on most smart glasses. 6.1 Stereo Triangulation Based Dimension Measurement A typical digital camera system is consisted of a lens (or a group of lens), an image sensor, and a housing that blocks unwanted external light. Let the focal length of the camera to be f , and the size of the image sensor to be w by h, then a camera system can be illustrated in Fig. 6.1. Assume 25 Camera Housing Object P Lens Image Sensor w by h camera-object distance f Figure 6.1: An illustration of typical camera systems. that the OoI P is sufficiently far from the camera, then the image plane is roughly positioned at the focal point. In order to perform stereo triangulation, two images containing the OoI must be taken at different positions. Fig. 6.2 illustrates the scenario when two images are taken at L and R, with the optical axes parallel to each other. The origin of the reference system lies at L, and the distance between L and R is d. Assume that the camera only moves along the X axis. Let x1 and x2 be the X coordinates of the images taken at L and R, respectively. Then the Z coordinate of P, i.e., the camera-object distance, could be calculated using simple geometry. Specifically, it is computed as: Z= df x1 − x2 (6.1) x1 and x2 can be computed from the positions of the OoI in the two images using the following equation: x x = ( im − 0.5)Lh Nh (6.2) where xim is the horizontal pixel index of the OoI in the image, Nh is the horizonal resolution of the image, and Lh is the physical length of the image sensor along the X axis. xim is obtained after Lockon successfully detects the object. d could be estimated by the user, although this would introduce large errors to the measurement of camera-object distance and the dimension of the object. Some fixtures could be used to accurately move the camera by a given distance, which 26 P Z x1 Image Sensor x2 Image Sensor f L d R Figure 6.2: The camera-object distance can be computed using stereo triangulation. produces a known d. d could be also estimated from the acceleration measured by the accelerator, although we find that the accelerometers on most smart devices are not sufficiently sensitive to accurately measure d, which incurs large estimation error to the object dimension. We describe a method for generating and measuring a fixed d in the next section. After Z is obtained, the dimension of the object, Dx and Dy can be estimated using the size of the image segments, represented as dx and dy, using the following equation: dx Z f dy Z Dy = f Dx = 27 (6.3) (6.4) (a) User holds his head upright, facing front (b) User tilts his head when taking the second im- when taking the first image. age. Figure 6.3: Illustration of head-tilting scheme. 6.2 Head-tilting Scheme We designed a scheme called head-tilting for consistently producing d each time the user takes photos and performs stereo triangulation. In this scheme, when taking the first photo, the user holds his head upright, facing front. Before taking the second photo, the user tilts his head toward one shoulder as much as he can, while keeping facing forward. As human can only tilt the head for a certain degree, the resulted d is largely a constant for each user. We also devise a method for Lockon to measure d, using the user’s thumb and stereo triangulation. To use this method, the user creates the distance d using the head-tilting method. However, the user also holds his left arm straight and forward, and sticks the thumb up. The thumb should be kept still during the measurement. Lockon approximates the camera-object distance, Z, as the length of the user’s arm, which can be estimated by the user’s height. Lockon has a built-in thumb detector that reports the locations of the thumb in the two images. d is then calculated using stereo triangulation. This method is illustrated in Fig. 6.3. The head-tilting method could introduce some errors due to the rotation of the head along Y axis during tilting. This could be compensated using the onboard sensors such as gyroscope and compass that can provide angular movement information. However, this is left for future work. Moreover, we show that in Section 8.2, even without such compensation, our scheme can still 28 achieve a mean estimation error of only 3.3% when the camera-object distance is smaller than 1 m. 6.3 Discussion Currently Lockon can only conduct dimension measurement when there is a single OoI in the view, since Lockon cannot differentiate multiple OoIs at different locations and associate OoIs in two images. When there are multiple OoIs presented in the view of the smart glasses, the dimension measurement function would not work. The dimension measurement of multiple objects can be enabled by adding an object tracking function to Lockon, which allows Lockon to track the OoIs during head-tiling. However, the implementation of this function is left for future work. On some smart devices that equip stereo cameras (two cameras that face the same direction) such as HTC One (M8) [4], the dimension of objects can be measured without moving the camera. Specifically, the two camera can take images simultaneously and the stereo triangulation can be performed using the two images. In such case, the d is a constant value, and the optic axels of the cameras are always parallel. This results in a highly accurate camera-object distance and dimension measurements. Unfortunately, we have yet seen such smart glasses appear. 29 CHAPTER 7 IMPLEMENTATION We implemented Lockon on Google Glass Explorer Edition running Android 4.0.3 . We installed OpenCV 2.4.8 library on Google Glass. Lockon is implemented as a standard Android application, which is written in Java. During the initialization phase, Lockon first loads the trained detectors (.xml files) from local folders specified in its configuration file. To detect additional types of objects other than the two built-in detectors (socket), user can train additional detectors and load them to Lockon. User can also specify which detectors should be loaded from the local detector collection in the configuration file. An accelerometer callback routine is registered to Android OS, which processes the accelerometer measurements once they are generated. Camera and accelerometer are activated in the end of the initialization. The camera takes images at 30 frame-per-second rate. The sampling rate of accelerometer is configured to 50 Hz to achieve low delay on the motion trigger. To help user determines the status of the motion trigger, Lockon implements a progress bar which is shown on the screen. The length of the bar is proportional to the duration that the device has been staying still since last movement. A full length of the progress bar indicates that the device has been staying still constantly for at least 1 s, and the object detection is ongoing. When the object detection process is activated, Lockon retrieves the image from the camera buffer once a frame is captured. As the cascade detector implementation in OpenCV only accepts gray-scale image, the retrieved image is converted it to a gray-scale image with 8 bit depth. The commonly used image equalization process is omitted in Lockon, since the HOG feature adopted by Lockon performs normalization on the image locally. A collection of image segments that are determined to contain the OoI are returned by the detectors. The segments that contain the same object are combined and labeled. The processed image with the bounding box indicating the OoI is then displayed on the screen. 30 The dimension measurement function is automatically invoked after the object detection is activated. Lockon assumes that the initial detection is done with the head of the user upright. After the object is initially detected, its location on the image is recorded. If the user wishes to measure the dimension of the object, he must touch the touchpad on the Glass, which prevents the motion trigger from deactivating the object detection process caused by the following head-tilting. The motion of the device is monitored using accelerometer to determine when the user finishes head-tilting. The detection results before and after head-tilting are feed to the stereo triangulation algorithm to calculate the distance and the dimension of the object. The measurement result is displayed along the bounding box of the object. We note that Google Glass supports Glassware API that allows the computational intensive task to be conducted on cloud. We will implement part of Lockon with this API in the future. The total ROM footprint of Lockon is about 10 KB, and the RAM usage is about 1 MB. The training program is implemented using Matlab 2013 and the Computer Vision System Toolbox. We trained two detectors for detecting the wrench sockets used in mechanical work in both horizontal and vertical directions. The sockets we used for training are shown in Fig. 4.5 31 CHAPTER 8 EXPERIMENTATION 8.1 Detection Performance In this section we evaluate the detection performance of Lockon. We adopt 6 sockets of different sizes as the objects to be detected. We use Google Glass to take about 3000 photos of these sockets lying either horizontally or vertically, at a fixed resolution of 1280 by 720 under different illumination conditions. We randomly pick 1600 photos for training two detectors that handle the two orientations separately. The rest of the photos are left for testing. We use Maltab 2013b and Computer Vision System Toolbox to train the detectors and test the detection accuracy. We set the scaling factor adopted in multiscale detection algorithm to be 1.1. 8.1.1 Detection Accuracy vs Number of Stages We first evaluate the performance of the object detection subsystem of Lockon. We train the detectors to detect sockets using HOG feature descriptor with number of stages from 10 to 20. We set the per stage detector parameters (false positive rate and true positive rate) the same for all −4 10 1 0.98 0.96 Ture Positive False Positive −5 10 −6 10 HOG−Vertical LBP−Vertical HOG−Horizontal LBP−Horizontal 0.92 0.9 0.88 HOG−Vertical LBP−Vertical HOG−Horizontal LBP−Horizontal 0.86 0.84 −7 10 0.94 10 12 14 16 Number of Stage 18 20 10 12 14 16 Number of Stage 18 20 (a) False positive rates of HOG and LBP detectors (b) True positive rates of HOG and LBP detectors vs stages. vs stages. Figure 8.1: Detection accuracies of detectors using HOG and LBP feature descriptors. 32 detectors. To compare the performance of detectors that use different feature descriptors, we also train detectors using LBP feature descriptor with the same setup. We then test the trained detectors to calculate the overall false positive rate (FP rates) and true positive rate (TP rates) associated with each detector. The false positive rate is calculated as the ratio of the number of negative samples (contain no OoI) that are incorrectly classified as positive (contain OoI), to the total number of samples tested. We calculate the total number of tested samples using the size of the testing image, the size of the sliding window, and the scaling factor. The results are shown in Fig. 8.1. It can be seen from Fig. 8.1 (a) that the overall FP rates of all curves decrease when the number of stages increases. This confirms our analysis in Section 4.1 that, the overall FP rates decrease exponentially with the increasing of the stages. When the number of stages is 20, the overall FP rates achieved by detectors using HOG and LBP features are 4.3 × 10−7 and 1.26 × 10−5 , respectively. We observe that using HOG feature descriptor incurs significantly lower overall FP rates than using LBP feature descriptor. For example, when the number of stages is 20, detectors using LBP incurs nearly 30X overall FP rate to those using HOG. From Fig. 8.1 (b) we can observe that the overall TP rates generally decrease when the number of stages increases. This is also consistent with our analysis in Section 4.1. The average maximum and minimum overall TP rates of detectors using HOG, are 0.9883 (10 stages) and 0.9474 (20 stages), respectively. For detectors using LBP, the average maximum and minimum overall TP rates are 0.9784 (10 stages) and 0.8902 (20 stages), respective. Similar to the overall FP rate measurements, detectors using HOG features generally achieve better performance than those using LBP features. We note that such performance difference does not indicate that LBP feature is inferior to HOG feature, as these two feature descriptors favor different applications. When detecting other objects like human faces, it is totally possible that detectors using LBP feature outperform those using HOG feature. However, we do believe for general object detection that Lockon targets, the overall performance of HOG-based detectors would be better than LBP-based detectors. 33 1 HOG−Horizontal HOG−Vertical 0.99 −5 10 False Positive Ture Positive 0.98 0.97 0.96 0.95 0.94 HOG−Horizontal HOG−Vertical 0.93 −6 10 10 12 14 16 Number of Stage 18 0.92 20 10 12 14 16 Number of Stage 18 20 (a) Measured overall false positive rates vs stages. (b) Measured overall true positive rates vs stages. Figure 8.2: Overall false positive rates and true positive rates rates. 8.1.2 Detection Delay We evaluate the detection delay of Lockon on Google Glass in this section. We train detectors using HOG feature descriptor with stage numbers from 10 to 20. We adjust the per stage FP rate and TP rate, so that the overall expected FP and TP rates of each detector are equal to 0.00002% and 98%, respectively. To fairly compare the delay incurred by these detectors, we have to make sure that these detectors can achieve similar accuracy. We run test using these detectors and plot the results plotted in Fig. 8.2. We can see that all measured FP and TP rates are about 0.00002% and 98%, respectively, although there are some minor variations. Having verified that all detectors with different number of stages have roughly the same performance, we then load the detectors to Google Glass, and measure the average detection delay of each detector. We timestamp the start time and the finish time of the detection process for each image, and compute the average detection delay of each detector. Fig. 8.3 shows the results. As we can see, the delay largely decreases when the number of stages increases, although the difference is insignificant (< %10). This is because detectors with higher number of stages tend to be simpler at each stage. This finding confirms the effectiveness of our optimization method described in Section 4.3. Moreover, even if there is only 10 stages, our detector can achieve a delay of less than 350ms on Google Glass, which converts to a processing rate of roughly 3Hz. This speed is sufficient for most applications. 34 0.35 Delay vs number of stages 0.345 Delay per Image (s) 0.34 0.335 0.33 0.325 0.32 0.315 0.31 0.305 10 12 14 16 Number of Stages 18 20 Figure 8.3: Detection delay incurred on Google Glass. It is worth mention that when multiple detectors are activated on the Glass, the detection delay could be significantly increased. This issue could be resolved by running detection process at the cloud side. For example, Google provides a set of API called Glass API for Google Glass to access Google cloud services. To reduce the delay caused by uploading the image to the cloud, the image should be processed locally to extract HOG feature descriptors. Feature compressing techniques [12] can be adopted to further reduce the bandwidth required for transmitting the image. After classification is finished on the cloud, the result is downloaded to the Glass. However, implementing Lockon using cloud is left for future work. 8.2 Accuracy of Distance Measurement The accuracy of the object dimension measurement largely depends on the accuracy of the cameraobject distance measurement. We investigate the error of the camera-object distance measurement in this section. We install Lockon on a Google Glass Explorer Edition, and ask a user to wear the Glass and conduct the experiment. The user uses the head-tilting scheme to measure the distance between the Glass and a socket hanging on the wall, and moves further from the socket after each 35 Measured Distance (m) 1.8 40 Mean error and 95% confidence interval Zero−error reference Relative Measurement Error (Percentage) 2 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 Mean relative error with 95% confidence interval 30 20 10 0 −10 −20 0.2 0.4 0.6 0.8 1 1.2 Ground Truth Distance (m) 1.4 1.6 0.35 0.45 0.53 0.67 0.82 0.92 Ground Truth Distance (m) 1.37 1.54 (a) Measurement distribution of the camera- (b) Relative error of the camera-object distance object distance measurements. measurements. Figure 8.4: Measurement error of the camera-object distance. round of measurements. At each camera-object distance, 10 measurements are taken and recorded. We then compute the errors associated with each camera-object distance, and plot them in Fig. 8.4. We also compute the 95% confidence interval to show the variations of measurements. Fig. 8.4 (a) shows the measurements at each distance with 95% confidence interval. It can be seen that, when the camera-object distance is smaller than 1 m, Lockon can achieve good measurement accuracy. This observation is confirmed in Fig. 8.4 (b), which shows the relative error associated with each distance. Lockon achieves a maximum mean error of only 3.3% when the camera-object distance is smaller than 1 m. Moreover, 95% of the errors fall below 15%. These errors are the result of head-tilting. Since it is impossible for the user to execute head-tilting exactly the same each time, it introduces small variations to d, i.e., the distance between the positions where the two images are taken. This introduces error to the distance estimation. Furthermore, head-tilting may also cause non-parallel optic axes of the camera when taking images, which also generates error to the measurement, especially at longer distances. Nevertheless, for estimating the distance and dimension of close objects (< 1 m), the accuracy of Lockon is sufficient. We also observe large errors when the camera-object distance is larger than 1 m. For example, at 1.54 m, the mean error is about 16.56% with a high variance (20%). This is because at longer distance, the measurement becomes much more sensitive to the non-parallel optic axes. A small rotation of the optic axis can generate large error when the distance is far. A possible solution to 36 this issue is to use sensors like compass to measure and compensate the rotation. However, this function is left for future work. 37 CHAPTER 9 CONCLUSION We design and implement a novel object detection and measurement system called Lockon for smart glasses. Lockon takes the advantages of the mounting position of the smart glasses, and provides users two useful functions that can benefit a wide range of applications. Lockon can accurately locate the object of interest (OoI) in the view of the user and inform the user the position of the OoI in real-time, using the front-facing camera equipped on the smart glasses and advanced computer vision and image processing techniques. To conserve energy, Lockon implements a motion trigger to intelligently activate the object detection process only when it is necessary. Lockon can also accurately measure the dimension of the object, with a 3D ranging technique. This capability allows user to remotely estimate the dimension of the object. We implement Lockon on Google Glass Explorer Edition, and evaluate the performance of Lockon using extensive experiments. Our results indicate that Lockon can achieve high detection accuracy (0.95 true positive rate and 4.3 × 10−7 false positive rate), low object dimension measurement error (3.3% when distance is less than 1 m), and low delay (300 ms). 38 BIBLIOGRAPHY 39 BIBLIOGRAPHY [1] Hoggles: Visualizing object detection features. http://web.mit.edu/vondrick/ihog, 2013. [2] Smart glasses market prospects 2013-2018. http://www.juniperresearch.com/reports/smartg lasses, 2013. [3] Google glass. http://www.google.com/glass/start/, 2014. [4] Htc one (m8) product page. http://www.htc.com/us/smartphones/htc-one-m8/, 2014. [5] Lootel. http://www.looktel.com/, 2014. [6] Nikon d90 product page. http://imaging.nikon.com/lineup/microsite/d90/en/advanced-function/, 2014. [7] Proper acceleration. http://en.wikipedia.org/wiki/Propera cceleration, 2014. [8] Remote sensing wikipedia. http://en.wikipedia.org/wiki/Remotes ensing, 2014. [9] Runkeeper. http://runkeeper.com/, 2014. [10] Stereo triangulation wikipedia. http://en.wikipedia.org/wiki/Triangulation( computerv ision), 2014. [11] Wearable devices. http://www.wearabledevices.com/what-is-a-wearable-device/, 2014. [12] V. Chandrasekhar, G. Takacs, D. Chen, S. Tsai, R. Grzeszczuk, and B. Girod. Chog: Compressed histogram of gradients a low bit-rate feature descriptor. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 2504–2511, June 2009. [13] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893 vol. 1, June 2005. [14] D. M. Gavrila and V. Philomin. Real-time object detection for aˇ ˛rsmarta´ ˛s vehicles. In Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, volume 1, pages 87–93. IEEE, 1999. [15] R. T. Jerome Friedman, Trevor Hastie. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 1998. [16] T. Ojala, M. Pietikainen, and D. Harwood. Performance evaluation of texture measures with classification based on kullback discrimination of distributions. In Pattern Recognition, 1994. Vol. 1-Conference A: Computer Vision & Image Processing., Proceedings of the 12th 40 IAPR International Conference on, volume 1, pages 582–585. IEEE, 1994. [17] L. Z. Robert LiKamWa; Zhen Wang; Aaron Corroll;Felix Lin. Draining our glass: An energy and heat characterization of google glass. Technical Report WUCSE-2003-06, Rice University, 2014. [18] A. Roy, S. Shinde, and K.-D. Kang. An approach for efficient real time moving object detection. In ESA, pages 157–162, 2010. [19] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pages I–511–I–518 vol.1, 2001. [20] T.-H. Wang, C.-C. Hsu, C.-C. Chen, C.-W. Huang, and Y.-C. Lu. Three-dimensional measurement of a remote object with a single ccd camera. In Autonomous Robots and Agents, 2009. ICARA 2009. 4th International Conference on, pages 187–192. IEEE, 2009. [21] H. Yano, Y. Miyamoto, and H. Iwata. Haptic interface for perceiving remote object using a laser range finder. In EuroHaptics conference, 2009 and Symposium on Haptic Interfaces for Virtual Environment and Teleoperator Systems. World Haptics 2009. Third Joint, pages 196–201. IEEE, 2009. 41