SIGN LANGUAGE RECOGNIZER FRAMEWORK BASED ON DEEP LEARNING ALGORITHMS By Atra Akandeh A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science – Doctor of Philosophy 2021 ABSTRACT SIGN LANGUAGE RECOGNIZER FRAMEWORK BASED ON DEEP LEARNING ALGORITHMS By Atra Akandeh According to the World Health Organization (WHO, 2017), 5% of the world’s population have hearing loss. Most people with hearing disabilities communicate via sign language, which hearing people find extremely difficult to understand. To facilitate communication of deaf and hard of hearing people, developing an efficient communication system is a necessity. There are many challenges associated with the Sign Language Recognition (SLR) task, namely, lighting conditions, complex background, signee body postures, camera position, occlusion, complexity and large variations in hand posture, no word alignment, coarticulation, etc. Sign Language Recognition has been an active domain of research since the early 90s. However, due to computational resources and sensing technology constraints, limited advancement has been achieved over the years. Existing sign language translation systems mostly can translate a single sign at a time, which makes them less effective in daily-life interaction. This work develops a novel sign language recognition framework using deep neural networks, which directly maps videos of sign language sentences to sequences of gloss labels by emphasizing critical characteristics of the signs and injecting domain-specific expert knowledge into the system. The proposed model also allows for combining data from variant sources and hence combating limited data resources in the SLR field. Copyright by ATRA AKANDEH 2021 ACKNOWLEDGEMENTS It is a pleasure to thank those who made this thesis possible. I would like to thank my advisor Dr. Salem for his enlightening guidance and continued support throughout my entire Ph.D. program. In addition, I would like to thank my committee members Dr. Zhou, Dr. Aktulga, and Dr. Liu for their constructive guidance and valuable feedback. I would also like to express my gratitude to Dr. Torng, Dr. Esfahanian, and Dr. Kulkarni for their support. Finally, special thanks to my family and friends for their unconditional support and constant encouragement. iv TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 CHAPTER 2 PRELIMINARIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.5 Connectionist Temporal Classification . . . . . . . . . . . . . . . . . . . . . . . . 9 2.6 Video Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.6.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 CHAPTER 3 SLIM LSTMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 New Variants of the LSTM model . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.1 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.1.1 The Tanh Activation . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.1.2 The Sigmoid activation . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.1.3 The ReLU activation . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.2 IMDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.2.1 The Sigmoid activation . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.3 20 Newsgroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.3.1 The Tanh activation . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 CHAPTER 4 FEATURE EXTRACTOIN MODEL . . . . . . . . . . . . . . . . . . . . . 46 4.1 MediaPipe Hands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2 MediaPipe Pose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 CHAPTER 5 CHARACTER-LEVEL SLR . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.1.1 Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 Proposed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 v 5.2.1 Model Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.2.2 Model Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 CHAPTER 6 WORD-LEVEL SLR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.1.1 Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.2 Proposed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.2.1 Model Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.2.2 Model Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.2.2.1 RSign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.2.2.2 MCSign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.2.3 Design Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 CHAPTER 7 SENTENCE-LEVEL SLR . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.1.1 Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.2 Proposed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7.2.1 Model Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.2.2 Model Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.2.2.1 RSign-C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.2.2.2 MCSign-C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.2.3 Design Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 7.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 7.5.1 MCSign-C & RSign-C . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 7.5.2 MCSign-C-Slim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 CHAPTER 8 FUTURE ROADMAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 vi BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 vii LIST OF TABLES Table 3.1: MNIST - Network specifications . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Table 3.2: MNIST - Best results obtained using tanh. . . . . . . . . . . . . . . . . . . . . 30 Table 3.3: MNIST - Best results obtained using sigmoid. . . . . . . . . . . . . . . . . . . 36 Table 3.4: MNIST - Best results obtained using relu. . . . . . . . . . . . . . . . . . . . . . 39 Table 3.5: IMDB - Network specifications. . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Table 3.6: IMDB - Best results obtained sigmoid. . . . . . . . . . . . . . . . . . . . . . . . 41 Table 3.7: News20 - Network specifications. . . . . . . . . . . . . . . . . . . . . . . . . . 42 Table 3.8: News20 - Best results obtained using tanh. . . . . . . . . . . . . . . . . . . . . 44 Table 5.1: Network used to train Sign Language MNIST and ASL Fingerspelling A Skeleton 57 Table 6.1: RSign Network specifications. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Table 7.1: RWTH-PHOENIX Setup Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 85 Table 7.2: Comparison between various approaches on the RWTH-PHOENIX dataset . . . 87 Table 7.3: Time per epoch for each MCSign-C-Slim model. . . . . . . . . . . . . . . . . . 88 viii LIST OF FIGURES Figure 2.1: A typical CNN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Figure 2.2: AlexNet architecture Krizhevsky et al. (2012) . . . . . . . . . . . . . . . . . . 7 Figure 2.3: HMDB51 Kuehne et al. (2011) . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Figure 2.4: The video classifier architectures proposed by Karpathy et al. (2014) . . . . . . 12 Figure 2.5: Multiresolution CNN architecture Karpathy et al. (2014) . . . . . . . . . . . . . 12 Figure 2.6: Two-stream architecture for video classification Simonyan & Zisserman (2014) . 13 Figure 2.7: The video classifier architecture proposed by Ng et al. (2015) . . . . . . . . . . 14 Figure 2.8: Long-term Recurrent Convolutional Network Donahue et al. (2017) . . . . . . . 15 Figure 2.9: Comparing 2D & 3D convolution kernel Tran et al. (2015) . . . . . . . . . . . 16 Figure 2.10: C3D proposed by Tran et al. (2015) . . . . . . . . . . . . . . . . . . . . . . . . 16 Figure 2.11: TSN proposed by Wang et al. (2014) . . . . . . . . . . . . . . . . . . . . . . . 17 Figure 2.12: The video classifier architecture proposed by Zhu et al. (2017) . . . . . . . . . . 18 Figure 3.1: MNIST - Training & Test accuracy, σ = tanh, η = 1e−4 . . . . . . . . . . . . . 27 Figure 3.2: MNIST - Training & Test accuracy, σ = tanh, η = 1e−4 . . . . . . . . . . . . . 28 Figure 3.3: MNIST - Training & Test accuracy, σ = tanh, η = 1e−3 . . . . . . . . . . . . . 29 Figure 3.4: MNIST - Training & Test accuracy, σ = sigmoid, η = 1e−4 . . . . . . . . . . . 31 Figure 3.5: MNIST - Training & Test accuracy, σ = sigmoid, η = 2e−3 . . . . . . . . . . . 32 Figure 3.6: MNIST - Training & Test accuracy, σ = sigmoid, η = 1e−4 . . . . . . . . . . . 33 Figure 3.7: MNIST - Training & Test accuracy, σ = sigmoid, η = 2e−3 . . . . . . . . . . . 34 Figure 3.8: MNIST - Training & Test accuracy, σ = sigmoid, η = 1e−3 . . . . . . . . . . . 35 Figure 3.9: MNIST - Training & Test accuracy, σ = relu, η = 1e−4 . . . . . . . . . . . . . 35 ix Figure 3.10: MNIST - Training & Test accuracy, σ = relu, η = 2e−3 . . . . . . . . . . . . . 37 Figure 3.11: MNIST - Training & Test accuracy, σ = relu, η = 1e−4 . . . . . . . . . . . . . 37 Figure 3.12: MNIST - Training & Test accuracy, σ = relu, η = 1e−3 . . . . . . . . . . . . . 38 Figure 3.13: MNIST - Training & Test accuracy of different η, lstm10, relu . . . . . . . . . 38 Figure 3.14: IMDB - Training & Test accuracy, σ = sigmoid, η = 1e−4 . . . . . . . . . . . 40 Figure 3.15: IMDB - Training & Test accuracy, σ = sigmoid, η = 1e−4 . . . . . . . . . . . 40 Figure 3.16: IMDB - Training & Test accuracy, σ = sigmoid, η = 1.25e−5 . . . . . . . . . . 41 Figure 3.17: IMDB - Training & Test accuracy, σ = sigmoid, η = 1e−5 . . . . . . . . . . . 42 Figure 3.18: News20 - Training & Test accuracy, σ = tanh, η = 1e−3 . . . . . . . . . . . . 43 Figure 3.19: News20 - Training & Test accuracy, σ = tanh, η = 1e−3 . . . . . . . . . . . . 43 Figure 3.20: News20 - Training & Test accuracy, σ = tanh, η = 1e−3 . . . . . . . . . . . . 44 Figure 4.1: MediaPipe hands tracking module output example . . . . . . . . . . . . . . . . 47 Figure 4.2: MediaPipe pose module output example MediaPipe (2019) . . . . . . . . . . . 48 Figure 4.3: The proportions of the human body according to Vitruvius MediaPipe (2019) . 48 Figure 5.1: American Sign Language Alphabet Sign Language Club (2012) . . . . . . . . . 50 Figure 5.2: The character-level sign recognizer architecture proposed by Li et al. (2015) . . 52 Figure 5.3: Feature extraction using PCANet Aly et al. (2016) . . . . . . . . . . . . . . . . 53 Figure 5.4: The character-level sign recognizer architecture proposed by Aly et al. (2016) . 53 Figure 5.5: Successive binary depth images character ’G’ Rioux-Maldague & Giguere (2014) 54 Figure 5.6: The character-level sign recognizer architecture proposed by Rioux-Maldague & Giguere (2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Figure 5.7: Unsegmented hand gesture samples Oyedotun & Khashman (2017) . . . . . . . 54 Figure 5.8: Finger joints capture by Real-Sense Huang et al. (2015) . . . . . . . . . . . . . 55 x Figure 5.9: ASL alphabet skeleton created from 2D coordinates of the hands using MediaPipe 56 Figure 5.10: VGG-19 architecture Jaworek-Korjakowska et al. (2019) . . . . . . . . . . . . . 56 Figure 5.11: Training & Validation Accuracy for the SConv model on Sign Language MNIST dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Figure 5.12: Training & Validation Accuracy of theSVGG (left) & SConv (right) models on Sign Language MNIST dataset. . . . . . . . . . . . . . . . . . . . . . . . . 58 Figure 5.13: Confusion Matrix of SConv model on ASL Fingerspelling A datasets. . . . . . 59 Figure 6.1: Color, depth and skeleton images Huang et al. (2015) . . . . . . . . . . . . . . 63 Figure 6.2: The word-level sign recognizer architecture proposed by Huang et al. (2015) . . 63 Figure 6.3: The word-level sign recognizer architecture proposed by Liu et al. (2016) . . . . 64 Figure 6.4: The word-level sign recognizer architecture proposed by Kumar et al. (2017) . . 65 Figure 6.5: MCSign Model Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Figure 6.6: RSign model overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Figure 6.7: The skeleton joints of ASL sign “Christmas” . . . . . . . . . . . . . . . . . . 68 Figure 6.8: MCSign Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Figure 6.9: MCSign model summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Figure 6.10: Optical flow field sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Figure 6.11: Frames with emphasis on hands . . . . . . . . . . . . . . . . . . . . . . . . . 71 Figure 6.12: ASLLVD Neidle et al. (2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Figure 6.13: 40 common hand shapes used in ASL . . . . . . . . . . . . . . . . . . . . . . . 73 Figure 6.14: Training & Validation Accuracy for the MCSign model on ASLLV dataset. . . . 73 Figure 6.15: Training & Validation Accuracy for the RSign model on ASLLV dataset. . . . . 74 Figure 7.1: The sentence-level sign recognizer architecture proposed by Koller et al. (2016a) 78 Figure 7.2: The sentence-level sign recognizer architecture proposed by Koller et al. (2016b) 79 xi Figure 7.3: A 2D black and white image sequence created from PHOENIX dataset . . . . . 81 Figure 7.4: MCSign-C model overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Figure 7.5: RSign-C model overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Figure 7.6: MCSign-C architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Figure 7.7: MCSign-C model summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Figure 7.8: MCSign-C-Slim models comparison . . . . . . . . . . . . . . . . . . . . . . . 88 xii CHAPTER 1 INTRODUCTION Deaf and hard of hearing people use sign language to communicate. Sing languages employ the visual-manual modality to convey meaning, and they are challenging to be understood by hearing people. Although making use of written communication or seeking help from a sign language interpreter can ease the situation, each has its drawbacks in terms of availability or convenience. To break the barrier between the hearing and the deaf, developing an efficient communication system is a necessity. The development of sign language translation technology dates back to the early 90s Fang et al. (2017). However, due to computational resources and sensing technology constraints, limited advancement has been achieved over the years. Early Sign Language Recognition (SLR) systems performed data acquisition using sensor- based devices such as data gloves and accelerometers. These devices provide position, orientation, velocity, and other specifics of the hands. However, due to the prohibitive costs of such approaches, vision-based devices have been introduced. Microsoft Kinect, Leap Motion Controller, and Google Tango provide RGB and depth-maps information, which can be used as an image-based input to a framework. Sign language is far more than just a collection of well-specified gestures, and many factors need to be taken into account when performing the Sign Language Recognition task. In character-level SLR, dealing with lighting conditions, complex background, signee body postures, camera position, and subtle differences between different letters is challenging. In word-level SLR, occlusion, complexity, large variations in hand posture, and variation in input length caused by signing speed need to be considered. In sentence-level SLR, there is no word alignment, and temporal boundaries of a specific word are not clear. Moreover, signs are context-dependent, and coarticulation in which sign is affected by the preceding or following signs plays an important role. 1 1.1 Objective Language translation technology is still far from being practically useful. Developing successful sign language recognition systems requires expertise in a wide range of fields, including computer vision, natural language processing, linguistics, and deaf culture, which is often overlooked by computer scientists Bragg et al. (2019). This work aims to provide a thorough study on SLR in three character, word and sentence levels, their characteristics, challenges associated with each, and how they have been approached previously. Existing sign language recognition systems can only recognize a single sign at a time rather than sentence-level recognition, making them less effective in daily-life interaction. This work also develops a novel sign language recognition framework using deep neural networks, which directly maps videos of sign language sentences to sequences of gloss labels. 1.2 Contribution The contribution of this thesis is threefold: first, it thoroughly studies sign language recognition in three levels of character, word, and sentence. In all three levels, two common input scenarios, i.e., raw data and extracted features, have been explored and compared. Second, it proposes a novel framework for sentence-level recognition consisting of a feature extractor piped into a deep sequential network. The framework is capable of real-time translation of American Sign Language into text. Third, it combats limited data resources in the ASL field by combining different datasets from variant sources. Since the proposed network is insensitive to dataset type, combining different datasets boosts the network’s performance tremendously. 1.3 Thesis Overview In chapter 2, we review two significant architectures, namely CNN and RNN. We also describe transfer learning and connectionist temporal classification. We then present an overview of video classification problems and provide a literature review on approaches taken by researchers in 2 this domain. In chapter 3, to investigate and model the temporal dynamics of sign languages, we experiment with LSTM and implement slim variations of it proposed by Salem (2018). We perform an excessive empirical evaluation on different datasets comparing these variants with standard LSTM and obtain promising results. We then adapt what we learned in the SLR task. In chapter 4, we introduce Mediapipe, an ML solution for live and streaming media that we leveraged to extract discriminative information. In chapters 5, 6, and 7 character-level SLR, word-level SLR, and sentence-level SLR are investigated, respectively. Starting with characteristics and challenges associated with each one, we propose a novel architecture that successfully performs corresponding tasks. Finally, In chapter 8, we conclude this dissertation by addressing essential gaps in the context of sign languages and describe the possible directions for future work. 3 CHAPTER 2 PRELIMINARIES In this chapter, an overview of the key concepts that the following chapters rely on has been provided. We first review deep learning in general and two major architectures, namely CNN and RNN. We also describe transfer learning and connectionist temporal classification. We then present an overview of video classification problems and provide a literature review on approaches taken by researchers in this domain. 2.1 Deep Learning The performance of discriminative models such as a classifier or regressor not only depends on the classifier algorithm itself but also on the choice of data representation. Therefore, the first step to fulfill the given task is to come up with a meaningful representation of data and then feed them into a supervised predictor. Traditionally people tried to extract discriminative information from the data by trial and error. They performed feature engineering and came up with hand design features. This approach needs specific domain knowledge, and it is application-dependent. That is why representation-learning algorithms play an essential role today. A representation learning algorithm aims to perform a feature extraction algorithm that is task non-specific and does not require human prior knowledge. There are many different approaches in representation learning, such as probabilistic models, auto-encoders, manifold learning, and deep networks. In many data science tasks, we are dealing with high-dimensional data. Raw data is usually noisy and includes redundancy. As it was mentioned earlier, one way to capture critical aspects of data and extract discriminative information is to explicitly pre-process the data and then pass the extracted feature to the discriminative algorithm. However, new findings have shown that this is not how the neocortex, which is in charge of cognitive abilities, deals with high dimensional data. The human brain receives a myriad of sensory data every second. Rather than explicitly pre-processing them, the brain sends out the data through a hierarchy to learn to represent observations and 4 overcome the curse of dimensionality Lee & Mumford (2003). This key finding has inspired many people and resulted in the emergence of deep learning networks. The focus of this field is to extract relevant information through different levels of a network. Instead of human-engineered features, the network itself comes up with a representation at each layer (level) to represent information. However, deep networks were rarely used due to convergence problems until 2006. In 2006 Geoff Hinton initiated representation learning using deep networks Hinton et al. (2006). At the time, researchers were not able to train deep networks using common approaches like backpropagation. To train a deep model, Hinton used unsupervised feature extraction and obtained a new transformation of data. Specifically, he used the Restricted Boltzmann Machine (RBM) to initialize the network parameters to avoid poor local minima. He intended to learn new representations of data one level at a time. For example, for a given input, one can generate a new representation within the first layer. Then this new representation can be used as an input to the second layer, and the procedure is repeated. The parameter associated with each layer can then be used as an initial value. One may perform fine-tuning the parameter later on using backpropagation or any other training algorithm. This unsupervised pre-training learns a hierarchy of features one level at a time. Hinton’s method has been then quickly followed up by Bengio et al. (2006), Ranzato et al. (2006), Lee et al. (2007), Bengio (2009) and many more later. It has been shown that layer-wise stacking of extracted features often yields better representations in terms of classification error Larochelle et al. (2009). However, today there is no need to use the Restricted Boltzmann Machine as a building block of a deep network. There are many applications associated with representation learning. Namely, Speech Recog- nition and Signal Processing, Object Recognition, Natural Language Processing, Multitasking and Transfer Learning, Domain Adaptation, and so on Bengio et al. (2013). Although deep networks are promising, training a deep architecture is not challenge-free. The first challenge is to establish a clear objective function or target for training Bengio et al. (2013). The second important point is proper initialization. Layer-wise pre-training has been suggested by many authors to avoid poor local minima. The choice of nonlinearity is also another critical factor. It has been shown that 5 using a rectifying linear unit as nonlinearity speeds up convergence dramatically. The number of variables to be learned in deep networks is relatively large. Hence, deep networks may require more epochs to learn the parameter thoroughly. Efficient GPU training allows one to train longer Bengio et al. (2013). The choice of hyperparameter is also essential. For example, the learning rate can be adaptive instead of being fixed. Applying the dropout trick has also been suggested by Hinton et al. (2012) to end up with more robust results. 2.2 CNNs CNN’s LeCun et al. (1999), which are inspired by the visual cortex of a cat, are building blocks of deep networks. There are many advantages associated with CNNs over multilayer perceptron networks introduced by Rosenblatt (1958). First, the number of trainable parameters is much fewer. Secondly, it offers invariance to shifting, scaling, and other forms of distortion. Third, the formation of the input data is preserved. ConvNets consist of three different types of layers: convolution, pooling, and fully connected layers. In the convolutional layers, a kernel (local receptive field) is passed over the input to create a feature map for the next layer. By restricting the neural weights of one layer to a local receptive field, relevant features can be extracted automatically. Then, a pooling layer is applied to the feature map, followed by a fully connected layer. In deep networks, convolutions and max-pooling layers are stacked on top of each other many times. Figure 2.1: A typical CNN architecture In 2012 Krizhevsky et al. (2012) introduced AlexNet, a new generation of convolutional net- works which won ImageNet Large Scale Visual Recognition Challenge Deng et al. (2009). They 6 showed that the depth of the network is essential for high performance. They also exploited graph- ics processing units (GPUs) during training to circumvent computational cost. Later on similar architecture such as Inception Szegedy et al. (2014), VGG-Net Simonyan & Zisserman (2015), ResNet He et al. (2015), etc were introduced by others in the DL research community. The effec- tiveness of these networks makes convolutional networks to be the architecture of choice for image classification problems. Figure 2.2: AlexNet architecture Krizhevsky et al. (2012) 2.3 RNNs Recurrent Neural Networks (RNNs) have been making an impact in sequence-to-sequence map- pings, with particularly successful applications in speech recognition, music, language translation, and natural language processing, to name a few Greff et al. (2017); Zaremba (2015); Chung et al. (2014); Boulanger-Lewandowski et al. (2012); Johnson et al. (2016). By their structure, they pos- sess a memory (or state) and include feedback or recurrence. The simple RNN (sRNN) is succinctly expressed, see e.g., Goodfellow et al. (2016): ht = σ(Whx x t + Whh ht−1 + bh ) (2.1) yt = Why ht + b y where x t is the input sequence vector at (time) step t, ht is the hidden (activation) unit vector at step t, while ht−1 is the hidden unit vector at the previous step t − 1, and yt is the output vector at step t. The parameters are the three matrices, namely, Whx, Whh , and W hy , and the vector bh . This 7 constitutes a discrete-step dynamic recurrent system with ht acting as the state. The parameters are to be determined adaptively via training mostly using various versions of backpropagation through time (BPTT), e.g., see Greff et al. (2017). The LSTM RNNs introduce a cell-memory and three gating signals to enable effective learning via the BPTT Greff et al. (2017). The simple activation state has been replaced with a more involved activation with gating mechanisms. The LSTM RNN uses an additional memory cell (vector) and includes three gates: (i) an input gate, it (ii) an output gate ot , and (iii) a forget gate, f t . These gates collectively control signaling. The standard LSTM is expressed mathematically as Greff et al. (2017); Goodfellow et al. (2016): it = σin (Wi x t + Ui ht−1 + bi ) f t = σin (W f x t + U f ht−1 + b f ) ot = σin (Wo x t + Uo ht−1 + bo ) (2.2) c˜t = σ(Wc x t + Uc ht−1 + bc ) ct = f t ct−1 + it c˜t ht = ot σ(ct ) where the first four equations are a replica of the simple RNN (sRNN) above, with the first three equations serving as gating signals, and thus their nonlinear activation is set as a sigmoid function σin . The 4th equation’s nonlinearity is arbitrary, typically sigmoid, hyperbolic tangent (tanh), or rectified linear unit (reLU). This 4th equation is sometimes referred to as the input block. The last two equations entail the memory cell ct and now the activation hidden unit ht with the insertion of the gating signals in a point-wise (Haramard) multiplications (using the symbol ). This represents a discrete-step nonlinear dynamic system with recurrence. The distinct parameters are associated with each replica as W∗, U∗ , and b∗ in a straight fashion. The output layer of the LSTM model may be chosen to be as a linear (more accurately, affine) map as 8 yt = Why ht + b y (2.3) where yt is the output, Why is a matrix, and b y is a bias vector. In other optional implementations, this layer may be followed by a softmax layer to render the output analogous with probability ranges. 2.4 Transfer Learning Transfer-Learning is a deep learning technique where models trained successfully on larger datasets are reused on more specific data. This is done by freezing the weights at deeper layers from the pre-trained model and fine-tune weights at shallower layers. The weights can also be used as the starting point for the training process and adapted to the new problem. The main advantages of such a technique are its less demanding time and data requirements. Many high-performing models have been developed so far (e.g., VGG, GoogLeNet, Resnet). These models are pre-trained on imageNet Deng et al. (2009) and are widely used for transfer learning. 2.5 Connectionist Temporal Classification Connectionist Temporal Classification (CTC), introduced by Graves et al. (2006), is a neural network layer and an associated scoring mechanism to tackle sequence to sequence problems when there is no alignment between input and output. In the CTC layer, a probability distribution over all labels at each time step is predicted. To calculate the probability of an output sequence, CTC works by summing over the probability of all possible alignments equivalent to a given label. Equivalent alignments of a given label are those in which collapsing repeating characters results in that label. For example, the label sequence ’ab’ is defined as equivalent to the label sequences ’aaab’, ’aabb’, ’abbb’. To take account of repetition in the label sequence, CTC introduced the "blank" token. Blank tokens separate individual characters so that only repeated characters that are not separated by the blank are collapsed. For example, the label sequence ’aab’ is defined as equivalent to the label sequences ’aa-ab’, ’a-abb’, ’a-aab’, etc. CTC scores can then be used with the back-propagation algorithm to update the neural network weights. 9 The predicted probability distribution can also be used to infer a likely output. One intuitive solution could be finding the most likely output. However, this solution does not account for the fact that some alignments collapse to the same output. For example, the alignments ’aa-’ and ’aaa’ can individually have a lower probability than ’bbb’, but the sum of their probabilities can be greater than that of ’bbb’. To address this issue, there are more advanced approaches such as beam-search decoding, prefix-search decoding, or token passing. 2.6 Video Classification After demonstrating outstanding results on image recognition problems, the next phase is finding an architecture to classify and analyze the semantic content of videos robustly. However, this is a harder task at hand. Videos have an additional temporal dimension, they are much larger in size, and they may contain different numbers of frames. An intuitive way to extend image-based CNN structures to the video domain is to perform classification on each frame independently and then conduct a later fusion, such as average scoring, to predict the action class of the video. In the next chapter, we thoroughly examine techniques and methods that different authors have employed to tackle these issues. UCF101 Soomro et al. (2012), Sports-1M Karpathy et al. (2014) and HMDB51 Kuehne et al. (2011) have been largely used by researchers to evaluate their models. UCF101 is an action recognition dataset collected from YouTube. It contains 13,320 videos from 101 action categories, with super categories of Human-Object Interaction, Body-Motion, Human- Human Interaction, Playing Musical Instruments, and Sports. The Sports-1M is a large-scale dataset containing 1,133,158 video URLs that have been annotated automatically with 487 Sports labels. HMDB51 contains 6849 videos distributed in 51 action classes. It has been collected from various sources such as movies, Prelinger Archive, YouTube, and Google videos. Action categories include general facial actions, facial actions with object manipulation, general body movements, body movements with object interaction, and body movements for human interaction. 10 Figure 2.3: HMDB51 Kuehne et al. (2011) 2.6.1 Related Work Karpathy et al. (2014) empirically studied various CNN Architecture on the Sport-1M and UCF-101 datasets. They proposed three different architectures to model temporal connectivity, namely late fusion, early fusion, and slow fusion networks (Figure 2.4). They first experimented with a single frame model as a baseline to perceive the contribution of static information to final predictions. Single frame architecture is similar to AlexNet model pre-trained with ImageNet. The late fusion model combined the information from two single frame networks (up to the last convolution layer) with shared parameters. Two pre-trained nets are placed 15 frames apart, and the merged output is then fed into two successive fully connected layers. Therefore, the first fully connected layer can compute global motion characteristics Karpathy et al. (2014). In the early fusion model, the architecture is similar to the single frame network, except that the first layer filters are modified to be of size 11 × 11 × 3 × T. Therefore, the information is combined across an entire time window (clip) early on. They set T to be 10 (there are ten frames in each clip) and trained the network from scratch. In the slow fusion model, first level filters of size 11 × 11 × 3 × 4 with stride 2 generate four responses in time. The process is repeated in the second and third levels with filters of size 11 × 11 × 3 × 2 and stride 2. Thus, the third convolutional layer has access to information across all ten input frames. To speed up the training process, they adopted a multi-resolution architecture. A two-stream network of input size of 89 × 89 × 3, namely, context stream for low-resolution frames and a fovea 11 Figure 2.4: The video classifier architectures proposed by Karpathy et al. (2014) stream for the center part of the image with the original resolution, is used (Figure 2.5). They took out the last pooling layer, concatenated the two streams outputs, and fed it into the first fully connected layer. They trained their model on the Sport-1M dataset and then performed transform learning on UCF-101. They also experimented with fine-tuning the top, top 3, and all layers. They fed input of size 170 × 170 × 3 into the network and randomly sampled 20 clips. Each clip is then propagated through the network four times with different flips and crops. Softmax scores are then averaged to produce a clip-level prediction. To produce video-level prediction, they averaged all 20 clips. Using the slow fusion model, they were able to achieve 65.4% accuracy on UCF-101. Figure 2.5: Multiresolution CNN architecture Karpathy et al. (2014) Simonyan & Zisserman (2014) introduced a new approach to deal with videos. To model motion features, they proposed to leverage optical-flow-based input. They studied optical flow and trajectory stacking and found out stacking horizontal and vertical optical flow displacement fields between consecutive frames increases the network’s performance by a great deal. Their architecture 12 consists of a two-stream network: one for spatial and the other for temporal content (figure 2.6). Both networks are mostly CNN-M-2048 introduced by Jia et al. (2014). The input to the spatial net is of size 224, 224, 3, and the input to the temporal net is of size 224, 224, 2L. In the temporal network, the consecutive grayscale frames are considered as a channel dimension in the convolution layer. Figure 2.6: Two-stream architecture for video classification Simonyan & Zisserman (2014) In the training phase, to deal with different time lengths of videos, a frame is randomly picked from each video. The frame is then fed into the spatial net. The input to the temporal net is a volume of size 224 × 224 × 2L. They found out that setting L equal to 10 and placing horizontal and vertical components of flow alternatively in the stack delivers the best results. To combine temporal and spatial prediction scores, they tried averaging and also training a multi-class SVN. In the testing phase, they sampled 25 equal temporal spacing frames and propagated them through the two-stream networks. The final answer is then averaged across all samples. Due to the insufficient size of the dataset, to avoid overfitting, one can combine two datasets to increase the amount of training data. However, intersections between two datasets may cause problems. To circumvent this issue, they performed multi-task learning by placing two softmax layers at the top. They were able to achieve an accuracy of 87% on UCF-101. Similar to the previous paper, to obtain video-level prediction, they performed sampling and averaging. However, in this method, some samples may include partial or even no part of the actual action, and this may cause false labeling. Also, due to sampling, the task of long-range temporal information modeling was still unaccomplished. 13 Ng et al. (2015) explored the idea of extracting features using a CNN and then aggregating frame information in a separate network. They proposed two aggregation methods. In the first method, they experimented with various pooling operations and found out that max-pooling outperforms other pooling methods in terms of speed and accuracy. In the second method, they passed extracted features to a separate LSTM. They experimented with a various number of layers and memory cells and decided to use five layers architecture with 512 cell units. They also studied multiple strategies to combine LSTM frame-level predictions into a single video-level prediction. They fine-tuned pre-trained GoogLeNet Szegedy et al. (2014) on optical flow and raw images and aggregated the results using the late fusion method. Given the nature of the method presented in their paper, prediction can be made with no need to sample. However, they found out sampling and averaging improve the results. In the training phase, they sampled 300 frames per video. Frames are repeated from the start for shorter videos. For data augmentation, multiple examples per video are obtained by randomly selecting the position of the first frame. They evaluated their proposed architectures on the UCF101 and Sport 1M datasets and were able to achieve 88.6% and 73.1%, respectively. Figure 2.7: The video classifier architecture proposed by Ng et al. (2015) Donahue et al. (2017) explored three tasks in their paper: activity recognition, image captioning, and video description. In the first task, they combined temporal and spatial content into a single 14 network by taking advantage of an encoder-decoder idea. In the LRCN network, a CNN model as a feature extractor is applied to each frame before feeding them into an LSTM (figure 2.8). To combine temporal information, they took a late fusion approach; they averaged the output of each LSTM, which corresponds to each frame and then chose the most probable label. As can be seen, the convolution blocks serve as an encoder, and LSTM blocks serve as a decoder. The CNN Model they adopted was a combination of CaffeNet, AlexNet, and the network used by Zeiler & Fergus (2014). The network is pre-trained on ImageNet. The input to the network is RGB and optical flow field images. Final video-level prediction is then obtained by calculating the weighted average in favor of the flow network. They subsample each video to 16 frames, which prevents long-range temporal information modeling, as each video is generally on the order of 100 frames. Figure 2.8: Long-term Recurrent Convolutional Network Donahue et al. (2017) They evaluated their model on the UCF-101 dataset and compared it to a single frame baseline model. As mentioned before, in a single frame model, each frame is fed into a CNN and then averaged across all frames. They were able to obtain 82.92% accuracy. They reported that hidden unit size is the most effective factor in terms of accuracy, and that aggressive dropout is also beneficial. Tran et al. (2015) explored the idea of using 3D convolution networks as a feature extractor 15 and spatiotemporal feature learner (figure2.9). Their architecture consists of 8 convolution, five max-pooling, and two fully connected layers, followed by a softmax function (figure 2.10). They performed an extensive search to find a good architecture by concentrating on the first layer kernel size. They kept the spatial receptive field fixed at 3 ∗ 3 and experimented with the temporal depth of the 3D convolution kernels. They found out that 3 ∗ 3 ∗ 3 is the best kernel choice. To learn Figure 2.9: Comparing 2D & 3D convolution kernel Tran et al. (2015) spatiotemporal features, they trained their model on the Sport-1M dataset. Each video is split into 16 non-overlapping clips in the training phase and used as an input to the network. Jittering and random crops are used as data augmentation. In the testing phase, ten random clips from each video are fed to the network, and the video-level prediction is then calculated by averaging across all clips. They trained their network from scratch and also fine-tuned it from the 1380K model. Figure 2.10: C3D proposed by Tran et al. (2015) Next, they used C3D as a feature extractor on the UCF101 dataset and propagated the output to a multi-class linear SVN. They observed that the C3D extractor outperforms iDT features. Also, combining C3D features and iDT boosts the performance, as the high-level semantic information extractor C3D and low-level gradient extractor iDT are highly complementary to each other. They were able to obtain 90.4% and 85.2% accuracy with and without iDt features. They also used a deconvolution layer to decode what C3D is learning internally. They found out that the network initially concentrates on spatial information in the first few frames and the salient motion in the subsequent frames. 16 Wang et al. (2016), addressed the two-stream network, mentioned earlier, inability to capture long-range temporal information. They introduced a temporal segmentation network that receives evenly distributed snippets rather than random clips. They modified the two-stream network by splitting videos into segmentations. Each snippet is then chosen from each segment. Then Segmen- tal Consensus is derived for each modality (RGB images, optical flow field, etc). They empirically evaluated Segmental Consensus functions (even averaging, maximum, weighted averaging) and decided to use even averaging for their architecture. Probability scores from each modality are then weight averaged to produce the video-level prediction (figure 2.11). Figure 2.11: TSN proposed by Wang et al. (2014) They set k (number of segmentations) equal to 3 according to previous works on temporal modeling by Gaidon et al. (2013) & Wang et al. (2014), and adopted the Inception model with Batch Normalization (BN-Inception), by Ioffe & Szegedy (2015), pre-trained on ImageNet as building blocks of two-stream networks. To benefit from the pre-trained network on optical flow, they scaled up the input to range 0 − 255 and averaged the weights across the RGB channels, and replicated it by the number of channels in the temporal network input. They studied the effect of adding additional modalities such as RGB difference and warped optical flow to extract salient motion and mitigate the effect of camera movement. They also benefited from data augmentation, heavy dropout, and batch normalization. They evaluated their model on the HMDB51 and UCF101 datasets and achieved 69.4% and 94.2%, respectively. 17 Zhu et al. (2017) introduced an end-to-end two-stream-based network. To speed up the process and avoid extra storage, the network generates an optical flow field rather than being fed by pre- computed optical flow images. Optical flow generation can be regarded as an image reconstruction problem. For two adjacent frames f1 and f2, they generated flow V and then reconstructed f2 using f1 and V employing inverse warping by minimizing reconstruction error. They stacked this network, calling it MotionNet, on top of a temporal stream, treating it as an off-the-shelf flow estimator (figure 2.12). This modification resulted in more than 20x speedup compared to the traditional two-stream approaches. Figure 2.12: The video classifier architecture proposed by Zhu et al. (2017) In the training phase, they evenly sampled 25 frames per video and performed 10x augmentation by flipping and cropping and then averaging the prediction scores (before softmax operation) over all crops of the samples. They evaluated their network on the UCF101 dataset and were able to obtain 89.2% accuracy. They improved their performance even further to 92.5% by fusing spatial and temporal streams using the TSN method. 18 CHAPTER 3 SLIM LSTMS In this chapter, we implement and evaluate nine slim variants of the standard Long Short-term Memory (LSTM) recurrent neural networks proposed by Salem (2018). For simplicity, we refer to these models as LSTM1, LSTM2, LSTM3, LSTM4, LSTM5, LSTM4a, LSTM5a, LSTM6, LSTM10 & LSTM11. In the author’s most recent work, the first seven models have been referred to as LSTM-1 RNN, LSTM-2 RNN, LSTM-3 RNN, LSTM-4 RNN, LSTM-5 RNN, LSTM-4i RNN, LSTM-5i RNN, and LSTM-6 RNN. These variants have been generated by uniformly reducing blocks of adaptive parameters in the gating mechanisms. Such parameter-reduced variants enable speeding up data training computations and would be more suitable for implementations onto constrained embedded platforms. We comparatively evaluate and verify these variant models on the three classical datasets and demonstrate that these variant models are comparable to a standard implementation of the LSTM model while using fewer parameters. The implementation of the models is publicly accessible through Akandeh (2019). 3.1 Introduction As mentioned earlier, the standard LSTM is expressed mathematically as it = σin (Wi x t + Ui ht−1 + bi ) f t = σin (W f x t + U f ht−1 + b f ) ot = σin (Wo x t + Uo ht−1 + bo ) (3.1) c˜t = σ(Wc x t + Uc ht−1 + bc ) ct = f t ct−1 + it c˜t ht = ot σ(ct ) where σin is called inner activation (logistic) function which is bounded between 0 and 1, and denotes point-wise multiplication. The output layer of the LSTM model may be chosen to be as a 19 linear map, namely, yt = Why ht + b y (3.2) LSTMs can be viewed as composed of the cell network and its three gating networks. LSTMs are relatively slow due to the fact that they have four sets of "weights," of which three are involved in the gating mechanism. In the following sections, we describe and demonstrate the comparative performance of nine simplified LSTM variants by removing select blocks of adaptive parameters from the gating mechanism and demonstrate that these variants are a competitive alternative to the original LSTM model while requiring a less computational cost. 3.2 New Variants of the LSTM model LSTM uses a gating mechanism to control the signal flow. It possesses three gating signals driven by three main components, namely, the external input signal, the previous state, and a bias. We have proposed nine variants of the LSTM model, aiming at reducing the number of (adaptive) parameters in each gate and thus reduce computational cost Salem (7.11.2016). The first three models have been demonstrated previously in initial experiments in Lu & Salem (2017). LSTM1 In this first model variant, input signals and their corresponding weights, namely, the terms Wi x t , W f x t , Wo x t have been removed from the equations in the three corresponding gating sig- nals. The resulting result model becomes it = σin (Ui ht−1 + bi ) f t = σin (U f ht−1 + b f ) ot = σin (Uo ht−1 + bo ) (3.3) c˜t = σ(Wc x t + Uc ht−1 + bc ) ct = f t ct−1 + it c˜t ht = ot σ(ct ) 20 LSTM2 In this second model variant, the gates have no bias and no input signals Wi x t , W f x t , Wo x t . Only the state is used in the gating signals. This produces it = σin (Ui ht−1 ) f t = σin (U f ht−1 ) ot = σin (Uo ht−1 ) (3.4) c˜t = σ(Wc x t + Uc ht−1 + bc ) ct = f t ct−1 + it c˜t ht = ot σ(ct ) LSTM3 In the third model variant, the only term in the gating signal is the (adaptive) bias. This model uses the least number of parameters among other variants. it = σin (bi ) f t = σin (b f ) ot = σin (bo ) (3.5) c˜t = σ(Wc x t + Uc ht−1 + bc ) ct = f t ct−1 + it c˜t ht = ot σ(ct ) LSTM4 In the fourth model variant, the Ui, U f , Uo matrices have been replaced with the corresponding ui, u f , uo vectors in LSTM2. The intent is to render the state signal with a point-wise multiplication. 21 Thus, one reduces parameters while retaining state feedback in the gatings. it = σin (ui ht−1 ) f t = σin (u f ht−1 ) ot = σin (uo ht−1 ) (3.6) c˜t = σ(Wc x t + Uc ht−1 + bc ) ct = f t ct−1 + it c˜t ht = ot σ(ct ) LSTM5 In the fifth model variant, we revise LSTM1 so that the matrices Ui, U f , Uo are replaced with corresponding vectors denoted by small letters. Then, as in LSTM4, we acquire (Hadamard) point-wise multiplication in the state variables. it = σin (ui ht−1 + bi ) f t = σin (u f ht−1 + b f ) ot = σin (uo ht−1 + bo ) (3.7) c˜t = σ(Wc x t + Uc ht−1 + bc ) ct = f t ct−1 + it c˜t ht = ot σ(ct ) LSTM4a In LSTM4a, a fixed real number with an absolute value less than one has been set for the forget gate in order to preserve bounded-input-bounded-output (BIBO) stability, Salem (2016). Meanwhile, 22 the output gate is set to 1 (which in practice eliminates this gate altogether). it = σin (ui ht−1 ) ft = c ot = 1.0 (3.8) c˜t = σ(Wc x t + Uc ht−1 + bc ) ct = f t ct−1 + it c˜t ht = ot σ(ct ) LSTM5a LSTM5a is similar to LSTM4a, but the bias term in the input gate equation is preserved. it = σin (ui ht−1 + bi ) ft = c ot = 1.0 (3.9) c˜t = σ(Wc x t + Uc ht−1 + bc ) ct = f t ct−1 + it c˜t ht = ot σ(ct ) LSTM6 Finally, this is the most aggressive parameter reduction. Here all gating equations are replaced by an appropriate constant. For BIBO stability, we found that the forget gate must be set f t = 0.59 or below. The other two gates are set to 1 each (which practically eliminates them for computational efficiency purposes). In fact, this model variant now becomes equivalent to the so-called basic 23 RNN model reported in Salem (2016). it = 1.0 ft = α ot = 1.0 (3.10) c˜t = σ(Wc x t + Uc ht−1 + bc ) ct = f t ct−1 + it c˜t ht = ot σ(ct ) LSTM10 In this model, pointwise multiplication is applied to the hidden state and corresponding weights in the cell-body equations as well. We apply this modification not only to the gating equations but also to the main equation, i.e. matrix Uc is replaced with vector uc for the pointwise multiplication. it = σin (ui ht−1 ) f t = σin (u f ht−1 ) ot = σin (uo ht−1 ) (3.11) c˜t = σ(Wc x t + uc ht−1 + bc ) ct = f t ct−1 + it c˜t ht = ot σ(ct ) 24 LSTM11 This variant is similar to the LSTM10. However, it reinstates the biases in the gating signals. Mathematically, it is expressed as it = σin (ui ht−1 + bi ) f t = σin (u f ht−1 + b f ) ot = σin (uo ht−1 + bo ) (3.12) c˜t = σ(Wc x t + uc ht−1 + bc ) ct = f t ct−1 + it c˜t ht = ot σ(ct ) 3.3 Dataset We train and evaluate our models on three benchmark datasets, namely MNIST, IMDB & 20 Newsgroups dataset. To avoid clutter in images, we plot performance in three sub-groups. First group contains LSTM, LSTM1, LSTM2, LSTM3, LSTM4 & LSTM5. Second groups depicts LSTM, LSTM4, LSTM5, LSTM4a & LSTM5a. Last group compares LSTM, LSTM6, LSTM10 & LSTM11. 3.3.1 MNIST The MNIST dataset contains 28 × 28 handwritten digits images. Treating images as row-wise sequences, each model reads one row at a time from top to bottom to produce its output after seeing all 28 rows. Table 3.1 provides a specification of the network used. Three different nonlinearities, i.e., tanh, sigmoid, and relu, have been employed in the first (LSTM) layer. For each case, we perform parameter tuning over different values of the learning parameter. 25 Table 3.1: MNIST - Network specifications Input dimension 28 × 28 Number of hidden units 100 Nonlinear function tanh, sigmoid, tanh Output dimension 10 Non-linear function softmax Number of epochs 100 Batch size 32 Optimizer RMSprop Loss function categorical cross-entropy 3.3.1.1 The Tanh Activation The tanh activation has been used as the nonlinearity of the first hidden layer. To improve the performance of the model, we search for an optimal learning parameter η. There is a small amount of fluctuation in the testing accuracy; however, all variants converge to above 98%. The general trend among all three η values is that LSTM1 and LSTM2 have the closest prediction to the standard LSTM. Then LSTM5 follows, and finally LSTM4 and LSTM3. As it is shown, setting η = 0.002 results in test accuracy score of 98.60% in LSTM3 (i.e., the fastest model with least number of parameters) which is close to the best test score of the standard LSTM, i.e., 99.09%. In training LSTM4a, LSTM5a & LSTM6, initially, there is a gap among different model variants. However, they all catch up with the base (standard) LSTM quickly within the 100 epochs. As one increases η, more fluctuation is observed. However, the models still sustain their performance levels. One advantage of the model variant LSTM6 is that it rises faster in comparison to other variants. However, η = 0.002 causes it to decrease, indicating that this η value is relatively large. In all cases, LSTM4a and LSTM5a performance curves overlap. With a little offset, LSTM10 and LSTM11 also perform reasonably. The best results obtained among all the epochs are shown in Table 3.2. 26 Figure 3.1: MNIST - Training & Test accuracy, σ = tanh, η = 1e−4 3.3.1.2 The Sigmoid activation Then, the sigmoid activation has been used as the nonlinearity of the first hidden layer. The same trend is observed using the sigmoid nonlinearity. In this case, one can clearly observe the training profile of each model. LSTM1, LSTM2, LSTM5, LSTM4, and LSTM3 have the closest prediction to the base LSTM, respectively. Again larger η results in better test accuracy and more fluctuation. It is observed that setting η = 0.002 results in a test score of 98.34% in LSTM3, which is close to the test score of base LSTM 98.86%. For the second group, the only difference is that the sigmoid activation with η = 1e−4 slowly progresses towards its maximal performance. After performing parameter tuning, the typical test accuracy of the base LSTM model is 99%, test accuracy of variants LSTM4 and LSTM5 is about 98%, and test accuracy of variants LSTM4a and LSTM5a are 97%. As it is shown in the table and associated plots, η = 1e − 4 seems relatively small when using the sigmoid activation. As evidence, LSTM6 attains only a training and test accuracy of about 74% after 100 epochs, while 27 Figure 3.2: MNIST - Training & Test accuracy, σ = tanh, η = 1e−4 it attains an accuracy of about 97% when η is increased 10-fold. In our variants, models using tanh saturate quickly; however, models using sigmoid rise steady from beginning to the last epoch. In these cases, again, LSTM4a and LSTM5a performances overlap. The third group also shows promising results. The best results obtained over the 100 epochs are summarized in Table 3.3. 3.3.1.3 The ReLU activation Using relu activation as the nonlinearity of the first hidden layer, it is observed that the performance of LSTM, LSTM1, and LSTM2 drops after a number of epochs; however, this is not the case for LSTM3, LSTM4, and LSTM5. These latter models are sustained for all three choices of η. Also, LSTM3, the fastest model with the least number of parameters, shows the best performance among all five variants! With the relu as nonlinearity, the models fluctuate for larger η, which is not within the tolerance range of the model. Setting η = 0.002 results in test score of 99.00% for LSTM3 which beat the best test score of the base LSTM, i.e., 98.43%. The best results obtained for all 28 Figure 3.3: MNIST - Training & Test accuracy, σ = tanh, η = 1e−3 models are summarized in table 3.4. In the second group, all models perform well with η = 1e − 4. In this case, LSTM4a and LSTM5a do not overlap. For η = 1e − 3, the base LSTM does not sustain its performance and drastically drops. One may need to leverage an early stopping strategy to avoid this problem. In this case, the LSTM model begins to fall around epoch=50. LSTM6 also drops in performance. Model LSTM4, LSTM5, LSTM4a and LSTM5a show sustained accuracy performance as for η = 1e − 4. For η = 2e − 3 model, LSTM4 and LSTM5 are still sustaining their performances. In the third group, LSTM6 and LSTM11 perform well, and LSTM10 does not rise. We have explored a range of η for LSTM10 with relu activation to improve its performance. However, the effort was not successful. Increasing η from 2e − 6 to 1e − 5 leads to an increase in accuracy with value of 53.13%. For η less than 1e−5, the plots have an increasing trend. However, after this point, the accuracy starts to drop after several epochs depending on the value of η. 29 Table 3.2: MNIST - Best results obtained using tanh. η = 1e−4 η = 1e−3 η = 2e−3 train 0.9995 1.0000 0.9994 LSTM test 0.9853 0.9909 0.9903 train 0.9993 0.9999 0.9996 LSTM1 test 0.9828 0.9906 0.9907 train 0.999 0.9997 0.9995 LSTM2 test 0.9849 0.9897 0.9897 train 0.9889 0.9977 0.9983 LSTM3 test 0.9781 0.9827 0.9860 train 0.9785 0.9975 0.9958 LSTM4 test 0.9734 0.9853 0.9834 train 0.9898 0.9985 0.9983 LSTM5 test 0.9774 0.9835 0.9859 train 0.9835 0.9957 0.9944 LSTM4a test 0.9698 0.9803 0.9792 train 0.9836 0.9977 0.998 LSTM5a test 0.9700 0.9820 0.9821 train 0.9948 0.9879 0.9657 LSTM6 test 0.9771 0.9792 0.9656 train - 0.9273 - LSTM10 test - 0.9225 - train - 0.9573 - LSTM11 test - 0.9514 - 3.3.2 IMDB The IMDB Datasets is a binary sentiment classification dataset. In our model, a dictionary size of 5000 has been used. Each review is truncated or padded to 500 words. The first layer is an embedding layer which is a simple multiplication that transforms words into their corresponding word embedding. The output is then passed to an LSTM layer following a dense layer. We used the Adam optimizer, a batch size of 32, and the binary cross-entropy loss function. The network specification, adopted from Keras 1.2 examples, is given in table 3.5. 3.3.2.1 The Sigmoid activation In this case, also LSTM1, LSTM2, LSTM3, LSTM5, and LSTM4 have the closest prediction to the base LSTM, respectively. Setting η = 0.0001 results in an almost fluctuation-free profile. It is 30 Figure 3.4: MNIST - Training & Test accuracy, σ = sigmoid, η = 1e−4 observed that setting η = 0.002 results in a test score of 86.86% in LSTM3, which is close to the test score of base LSTM 88.68%. LSTM5a, LSTM4a, LSTM5, LSTM4, and LSTM6 have the closest prediction to the base LSTM, respectively. In these cases, LSTM4a and LSTM5a performance curves overlap. It is observed that setting η = 0.002 results in a test score of 89.12% in LSTM5a, which beat the test score of base LSTM, 88.68%. LSTM11 and LSTM6 have the closest prediction to the base LSTM. Here, as we expected, LSTM11 shows better performance than LSTM10. Setting η = 0.0001 results in a very smooth profile and less fluctuation. The best results obtained over the 100 epochs are summarized in Table 3.6. To compensate for a decreased number of parameters, the dimension of hidden units has been increased along with having different, smaller values of η (figure 3.16). As it is shown, higher dimensions need fewer epochs to reach leveling off profiles. Setting η = 1.25e − 5, creates an 31 Figure 3.5: MNIST - Training & Test accuracy, σ = sigmoid, η = 2e−3 almost fluctuation free profile. In this figure, ’lstm62’ stands for LSTM6 using 200 hidden units. The forget-gate constant value must be less than one in absolute value for bounded-input- bounded-output (BIBO) stability Salem (2016). f > 0.59 did not work for the MNIST dataset and made the model unstable during training. We initially start with the same forget hyper-parameter f value on the IMDB dataset and then gradually increase it. We observe that the network retains BIBO stable performance up to f t = 0.96. In figure 3.17, lstm629 denotes LSTM6 using h = 200 and f = 0.99. Note that the accuracy plot profiles (figure 3.17 ) show an increasing trend and do not appear to level off after 100 epochs. We run the network training for 200 epochs; it is observed that LSTM6 surpasses standard LSTM at around epoch 150. 3.3.3 20 Newsgroups The 20 Newsgroups dataset is a collection of 20,000 documents containing 20 different newsgroups. GloVe embedding is used to pre-train the model. The network architecture is adapted from Keras1.2 32 Figure 3.6: MNIST - Training & Test accuracy, σ = sigmoid, η = 1e−4 examples. We used the RMSprop optimizer, a batch size of 128, and the categorical cross-entropy loss function. Table 3.7 provides the network specification. We have applied our variants in the bidirectional layer. For this dataset, we evaluate our new model LSTM6a as well. In LSTM6a, the matrix Uc in the cell equation is replaced with a corresponding vector uc , in order to render a point-wise multiplication instead. Thus, the variant equations become it = 1.0 f t = f , − 1 < f < 1.0 ot = 1.0 (3.13) c˜t = σ(Wc x t + uc ht−1 + bc ) ct = f t ct−1 + it c˜t ht = ot σ(ct ) 33 Figure 3.7: MNIST - Training & Test accuracy, σ = sigmoid, η = 2e−3 3.3.3.1 The Tanh activation Due to a considerable amount of fluctuation when using sigmoid, we decide to only depict results for tanh nonlinearity. As it is shown, LSTM1, LSTM2, LSTM3, and LSTM5 & LSTM6a show better performance than standard LSTM. Also, LSTM4a, LSTM5a & LSTM 10 have a very close prediction to it. As it is shown, setting η = 0.001 results in a test accuracy score of 81.33% in LSTM3, which outperforms the best test score of the standard LSTM, i.e., 77.75%. 34 Figure 3.8: MNIST - Training & Test accuracy, σ = sigmoid, η = 1e−3 Figure 3.9: MNIST - Training & Test accuracy, σ = relu, η = 1e−4 35 Table 3.3: MNIST - Best results obtained using sigmoid. η = 1e−4 η = 1e−3 η = 2e−3 train 0.9751 0.9972 0.9978 LSTM test 0.9739 0.9880 0.9886 train 0.9584 0.9901 0.9905 LSTM1 test 0.9635 0.9863 0.9858 train 0.9636 0.9901 0.9907 LSTM2 test 0.9660 0.9856 0.9858 train 0.8721 0.9787 0.9828 LSTM3 test 0.8757 0.9796 0.9834 train 0.8439 0.9793 0.9839 LSTM4 test 0.8466 0.9781 0.9822 train 0.9438 0.9849 0.9879 LSTM5 test 0.9431 0.9829 0.9844 train 0.8728 0.9726 0.9778 LSTM4a test 0.8770 0.9720 0.9768 train 0.8742 0.9725 0.9789 LSTM5a test 0.8788 0.9707 0.9783 train 0.7373 0.9495 0.9636 LSTM6 test 0.7423 0.9513 0.9700 train - 0.9168 - LSTM10 test - 0.9184 - train - 0.9407 - LSTM11 test - 0.9403 - 36 Figure 3.10: MNIST - Training & Test accuracy, σ = relu, η = 2e−3 Figure 3.11: MNIST - Training & Test accuracy, σ = relu, η = 1e−4 37 Figure 3.12: MNIST - Training & Test accuracy, σ = relu, η = 1e−3 Figure 3.13: MNIST - Training & Test accuracy of different η, lstm10, relu 38 Table 3.4: MNIST - Best results obtained using relu. η = 1e−4 η = 1e−3 η = 2e−3 train 0.9932 0.9829 0.9787 LSTM test 0.9824 0.9843 0.9833 train 0.9926 0.9824 0.9758 LSTM1 test 0.9803 0.9832 0.9806 train 0.9896 0.9795 0.98 LSTM2 test 0.9802 0.9805 0.9836 train 0.9865 0.9967 0.9968 LSTM3 test 0.9808 0.9882 0.9900 train 0.9808 0.9916 0.9918 LSTM4 test 0.9796 0.9857 0.9847 train 0.987 0.9962 0.9964 LSTM5 test 0.9807 0.9885 0.9892 train 0.9906 0.9949 0.1124 LSTM4a test 0.9775 0.9878 0.1135 train 0.9904 0.996 0.1124 LSTM5a test 0.9769 0.9856 0.1135 train 0.9935 0.9719 0.09737 LSTM6 test 0.9761 0.9720 0.0982 train - 0.4018 - LSTM10 test - 0.5226 - train - 0.9597 - LSTM11 test - 0.9582 - Table 3.5: IMDB - Network specifications. Input dimension 32 × 500 Number of hidden units 100 Non-linear function tanh, sigmoid, tanh Output dimension 2 39 Figure 3.14: IMDB - Training & Test accuracy, σ = sigmoid, η = 1e−4 Figure 3.15: IMDB - Training & Test accuracy, σ = sigmoid, η = 1e−4 40 Table 3.6: IMDB - Best results obtained sigmoid. η = 1e−4 η = 1e−3 η = 2e−3 train 0.9906 1.0000 1.0000 LSTM test 0.8856 0.8868 0.8824 train 0.9744 0.9923 0.9968 LSTM1 test 0.8835 0.8840 0.8766 train 0.9746 0.9929 0.9954 LSTM2 test 0.8853 0.8804 0.8800 train 0.9542 0.9863 0.9955 LSTM3 test 0.8537 0.8426 0.8686 train 0.9346 0.9616 0.9920 LSTM4 test 0.8165 0.8204 0.8624 train 0.9637 0.9934 0.9982 LSTM5 test 0.8482 0.8702 0.8746 train 0.9781 0.9897 0.9955 LSTM4a test 0.8914 0.8790 0.8766 train 0.9782 0.9901 0.9955 LSTM5a test 0.8912 0.8790 0.8762 train 0.7993 0.9989 0.9909 LSTM10 test 0.7112 0.8512 0.8470 train 0.9104 0.9996 0.9969 LSTM11 test 0.8374 0.8564 0.8596 Figure 3.16: IMDB - Training & Test accuracy, σ = sigmoid, η = 1.25e−5 41 Figure 3.17: IMDB - Training & Test accuracy, σ = sigmoid, η = 1e−5 Table 3.7: News20 - Network specifications. Input dimension 1000 Embedding layer 1000 × 100 Conv1D(128, 5,’relu’) 996 × 128 Maxpooling1D(5) 199 × 128 Conv1D(128, 5,’relu’) 195 × 128 Maxpooling1D(5) 39 × 128 Conv1D(128, 5,’relu’) 35 × 128 Maxpooling1D(2) 17 × 128 Number of epochs 100 Bidirectional(lstmi) 256 Dense 128 Dense 6 42 Figure 3.18: News20 - Training & Test accuracy, σ = tanh, η = 1e−3 Figure 3.19: News20 - Training & Test accuracy, σ = tanh, η = 1e−3 43 Figure 3.20: News20 - Training & Test accuracy, σ = tanh, η = 1e−3 Table 3.8: News20 - Best results obtained using tanh. η = 5e−4 η = 1e−3 η = 2e−3 train 0.9519 0.9581 0.9600 LSTM test 0.7592 0.7750 0.7775 train 0.9508 0.9554 0.9563 LSTM1 test 0.7608 0.8033 0.8058 train 0.9415 0.9560 0.9552 LSTM2 test 0.7517 0.7883 0.7942 train 0.9513 0.9563 0.9567 LSTM3 test 0.7917 0.8133 0.7925 train 0.9519 0.9552 0.9575 LSTM5 test 0.7783 0.7800 0.7967 train 0.9181 0.9313 0.9163 LSTM4a test 0.7558 0.7775 0.7708 train 0.9271 0.9313 0.9254 LSTM5a test 0.7558 0.7600 0.7700 train 0.8169 0.8448 0.7850 LSTM6 test 0.7158 0.7200 0.7100 train 0.9552 0.9583 0.9556 LSTM6a test 0.7792 0.7942 0.7842 train 0.9506 0.9571 0.9573 LSTM10 test 0.7250 0.7650 0.7758 44 3.4 Conclusion Nine variants of the base LSTM model have been presented and evaluated. These models have been examined and evaluated using different nonlinearity and different learning rates on the three classical datasets. It has been found that new model variants are comparable to the standard LSTM model. Thus, these variant models may be suitably chosen in applications in order to benefit from the speed and/or computational cost. 45 CHAPTER 4 FEATURE EXTRACTOIN MODEL Early SLR systems performed data acquisition using sensor-based devices such as data gloves and accelerometers. These devices provide position, orientation, velocity, and other specifics of the hands. However, due to the prohibitive costs of such approaches, vision-based devices have been introduced. Microsoft Kinect, Leap Motion Controller, and Google Tango provide RGB and depth-maps information, which can be used as an image-based input to a network. In one of our solutions to the SLR task in the next chapters, we go one step further and propose to acquire data using a regular camera and then extract the key characteristics of each sign using MediaPipe Hands and Pose modules. MediaPipe is a graph-based framework for building multimodal (video, audio, and sensor) applied machine learning pipelines MediaPipe (2019) introduced by Google. Mediapipe’s use of graphs and calculators means that the work of one project can easily translate to the work of another. One advantage of MediaPipe is that it does not rely on powerful desktop environments for inference and achieves real-time performance on mobile phones, laptops, and even on the web. In addition, by employing GPU acceleration and multi-threading, MediaPipe is one of the fastest ML solutions available. 4.1 MediaPipe Hands Detecting hands is a complex task. The model has to recognize a variety of hand sizes and also address occluded hands issues. In addition, as there are not many distinguishing features in hands themself, additional context like arm and body are required to locate the hands accurately. MediaPipe hands module addresses these challenges by first training a palm detector and then training a hand landmark model on the generated cropped hand images. The cropped images can also be generated based on the hand landmarks identified in the previous frame. The palm detection model is invoked only when the landmark model can no longer identify hand presence. 46 The palm detection model operates over the whole image and returns cropped hand images. Mediapipe proposes to employ a palm detection model instead of a hand detection one since estimating bounding boxes of rigid objects like palms is significantly simpler than detecting hands with fingers. Also, non-maximum suppression algorithms work well on small objects like palms. Moreover, since palms can be modeled using square bounding boxes, other aspect ratios can be ignored. An average precision of 95.7% in the palm detection model has been reported. The hand landmark model performs keypoint localization of 21 3D coordinates on the cropped hand images generated by the palm detection model using regression (see figure 4.1). To perform supervised training, they have manually annotated 30K real-world images with 21 3D coordinates. They have also generated a synthetic hand model over various backgrounds and map it to the corresponding 3D coordinates to provide additional supervision. Figure 4.1: MediaPipe hands tracking module output example 4.2 MediaPipe Pose MediaPipe pose module infers 25 2D upper-body landmarks by first locating the pose region- of-interest (ROI) and then training a pose landmark model on the generated region. The ROI can also be generated based on the pose landmarks identified in the previous frame. The pose detection model is invoked only when the landmark model can no longer identify body pose presence (see figure 4.2). Inspired by Leonardo’s Vitruvian man Vitruvian Man (1490), the pose detection model predicts 47 Figure 4.2: MediaPipe pose module output example MediaPipe (2019) two virtual key-points, namely shoulder, and hip midpoints, that describe the circle circumscribing the person (see figure 4.3). The incline angle of the line connecting the shoulder and hip midpoints is also inferred. Similar to the hand landmark model, the pose landmark model performs keypoint localization of 25 upper-body coordinates on pose region-of-interest generated by the pose detection model. Figure 4.3: The proportions of the human body according to Vitruvius MediaPipe (2019) 48 CHAPTER 5 CHARACTER-LEVEL SLR To investigate character-level Sign Language Recognition, we study two different datasets with uniform and complex backgrounds. To examine grayscale Sign Language MNIST, we propose an 8-layers Convolutional neural network. To study ASL Fingerspelling A datasets, we propose two different approaches. In the first approach, we use VGG19 pre-trained with ImageNet and fine-tune it. In the second approach, we use MediaPipe to extract 2D landmarks of the hand, and then create new black and white images with uniform background and then feed them to the same 8-layer architecture proposed for Sign Language MNIST. The experiment results demonstrate the effectiveness of our approach on both datasets. We achieve 99.9% accuracy for the first dataset and 96% & 98% accuracy for the second dataset. The implementation of the models is publicly accessible through Akandeh (2021). 5.1 Introduction To perform a thorough study of Sign Language Recognition, we start by experimenting with the character-level SLR. A practical application for Character-level SLR is fingerspelling. Finger- spelling is used for the letter by letter signing of names, place names, age, numbers, date, year, and words that do not have pre-defined signs in the vocabulary Adithya & Rajesh (2020). In this chapter, we study image-based input to the SLR system. We can then apply our findings in world-level and sentence-level SLR, where we have to deal with video-based input. First, we experiment with Sign Language MNIST, a grayscale-based dataset with uniform background. Then we investigate ASL Fingerspelling A datasets, in which hands are presented in complex background conditions. To deal with complex background dataset, we use domain knowledge to transform the raw images into the most discriminative representation by which the model can detect and differentiate the patterns accurately. We also try a second approach in which we minimize prior knowledge and feed raw images to the network. 49 5.1.1 Characteristics Sign Languages use the hand gesture, facial expression and body pose to convey information. There are two types of static and dynamic hand gestures. Static hand gestures are formed by various shapes and orientations of hands without any motion involved. Dynamic hand gestures are formed by a sequence of hand postures with associated motion information Adithya & Rajesh (2020). In American sign language, all letters except ”j” and ”z” are represented by static hand gestures. Figure 5.1 illustrates ASL fingerspelling alphabet. Figure 5.1: American Sign Language Alphabet Sign Language Club (2012) 5.1.2 Challenges Among character-level, word-level, and sentence-level SLR, character-level is the easiest to deal with since it does not involve any motion, and inputs are in the form of images rather than videos. Also, American fingerspelling is single-handed, which removes the difficulties of hands occluding 50 one another. Still, character-level SLR is a challenging problem. Lighting conditions, complex background, signee body postures, and camera position can pose significant challenges on the task. Also, there are only subtle differences between different letters. For example, letters a, e, m, n, s, and t are all represented by a closed fist and differ only by the thumb position Pugeault & Bowden (2011). The thumb itself is also barely visible in m and n (figure 5.1). Moreover, signees can have their singing styles which leads to variability in the dataset. 5.1.3 Related Work Rioux-Maldague & Giguere (2014) applied Deep Belief Network on the American Sign Language fingerspelling dataset. They leveraged depth and intensity images obtained from the Microsoft Kinect sensor. The network consists of 3 Restricted Boltzmann Machines with the size of 1500, 700, and 400 units and one translation layer. Their architecture was capable of real-time sign classification, and it was also adaptive to any environment or lighting intensity. They were able to obtain 99% accuracy on multi-user with all known users and 77% accuracy on multi-user with unseen users. Li et al. (2015) performed feature learning using a sparse autoencoder (SAE) and principal component analysis on ASL fingerspelling by Pugeault & Bowden (2011). An auto-encoder is a type of feed-forward neural network, under the unsupervised setting, whose output is required to be equal to the input. First, they performed dimension reduction on RGB and depth images using PCA and fed the results into SAE with CNNs. Then, the learned features from both channels have been dimensionality reduced further via PCA and then are concatenated and fed into a softmax classifier. The proposed feature-learning model obtained a recognition rate of 99.05%, outperforming state- of-the-art models at the time. Huang et al. (2015) also adopted a three-layer Deep Belief Network, consisting of 500, 500, and 2000 hidden units. They collected two datasets containing 26 alphabet signs using Real-Sense and Kinect. Real-Sense is a camera device that can detect and track the location of hands. They recorded video clips from 3 different signers. Then, they extracted frames from recorded video 51 Figure 5.2: The character-level sign recognizer architecture proposed by Li et al. (2015) clips and collected 65,000 frame images in total. Each image is then converted to a 66-dimensional feature vector, including the 3D coordinates of 22 finger points. For the Kinect dataset, they produced the same amount of data. However, Kinect does not support a hands or fingers model. Instead, they made use of color and depth information and then segmented hand-shape from the background. The resolution of hand-shape images was 32 × 32. Real-Sense and Kinect DNN were able to achieve 97.8% and 98.9% accuracy, respectively. Aly et al. (2016) used an architecture called PCANet to extract local features from depth and intensity images using an unsupervised deep learning method on Arabic SL fingerspelling. The learned features are then fed into a linear support vector machine classifier. They were able to obtain an average accuracy of 99.5%. Oyedotun & Khashman (2017) also adopted three different sized convolutional neural networks and three stacked denoising autoencoders (SDAEs) on 24 hand gestures of the Thomas Moeslunds gesture database. They studied CNNs with different depth sizes of 2, 3, and 4 and SDAEs with 1 to 4 hidden layers. 5.2 Proposed Models To investigate character-level Sign Language Recognition, we study two different datasets with uniform and complex backgrounds. The Sign Language MNIST, which is patterned to match with the classic MNIST, includes images with uniform background, and the ASL Fingerspelling A 52 Figure 5.3: Feature extraction using PCANet Aly et al. (2016) Figure 5.4: The character-level sign recognizer architecture proposed by Aly et al. (2016) dataset has images with complex background. To examine grayscale Sign Language MNIST, we propose an 8-layers Convolutional neural network. To study ASL Fingerspelling A datasets, we propose two different approaches, one with using prior knowledge and one without. 53 Figure 5.5: Successive binary depth images character ’G’ Rioux-Maldague & Giguere (2014) Figure 5.6: The character-level sign recognizer architecture proposed by Rioux-Maldague & Giguere (2014) Figure 5.7: Unsegmented hand gesture samples Oyedotun & Khashman (2017) 5.2.1 Model Overview To analyze and infer on Sign Language MNIST, we proposed an eight-layer convolutional network. The first six are alternating convolutional and max-pooling layers, and the last two are fully- 54 Figure 5.8: Finger joints capture by Real-Sense Huang et al. (2015) connected layers. The first convolutional layer filters the 28 × 28 input image with 128 kernels of size 5 × 5 with a stride of 1. The second convolutional layer takes as input the response-normalized and pooled output of the first convolutional layer and filters it with 64 kernels of size 2 × 2. The third convolutional layer has 32 kernels of size 2 × 2, and finally, the fully-connected layers have 512 and 24 neurons, respectively. To examine the ASL Fingerspelling A dataset, we propose two different approaches. In the first approach, raw data is fed to VGG-19 pre-trained with ImageNet dataset Deng et al. (2009) and the model is fine-tuned. VGG-19 is a convolutional neural network that is 19 layers deep Simonyan & Zisserman (2015). In the second approach, we create new black and white skeleton images (see figure 5.9) from extracted 2D landmarks of the hands using MediaPipe and feed them to the same network proposed for the Sign Language MNIST. 5.2.2 Model Details A detailed description of the proposed model to train Sign Language MNIST and ASL Fingerspelling A Skeleton has been listed in table 5.1. To train ASL Fingerspelling A raw data we perform transfer learning. To fine-tune VGG-19 (figure 5.10) we discard the fully-connected layer heads and add our classifier layers of size 512, 512 & 24. We call the first proposed model ’SConv’ and the second model ’SVGG’ for the sake of easy referring. 55 Figure 5.9: ASL alphabet skeleton created from 2D coordinates of the hands using MediaPipe Figure 5.10: VGG-19 architecture Jaworek-Korjakowska et al. (2019) 5.3 Dataset The Sign Language MNIST multi-signer Sign Language MNIST (2018), is a dataset for character-level sign language recognition created from the American Sign Language letter database Sign Language and Static gesture (2018) and patterned to match closely with the classic MNIST. This dataset contains 27455 images for training and 7172 images for testing. Each image is 28 × 28 with grayscale value and includes hands-only. Sign Language MNIST represents a multi-class problem with 24 classes of letters (excluding J and Z, which require motion) The second dataset is the American fingerspelling A dataset which contains 24 letters of the ASL alphabet excluding the letters “j” and “z” Rastgoo et al. (2018). The images of this set are 56 Table 5.1: Network used to train Sign Language MNIST and ASL Fingerspelling A Skeleton Input dimension 28 × 28 Conv2D(128, 5,’relu’) 28 × 28 × 128 Maxpooling2D(2) 14 × 14 × 128 Conv2D(64, 2,’relu’) 14 × 14 × 64 Maxpooling2D(2) 7 × 7 × 64 Conv2D(32, 2,’relu’) 7 × 7 × 32 Maxpooling2D(2) 3 × 3 × 32 Flatten 288 Dense (’relu’) 512 Dense (’softmax’) 24 captured in five different sessions, with five different users in similar lighting conditions in the presence of complex background objects. 5.4 Experimental Setup MediaPipe requires packets (any data structure with a timestamp) to be fed into the graph in order to work. To make use of MediaPipe, we used FFmpeg to create videos from images in each class. FFmpeg is an open-source project consisting of a vast software suite of libraries and programs for handling video, audio, and other multimedia files and streams. We then used generated 2D landmarks of the hands to create new black and white skeleton images. The skeleton images are constructed such that the center of the hand (point 9 in figure 4.1) is centered in the middle of the image, that is, both vertically and horizontally centered. 5.5 Experimental Results The proposed models are evaluated by accuracy and confusion matrix. The confusion matrix is a specific table layout that allows visualization of the performance of the classification model by class. SConv achieved 99.9% accuracy on Sign Language MNIST for seen signers (figures 5.11). There is almost no confusion between similar-looking hand-shapes. The overall performance of SConv and SVGG models trained on ASL Fingerspelling A Skeleton for seen users across 7 participants are recorded in figures 5.12. 57 Figure 5.11: Training & Validation Accuracy for the SConv model on Sign Language MNIST dataset. Figure 5.12: Training & Validation Accuracy of theSVGG (left) & SConv (right) models on Sign Language MNIST dataset. Confusion between similar-looking hand-shapes, such as ’m’ and ’n’, and also ’s’, ’t’ and ’e’ are noticeable in the confusion matrices (figures5.13). This is expected as these letters are signed as a fist that only differs by the thumb position, which is barely visible in some cases. 58 Figure 5.13: Confusion Matrix of SConv model on ASL Fingerspelling A datasets. 5.6 Discussion With our presented novel model, we are able to classify 24 characters with 98% accuracy. By creating black and white skeleton images using Mediapipe from raw data, we overcome lighting conditions, complex background, and camera position challenges. Experimental results indicate that our model achieves substantial improvements over mainstream methods in terms of training speed. The SConv model starts to converge at 100 epochs, and SVGG model starts to converge at 150 epochs (figure 5.12). Given its promising performance, as there are approximately 40 common handshapes used in ASL, including the 24 static letters, we can use this model along with its weights as a base for the word-level classification task. 5.7 Summary In this chapter, we experimented with different datasets with uniform and complex backgrounds. To examine grayscale Sign Language MNIST, we proposed an 8-layers Convolutional neural net- work. To study ASL Fingerspelling A datasets, we investigated two different approaches. In the 59 first approach, we use VGG19 pre-trained with ImageNet and fine-tune it. In the second approach, we use MediaPipe to extract 2D landmarks of the hand, and then create the skeleton of the hands with uniform background and then feed them to the same 8-layer architecture proposed for Sign Language MNIST. The experiment results demonstrated the effectiveness of our approach on both datasets. We achieve 99.9% accuracy for the first dataset and 96% & 98% accuracy for the second dataset. 60 CHAPTER 6 WORD-LEVEL SLR To investigate word-level SLR, we present two deep dynamic sign language recognition frameworks. In the first framework, to minimize prior knowledge, we feed raw frames into an 18-layers LRCN. LRCN is a model in which a CNN as a feature extractor is applied to each frame before feeding them into an LSTM. In the second framework, we leverage domain knowledge of sign languages to extract the key characteristics of each frame. We first extract 2D landmarks of the hands and then encode them in terms of handshape, hand movement, and hand location. These information are then fed to a Multi-Cue network. We evaluate our proposed models on 52 vocabularies ASLLVD. Each word is played by six signers, and each signer plays each word once. We perform an excessive search on model hyper-parameters such as the number of feature maps, input size, batch size, sequence length, LSTM memory cell, regularization, and dropout. The implementation of the models is publicly accessible through Akandeh (2021). 6.1 Introduction Words in sign languages are mostly expressed by movements of the hands. Some words do not incorporate motion; however, here, we merely refer to dynamic word-level SLR. In the previous chapter, character-level SLR has been studied. In character-level SLR, inputs are in the form of images. In this chapter, we study Dynamic Sign Language Recognition (DSLR), in which inputs are in the form of videos. Videos have an additional temporal dimension, they are much larger in size, and they may contain different numbers of frames. Although SLR can be viewed as a video classification task, it is much more challenging. SLR includes a large number of classes. Also, in classifying signs, more attention is required in hand regions than the general video classification task. Moreover, one word can be signed differently in different dialects. For example, in ASL, there are over 100 different ways to sign “Birthday” and “Picnic”. There exist many text-to-sign features that translate English text into ASL signs. However, we lack a technology that allows us 61 to look up corresponding written forms of a given sign. 6.1.1 Characteristics ASL is a complete and complex language that mainly employs signs made by moving the hands Hoza (2007). Each sign is characterized by manual elements such as shape, movement, and location of the hands. Many ASL signs involve very subtle differences in their three characteristics. For example, the main difference between two signs “small” and “big” is hand movement. Sign “small” involves hands moving toward each other, and sign “big” involves hands moving away from each other. Changing the location of a sign can also completely change the meaning of a sign even though the other parameters are the same. For example, the only parameter change between “summer”, “ugly”, and “dry” is the hand location. A competent SLR framework should be able to take into account those subtle differences and accurately learn the distinguishable information. 6.1.2 Challenges Word-level SLR shares challenges associated with character-level SLR. Environmental factors such as lighting sensitivity, complex background, occlusion, signee body postures, and camera position are issues that need to be considered at word-level as well. Moreover, in word-level SLR, there is input length variation caused by signing speed. Also, there are a large number of vocabularies that need to be learned. Furthermore, in word-level SLR, publicly available datasets are limited both in quantity and quality despite character-level SLR. 6.1.3 Related Work To investigate word-level ASL, Huang et al. (2015) proposed an eight layers 3D convolutional neural network. The first four layers were convolution layers followed by sub-sampling and then followed by two fully-connected layers. Color information, depth clue, and body joint positions were fed into the network. They generated their own dataset of 25 vocabularies using Kinect. Each word is played by nine signers, and each signer played each word three times. They stacked 62 image frames to form a cube for the five types of features (R, G, B, depth, and skeleton) to feed the network. Fifty kernels of the size of 7 × 7 × 5 were adopted in the convolution layers. They were able to achieve 94% classification result. Figure 6.1: Color, depth and skeleton images Huang et al. (2015) Figure 6.2: The word-level sign recognizer architecture proposed by Huang et al. (2015) Liu et al. (2016) proposed a seven layers LSTM-based sign language recognition system. They discarded color image and depth information and fed 3D coordinates of hands and elbows into the network. The Kinect sensor camera provides joints skeleton trajectories. The input is a 12- dimensional feature vector comprised of four 3D coordinates of two hands and two elbows. The next layer is an LSTM layer with 512 dimensions. Two fully connected layers with dimensions 512 and 100 come afterward. They evaluated their model on 100 isolated Chinese sign and 500 sign words. The first dataset was performed by five signers who each gestured five times, resulting 63 in 25,000 images. The second dataset was performed by 50 singers who each gestured five times, resulting in 125,000 images. They were able to achieve 64% and 86%, respectively. Figure 6.3: The word-level sign recognizer architecture proposed by Liu et al. (2016) Kumar et al. (2017) proposed a hybrid Hidden Markov Model and Bidirectional Long Short Term Memory on 7500 Indian Sign Language (ISL) gestures comprised of 50 different sign-words. The input was obtained using Leap Motion and the Kinect sensor. They fed horizontal and vertical movement of fingers and palm alongside depth images and were able to achieve 96.2% accuracy. 6.2 Proposed Models In this section, we discuss approaches we took to recognize dynamic isolated signs and propose two deep word-level sign language recognition frameworks. In the first framework, it is aimed to find a robust deep learning algorithm that requires minimum prior knowledge. In the second framework, domain knowledge of sign languages is leveraged to extract the key characteristics of each frame. The first model employs LRCN as its building block and the second model is a multi-cue-based network. We call the first proposed model RSign (R corresponds to raw input) and the second model Multi-Cue Sign (MCSign). 64 Figure 6.4: The word-level sign recognizer architecture proposed by Kumar et al. (2017) 6.2.1 Model Overview RSign is the sign language application of the LRCN network proposed by Donahue et al. (2017). LRCN is an end-to-end trainable network in which convolutional layers and long-range temporal recursion are combined (figure 2.8). MCSign, on the other hand, aims to capture and model the three main characteristics associated with each sign. In the first layer, a temporal sequence of 2D landmarks of hands and 2D landmarks of upper-body are captured during the signing. In the second layer, the critical characteristics of the signs, including shape, movement, and location of hands, are modeled. A new 2D black and white image (see figure 6.7) is created out of 2D coordinates of the skeleton joints of hands to model the hands’ shape. Hands movement is encoded by spatial displacement of the palm center between two consecutive frames. A one-hot vector corresponding to the relative location of hands to the head is also created to encode the hands’ location. In the third level, newly created spatio-temporal trajectories are fed into a CONV-LSTM model. The one-hot vector corresponding to the hand 65 locations is also fed into an LSTM model. Finally, at the top layer, MCSign adopts a softmax classifier layer. Figure 6.5 provides an overview of the proposed architecture. Figure 6.5: MCSign Model Overview 6.2.2 Model Details In this section, a detailed description of two proposed models, namely, RSign and MCSign is provided. 6.2.2.1 RSign The high level architecture of LRCN proposed by Donahue et al. (2017) (figure 2.8) has been used as the building block of RSign model. The more detailed architecture used in this study is given in figure 6.6. Our architecture consists of a default block (blue box) and four repetitions of non-default blocks (red box). We used the Time-Distributed layer in Keras to apply feature extraction (default and non-default blocks) to every temporal slice of the input. Also, we benefited from the LSTM3 model to speed up the training process. We used the RMSprop optimizer, a batch size of 32, and the categorical cross-entropy loss function. The network specification and also a list of parameters that have been used are given in 66 Figure 6.6: RSign model overview table 6.1. It is observed that even a small sequence length is sufficient to perform well due to the continuous nature of sign language. It is also observed that smaller batch sizes perform better. Table 6.1: RSign Network specifications. Input dimension 200 × 200 × 3 1st layer of default block Conv2D(32, 7×7)-Relu nd 2 layer of default block Conv2D(32, 3×3)-Relu Number of feature maps in non-default blocks 128-64-128-64 LSTM memroy cell 256 Sequence length None Dropout 0.6 Regularization 0.001 6.2.2.2 MCSign To emphasize on crucial characteristics of the signs and inject domain-specific expert knowledge into the system, we propose to extract and model each characteristic as three input modalities (cues) to the network. Multi-hands tracking and Pose module in Mediapipe has been used to extract this representative information (see figure 4.1 for the 2D landmarks of hands tracked by Mediapipe). Having access to 21 2D landmarks of the hands, we create new black and white images representing handshape in each frame. Figure 6.7 illustrates how the ASL signs of the word “Christmas” is precisely captured by the temporal sequence of skeleton joints of the fingers. We also extract hand movement information of both hands as the spatial displacement of the palm center between two consecutive frames. Hand location is also encoded in a one-hot vector based on hand closeness to eyes, mouth, or chest. The extracted characteristics result in six trajectories that 67 capture handshape, movement, and location. Figure 6.8 illustrates the architecture of the proposed model. The “None” in this figure refers to the number of frames, which can be variable and depends on the signee speed or length of the word. Figure 6.7: The skeleton joints of ASL sign “Christmas” Figure 6.8: MCSign Model Architecture As shown in figure 6.8, our proposed model consists of seven main layers. Within the first layer, 68 a block of CONV-LSTM is embedded. This CONV-LSTM model is also LRCN-based in which a CNN model as a feature extractor is applied to each frame before feeding them into an LSTM. For the CNN part, we employ the SConv model that has been proposed in the character-level chapter. In the first layer, the new skeleton images of each hand are fed to the CONV-LSTM block. Hand movement and hand location are also fed into two separate LSTMs. The new representation of each characteristic is then concatenated in the first fusion layer and fed into BLSTM to obtain an integrated representation of each hand. Two high-level hand results are then fused and fed into another BLSTM. Finally, The results of this process are fed into a fully connected layer with a so f tmax nonlinearity that drives the final classification decision. See figure 6.9, for a detailed model summary. Figure 6.9: MCSign model summary 69 6.2.3 Design Choice To decide on the building block of both proposed frameworks, we started by employing architecture that has been used by other video classification tasks, namely LRCN, Features+LSTM, C3D, and CONVLSTM. Based on our experiments, LRCN submitted the best performance among all. Also, to decide on the input form of the first proposed framework, we fed raw frames, optical flow field images (figure 6.10), and also frames with emphasis on hands (6.11) using skin detection techniques. We believe that optical flow field images are too vague for the network. Since signees have worn clothes with different skin exposure levels, the network with input in the form of frames with emphasis on hands failed in giving proper results. Figure 6.10: Optical flow field sample Despite character-level SLR, in word-level SLR, publicly available datasets are limited both in quantity and quality. Deep networks require large amounts of labeled training data to demonstrate satisfactory performance. A model with limited parameters cannot learn the problem, and a model with too many parameters can learn it too well. Both scenarios lead to a model that does not generalize well. The complexity of the architecture is a controlling factor in how well a model generalizes. If a model is under-fitted, in which training accuracy is not high enough, one could increase the complexity of the model by enlarging the architecture. If it is overfitted, in which training accuracy is high enough, but testing accuracy is not, one may decrease the complexity. Recently, to reduce generalization error, researchers use a large model along with regularization to 70 Figure 6.11: Frames with emphasis on hands put a constraint on weights and avoid overfitting. It has been shown that this approach not only reduces overfitting but also leads to faster optimization. 6.3 Dataset The American Sign Language Lexicon Video Dataset (ASLLVD) Neidle et al. (2012) has been collected at Boston University. It is captured by four synchronized cameras, providing a side view, a close-up, a half-speed front view, and a full-resolution front view (figure 6.12). It consists of more than 3300 ASL signs. Six native signees perform each sign once for a total of almost 9,800 tokens. Gloss labels and sign start and end times are also included in linguistic annotations. In this study, we perform classification on 52 dynamic signs. Figure 6.12: ASLLVD Neidle et al. (2012) 71 6.4 Experimental Setup After downloading and placing videos, FFmpeg was used to extract frames from videos. FFmpeg is an open-source project consisting of a vast software suite of libraries and programs for handling video, audio, and other multimedia files and streams. We also used OpenCV to obtain optical flow images. OpenCV is a library of programming functions mainly aimed at computer vision tasks. The skin detection algorithm is also implemented in OpenCV. Also, since the number of samples per class was limited, we enlarged the dataset size by means of augmentation to a total of 5 per sample. To implement the CNN block of the CONV-LSTM network, we employed the SConv model that has been proposed in the character-level chapter with TimeDistributed layer in Keras. The only difference was to perform batch-normalization and then apply relu nonlinearity after each convolutional layer. As mentioned before, the MediaPipe palm detection model is invoked only when the landmark model can no longer identify hand presence. To increase the performance, we had to turn this feature off. There are approximately 40 common handshapes used in ASL, including the 24 static letters of the alphabet 6.13. Before training the proposed model on ASLLVD, we first created a dataset of those handshapes, enlarged the quantity through data augmentation, and then trained the shape- network part of the proposed model on these static hand shapes. Also, to train the shape part of the network on 40 common hand shapes, we initialized it with character-level model (introduced in the previous chapter) weights. 6.5 Experimental Results We used leave-one-subject-out cross-validation to examine the generalization capability of the models across different subjects by selecting five signees for training and left one for evaluation. Later we decided to change the protocol and test it across participants on unseen samples. Overall, MCSing and RSign achieved an average accuracy of 91% and 84% on translating 52 signs across 6 72 Figure 6.13: 40 common hand shapes used in ASL participants. This indicates that our proposed models are capable of capturing the key characteristics of the signs. Figures 6.14 and 6.15 show the training and validation accuracy across 6 participants for both proposed models. Figure 6.14: Training & Validation Accuracy for the MCSign model on ASLLV dataset. 73 Figure 6.15: Training & Validation Accuracy for the RSign model on ASLLV dataset. 6.6 Discussion With our presented end-to-end model, we are able to classify 52 sign words with 91% accuracy. By creating black and white skeleton images using Mediapipe from raw data, we not only overcome lighting conditions, complex background, and camera position challenges but also create a way to easily combine data from variant sources and hence combat limited data resources in the SLR field. Since the proposed network is insensitive to dataset type, combining different datasets boosts the performance of the network tremendously. 6.7 Summary In this chapter, we proposed two solutions to word-level SLR. In the first approach, no prior knowledge has been leveraged. Raw frames are fed into an 18-layer LRCN. LRCN is a model in which a CNN as a feature extractor is applied to each frame before feeding them into an LSTM. In the second approach, three main characteristics (hand shape, hand position, and hand movement information) associated with each sign have been extracted using Mediapipe. 2D landmarks of 74 handshape have been used to create the skeleton of the hands and then are fed to a CONV-LSTM block. Hand locations and hand positions as relative distance to head are fed to separate LSTMs. All three sources of information have been then integrated into a Multi-Cue Network. We evaluated the performance of proposed models on 52 vocabularies ASLLVD. Performing an excessive search on model hyper-parameters such as the number of feature maps, input size, batch size, sequence length, LSTM memory cell, regularization, and dropout, we were able to obtain promising results in this domain. 75 CHAPTER 7 SENTENCE-LEVEL SLR This chapter presents two deep continuous sign language recognition frameworks that map videos of sign language sentences to sequences of gloss labels. Connectionist Temporal Classification (CTC) has been used as the classifier level of both models. CTC is used to avoid pre-segmenting the sentences into individual words. The first model is an LRCN-based model, and the second model is a Multi-Cue network. LRCN is a model in which a CNN as a feature extractor is applied to each frame before feeding them into an LSTM. In the first framework, to minimize prior knowledge, we feed raw frames into an 18-layer LRCN with a CTC on top. In the second framework, we leverage domain knowledge of sign languages to extract the key characteristics of each frame. The features are then fed into a Multi-Cue model with a CTC classification layer. We evaluate proposed models on RWTH-PHOENIX-Weather multi-signer. This dataset contains 5672 sentences in German sign language for training with 65,227 signs and 799,006 frames in total. We perform an excessive search on model hyper-parameters such as the number of feature maps, input size, batch size, sequence length, LSTM memory cell, regularization, and dropout. The implementation of the models is publicly accessible through Akandeh (2021). 7.1 Introduction In previous chapters, character-level and word-level sign language recognition have been studied. In character-level SLR, inputs are in the form of images, and in word-level SLR, inputs are in the form of videos. In this chapter, we study continuous sign language recognition (CSLR) in which temporal boundaries of the words in the sentences are not defined. Most existing sign language recognition systems can only recognize a single sign at a time and thus requires users to pause between signs, which is not very practical in daily-life interaction. To address this problem, the correspondence between video sequence and sign gloss sequence needs to be learned. Glossing corresponds to mapping signs word-for-word to another written language. Glosses differ from 76 translation as they only denote the meaning of each part in a sign language sentence and do not necessarily construct a grammatically correct sentence in the written language. In this chapter, we do not address linguistic structures and grammar unique to sign language. 7.1.1 Characteristics As mentioned in the previous chapter, each sign is characterized by manual elements such as shape, movement, and location of the hands. To structure sentences, non-manual elements like eye gaze, mouth shape, facial expression, and body pose are also involved. Facial expressions are used to prevent confusion or misunderstandings. The visible mouth shapes can add information to the meaning of a sign and make it distinguishable from a semantically related sign. Some ASL signs have a permanent mouth morpheme as part of their production. For example, the ASL word NOT-YET requires a mouth morpheme (TH), whereas LATE has no mouth morpheme. These two are the same signs but with a different non-manual signal. In American Sign Language, eye gazing serves a variety of functions. It can regulate turn-taking and mark constituent boundaries. Eye gazing is also frequently used to repair or monitor utterances and to direct the addressee’s attention. It is also engaged in indexing and in expressing object and subject agreement and definiteness versus indefiniteness (Thompson (2006)). In this chapter, non-manual elements, despite their importance, have not been addressed. In the next chapter, we have made some suggestions and introduced some references on how to involve those essential factors in recognition systems. 7.1.2 Challenges As mentioned in previous chapters, there are many challenges associated with the SLR task. Publicly available datasets are limited both in quantity and quality. Environmental factors such as lighting sensitivity, complex background, occlusion, signee body postures, and camera position are also challenging issues. In terms of sign linguistics, there are subtle differences between different signs, and there are a large number of vocabularies that need to be learned. Specifically, 77 in sentence-level SLR, there is no word alignment, and temporal boundaries of a specific word are not clear. Also, there is sentence length variation caused by the number of the word or by signing speed. Moreover, signs are context-dependent and, coarticulation in which sign is affected by the preceding or following signs also plays an important role. All these factors together make sentence-level SLR a very challenging task. 7.1.3 Related Work Koller et al. (2016a) employed a pre-trained 22-layer CNN model within an iterative EM algorithm on a sequence of data. The algorithm iteratively refined the frame-level annotation and subsequent training of the CNN. Three thousand manually labeled hand shape images of 60 different classes were employed to train the model. Figure 7.1: The sentence-level sign recognizer architecture proposed by Koller et al. (2016a) Koller et al. (2016b) embedded a CNN into an HMM. The outputs of the CNN were treated as Bayesian posteriors, and they trained the system in an end-to-end fashion. Their model was very similar to the LRCN mentioned earlier, which combined CNN and LSTM. They were able to improve the state-of-the-art accuracy on three challenging continuous sign language benchmarks (SIGNUM , RWTH-PHOENIX-Weather2012 , RWTH-PHOENIX-Weather Multi-signer ) by 15%, 38% and 13.3% respectively. Cui et al. (2017) used a CNN-LSTM model to learn the mapping of sign sequences to sequences of the gestures on RWTH-PHOENIX-Weather Multi-signer and achieved 61.3% accuracy. Their 78 Figure 7.2: The sentence-level sign recognizer architecture proposed by Koller et al. (2016b) architecture consists of a CNN with temporal convolution and spatial pooling, a bidirectional LSTM for global sequence learning, and a detection network. The detection network combines the temporal convolution operations on the spatio-temporal features. According to the authors, this combination acts like a sliding window along with the sign sequences. To perform Sign2Gloss task (SLR), Camgoz et al. (2020) utilized a pre-trained Inception model as spatial embeddings in a CNN+LSTM+HMM setup. They extracted frame level representations from sign videos and trained two-layered sign language transformers to learn CSLR (continuous sign language recognition) and SLT(sign language translation) jointly in an end-to-end manner. Niu & Mak (2020) also proposed to use the transformer encoder as the contextual model for CSLR to fine-tune the lower-level visual feature extractor during model training. To improve model robustness, they proposed dropping video frames stochastically (SFD) and randomly stopping the gradients of some frames during training (SGS). 7.2 Proposed Models We present the development and implementation of two deep sentence-level sign language recognition frameworks. In the first framework, to minimize prior knowledge, raw frames are fed 79 to a deep network. In the second framework, sign language linguistics is leveraged to extract the key characteristics of each frame. In both models, we adopt a probabilistic framework based on Connectionist Temporal Classification to overcome pre-segmentation requirements posed by the sequence learning problem associated with sentence-level SLR. We call the first proposed model RSign-C (R corresponds to raw input and C corresponds to CTC), and the second model Multi-Cue Sign-C (MCSign-C). 7.2.1 Model Overview In this section, we provide an overview of two proposed models, namely, RSign-C and MCSign-C. These models are based on models that have been proposed in the previous chapter for word-level SLR, except we modify the so f tmax layer, which calculates the probability of a single sign and adds a CTC layer that calculates the probabilities of a sequence of signs. To have a standalone chapter, we re-present the architecture of both models. RSign-C is an 18-layers LRCN with CTC layer on top. In the LRCN network, a CNN model as a feature extractor is applied to each frame before feeding them into an LSTM (2.8). MCSign-C, on the other hand, aims to capture and model the three main characteristics associ- ated with each sign. In the first layer, a temporal sequence of 2D landmarks of hands, as well as 2D landmarks of the upper-body, are captured during the signing. In the second layer, the critical characteristics of the signs, including shape, movement, and location of hands, are modeled. A new 2D black and white image (see figure 7.3) is created out of 2D coordinates of the skeleton joints of hands to model the handshapes. Hands movement is encoded by spatial displacement of the palm center between two consecutive frames. A one-hot vector corresponding to the relative location of hands to the head is also created to encode the hands’ location. In the third level, newly created spatio-temporal trajectories are fed into a CONV-LSTM model. The one-hot vector corresponding to the hand locations is also fed into an LSTM model. Finally, at the top layer, MCSign-C adopts a CTC-based approach to output sequences of gloss labels. Figure 7.4 provides an overview of the proposed architecture. 80 Figure 7.3: A 2D black and white image sequence created from PHOENIX dataset Figure 7.4: MCSign-C model overview 7.2.2 Model Details This section provides a detailed description of two proposed models, namely, RSign-C and MCSign- C. 7.2.2.1 RSign-C We modify the RSign model that has been proposed in the previous chapter such that it is capable of computing the probabilities of a sequence of signs. LRCN proposed by Donahue et al. (2017) (figure 2.8) has been used as the building block of RSign model. To produce a probability distribution over all labels at each time step, we modify the final LSTM to return all sequence corresponds 81 to each hidden state, then apply a so f tmax function and finally add a CTC layer. The more detailed architecture used in this study is given in figure 7.5. As mentioned earlier, we used the Time-Distributed layer in Keras to apply feature extraction to every temporal slice of the input. Figure 7.5: RSign-C model overview 7.2.2.2 MCSign-C To emphasize on crucial characteristics of the signs and inject domain-specific expert knowledge into the system, we propose to extract and model each characteristic as three input modalities (cues) to the network. Multi-hands tracking and pose modules in Mediapipe have been used to extract this representative information. (see figure 4.1 for the 2D landmarks of hands tracked by Mediapipe). Having access to 21 2D landmarks of the hands, we create new black and white images representing handshape in each frame. We also extract hand movement information of both hands as the spatial displacement of the palm center between two consecutive frames. Hand location is also encoded in a one-hot vector based on hand closeness to eyes, mouth, or chest. The extracted characteristics result in six trajectories that capture information related to each handshape, movement, and location. Figure 7.6 illustrates the architecture of the proposed model. As shown in figure 7.6, our proposed model consists of seven main layers. Within the first layer, a block of CONV-LSTM is embedded. This CONV-LSTM model is also LRCN-based in which a CNN model as a feature extractor is applied to each frame before feeding them into an LSTM. For the CNN part, we employ the SConv model that has been proposed in the character-level chapter. In the first layer, the new skeleton images of each hand are fed to the CONV-LSTM block. Hand movement and hand location are also fed into two separate LSTM. The new representation 82 Figure 7.6: MCSign-C architecture of each characteristic is then concatenated in the first fusion layer and fed into BLSTM to obtain an integrated representation of each hand. Two high-level hand results are then fused and fed into another BLSTM. Finally, The results of this process are fed into a fully connected layer with CTC loss that drives the final classification decision. See figure 7.7, for a detailed model summary. The “None” in this figure refers to the number of frames, which can be variable and depends on the signee speed or length of the sentence. In the CTC approach, the probabilities of all the possible sentences formed by the word included in the output domain are computed. This approach not only eliminates the need for word pre- segmentation and post-processing but also addresses variable-length sequences Fang et al. (2017). To obtain the final gloss label, as it was discussed in chapter 2, section 4, adjacent duplicates and all the blank symbols in the inferred label sequence are removed. 83 Figure 7.7: MCSign-C model summary 7.2.3 Design Choice There have been many different architectures proposed by many researchers to address the SLR task. Here, we present the main reasons behind selecting some specific design choices. In the MCSign-C model, we leverage sign language linguistics to extract trajectories of rep- resentative information from the frame sequences during the signing and develop models based on the extracted information. Handshape information of both left and right hands is encoded by zero-centering the palm center of the hands and then scaling the given normalized 2D coordinates of the skeleton joints. Since there was no ultimate solution to layout the coordinates, we decided to leave it to the network to decide and proposed to regenerate each frame using 2D landmarks. This method also handles different hand size issues imposed by distance to the camera. As many signs share similar characteristics at the beginning of their trajectories, to avoid any 84 confusion by traditional unidirectional LSTM, a bidirectional LSTM model has been adopted (also suggested by Fang et al. (2017)). A bidirectional LSTM model performs inference based on both past and future information and computes the hidden state sequence by combining the output sequences of LSTM by iterating forwards and backward. To perform sentence-level SLR, current technologies employ a framework that requires pre- segmenting individual words within the sentence. This restriction requires users to pause between adjacent signs when signing sentences. To address this problem, some researchers have proposed sign boundary detection to detect when a sign ends, and the next one begins. Attention mechanisms (Encoder-Decoder Networks) and Connectionist Temporal Classification (CTC) have been also proposed by other researchers to compute the probability of the whole sentence directly. Since CTC is the key technique that drives the modern automatic speech recognition systems such as Amazon Alexa and Apple Siri Fang et al. (2017), we propose to adopt a framework based on CTC which removes the requirement of pre-segmentation and can be easily built on top of the MCSign. 7.3 Dataset The RWTH-PHOENIX-Weather multi-signer Forster et al. (2012), is a dataset for continuous sign language recognition. This dataset contains 5672 sentences in German sign language for training with 65,227 signs and 799,006 frames in total. Nine signers perform these videos. Table 7.1 summarizes setup statistics RWTH-PHOENIX-Weather (2012). Table 7.1: RWTH-PHOENIX Setup Statistics Glosses German Train number of sentences 2612 2612 number of running words 20713 26585 vocabulary size 768 1389 singeltons/vocabulary size 32.4% 36.4% Development number of sentences 250 250 number of running words 2573 3293 out-of-vocabulary-words 1.4% 1.9% Test number of sentences 228 228 number of running words 2163 2980 out-of-vocabulary-words 1.0% 1.5% 85 7.4 Experimental Setup To implement the CNN block of the CONV-LSTM network, we employed the SConv model that has been proposed in the character-level chapter with TimeDistributed layer in Keras. The only difference was to perform batch-normalization and then apply relu nonlinearity after each convolutional layer. As mentioned before, in the CTC layer, a probability distribution over all labels at each time step is predicted. This can be implemented by feeding a 2D input (e.g., the output of an LSTM layer where return-sequence is true) to a Softmax function and then employ the CTC loss function. To infer a likely output, the Beam Search algorithm has been used. Only a predetermined number of best partial solutions (beam size) are kept as candidates in the beam search algorithm. To get the optimum beam size performing error analysis are required. If the probability of a true label y ∗ for a given input is lower than the probability of generated output ỹ, then the network is at fault, and more training is required. If p(y ∗ ) > p( ỹ), then the beam search is at fault, and the beam size needs to be increased. 7.5 Experimental Results To evaluate the performance of the two proposed models, we use word error rate (WER) as the evaluation metric. WER is a standard metric of the performance of a sentence-level translation or a speech recognition system and is computed as the minimum number of word insertions, substitutions, and deletions required to get from the reference to the hypothesis, divided by the number of words in the reference. To further investigate Slim LSTM models proposed in chapter 3, we then run multiple experiments with MCSign-C-Slim models and compare them against baseline MCSign-C. 7.5.1 MCSign-C & RSign-C MCSign-C achieves an average 35.2% word error rate on mapping Dev sentences and an average of 35.3% word error rate on Test sentences. Table 7.2 lists other approaches, and WER reported 86 on RWTH-PHOENIX-Weather dataset. Table 7.2: Comparison between various approaches on the RWTH-PHOENIX dataset Method Dev(WER) Test(WER) CMLLR Koller et al. (2015) 55.0 53.0 1-Mio-Hands Koller et al. (2016b) 45.1 47.1 CNN-Hybrid Koller et al. (2016b) 38.3 38.8 SubUNets Camgoz et al. (2017) 40.8 40.7 Staged-Opt Cui et al. (2017) 39.4 38.7 Re-sign Koller et al. (2017) 27.1 26.8 LS-HAN Huang et al. (2018) − 38.3 Dilated Pu et al. (2018) 38.0 37.3 Hybrid CNN-HMM Koller et al. (2018) 31.6 32.5 IAN Pu et al. (2019) 37.1 36.7 DenseTCN Guo et al. (2019) 35.9 36.5 DNF Cui et al. (2019) 23.8 24.4 DNF Cui et al. (2019) 23.1 22.9 CNN-LSTM-HMM Koller et al. (2020) 26.0 26.0 RSign 45.1 45.1 MCSign 35.2 35.3 7.5.2 MCSign-C-Slim Ten Slim LSTM models have been introduced and investigated in chapter 3. Due to their promising results, we decide to evaluate and verify them further on RWTH-PHOENIX-Weather dataset. We train the MCSign-C-Slim model with three configurations: LSTM1, LSTM3, and LSTM6. We then compare baseline performance to the proposed models. To train each model, we perform parameter selection for learning rate, number of epochs, and beam width. For all three models, we only replaced LSTM units with corresponding Slim models. The networks were trained by RMSprop optimizer with a batch size of 32. We first investigate the effect of the beam width. The beam width varied from 1 to 6, then set to the optimum value of 4 when only the 4 most probable hypotheses are stored. Similarly, the number of epochs for the training was changed from 100 to 1000. We then investigate how LSTM1, LSTM3 and LSTM6 affect WER score. Figure 7.8 shows that Slim models and the baseline model have nearly the same word error rates. This indicates that 87 the proposed models, although being lightweight, do not harm the model performance. Development Set MCSign-C 90 MCSign-C-Slim1 MCSign-C-Slim3 80 MCSign-C-Slim6 70 WER 60 50 40 200 400 600 800 1000 Epoch Figure 7.8: MCSign-C-Slim models comparison In Table 7.3, we list the time per epochs for each model. As the table indicates, slim models provide significant improvements over the baseline model in terms of training cost. The model is implemented with the Tensorflow.keras deep-learning platform and trained on a Nvidia K80 with 4.1 TFLOPS as the GPU resource (Google Colab Setting). Table 7.3: Time per epoch for each MCSign-C-Slim model. Model Time per Epoch (s) MCSign-C-Slim1 36 MCSign-C-Slim3 30 MCSign-C-Slim6 29 MCSign-C 40 7.6 Discussion With our presented end-to-end model, we are able to achieve 35 WER on WTH-PHOENIX- Weather dataset. The slightly lower performance of the model is compensated by greater robustness to environmental circumstances, like lighting. By creating black and white skeleton images using Mediapipe from raw data, we not only overcome lighting conditions, complex background, and camera position challenges but also create a way to easily combine data from variant sources and 88 hence combat limited data resources in the SLR field. Since the proposed network is insensitive to dataset type, combining different datasets boosts the network’s performance tremendously. Experimental results indicate that our model achieves substantial improvements over mainstream methods in terms of training speed. 7.7 Summary In this chapter, we proposed two solutions to sentence-level SLR. Sentence-level SLR required mapping videos of sign language sentences to sequences of gloss labels. Connectionist Temporal Classification (CTC) has been used as the classifier level of both models. CTC is used to avoid pre- segmenting the sentences into individual words. The first model is an LRCN-based model, and the second model is a Multi-Cue Network. LRCN is a model in which a CNN as a feature extractor is applied to each frame before feeding them into an LSTM. In the first approach, no prior knowledge has been leveraged. Raw frames are fed into an 18-layer LRCN with a CTC on top. In the second approach, three main characteristics (hand shape, hand position, and hand movement information) associated with each sign have been extracted using Mediapipe. 2D landmarks of handshape have been used to create the skeleton of the hands and then are fed to a CONV-LSTM model. Hand locations and hand positions as relative distance to head are fed to separate LSTMs. All three sources of information have been then integrated into a Multi-Cue network with a CTC classification layer. We evaluated the performance of proposed models on RWTH-PHOENIX-Weather and were able to achieve 35 WER on RWTH-PHOENIX-Weather dataset. 89 CHAPTER 8 FUTURE ROADMAP Although this study aimed to address many aspects of SLR that computer scientists have overlooked, there are still significant gaps to be addressed. Specifically, non-manual elements, including the eye gaze, mouth shape, facial expression, require much attention in SLR. Face Mesh and Iris tracking modules in MediaPipe can be beneficial to pursue this track. To extend the SLR task, one may also look into SLT. SLT differs from SLR as the latter merely detects a sequence of signs without taking into account the linguistic structures and grammar unique to sign language Yin (2020). There have been limited studies on mapping glosses to spoken language. To map the detected glosses into a proper sentence in the target language Neural Machine Translation (NMT) has been introduced. An encoder-decoder architecture, also known as a sequence to sequence model, is primarily employed in recent NMT approaches. However, sequence to sequence networks are unable to model long-term dependencies in large input sentences. To address this issue, attention mechanisms are introduced Bahdanau et al. (2015). Transformer Vaswani et al. (2017) is an encoder-decoder network in which self-attention layers are used in place of recurrent networks. The next phase in translating sign language to written or spoken language can be piping the output of an SLR system into a sequence to sequence model. RWTH-PHOENIX- Weather is currently the only publicly available dataset with both gloss labels and spoken language translations. 90 BIBLIOGRAPHY 91 BIBLIOGRAPHY Adithya, V & R Rajesh. 2020. A deep convolutional neural network approach for static hand gesture recognition. Procedia Computer Science 171. 2353–2361. Akandeh, Atra. 2019. Slim lstm models. https://github.com/atrakriv/Slim-LSTMs . Akandeh, Atra. 2021. Sign language recognition. https://github.com/atrakriv/SLR . Aly, Saleh, Basma Osman, Walaa Aly & Mahmoud Saber. 2016. Arabic sign language fingerspelling recognition from depth and intensity images. In 2016 12th international computer engineering conference (icenco), vol. 2016, 104. Bahdanau, Dzmitry, Kyunghyun Cho & Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Iclr 2015 : International conference on learning representations 2015, . Bengio, Y., A. Courville & P. Vincent. 2013. Representation learning: A review and new perspec- tives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(8). 1798–1828. Bengio, Yoshua. 2009. Learning deep architectures for ai . Bengio, Yoshua, Pascal Lamblin, Dan Popovici & Hugo Larochelle. 2006. Greedy layer-wise training of deep networks. In Advances in neural information processing systems 19, 153–160. Boulanger-Lewandowski, Nicolas, Yoshua Bengio & Pascal Vincent. 2012. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. arXiv preprint arXiv:1206.6392 . Bragg, Danielle, Oscar Koller, Mary Bellard, Larwan Berke, Patrick Boudreault, Annelies Braf- fort, Naomi Caselli, Matt Huenerfauth, Hernisa Kacorri, Tessa Verhoef, Christian Vogler & Meredith Ringel Morris. 2019. Sign language recognition, generation, and translation: An inter- disciplinary perspective. In The 21st international acm sigaccess conference on computers and accessibility, 16–31. Camgoz, Necati Cihan, Simon Hadfield, Oscar Koller & Richard Bowden. 2017. Subunets: End-to- end hand shape and continuous sign language recognition. In 2017 ieee international conference on computer vision (iccv), 3075–3084. Camgoz, Necati Cihan, Oscar Koller, Simon Hadfield & Richard Bowden. 2020. Sign language transformers: Joint end-to-end sign language recognition and translation. In 2020 ieee/cvf conference on computer vision and pattern recognition (cvpr), 10023–10033. Chung, Junyoung, Caglar Gulcehre, KyungHyun Cho & Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 . 92 Cui, Runpeng, Hu Liu & Changshui Zhang. 2017. Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In 2017 ieee conference on computer vision and pattern recognition (cvpr), 1610–1618. Cui, Runpeng, Hu Liu & Changshui Zhang. 2019. A deep neural framework for continuous sign language recognition by iterative training. IEEE Transactions on Multimedia 21(7). 1880–1891. Deng, J., W. Dong, R. Socher, L.-J. Li, K. Li & L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In Cvpr09, . Donahue, Jeff, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadar- rama, Kate Saenko & Trevor Darrell. 2017. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(4). 677–691. Fang, Biyi, Jillian Co & Mi Zhang. 2017. Deepasl: Enabling ubiquitous and non-intrusive word and sentence-level sign language translation. In Proceedings of the 15th acm conference on embedded network sensor systems, 5. Forster, Jens, Christoph Schmidt, Thomas Hoyoux, Oscar Koller, Uwe Zelle, Justus Piater & Hermann Ney. 2012. Rwth-phoenix-weather: A large vocabulary sign language recognition and translation corpus. In Proceedings of the eighth international conference on language resources and evaluation (lrec-2012), 3785–3789. Gaidon, Adrien, Zaid Harchaoui & Cordelia Schmid. 2013. Temporal localization of actions with actoms. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(11). 2782–2795. Goodfellow, Ian, Yoshua Bengio & Aaron Courville. 2016. Deep learning. http://www.deeplearningbook.org . Graves, Alex, Santiago Fernandez, Faustino Gomez & Jurgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on machine learning, 369–376. Greff, Klaus, Rupesh K. Srivastava, Jan Koutnik, Bas R. Steunebrink & Jurgen Schmidhuber. 2017. Lstm: A search space odyssey. IEEE transactions on Neural Networks and Learning Systems 28(10). 2222–2232. Guo, Dan, Shuo Wang, Qi Tian & Meng Wang. 2019. Dense temporal convolution network for sign language translation. In Proceedings of the twenty-eighth international joint conference on artificial intelligence, 744–750. He, Kaiming, Xiangyu Zhang, Shaoqing Ren & Jian Sun. 2015. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 . Hinton, Geoffrey E., Simon Osindero & Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neural Computation 18(7). 1527–1554. 93 Hinton, Geoffrey E., Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever & Ruslan Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 . Hoza, Jack. 2007. It’s not what you sign, it’s how you sign it: Politeness in american sign language. Huang, Jie, Wengang Zhou, Houqiang Li & Weiping Li. 2015. Sign language recognition using real-sense. In 2015 ieee china summit and international conference on signal and information processing (chinasip), 166–170. Huang, Jie, Wengang Zhou, Qilin Zhang, Houqiang Li & Weiping Li. 2018. Video-based sign language recognition without temporal segmentation. In Aaai, 2257–2264. Ioffe, Sergey & Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd international conference on machine learning, 448–456. Jaworek-Korjakowska, Joanna, Pawel Kleczek & Marek Gorgon. 2019. Melanoma thickness prediction based on convolutional neural network with vgg-19 model transfer learning. In 2019 ieee/cvf conference on computer vision and pattern recognition workshops (cvprw), 0–0. Jia, Yangqing, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross B. Girshick, Sergio Guadarrama & Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd acm international conference on multimedia, 675–678. Johnson, Melvin, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B. Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes & Jeffrey Dean. 2016. Google’s multilingual neural machine translation system: Enabling zero-shot translation. http://arxiv.org/abs/1611.04558 abs/1611.04558. Karpathy, Andrej, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar & Li Fei- Fei. 2014. Large-scale video classification with convolutional neural networks. IEEE Conference on Computer Vision and Pattern Recognition 1725–1732. Koller, Oscar, Necati Cihan Camgoz, Hermann Ney & Richard Bowden. 2020. Weakly supervised learning with multi-stream cnn-lstm-hmms to discover sequential parallelism in sign language videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 42(9). 2306–2320. Koller, Oscar, Jens Forster & Hermann Ney. 2015. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Computer Vision and Image Understanding 141. 108–125. Koller, Oscar, Hermann Ney & Richard Bowden. 2016a. Deep hand: How to train a cnn on 1 million hand images when your data is continuous and weakly labelled. In 2016 ieee conference on computer vision and pattern recognition (cvpr), 3793–3802. Koller, Oscar, Sepehr Zargaran & Hermann Ney. 2017. Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent cnn-hmms. In 2017 ieee conference on computer vision and pattern recognition (cvpr), 3416–3424. 94 Koller, Oscar, Sepehr Zargaran, Hermann Ney & Richard Bowden. 2018. Deep sign: Enabling robust statistical continuous sign language recognition via hybrid cnn-hmms. International Journal of Computer Vision 126(12). 1311–1325. Koller, Oscar Tobias Anatol, Sepehr Zargaran, Hermann Ney & Richard Bowden. 2016b. Deep sign: Hybrid cnn-hmm for continuous sign language recognition. In British machine vision conference 2016, . Krizhevsky, Alex, Ilya Sutskever & Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Neural Information Processing Systems . Kuehne, H., H. Jhuang, E. Garrote, T. Poggio & T. Serre. 2011. Hmdb: A large video database for human motion recognition. In 2011 international conference on computer vision, 2556–2563. Kumar, Pradeep, Himaanshu Gauba, Partha Pratim Roy & Debi Prosad Dogra. 2017. A multimodal framework for sensor based sign language recognition. Neurocomputing 259. 21–38. Larochelle, Hugo, Yoshua Bengio, Jerome Louradour & Pascal Lamblin. 2009. Exploring strategies for training deep neural networks. Journal of Machine Learning Research 10. 1–40. LeCun, Yann, Patrick Haffner, Leon Bottou & Yoshua Bengio. 1999. Object recognition with gradient-based learning. Shape, Contour and Grouping in Computer Vision 319–345. Lee, Honglak, Chaitanya Ekanadham & Andrew Y. Ng. 2007. Sparse deep belief net model for visual area v2. In Advances in neural information processing systems 20, 873–880. Lee, Tai Sing & David Bryant Mumford. 2003. Hierarchical bayesian inference in the visual cortex. Journal of The Optical Society of America A-optics Image Science and Vision 20(7). 1434–1448. Li, Shao-Zi, Bin Yu, Wei Wu, Song-Zhi Su & Rong-Rong Ji. 2015. Feature learning based on sae pca network for human gesture recognition in rgbd images. Neurocomputing 151. 565–573. Liu, Tao, Wengang Zhou & Houqiang Li. 2016. Sign language recognition with long short-term memory. In 2016 ieee international conference on image processing (icip), 2871–2875. Lu, Y. & F. Salem. 2017. Simplified gating in long short-term memory (lstm) recurrent neural networks. arXiv:1701.03441 . MediaPipe. 2019. MediaPipe Dev. https://mediapipe.dev . Neidle, Carol, Ashwin Thangali & Stan Sclaroff. 2012. Challenges in development of the american sign language lexicon video dataset (asllvd) corpus . Ng, Joe Yue-Hei, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga & George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In 2015 ieee conference on computer vision and pattern recognition (cvpr), 4694–4702. Niu, Zhe & Brian Mak. 2020. Stochastic fine-grained labeling of multi-state sign glosses for continuous sign language recognition. In European conference on computer vision, 172–186. 95 Oyedotun, Oyebade K. & Adnan Khashman. 2017. Deep learning in vision-based static hand gesture recognition. Neural Computing and Applications 28(12). 3941–3951. Pu, Junfu, Wengang Zhou & Houqiang Li. 2018. Dilated convolutional network with iterative optimization for continuous sign language recognition. In Ijcai’18 proceedings of the 27th international joint conference on artificial intelligence, 885–891. Pu, Junfu, Wengang Zhou & Houqiang Li. 2019. Iterative alignment network for continuous sign language recognition. In 2019 ieee/cvf conference on computer vision and pattern recognition (cvpr), 4165–4174. Pugeault, Nicolas & Richard Bowden. 2011. Spelling it out: Real-time asl fingerspelling recog- nition. In 2011 ieee international conference on computer vision workshops (iccv workshops), 1114–1119. Ranzato, Marc’aurelio, Christopher Poultney, Sumit Chopra & Yann L. Cun. 2006. Efficient learning of sparse representations with an energy-based model. In Advances in neural information processing systems 19, 1137–1144. Rastgoo, Razieh, Kourosh Kiani & Sergio Escalera. 2018. Multi-modal deep hand sign language recognition in still images using restricted boltzmann machine. Entropy 20(11). 809. Rioux-Maldague, Lucas & Philippe Giguere. 2014. Sign language fingerspelling classification from depth and color images using a deep belief network. In 2014 canadian conference on computer and robot vision, 92–97. Rosenblatt, F. 1958. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review 65(6). 386–408. RWTH-PHOENIX-Weather. 2012. The RWTH-PHOENIX-Weather Database of German Sign Language. https://www-i6.informatik.rwth-aachen.de/ forster/database-rwth-phoenix.php . Salem, Fathi M. 2016. A basic recurrent neural network model. arXiv preprint arXiv:1612.09022 . Salem, Fathi M. 2018. Slim lstms. arXiv: Neural and Evolutionary Computing . Salem, Fathi M. 7.11.2016. Reduced parameterization of gated recurrent neural networks. MSU Memorandum . Sign Language and Static gesture. 2018. Sign Language and Static-Gesture Recognition using scikit- learn. https://github.com/mon95/Sign-Language-and-Static-gesture-recognition-using-sklearn . Sign Language Club. 2012. American Sign Language Alphabet. https://www.cchsvoice.org/join- sign-language-club/ . Sign Language MNIST. 2018. Drop-In Replacement for MNIST for Hand Gesture Recognition Tasks. https://www.kaggle.com/datamunge/sign-language-mnist . Simonyan, Karen & Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. Neural Information Processing Systems 27. 568–576. 96 Simonyan, Karen & Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Iclr 2015 : International conference on learning representations 2015, . Soomro, Khurram, Amir Roshan Zamir & Mubarak Shah. 2012. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 . Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Du- mitru Erhan, Vincent Vanhoucke & Andrew Rabinovich. 2014. Going deeper with convolutions. arXiv preprint arXiv:1409.4842 . Thompson, Robin L. 2006. Eye gaze in american sign language : linguistic functions for verbs and pronoun: dissertation. Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani & Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. IEEE International Conference on Computer Vision 4489–4497. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser & Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st international conference on neural information processing systems, vol. 30, 5998–6008. Vitruvian Man. 1490. The proportions of the human body according to Vitruvius. https://en.wikipedia.org/wiki/Vitruvian M an . Wang, Limin, Yu Qiao & Xiaoou Tang. 2014. Latent hierarchical model of temporal structure for complex activity classification. IEEE Transactions on Image Processing 23(2). 810–822. Wang, Limin, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang & Luc van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, vol. 9912 LNCS, 20–36. Yin, Kayo. 2020. Sign language translation with transformers. arXiv preprint arXiv:2004.00588 . Zaremba, Wojciech. 2015. An empirical exploration of recurrent network architectures. An empirical exploration of recurrent network architectures . Zeiler, Matthew D. & Rob Fergus. 2014. Visualizing and understanding convolutional networks. In 13th european conference on computer vision, eccv 2014, 818–833. Zhu, Yi, Zhenzhong Lan, Shawn Newsam & Alexander G. Hauptmann. 2017. Hidden two-stream convolutional networks for action recognition. arXiv preprint arXiv:1704.00389 . 97