LARYNGEAL MECHANISMS AND VOCAL FOLDS FUNCTION IN ADDUCTOR LARYNGEAL DYSTONIA DURING CONNECTED SPEECH By Ahmed Yousef A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Communicative Sciences and Disorders – Doctor of Philosophy Mechanical Engineering – Dual Major 2023 ABSTRACT Adductor laryngeal dystonia (AdLD) is a neurological voice disorder that disrupts laryngeal muscle control during running speech. Diagnosis of AdLD is challenging because of the limited scientific consensus on accurate diagnostic criteria as it can mimic voice features of other voice disorders. The use of laryngeal high-speed videoendoscopy (HSV) as a powerful tool to capture the detailed vocal fold (VF) vibrations has been almost nonexistent to study AdLD and limited to sustained phonation, not connected speech in which AdLD’s symptoms manifest. The present dissertation aims to address the previous literature gap using HSV and provide, for the first time, quantitative analysis for the impaired vocal function in AdLD during connected speech. To accomplish this, HSV recordings were collected from vocally normal adults and AdLD patients during connected speech. Five different studies were implemented in order to analyze and extract clinically relevant information from these recordings. The first study investigated the differences between AdLD and normal controls based on evaluating running speech durations in HSV over which VFs were visually obstructed by excessive movements of laryngeal tissues. To facilitate these analyses, a deep learning tool was developed to automatically classify HSV frames in terms of detecting visual obstructions in the VF images. The second study provided a new image segmentation tool for detecting VF edges during running speech in HSV. This tool was developed using a unique combination of the active contour modeling method and a machine-learning based method (k-means clustering) to segment VF edges in HSV kymograms. The third study developed a quantitative representation of VF dynamics in AdLD in running speech using HSV. A deep learning technique was used based on the tool developed in study two to segment the glottal area/edges and extract the glottal area waveform from the HSV recordings for analysis. The fourth study analyzed the pathological vocal function of AdLD during phonation onset and offset in connected speech using HSV. An automated approach was developed and validated with manual analysis to measure and compare the glottal attack and offset times between AdLD group and normal controls. Study five presented a one- mass lumped model that can estimate glottal area waveform and biomechanical characteristics of VFs based on HSV data. The results of study one showed the accurate detection of the visual obstructions of the VF frames – facilitating the study of laryngeal activities in AdLD. The findings revealed that AdLD group exhibited longer durations of obstructions – making this measure a potential candidate for AdLD assessment. Also, indicating parts of connected speech that provide an unobstructed view of VFs allows for developing optimal passages for precise HSV examination and disorder-specific clinical voice assessment protocols. Study two and three demonstrated promising performance of the proposed automated tools to detect VF edges and analyze glottal area waveforms. These accurate techniques overcame the challenges involved in HSV analysis including the poor image quality during running speech and the excessive laryngeal maneuvers of AdLD. Future research should benefit from these newly developed automated tools for HSV analysis of VF vibrations in running speech to explore diagnostically relevant information in both vocally normal adults and AdLD. The findings of the fourth study revealed the accurate measurements of the glottal attack and offset times using the developed automated technique. The measurements showed significant longer attack time in AdLD and more variability of the attack and offset times in AdLD due to the irregularity of the VF vibratory behavior in this disorder. The results of this study also demonstrated an agreement with the previous findings in literature. Accordingly, glottal attack time might be a compelling measurement of the severity of AdLD, which can be further investigated in future using the developed tool with larger sample size and, even for different voice disorders. Obtaining such measures in running speech opens up new lines of research to explore the clinical significance of these measurements and address the diagnostic challenges in AdLD. In the last study on modeling, the results show the successful optimization of the developed one-mass model to closely capture the characteristics of VF vibrations observed in the HSV running speech sample. The study uncovered the potential of this simplified model to estimate biomechanical properties of VFs with minimal computational cost non-invasively – paving the path for future research to utilize this model for analyzing connected speech samples and study the impaired VF dynamics in AdLD. This dissertation is dedicated to my beloved wife, Noura, my unwavering source of happiness, motivation, and strength. Her limitless support and patience during the pandemic’s challenges that kept us apart for years is exceptional. Without her by my side, completing this PhD would not have been possible. I would like to dedicate this dissertation to my Mum, Laila, my Dad, Dr. Mokhtar Yousef, my brother, Mohamed, my sister, Mai, and all my family members for their unconditional support, care, and belief in me. iv ACKNOWLEDGEMENTS I would like to express my deep appreciation to the individuals who have made unique contributions to my academic growth. Above all, I am profoundly indebted to my PhD advisor, Dr. Maryam Naghibolhosseini, who offered unwavering support and invaluable guidance throughout my PhD training and made my shift from engineering to science smooth. I truly appreciate the countless number of hours she invested in offering enlightening research ideas, reviewing my work, providing thought-provoking feedback, and encouraging me to excel. Without her patience, dedication, and exceptional mentorship this research would not have been possible. I would also like to express my sincere gratitude to Dr. Mohsen Zayernouri for his significant contribution to this dissertation. His in-depth knowledge and extensive expertise in Mechanical Engineering have considerably improved the quality of this research. I am genuinely thankful for the opportunity to work alongside such a knowledgeable mentor. I would like to extend my utmost appreciation to Dr. Dimitar Deliyski, for his consistent support throughout my PhD journey and his commitment to provide an excellent research environment for my professional development. Furthermore, I am wholeheartedly grateful to my committee members, Dr. Eric Hunter and Dr. Jeff Searl, for their valuable guidance and persistent help throughout the development of my PhD. v TABLE OF CONTENTS LIST OF SYMBOLS AND ABBREVIATIONS ...................................................................... vii CHAPTER 1: INTRODUCTION ................................................................................................ 1 1.1. Voice Production and Assessment ................................................................................... 1 1.2. Adductor Laryngeal Dystonia (AdLD) ............................................................................ 4 1.3. High-speed Videoendoscopy (HSV) ................................................................................ 6 1.4. Biomechanical Characteristics of Vocal Folds in AdLD ................................................. 7 1.5. Automated HSV Analysis .............................................................................................. 10 1.6. Research Gaps, Questions and Hypotheses .................................................................... 13 1.7. Dissertation Structure ..................................................................................................... 18 CHAPTER 2: METHODOLOGICAL APPROACH .............................................................. 19 2.1. Research Design ............................................................................................................. 19 2.2. Study Subjects ................................................................................................................ 20 2.3. Data acquisition .............................................................................................................. 23 2.4. Study I: Automated Detection of Vocal Fold Image Obstructions ................................ 24 2.5. Study II: Image Segmentation of Vocal Fold Edges ...................................................... 28 2.6. Study III: Deep-Learning-based Representation of Vocal Fold Dynamics ................... 36 2.7. Study IV: Automated Measurements of Glottal Attack and Offset Time ...................... 43 2.8. Study V: Lumped-Element Modeling ............................................................................ 46 CHAPTER 3: RESULTS ........................................................................................................... 59 3.1. Study I: Automated Detection of Vocal Fold Image Obstructions ................................ 59 3.2. Study II: Image Segmentation of Vocal Fold Edges ...................................................... 67 3.3. Study III: Deep-Learning-based Representation of Vocal Fold Dynamics ................... 77 3.4. Study IV: Automated Measurements of Glottal Attack and Offset Time ...................... 93 3.5. Study V: Lumped Modeling ......................................................................................... 101 CHAPTER 4: DISCUSSION ................................................................................................... 108 4.1. Study I: Automated Detection of Vocal Fold Image Obstructions .............................. 108 4.2. Study II: Image Segmentation of Vocal Fold Edges .................................................... 111 4.3. Study III: Deep-Learning-based Representation of Vocal Fold Dynamics ................. 116 4.4. Study IV: Automated Measurements of Glottal Attack and Offset Time .................... 123 4.5. Study V: Lumped Modeling and Optimization of Vocal Fold Vibration .................... 126 4.6. Limitations and Directions for Future Studies ............................................................. 129 CHAPTER 5: CONCLUSION................................................................................................. 133 BIBLIOGRAPHY ..................................................................................................................... 136 vi LIST OF SYMBOLS AND ABBREVIATIONS AdLD Adductor Laryngeal Dystonia AbLD Abductor Laryngeal Dystonia LD Laryngeal Dystonia MTD Muscle Tension Dysphonia ET Essential Tremor CAPE-V Consensus Auditory Perceptual Evaluation of Voice HSV High-Speed Videoendoscopy EGG Electroglottography VF Vocal Fold GAT Glottal Attack Time GOT Glottal Offset Time ACM Active Contour Modeling DNN Deep Neural Networks Q Research Question H Hypothesis GAW Glottal Area Waveform SLP Speech-Language Pathologists fps Frames Per Second CNN Convolutional Neural Network ReLU Rectified Linear Unit TP True Positive TN True Negative FP False Positive FN False Negative 𝑐𝑘 K-Means Cluster Centroid D K-Means Euclidean Distance 𝐼 Image Intensity E Active Contour Energy Function Eimage Active Contour Internal Energy Function Eint Active Contour External Image Function vii γ Active Contour Elasticity Weight β Active Contour Rigidity Weight ∇𝐼 Image Gradient 𝐾𝑤 Kymogram Image Width 𝐾ℎ Kymogram Image Height IoU Intersection Over Union DC Dice Coefficient F1 Boundary-F1 Score ML Machine Learning m Vocal Fold Mass k Vocal Fold Elasticity c Vocal Fold Damping Coefficient AUC Area Under The Curve Ps Subglottal Pressure P1 Inlet Glottis Pressure P2 Outlet Glottis Pressure Qg Glottal Air Flowrate d Vocal Fold Thickness l Vocal Fold Length w Vocal Fold Width ℱ Net Vocal Fold Force F External Vocal Fold Force PB Bernoulli Pressure ρ Air Density µ Coefficient of Air Viscosity Ag Glottal Area Ag0 Initial Glottal Area Xc Critical Vocal Fold Displacement 𝑃̅𝑠 Typical Subglottal Pressure PSmax Maximum Built-Up Pressure tc Vocal Fold Closure Time viii c' Vocal Fold Closure Damping Coefficient Δt Time Step K1 Initial Slope Estimate K2 Second Slope Estimate K3 Third Slope Estimate K4 Fourth Slope Estimate 𝑥̥ Initial Displacement 𝑉̥ Initial Velocity α Scaling Factor Obj Objective Function 𝐴𝑀𝑜𝑑𝑒𝑙 Simulated Glottal Area 𝐴𝐻𝑆𝑉 Experimental Glottal Area q Optimizing Parameters Vector PSO Particle Swarm Optimization N Number of Swarm Particles J Total Iteration Number * Optimum Value vi Swarm Particle’s Velocity qi Swarm Particle’s Position pb Best Swarm Particle Position gb Best Swarm Global Position W Swarm Particle Inertia Weight Z1 Swarm Particle Cognitive Parameter Z2 Swarm Particle Social Parameter ix CHAPTER 1: INTRODUCTION 1.1. Voice Production and Assessment Divulging the mastery behind speech production has been a desire for scientists. This desire emerged about a century ago [1] when scientists aimed to understand the governing physics of phonation and voice production. Human voice production process works through an energy conversion; the aerodynamic energy generated by the lungs is converted into the acoustic energy and sound in the vocal tract; this conversion happens when the vocal folds (VFs) vibrate and, appropriately modulate the glottal airstream [2, 3]. Different theories were proposed to better understand the voice production mechanisms and interpret the complex interaction between the aerodynamics of glottal airflow, vibration of VFs, and the acoustic output of the vocal tract [4, 5, 6, 2, 7]. One of the well-established theories is the Aerodynamic-Myoelastic Theory which offers a foundation for understanding human voice production. It states that the vibratory motion of the VFs during phonation are produced by a combination of both the aerodynamic forces of the airflow and the VF tissue dynamics [2, 3]. Other theories were recently developed like the nonlinear source-filter theory proposed by Titze [8]. The subglottal system below the larynx was defined as a sound source (source of energy), which helps in sustaining the VF vibration in phonation [9]. The vocal tract was considered as a filter that convolved with the source to generate the sound [10, 11, 12, 13]. Understanding these underlying mechanisms of voice production and particularly, the vibratory behavior of VF as a vital component in the larynx helps in providing better healthcare, medical diagnosis and treatments for individuals who suffer from voice problems and degraded voice quality. This cannot only enhance individuals’ quality of life and social well-being but also individuals’ work productivity and health care cost. Therefore, several tools and methods were developed to obtain a better assessment of the VF and the overall voice quality. One assessment approach is through analyzing the output aerodynamic signal (glottal airflow) from the phonatory system. Several measures can be obtained from the change in the glottal airflow due to the vibratory abduction–adduction movement of the VF [14]. Open quotient is an example of the aerodynamic measures and is defined as the portion of the vibratory cycle with an open glottis. A large open quotient value relates to a breathier voice quality whereas a small value associates with a more pressed quality [15]. Analyzing the output acoustic signal is another method for voice assessment. This can be done through either objective acoustic measurements or subjective 1 perceptual assessments. Acoustic measures are generated using signal processing methods and can provide a quantitative tool for assessment of voice quality. These objective measures can be divided into three different categories: (1) perturbation measures such as jitter and shimmer [16]; (2) noise measures such as signal to noise ratio and harmonic to noise ratio [17]; (3) spectral/cepstral measures such as cepstral peak prominence and Mel-frequency cepstral coefficients [18]. The second way of analyzing the acoustic signals is through the auditory perceptual evaluation, which is the most commonly used approach in voice clinics and mainly depends on the level of expertise of the evaluator. Evaluation tools have been developed as standardized scales in order to reduce the possible variability and inconsistencies in the perceptual evaluation of voice disorders. One of the most recent effective standard scales is the Consensus Auditory Perceptual Evaluation of Voice (CAPE-V) [19]. CAPE-V enables the analysis and assessment of different voice features, namely, severity, roughness, breathiness, strain, pitch, and loudness. The CAPE-V rating is done using a visual analog scale on a 100-mm line and has standard vocal tasks to assess the voice quality. Additionally, the CAPE-V form includes an ordinary scale of moderate, mild, and severe to make a perceptual judgment of the voice. It was found that this rating procedure is consistent and reliable among the raters [20]. Another assessment approach is performed using imaging techniques to directly visualize the activities and the different configurations of the larynx and, particularly, VFs for a reliable assessment of the voice. The most common modalities for laryngeal imaging are electroglottography (EGG), videostroboscopy, and high-speed videoendoscopy (HSV). EGG is a voice assessment technique that is used to analyze the contact of the VFs with large sampling rates [21]. The EGG principle depends on the variation in the electrical conductivity between tissue and air. That is, two or more electrodes are placed on both sides of the larynx and a high frequency and low voltage electric current is fed between the electrodes. During the VFs vibration, the contact area between the VFs changes, hence, the electrical impedance between the electrodes varies. This variation in the impedance is reflected in the EGG output [21]. Several characteristic points of VF vibration can be obtained from EGG such as at the beginning of opening the upper margin of VF, at the complete closure of the lower margin, and at full VF contact. Different measures can be generated based on these extracted characteristic points, such as the contact quotient (ratio between the contact phase and total time of the vibratory cycle), open quotient (ratio between the open phase and total time), and speed quotient (ratio between opening and closing time) [22]. As an 2 example of how these measures can be related to the assessment of voice quality, a breathy voice quality is associated with smaller contact quotient [22]. The current laryngeal imaging technique that is widely used in clinics for voice assessment is videostroboscopy [23, 24, 25]. Using an endoscope coupled with a stroboscopic light in videostroboscopy, video recordings of the laryngeal structures can be obtained which allows to visually assess laryngeal tissue health and VF vibrations [23, 24, 25]. Videostroboscopy can only capture stationary phonation events during periodic VF vibrations. Although videostroboscopy is used during connected speech, where most of voice disorders reveal themselves, it can only capture gross laryngeal adjustments. That is, the functional assessment of VFs vibration using videostroboscopy is limited to sustained vocalizations only [26, 27, 28, 29]. Due to the low sampling rate of the camera (resulting in a low temporal resolution), videostroboscopy is incapable of capturing the cycle-to-cycle and intra-cycle details of VFs vibration, which is critical when those vibrations are aperiodic – a common occurrence in voice disorders [30, 31]. The recent advancement of coupling flexible fiberoptic endoscopes with laryngeal high-speed videos serves to overcome the previous limitations of stroboscopy by offering high recording frame rates (thousands of frames per second) and capturing the true VF vibrations [30, 31, 32, 33]. Using HSV allows the visualization and analysis of the detailed pathological phonatory events in voice disorders during running speech [34, 35, 36, 37, 38, 39, 40] such as the true VF oscillations (cycle to cycle) [41, 42, 43, 44], phonation onsets and offsets [45, 46, 47, 48], voice breaks [36], and singing [33]. HSV will be revisited and discussed in detail later in this chapter. Researchers have been using these different assessment modalities and approaches to study and analyze the different voice disorders. Among these modalities, imaging techniques (particularly the advanced HSV) can provide accurate information regarding the underlaying mechanisms of voice production, vocal function, and their dynamics. The high capabilities of imaging techniques allows researchers to study laryngeal dynamics and VF function in dysphonic voices [49, 50, 51, 32]. However, there is a huge gap in literature in terms of studying voice disorders using HSV, especially in neurological voice disorders whose symptoms mostly appear during connected speech. One of these neurological voice disorders that has not been well documented in literature is laryngeal dystonia (LD) which will be discussed in detail in the following subsection. 3 1.2. Adductor Laryngeal Dystonia (AdLD) LD is a neurogenic, chronic voice disorder that causes the intrinsic laryngeal muscles to contract, or spasm, involuntarily during phonation [52]. LD affects an estimated 1 per 100,000 people (with a prevalence of 35,000-50,000 cases in the United States) [53]. There is a female predominance (79% of the patients are women) [53]. The average age on the onset of LD ranges from 40 to 50 years old [54]. The patients typically report a sudden onset of symptoms, which gradually progress until become severe within few months or few years [55]. As a chronic voice disorder, LD affects the daily communication of the patients and leads to social isolation and occupational disability [56]. The LD etiology remains elusive. However, recent studies demonstrated some association between the LD development and genetic, environmental, and familial factors [55, 57]. Although most LD cases have focal laryngeal dystonia [54], LD symptoms may appear in patients with other neuromuscular disorders; for example, around 25% of patients with essential tremor suffer from LD [58]. Other scholars hypothesized that the pathophysiology of LD may arise from an increase in brain plasticity, sensory abnormalities, and a reduced inhabitation of intracortical processes [59]. LD is characterized as a task-specific dystonia where it only occurs during connected speech and its severity relies on the demands of the vocal task [57]. It was reported that LD signs are more likely to appear during connected speech than sustained/prolonged vowels [60]. This is due to the increased motor complexity of running speech compared to sustained phonation, which provokes more sever laryngeal spasms and higher strain. The complexity stems from the rapid transitions during running speech which requires switching between voiced and nonvoiced sounds whereas no such transitions exist in sustained vowels [60]. LD is typically divided into three subtypes: adductor LD (AdLD), abductor LD (AbLD), and mixed. Patients with AdLD experience spasmodic overclosure of the VFs during phonation, particularly when the VFs are approximating, leading to excessive phonatory breaks and a strained voice quality with cessation of airflow [55]. In contrast, patients with AbLD exhibit excessive involuntary opening of the VFs during phonation – leading to a transient breathy voice quality with excessive escape of airflow [52, 61]. Additionally, some clinicians recognize patients who suffer from both conditions, mixed LD [61]. Since AdLD is the most common form of LD with 80% of all LD Patients [55], it will be investigated in this dissertation. 4 Diagnosis of AdLD is challenging because AdLD can coexist with other neuromuscular disorders that have similar voice symptoms [62]. Although current diagnosis of AdLD mostly relies on auditory–perceptual features [63], other functional voice disorders such as muscle tension dysphonia (MTD) can mimic the voice characteristics of AdLD – resulting in diagnostic confusion [64]. MTD patients can have hypercontraction in the laryngeal muscles and, hence, a strained voice. Further, MTD cases may normally cough, cry, and sing similar to AdLD [62]. There are no diagnostic criteria in the current clinical practice to differentiate between AdLD and MTD even for experienced clinicians. Given that the treatments of MTD and LD is completely different, misdiagnosis can lead to inappropriate/needless surgical or medical interventions [65]. Hence, researchers tried to differentiate between the two disorders. In this regard, they found that AdLD is a “task dependent” dystonia (Less severe during sustained vowels than running speech), yet MTD is not (equally severe regardless of the vocal task) [66]. Essential tremor (ET) might also be mistaken for AdLD where above 25% of ET suffer from laryngeal tremor. This is because laryngeal muscular tremors can mimic glottic stops as in AdLD. But misdiagnosis can be avoided knowing that the tremor with ET is present in sustained phonation while it is not present with AdLD [62]. Similar to the challenges in diagnosing AdLD, ineffective treatments of AdLD can also occur. That is, the difficulty in the differential diagnosis of AdLD may lead to needless treatments and surgical interventions because, for example, if LD is misdiagnosed with MTD (the treatments of MTD and LD are completely different) [67, 68]. The main treatment for AdLD is botulinum toxin injection into the affected muscle(s): thyroarytenoid, interarytenoid, and lateral cricoarytenoid [62]. The injection is effective and provides temporary relief; usually, it is done every 3-4 months. Studies showed that this treatment enhances the acoustic/aerodynamic measurements and the voice quality of AdLD patients [69]. However, side effects may exist after the injection such as incomplete glottic closure in AdLD. Another treatment option for AdLD is provided through a surgery which includes denervation and reinnervation of the recurrent laryngeal nerve, recurrent laryngeal nerve sectioning, type II thyroplasty, and thyroarytenoid muscle neuromyectomy [70]. Voice therapy is also considered as a complementary treatment, provided by speech-language pathologists that can help mitigating AdLD symptoms. For example, some studies show that when voice therapy is provided after the Botox injection, it gives patients a longer time with alleviated voice symptoms before needing to repeat the injection [71]. 5 1.3. High-speed Videoendoscopy (HSV) The voice production in AdLD has been studied using different assessment tools that we discussed earlier in this document such as the acoustic analysis [72, 73], fiberoptic laryngoscopy [74] and aerodynamic measurements [64]; yet the pathophysiology and differential diagnosis of AdLD is still not fully understood. Despite the use of these different assessment tools, the use of HSV has not been well investigated in literature. Laryngeal imaging tools can be used to observe and diagnose the impaired voice production function in AdLD during connected speech. This is because the AdLD symptoms are mostly reveal themselves during running speech [75, 76, 77, 78]. HSV is a powerful tool that can offer high frame rates and temporal resolution [30, 31, 32, 33]. The main advantage of HSV resides in the ability to visualize both periodic and aperiodic VF movements that otherwise would not be feasible with videostroboscopy [49, 41]. Such capability makes HSV viable for examining the variations between and within vibratory cycles of VFs which are associated with their aperiodic motions in AdLD during running speech [30]. This aforementioned capability provides the opportunity to develop new tools to objectively analyze the entire vibratory cycles during phonation in AdLD. Hence, the potential of using HSV has been investigated in previous studies as a promising tool to understand the underlying voice production mechanisms in dysphonic voices [49, 79, 32]. The clinical assessment of VFs vibration using videoendoscopic images is commonly performed subjectively with visual inspection of the data. However, employing efficient quantitative methods for voice analysis using HSV would be valuable for clinical voice examination. Hence, extracting useful, quantitative measurements of the dynamic motion of the VFs in HSV recordings could allow the clinicians to obtain clinically relevant characteristics of the VF oscillations during connected speech. The measurements and features of the VF vibrations during either sustained phonation or running speech can be extracted by visually analyzing HSV recordings. During the sustained phonation (e.g., the production of the vowel /i/) and steady-state VF oscillations, features such as periodicity, VFs symmetry, glottal closure, and information about the mucosal wave and its aggregation can be obtained from HSV [32]. In addition, HSV is a unique tool to study and analyze aperiodic speech and asymmetric vibrations of the VFs, which is common in voice disorders that cause perturbed periodicity in VF vibrations. This aperiodicity cannot be analyzed using videolaryngoscopy. Hence, HSV is a powerful approach to tackle this problem as it can be used to assess the most transient VF vibratory behaviors regardless of the periodicity of the vibrations 6 [50]. So, visual information about the phonatory breaks, laryngeal spasms, onset and offset of phonation, and any other laryngeal movements that involve rapid maneuvers can be obtained from HSV data for clinical examinations in future. However, using HSV in voice clinics remains a daunting task for clinicians since the data obtained through HSV are difficult to analyze with visual inspection; a short HSV recording can yield thousands of frames needing to be assessed, which is a time-consuming process [80, 81]. The semiautomated and fully automated objective analysis and measurements of HSV can overcome this challenge. Some of these measures include the closed/open quotient, left-right phase and amplitude asymmetry, axis shifts during closure, period/glottal width irregularity, glottal attack time (GAT) and glottal offset time (GOT) [32, 82]. Among these objective measures, GAT and GOT will be discussed as it will be included in the present dissertation analysis. GAT is closely related to the onset of VF vibration and the sound generation while GOT is associated with the offset of VF vibration and the end of a phonation. Both measures are critical factors to study the pathophysiology of voice disorders. Previously, these measures were manually extracted through the visual analysis of HSV data from vocally normal individual and patients with neurogenic voice disorders (LD and unilateral VF paralysis) during connected speech [83]. This work included a small sample size but emphasized the importance of measuring GAT and GOT and how they are critical for the voice characterization of neurogenic voice disorders. However, these two measures were extracted manually by visual raters, which was a time-consuming process resulting in analyzing a limited number of participants. This emphasizes the importance of developing these measures using automated techniques which will be done in this dissertation work. 1.4. Biomechanical Characteristics of Vocal Folds in AdLD In the previous section, we discussed about the high capability of an advanced imaging technique as in the HSV technology and how it can be used to obtain a variety of useful clinical measures through analyzing the video recordings. In this section, a different set of indirect measures that can also be obtained from HSV will be discussed. These indirect measures cannot be directly obtained from visually analyzing the HSV frames and videos because they need a model to be coupled with the HSV analysis. These indirect measures are closely associated with the biomechanical properties and behavior of VFs movement, and they can be obtained by designing biomechanical models based on HSV. Examples of these indirect features include VF masses, the 7 stiffness and elasticity properties of the VF tissues as well as information about the subglottal pressure. The main advantage that these models provide is that they can provide measures that cannot be directly extracted from HSV or any other traditional recording approaches like EGG [84, 85, 86, 87]. These indirect biomechanical parameters are essential to understand the underlying physics of phonation [88, 89, 90, 91]. These parameters could be estimated using biomechanical models through an inverse problem to infer VF tissue properties in a non-invasive way. The inverse problems to predict the parameters of VFs vibration during phonation have been first introduced around 20 years ago [92]. In the inverse problems, the model parameters are optimized so that the model can generate a behavior similar to the experimental observations obtained from, e.g., HSV video data. If the optimization succeeds, the model can infer the biomechanical parameters of VFs [93]. The lumped element models are commonly used in this inverse analysis [94] to obtain the biomechanical characteristics of VFs because they are simple and can simulate different VF vibration behaviors with a few parameters. These models are designed such that the VF tissues are described by a small number of discrete, rigid masses. The neighboring masses are coupled by springs and interact with the external aerodynamic loading and/or acoustical loading [95]. The model parameters are specified based on each model component: masses (VF tissue), springs (to impose the elasticity effect), and dampers (to damp the motion). These models can be designed in different configurations: one- [96, 97, 98, 99], two- [94], three- [100], and multi-mass models [96, 97, 98]. The main advantage of the one-mass model is that it has a simple structure and minimal number of control parameters. Despite its simplicity, the model can still capture the characteristic vibratory features of the self-sustained oscillation of the VFs. These features make this simple model a compelling choice to reduce the computational coset, particularly in the real-time tasks [96, 97, 98, 99]. The two-mass model is the widely used model that can mimic the underlying mechanism of VF dynamics such as phase shifts of the upper and lower VF edges. This model was frequently used in the inverse analysis during the sustained phonatory speech as it captures self- sustained vibrations, asymmetric VF vibrations, and nonlinear VF dynamics [101, 102]. Different modeling works have used the HSV as an experimental data to build their models through the inverse analysis technique in humans to predict different measures. One of these models was done by Döllinger et al. [92] using the two-mass model. They used Nelder–Mead 8 algorithm for optimizing the model with HSV of two human subjects. From the HSV, the medial VF edges were extracted with time using image processing techniques. The extracted trajectory of the VF edges (which refers to the change in the spatial location of VF edges across the recording time) was used to optimize and estimate the model parameters that yield a close trajectory against the experimental data. They predicted VF masses/stiffnesses and the subglottal pressure for each subject. Several studies tackled the inverse problem to quantify the asymmetry in both VF properties and VF vibration patterns [93, 103, 104, 105, 84, 106, 92]. These studies used HSV data from vocally normal participants [92] and patients with unilateral VF paralysis [104] and functional dysphonia [92], as well as from animals [93, 103]. Noting that the VF edges were extracted from the previous HSV data at the medial cross-section during vibration [84]. Only one study [106] used both HSV and EGG; they utilized HSV to extract VF edges whereas EGG to obtain three characteristic points of VF vibration: at the beginning of opening of the upper margins of the VFs, at the complete closure of the lower margins, and at the full VF contact. The prior studies used the same two-mass model with slight changes like using vertical coupling between masses [104], utilizing variable spring stiffnesses [93, 103], or including the collision force between the VFs [93, 103]. The common optimized parameters in these studies were the masses and their displacements, stiffness of the springs, stiffness during collision, damping coefficients, glottal length, thickness of masses, areas between masses, and the subglottal pressure [93, 103, 104, 105, 84, 106, 92]. Particle swarm technique and genetic algorithm were mostly utilized for optimization [93, 103]. The biomechanical characteristics for the pathological phonation of different disorders including LD can be modeled using the lumped-element models. For example, by including the anterior-posterior variations in a 3D model, incomplete glottal closure can be simulated [107, 108]. The VF vibratory characteristics of LD associated with different voice qualities, ranging from pressed as in AdLD to a breathy voice as in AbLD, can also be modeled as well [109]. The unilateral laryngeal nerve paralysis represents a phase lag between the healthy and damaged VFs leading to asymmetries in VF tissues. This disorder can be simulated by decreasing the mass and increasing all the stiffnesses of the damaged VFs [95, 110, 111]. Polyps and nodules, which refer to geometric abnormalities can be mimicked too [112] by adding an extra mass to the affected VF, altering the stiffness/damping coefficients, modifying the collision force, and altering the subglottal pressure loading [113, 114]. The Parkinson’s disease, which causes breathiness, vocal 9 tremor, and an incomplete glottal closure [115, 116], is another disorder that lumped models can emulate [117, 118] through using time-varying model parameters and increasing the springs stiffness [95]. Despite the significant potential of the inverse analysis, attempts are almost nonexistent to extract the biomechanical characteristics of the impaired VF vibrations during running speech. That is, a major knowledge gap exists in terms of analyzing connected speech based on the biomechanical characteristics of VFs using biomechanical models. Prior models were not able to extract indirect biomechanical parameters during running speech; they were just focused on sustained phonation. This is due to the lack of the experimental data collected in connected speech, particularly the imaging data (HSV), that are necessary to design/validate those models and generate the various biomechanical characteristics of VF vibrations. Another reason for the lack of previous models is due to the complexity of simulating running speech since the phonatory events are transitory and convoluted. This high complexity in running speech requires building biomechanical models to simulate the phonation with myriad parameters that need to be optimized which is generally not favorable as it will considerably increase the computational cost, particularly in case of using a complicated model for the inverse analysis problem. This difficulty resulted in a lack of knowledge in terms of studying connected speech and AdLD symptoms since these symptoms only elicited during connected speech which, as mentioned, are very complicated to be simulated. Therefore, as we discussed previously, it is crucial to develop techniques to automatically analyze VF vibrations in order to be able to generate the analysis we mentioned earlier. Obtaining these measures automatically from the HSV video recordings requires the automated segmentation of VF edges and the glottal area. These automated segmentations will determine the location and shape of the vibrating VFs during speech. The objective representation of the edges of the VFs will allow the development of the different HSV-based measures. In the next section, the different automated methods and approaches that were implemented in literature to spatially capture the VF vibrations will be discussed. 1.5. Automated HSV Analysis Considering the large number of images generated from HSV recordings, visual analysis is not a practical solution. This emphasizes the need for automated techniques that can analyze and process the HSV videos in order to obtain useful measurements and information about the VFs function during speech. Segmentation is a fundamental step to analyze HSV video sequences and 10 a building block needed to extract such measurements from the video data. Segmentation can be classified into three different types: temporal segmentation, spatial segmentation, and spatiotemporal segmentation. Temporal segmentation is a process of dividing a video sequence into well-defined time segments (short sequences) in order to extract useful temporal information from the video. An example of this type of segmentation would be the instances in a HSV sequence during which VFs are visually unobstructed or would be the time segments during which a phonation or VFs vibration occurs. Another type of video analysis occurs spatially which is called the spatial segmentation. This technique identifies the region of interest (e.g., a moving object like VFs) in the frames across the video sequence and provides its spatial location and structure for further analysis. For example, this technique can be used to identify the edges of the vibrating VFs and highlight the spatial location of the glottal area in the different HSV frames. This would facilitate the downstream analysis of VF vibrations and the development of the HSV-based measures. The third type of segmentation is spatiotemporal combines the previous two segmentation techniques such that the segmentation is performed on image sequences or video data in both the time and space domains. In literature, there are two main methods to apply segmentation on HSV images: by either the image processing algorithms or the machine/deep learning techniques. The first method includes the classical image processing techniques for extracting useful spatial and temporal information/features from images. In terms of temporal segmentation, a previous study on using HSV in connected speech was proposed, which developed an automated temporal segmentation method using a statistical-based image processing algorithm. This algorithm was able to extract the timestamps for the onsets and offsets of vocalizations and epiglottic obstructions of the VFs [119]. For spatial segmentation, several studies developed and applied the traditional image processing techniques for spatial segmentation of the VF edges in HSV during sustained phonation [120, 121, 41, 122, 123]. The main approaches for extracting the VF edges from the HSV data are the region growing [121, 124, 125], histogram thresholding [41, 126], level sets [127], and active contour modeling (ACM) approaches [122, 123]. These image processing methods were used for HSV analysis in isolated sustained vowels. Most of the developed methods are not fully automated and require visual inspection of the data and some manual analysis [122, 124, 127, 128, 129]. Region growing and histogram thresholding are both vulnerable to image noise and homogeneity [120]. The level set method can accurately estimate the glottal cycle only when the VFs are closed, 11 and is also prone to noise [127]. The ACM approach, however, is less sensitive to noise and intensity inhomogeneity in images, can be initialized anywhere, even across boundaries, and efficiently preserves global line shapes [130, 131]. Hence, this approach is an alternative promising technique for spatial segmentation of the glottal boundaries [132]. The ACM method can be used to dynamically locate the contour of the desired image features, such as the edges of the glottis. The active contour (aka snake) is a spline, which deforms based on certain energy minimization rules to capture the glottal edges. The ACM approach has previously been employed to detect the glottal edges i) spatially in each HSV frame: this method is based on using closed loop snakes for each individual HSV frame [121, 122]; ii) temporally in HSV kymograms: this method uses two open curve snakes for detection of the right and left VFs [123, 133]; and iii) spatio-temporally: the open curve snakes are used for glottal edge detection in the HSV kymograms across different cross sections of the VFs and the extracted edges are then registered back to each HSV frame [123]. The existing studies for spatial segmentation of glottal edges were performed on HSV data obtained during sustained vowels and not in connected speech. The second method that was used in literature for both spatial and temporal segmentation in HSV data is the machine and deep learning technique. Using this advanced method, we can efficiently classify and cluster similar structures and/or discover hidden patterns in a sequence of image data with a low computational cost. In literature, deep learning was used as a tool for temporal segmentation and, particularly, laryngoscopic image classification. Deep learning is a subcategory of machine learning and utilizes deep neural networks (DNN) that can learn features from known/labeled image data (training data) and make predictions on new image data, based on the learned features [134]. Deep learning has shown promising performance in a variety of diagnostic tasks from medical images [135]. There is an immense potential for using deep learning techniques to analyze laryngeal images. Accordingly, several recent studies have applied deep learning on laryngoscopic videos to automatically select frames that display sufficient diagnostic information allowing clinicians to find abnormalities in a timely manner, yet these models were vulnerable to overfitting due to the limited size of the training dataset (i.e., only a few hundred images) [136, 137, 138]. Others used deep learning as a classifier to recognize laryngeal pathology (such as polyps, leukoplakia, vocal nodules, and cancer) based on thousands of laryngoscopic images as a larger training dataset [139, 140, 141]. However, none of the previous studies were conducted using HSV in connected speech for frame selection/classification, which is very 12 important in studying voice disorders as they mostly reveal themselves in running speech. Additionally, in terms of the type of voice disorder, this literature did not investigate AdLD. HSV in connected speech imposes lower image quality than in sustained phonation and exhibits excessive laryngeal maneuvers across frames, which urges the need to develop even more efficient methods that can deal with these challenging conditions [81]. In addition to the temporal segmentation and image classification tasks, deep learning (specifically the DNN) has also been utilized for the spatial segmentation task in order to capture VF edges and the glottal area in HSV data. Five recent deep learning techniques based on DNN have successfully segmented the glottis and VF edges in HSV recordings with satisfying accuracy [142, 143, 144, 145, 146, 147]. The HSV datasets in these studies, however, were recorded during production of sustained phonation using a rigid HSV with high image quality. Also, these approaches used DNN for the spatial segmentation task and, hence, required manual labelling/annotation of the glottal edges/area in HSV frames to train the neural networks. Keeping in view that these previous studies for spatial segmentation used only sustained phonation as their data set, expanding this to connected speech using a flexible HSV system is an important next step. 1.6. Research Gaps, Questions and Hypotheses The current diagnosis of AdLD is predominantly based on only auditory–perceptual assessment which causes diagnostic confusion as AdLD symptoms can mimic other disorders such as MTD [67, 68]. This is because the auditory-perceptual evaluation does not provide enough information that would help in differentiating between the two disorders – leading to needless surgical interventions or treatments [148, 149]. Therefore, there is a need for other effective assessment tools like the imaging techniques which can provide more information about the impaired laryngeal mechanisms and vocal function in AdLD and, eventually, enhance its differential diagnosis [150]. Since the previous studies found that the symptoms of AdLD typically appear in connected speech, not in a sustained phonation context [66, 151], the most appropriate powerful imaging technique to study AdLD is the HSV tool. HSV allows to visualize and analyze the detailed VF vibrations as well as the various phonation events in AdLD during connected speech [119, 35, 37]. This, in turn, could lead to a more accurate diagnosis of AdLD. However, most of the previous studies neither used HSV to study connected speech nor used HSV to investigate AdLD. Moreover, given the massive number of frames present in the HSV recordings which, definitely, needs an automated analysis, there is, however, a lack of effective automated 13 tools (the temporal and spatial segmentation) for HSV analysis in connected speech. The reason for this is that using a transnasal flexible endoscope with a high-speed camera to record connected speech imposes challenges to the available automated image processing tools in term of image quality and excessive laryngeal maneuvers. The present dissertation aims to fill these gaps in research. Several measures can be directly extracted from analyzing HSV as mentioned before such as GAT and GOT. These measures can lead to deeper insights on the physiological changes in the impaired voice production of AdLD patients. On top of that, HSV can also be utilized to construct individual-specific models by which different biomechanical measurements can be generated to study AdLD in connected speech such as the elasticity and viscosity of the VF tissues [93]. These biomechanical measures can be generated using modeling techniques that have not been discussed before in literature to study the vocal function in running speech. A summary of the research gaps, research questions (Q), and hypotheses (H) are listed as follows: Research Gap 1: Current standard passages used for perceptual voice assessment of, e.g., AdLD exhibit difficulty to clearly visualize VFs when using HSV during connected speech. This urges suggestions for new speech tasks that would best allow for a better HSV assessment in these populations. This can be done by introducing automated approaches to detect the visual obstructions of the VFs in HSV and by which new assessment passages can be tested and optimized. However, there is a lack of effective automated techniques that can be used for image/frame classification and temporal segmentation of HSV during connected speech. Several studies proposed temporal segmentation techniques and image classification procedures to automatically detect laryngoscopic images that display sufficient diagnostic information. However, none of these studies were performed using HSV in connected speech, which is very important in studying voice disorders such as AdLD as they mostly reveal themselves in running speech. Only one study by our research lab was conducted for the temporal segmentation task but it only used a single HSV recording of a vocally normal individual in connected speech [119]. This emphasizes the need to develop reliable methods to temporally analyze connected speech in HSV. Aim 1 is to build an automated tool to classify HSV frames by detecting the instances during which the image of the VFs is optically obstructed. This tool will be able to automatically detect the time segments (HSV frames) that display an unobstructed clear view of VFs during a token of speech. This aim will address the following research questions. 14 Q1.1: Can DNN accurately classify HSV frames in AdLD during connected speech regardless of the excessive laryngeal maneuvers? H1.1: DNN can accurately classify HSV frames based on whether these frames display an obstructed view of the VFs. Q1.2: Does the presence of AdLD affect the durations over which VFs are visually obstructed in HSV during running speech? H1.2: The duration of the visual obstruction of the VFs will be longer in AdLD versus normal controls during connected speech. Research Gap 2: Another lack lies in developing automated spatial segmentation methods amenable to HSV analysis during connected speech. Successful spatial segmentation in HSV data would lead to a precise detection/localization of VF edges during vibration in connected speech. This can be used to assess and evaluate the VF behavior during running speech. Yet, the existing studies for spatial segmentation of VF edges were performed on HSV data obtained during sustained vowels. The utilized image processing tools in these studies are ineffective in terms of being applied to connected speech with more challenging conditions (poor image quality and excessive laryngeal movements). Aim 2 is to spatially represent VF edges in connected speech through developing a robust automated image segmentation tool considering the poor image quality in the transnasal HSV. Q2: Can VF edges be accurately and robustly segmented in HSV data during running speech in the presence of image noise? H2.1: The dark glottal area can be successfully silhouetted against the brighter surrounding VF tissues. H2.2: ACM can accurately segment VF edges in HSV data with excessive image noise during VF vibrations. H2.3: A clustering technique can be combined with ACM to build a hybrid method improving the edge segmentation accuracy of ACM during vocalization and when VFs are not vibrating. 15 Research Gap 3: Although previous studies examined AdLD disorder in running speech using acoustic analysis, laryngoscopy, and aerodynamic measurements [72, 73, 64, 74], utilizing an advanced imaging technique like HSV has not been well investigated. Limited research was found on HSV use in studying AdLD, which was mainly conducted on sustained phonation, not running speech. Therefore, there is a need for further investigation into analyzing HSV during connected speech tasks to gain a more complete understanding of the impaired vocal function in AdLD. To do so, it is critical to develop automated approaches to provide analytical representation of the VF dynamics in AdLD during connected speech. Developing these approaches – though challenging due to the excessive laryngeal maneuvers existed in AdLD – will represent a considerable contribution to the existing literature in terms of offering a distinctive quantitative portrayal of the impaired behavior of VF vibrations. This can provide unique insights into the underlying VF dynamics in AdLD Aim 3 is to provide quantitative representation of VF dynamics in AdLD during connected speech by extracting glottal area waveform (GAW) and glottal edges from HSV. Q3: Can the GAW be automatically extracted given the inferior image quality in the fiberoptic HSV and the excessive laryngeal movements in AdLD during running speech? H3.1: The hybrid method can be used as an automated labeling tool to train a robust DNN on detecting the glottal area in HSV during running speech. H3.2: This trained DNN will be successfully implemented for the automated extraction of the GAW in AdLD and normal controls even with its challenging image conditions. H3.3: The glottal midline along with the left and right VF edges can be successfully captured based on the segmented glottal area. Research Gap 4: Evaluating and investigating the pathological vocal function during phonation onset and offset of AdLD has been almost non-existent using HSV in running speech [152, 46]. Previous research showed that voicing onset/offset times might contribute to assess AdLD symptoms using acoustic analysis in running speech [72, 73], not HSV in connected speech. Thus, developing automated method for extracting quantitative measures of VF behavior at the onset and offset of phonation in HSV is important to understand the impaired vocal function in AdLD. 16 Aim 4 is to automatically analyze phonation onset and offset from HSV and measure glottal attack and offset times in AdLD and normal controls. Those two measures are critical factors to study the pathophysiology of voice disorders and, particularly, AdLD [153]. This aim will answer the following research question. Q4: Are the glottal attack and offset times different between AdLD and normal controls? H4.1: An automated algorithm can be developed to measure GAT and GOT with comparable accuracy to visual measurements. H4.2: GAT and GOT will be significantly higher in AdLD versus normal controls. H4.3: GAT and GOT will show more variability in AdLD subjects. Research Gap 5: A major knowledge gap exists in studying the biomechanical characteristics of the VFs in connected speech. Previous studies showed the successful development of the biomechanical measures using inverse analysis of HSV data through model-based methods. Yet these papers studied the biomechanical behavior of VFs neither in connected speech nor in AdLD disorders. These modeling studies used HSV data recorded during prolonged vowels, which are impractical for analyzing the pathological voice function as it mainly appears in running speech. Moreover, complex lumped models with multiple masses have been intensively utilized in literature. Given the complexity of analyzing connected speech, using simplified models becomes essential for optimizing simulations. However, simple models, like one mass, have not been explored before to conduct inverse analysis of HSV data. This model offers a simple structure, and with minimal control parameters, it can still represent the self-sustained VF vibration. Such advantage besides its low computational demands makes the one-mass model a compelling candidate for the present inverse analysis problem to capture the prominent features of VFs. Aim 5 is to develop a simplified model-based approach that can determine the biomechanical characteristics of VF including VF mass, elasticity and viscosity based on the HSV running speech sample. The model parameters are obtained/optimized based on inverse analysis of HSV data using the dynamic vibration of VF extracted form HSV data. Q5: Can a simplified one-mass model be optimized to accurately match the vibratory behavior of VF extracted from HSV? H5.1: A simplified one-mass model can successfully simulate both the vibratory and closure phases of VF motion. 17 H5.2: The particle swarm optimization technique will enable accurate optimization of the model to predict the experimental glottal area waveform. H5.3: The optimized model parameters, obtained through inverse analysis of HSV data, can estimate the VF mass, elasticity, and viscosity indices. 1.7. Dissertation Structure Chapter 2 focuses on developing the required methodology to answer the aforementioned research questions. A description of the study subjects and the clinical HSV data acquisition is included with two different datasets (color and monochrome HSV data). The methodology used to generate the HSV-based measures and the conducted analysis are profoundly elaborated in this chapter. This methodology consists of five studies including various techniques and different analysis that are discussed in Chapter 2 in detail under a separate section. Here is a brief summary regarding the different approaches and analysis performed in each study. Study I: Obstruction detection of the VF view is a method for extracting the frames that display an obstructed view of VFs. This method includes a developed deep learning framework to conduct this task. Study II: Image segmentation of VF edges is an algorithm developed for capturing VF edges in HSV-based extracted kymograms. Study III: A deep learning approach for analytical representation of the VF dynamics and glottal area aims at providing analytical representation of the dynamic movement of the VFs. Study IV: Automated measurements of GAT and GOT which are extracted and analyzed from the video recordings to assess the impaired VF dynamics in AdLD during the phonation onset and offset. Study V: Developing and optimizing a lumped modeling technique to simulate the oscillatory characteristics of the VFs and optimizing it using the experimental HSV data. Chapter 3 presents and describes the results corresponding to each developed methodology and technique in the dissertation and is organized similar to Chapter 2 for clarity. Chapter 4 discusses the results of the various developed methods and the different performed analysis. Finally, Chapter 5 presents a conclusion of each study including a summary of the findings. 18 CHAPTER 2: METHODOLOGICAL APPROACH 2.1. Research Design This chapter provides information about the methodologies required to pursue the target aims of this dissertation, discussed in the previous chapter. The research design, methods, and analysis of each corresponding aim will be presented. In addition, information about the study subjects and the data acquisition are included at the beginning of this chapter. There are two different sets of HSV data that are used and analyzed in this dissertation study. Each dataset is utilized to pursue specific aims and address their related research questions and hypotheses. The first dataset was recorded during running speech from a vocally normal speaker using a flexible HSV with a color high-speed camera. This recording had an inferior noisy image quality with dim lighting. This challenging HSV data will be used to fulfil Aim 2 and a part of Aim 3. The methods and the image processing techniques related to these aims are mainly designed and implemented using this challenging recording in order to demonstrate the robustness of the introduced approaches. The successful application of the proposed methods on such challenging image conditions facilitates the implementation to less challenging monochromatic HSV images since a monochrome camera provides a higher sensitivity and dynamic range with better pixel representation. Monochromatic HSV recordings make up the second HSV dataset which are utilized to execute the remaining research aims in this dissertation. The second set includes HSV recordings obtained from both vocally normal participants and patients with AdLD using a monochrome camera; each subject’s HSV data were recorded during running speech. This chapter includes a description of each methodology used to analyze the HSV data in order to address the research questions and aims of the present dissertation. Different methods and approaches are introduced for the data analysis and discussed under different sections. The first study was developed as a temporal (classification) technique which will pursue Aim 1. In this approach, an effective tool was introduced using DNN to analyze the monochrome HSV dataset for both vocally normal adults and patients with AdLD. The tool was designed such that it can automatically recognize and classify the time segments in HSV recordings that displayed an obstructed view of VFs. Information regarding the DNN structure, training and evaluation are included in this section. This temporal method allows to analyze the differences between the normal controls and AdLD in terms of the durations of VF obstructions. 19 The second study aims at developing a spatial segmentation method that can successfully detect VF edges in HSV-based kymograms after a series of image processing steps to preprocess the color HSV data. These analyses fulfill Aims 2. The developed approaches introduced improved image processing techniques that were designed as a combination of the ACM method and a clustering technique. This hybrid method was developed based on the color HSV dataset. The design and the implementation of this method will be discussed in detail in this chapter. Aim 3 was implemented using another DNN method. The spatial segmentation technique, developed in Aim 2, was used to train a robust DNN that can perform spatial segmentation to analyze the GAW and provide an analytic representation of the vibratory characteristics of VF in Aim 3. Information about how this DNN method for spatial segmentation was built, trained, and evaluated are included. This developed DNN tool was then applied to the monochrome HSV data to analyze the vibratory characteristics of VF in AdLD patients by automatically segmenting the glottal area/edges within the HSV recordings; this implementation is also discussed in detail in this chapter. The spatial segmentation DNN tool facilitated the development and analyses of the HSV- based measurements, which is discussed as the fourth study in this dissertation: particularly the generating the GAT and GOT measures. In the fifth study, a lumped model was developed, simulated and optimized in combination of the extracted results of the spatial segmentation of the VFs and the glottal area. This model pursued Aim 5, which is discussed in the last section of this chapter. 2.2. Study Subjects The details of two subject pools (subject pool one and subject pool two) that are included in the present dissertation are discussed in this section. The main difference between the two subject pools is that the subject pool one included a vocally normal speaker while the subject pool two contained both vocally normal speakers and AdLD patients. Also, the two subject pools were recruited and examined at two different locations. Further information about each subject pool is included in the next two subsections. 2.2.1. Subject Pool One The first HSV dataset (subject pool one) was obtained from a vocally normal female (38 years of age) who did not have any history of voice disorder. The following inclusion criteria were used for the vocally normal participant: 1) at least 18 years of age and proficiency in reading English written text; 2) no prior history of intubation injury or airway/laryngeal surgery; 3) normal hearing; 20 4) absence of structural abnormalities including lesions and/or VF paralysis/paresis, 5) absence of perceptual symptoms of the classical dysarthria; 6) cognitively intact and able to undergo the flexible HSV protocol. The subject was recorded during running speech while reading the “Rainbow Passage.” The examination was done at the Center for Pediatric Voice Disorders, Cincinnati Children’s Hospital Medical Center and was approved by the Institutional Review Board. This subject pool is used to pursue Aim 2 and part of Aim 3. 2.2.2. Subject Pool Two Monochrome data from 9 participants within the age range of 35–76 years old were collected. The current study population includes 4 vocally normal participants (3 female and 1 male) without history of voice disorder and 5 patients with AdLD (4 female and 1 male). The data collection was conducted at the Mayo Clinic in Scottsdale, AZ, and approved by the Institutional Review Board. The goal of this data collection is to use the resulted HSV dataset from the second subject pool to fulfill Aims 1, 3, and 4. The inclusion criteria for the normal participants were similar to the criteria mentioned in the previous subsection with the first dataset of subject pool one. The inclusion criteria for the AdLD subjects were [154, 155]: 1) at least 18 years of age and proficiency in reading English written text; 2) no prior history of intubation injury or airway/laryngeal surgery; 3) normal hearing; 4) absence of structural abnormalities including lesions and/or VF paralysis/paresis; 5) absence of perceptual symptoms of the classical dysarthria; 6) auditory-perceptual characteristics consistent with the disorder (evidence of phonatory breaks on voiced sounds and a strained-strangled quality, and no obvious tremor during phonation); 7) occasional moments of normal sounding voice; 8) improved voice for non-speech vocalizations; 9) improved voice quality for phonation at higher pitches; 10) cognitively intact and able to undergo the flexible HSV protocol; 11) not stimulable for voice change with facilitation techniques (e.g., distraction, improving breath and voice coordination and forward resonance, manual manipulation, etc.); 12) no recent Botox treatment or surgical treatment for AdLD. A summary of the study inclusion criteria for both groups of subjects is included in Table 2.1. 21 Table 2.1. Study inclusion criteria for AdLD patients. Non- AdLD Inclusion Criteria pathological group group At least 18 years of age and proficiency in reading English written X X text. No prior history of intubation injury or airway/laryngeal surgery. X X Normal hearing. X X Absence of structural abnormalities including lesions and/or VF X X paralysis/paresis. Absence of perceptual symptoms of the classical dysarthria. X X Cognitively intact and able to undergo the flexible HSV protocol. X X Auditory-perceptual characteristics consistent with the disorder (evidence of phonatory breaks on voiced sounds and a strained- X strangled quality, and no obvious tremor during phonation). Occasional moments of normal sounding voice. X Improved voice for non-speech vocalizations. X Improved voice quality for phonation at higher pitches. X Not stimulable for voice change with facilitation techniques (e.g., distraction, improving breath and voice coordination and forward X resonance, manual manipulation). No recent Botox treatment or surgical treatment for AdLD. X Furthermore, all vocally normal participants were screened by a voice specialized SLP (15+ years’ experience), prior to consent, based on the above inclusion criteria. All participants with AdLD were diagnosed through the consensus of a voice-specialized SLP (5 years of experience) and a fellowship-trained laryngologist (15+ years of experience) based on the mentioned inclusion criteria. Also, in order to obtain an accurate diagnosis of AdLD, subjects were identified from the treatment-seeking population at the Mayo-AZ Otolaryngology – Head & Neck Clinic. All participants were seen by both a speech-language pathologist and a laryngologist. A voice evaluation and laryngoscopy were completed during sustained phonation and connected speech. 22 Additionally, stimulability tasks were completed. Post full voice evaluation, a diagnosis of AdLD was assigned via multidisciplinary consensus involving the laryngologist and one of 3 speech- language pathologists specialized in voice disorders. All participants with the diagnosis of AdLD had no evidence of tremor via consensus between the treating speech-language pathologists and laryngologist. 2.3. Data acquisition This section explains how the data acquisition was performed. The two subject pools included in the present work were examined and recorded differently. The difference mainly lied in the way of collecting the experimental data such that different HSV systems and setups were used to obtain the video recordings from the participants. An HSV system with a color high-speed camera was utilized to collect recordings from the subject pool one whereas a monochrome high-speed camera was used to collect the HSV data from the subject pool two. In addition, each camera had different spatial resolution. A detailed description of each HSV setup is discussed in the following two subsections. 2.3.1. Subject Pool One (Color HSV System) A custom-built color HSV system with 4,000 frames per second (fps) and 249 µs integration time was used for the data acquisition from the subject pool one (a vocally normal speaker). The recording length was 29.14 s (116,543 frames in total) with HSV image resolution of 256x256 pixels. The recording system included a FASTCAM SA-Z color high-speed camera (Photron Inc., San Diego, CA) coupled with a 3.6-mm Olympus ENF-GP Fiber Rhinolaryngoscope (Olympus Corporation, Tokyo, Japan), and a 300-W xenon light source, model 7152A (PENTAX Medical Company, Montvale, NJ). The camera had a 12-bit color image sensor with sensitivity of ISO 20,000 and 64 GB of cache memory divided into two 32-GB partitions. The selected zoom lens adapter had a focal distance of 45 mm in order to provide the optimal pixel representation and dynamic range. The distance of the endoscope to the VFs was selected to ensure that despite the active maneuvers of the larynx during connected speech, the VFs always fall within the field of view of the endoscope during the recording. The recorded HSV sequence was saved as an uncompressed 24-bit RGB AVI file and then analyzed. The camera used for the data collection had a native resolution of 1,024x1,024 pixels at 20,000 fps. However, for the purposes of the study, the resolution was set to 256x256 pixels, which at the chosen speed of 4,000 fps provided for up to 30 seconds per partition to record the reading of the Rainbow Passage. The selected frame rate 23 was shown to be clinically acceptable for voice assessment [42]. Moreover, using 256x256 pixels provided the optimal balance between the image resolution, camera frame rate, duration of the recording, and the light sensitivity necessary for this data collection. 2.3.2. Subject Pool Two (Monochrome HSV) A different HSV system was used to collect the HSV video data from the second subject pool. For this second HSV dataset, the experimental setup included a Photron FASTCAM mini AX200 high-speed monochrome camera (Photron Inc., San Diego, CA). The camera was coupled with a flexible nasolaryngoscope. The video recordings were collected at spatial resolution of 256 x 224 pixels with a rate of 4,000 frames per second (fps). This chosen spatial resolution was appropriate for the present dataset to be analyzed due to the reasons mentioned in the previous HSV system (in the previous subsection). The recording procedure consisted of several connected speech samples. As such, in the same recording session, each subject was asked to complete reading the six CAPE-V sentences, and a reading of part of the “Rainbow Passage” (the first six sentences). The full length of HSV recordings varied between 50 to 100 s among participants. The HSV files (up to 32GB each) were stored on a computer connected to the HSV camera as mraw files (the file format of the Photron camera used in this study), then transferred to the laboratory data server after de-identification. 2.4. Study I: Automated Detection of Vocal Fold Image Obstructions The purpose of this study was intended to fulfill Aim 1 by addressing the following: Q1.1: Can DNN accurately classify HSV frames in AdLD during connected speech regardless of the excessive laryngeal maneuvers? H1.1: DNN can accurately classify HSV frames based on whether these frames display an obstructed view of the VFs. Q1.2: Does the presence of AdLD affect the durations over which VFs are visually obstructed in HSV during running speech? H1.2: The duration of the visual obstruction of the VFs will be longer in AdLD versus normal controls during connected speech. This study investigates a new approach to temporally segment the HSV recordings. This approach was utilized and introduced in order to analyze the monochrome HSV datasets [152] and was thoroughly discussed in this section. This method was developed using a deep learning framework where a DNN was designed as a classifier. The main objective of this temporal method 24 was to automatically classify the video frames from the HSV recordings into either a frame with a clear display of VF view or a frame with an obstructed view of VFs. This proposed method fulfilled Aim 1 and its hypothesis using the vocally normal and AdLD subjects from the second sample pool (the monochrome dataset). As this approach was designed using a classifying DNN, the structure, training, implementation, and evaluation of this classifying network are discussed in the following subsection. The results of implementing this approach are also discussed at the beginning of the results chapter. 2.4.1. Convolutional Neural Network (CNN) Architecture The deep-learning method we propose for classification is based on convolutional neural networks (CNN), which is a well-known technique with a promising performance for image classification [156, 157]. The network architecture is built using 64-bit MATLAB R2019b (MathWorks Inc., Natick, MA) as a powerful platform for CNN implementation. Figure 2.1 illustrates the main network architecture, which is mainly designed through 10 layers of 3x3 unpadded convolutions. Each convolution is followed by batch normalization to accelerate the training of the neural network and alleviate the sensitivity of the network to its initialization. A nonlinearity term is then added by including a rectified linear unit (ReLU) to accelerate the computations and improve the network performance. ReLU converts values below zero in the convolution input to a value of zero while keeping the values above zero the same. After each pair of two successive convolution-batch normalization and ReLU layers, a down sampling step is applied using a 2x2 max pooling with a strid of 2 where the number of feature maps are doubled. The dimensions of the feature maps corresponding to the different convolutional layers are illustrated in Figure 2.1 at each down sampling stage. The last layer is a sigmoid layer to classify whether the input frame contained the VFs or not (2 classes: unobstructed VF or obstructed VF). The input of the network-based classifier is the HSV frames (256x224 pixels) and are classified as images with unobstructed or obstructed VF view in the classifier output. 25 Figure 2.1. A Schematic diagram for the automated deep learning approach, developed in this work. The HSV video frames serve as the input to the automated classifier. The detailed structure of the convolutional neural network is illustrated. The input frames are processed through several layers of 3x3 convolutions combined with rectified linear unit (ReLU) layers (in dark blue), followed by multiple 2x2 max pooling layers (in orange). The last layer includes a sigmoid layer (in green). The dimensions of the feature maps corresponding to the different convolutional layers are also included in the figure. The neural network classifies each frame into two classes as a classification output: either a frame with unobstructed VF(s) or a frame with obstructed VF(s). 2.4.2. Classification CNN Training The training dataset is constructed using the monochrome HSV data (subject pool two). That is, a total of 11,800 HSV frames is manually selected and extracted from six monochrome HSV recordings of three normal participants and three AdLD patients – creating an image dataset by which the network is trained and validated. These chosen frames are labeled by a rater into two classes: 5,900 images with unobstructed VF view and 5,900 images with obstructed VF view. For the first class, the frames are selected such that all different phonatory gestures that may occur during running speech are represented. For example, we select frames during sustained VF vibration, pre-phonatory adjustments, phonation onset/offset, and no VF vibration. For the second class, we consider frames that displayed various types of VF obstructions due to, e.g., epiglottis movements, the movements of the left/right arytenoid cartilages, laryngeal constriction, false VF movements, or a combination of any of these various obstruction types. Generally, an obstruction is defined such that if more than 50% of VF(s) was obstructed, the frame would be classified as a frame with VF obstruction. 26 The total of 11,800 training frames are divided into two independent datasets, where 10,620 frames (90%) form a training dataset, and 1,180 frames (10%) are assigned to a validation dataset. The validation dataset is created to tune the network parameters and avoid any overfitting that may occur during training. Adam optimizer, a stochastic gradient descent optimizer, is used for training (refer to [158], for the complete description of implementing Adam optimizer). The training is performed with a batch size of 16 (16 images processed in each training iteration) for a maximum of 20 epochs (the maximum number of training iterations). The outcome of the trained network return one of two labels for each frame in the video (either “Unobstructed VF” or “Obstructed VF”). 2.4.3. Classification CNN Evaluation An additional 2,250 HSV frames are manually extracted and labeled from the HSV recording of an AdLD participant (selected from the second subject pool as well) who is not included in the training dataset. Those labeled frames constitute a testing dataset, which is important to verify the generalization capability of the trained network on a set of new images. Also, the robustness and stability of the trained network are thoroughly evaluated through the comparison of automated and visual classifications for the entire HSV recordings of two participants. A rater visually analyzes the recordings of a vocally normal participant (264,400 frames) and an AdLD patient (399,384 frames) and determines the timestamps (frame numbers) between the beginning and ending of each obstruction. We use the confusion matrix as a quantitative measurement to assess the performance of the trained network on the testing dataset and the two testing videos. The matrix demonstrates how accurate the trained network recognizes VF obstructions in the frames. From the confusion matrix, we use different metrics to evaluate the performance: class sensitivity, specificity, precision, F1- score, and the overall network accuracy. The sensitivity and specificity are the measures of the network ability to correctly predict the output based on the actual frame label as follows: TP 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = TP + FN , (2.1) TN 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = TN + FP . (2.2) TP (true positive) is the number of frames for which the network correctly predicts the positive class (frames with an unobstructed view). TN (true negative) is the number of frames for which the trained network correctly predicts the negative class (frames with an obstructed view). In 27 contrast, FP (false positive) and FN (false negative) are the number of incorrectly classified frames as images with unobstructed and obstructed view, respectively. The sensitivity and specificity scores are utilized to obtain the receiver operating characteristics curve (refer to [152] for details about its generation). The area under this curve is calculated as an overall evaluation of the network performance: The larger the area, the better the accuracy of the network. The precision of correctly predicting each class is calculated from the following equation: TP 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = TP + FP. (2.3) F1-score is computed based on the precision and sensitivity scores of each class as another way to measure accuracy as shown in equation (2.4): Precision x Sensitivity 2TP 𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 x = , (2.4) Precision+ Sensitivity 2TP + FP + FN where a high F1-score means that the network has low number of incorrect frame classifications for both classes. Also, the overall network accuracy when applied to the testing frames/videos is determined from equation (2.5): TP + TN 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = TP + TN+ FP+ FN. (2.5) After the evaluation, the network is applied to the monochrome recordings of both the vocally normal subjects and the AdLD patients. The network defined the durations and exacted the frames during which the VF was visually obstructed. Hence, these durations are compared in order to investigate any noticeable differences between the AdLD and the normal controls – seeking to address H1.2. 2.5. Study II: Image Segmentation of Vocal Fold Edges This study is focused to address Aim 2 by fulfilling the following: Q2: Can VF edges be accurately and robustly segmented in HSV data during running speech in the presence of image noise? H2.1: The dark glottal area can be successfully silhouetted against the brighter surrounding VF tissues. H2.2: ACM can accurately segment VF edges in HSV data with excessive image noise during VF vibrations. H2.3: A clustering technique can be combined with ACM to build a hybrid method improving the edge segmentation accuracy of ACM during vocalization and when VFs are not vibrating. 28 Spatial segmentation is the second study in the present dissertation. The goal of this study is to spatially segment the VF edges in HSV-based kymograms using the color HSV data (subject pool 1). In order to achieve this, several image processing techniques are considered to facilitate the downstream spatial segmentation task in this work. A preprocessing step including a temporal segmentation technique is required to provide the vocalized time segments in the HSV data that only display VF vibrations to which the spatial technique can be effectively applied. The spatial segmentation approach is a combination of two image processing steps for spatial segmentation: a clustering technique and an ACM. This hybrid approach is applied to the color HSV recording to segment the VF edges after preprocessing the video. The proposed hybrid image segmentation method aimed at fulfilling Aim 2. The details of each step of developing this technique are discussed under the following subsections. The proposed image segmentation tool is developed in order to localize and segment the edges of VFs across HSV frames in HSV-based kymogrames. This tool consists of several integrated algorithms, which can be divided into two main stages as shown in Figure 2.2: a data preprocessing step (including temporal segmentation, motion compensation, and kymograms extraction) and a machine-learning based image segmentation step (including a clustering and an ACM). Each step will be discussed in detail in the following subsections. All the algorithms are developed and implemented using 64-bit MATLAB (MathWorks Inc., Natick, MA). Figure 2.2. Workflow-chart of the spatial segmentation tool. The gray boxes indicate the input (HSV video data) and the output (HSV frames with segmented VF edges), the green boxes show the data preprocessing steps, and the blue boxes represent the image segmentation steps. 29 2.5.1. Data Preprocessing Several preprocessing steps are applied to the color HSV data before proceeding with the proposed spatial segmentation approach as shown in Figure 2.2. These data preprocessing steps were developed in a previous study [119] and were applied to the present HSV recording. The main goal of this algorithm is to preprocess the HSV video to automatically extract timestamps of the vocalized segments (the vibratory onsets/offsets) before applying the image segmentation algorithm to obtain an analytic representation of the VF edges, which will be discussed later. These preprocessing steps are essential as a temporal segmentation stage preceding the spatial segmentation, which eventually provides an analytic representation of the VF edges in the images/frames. This is because the temporal method, discussed in this subsection, should first determine where in time the vibrating VFs are present before even the edges of the VFs can be localized and segmented. The main goal of these preprocessing steps is to extract the vocalized time segments in the video where the VFs are vibrating. A histogram analysis is first performed to enhance the contrast of the HSV frames by removing the saturated pixels. A gradient-based technique is then applied in order to eliminate the image noise and suppress any irrelevant movements in the image other than VF vibrations. This is applied to the images through a moving- average procedure to generate a motion window. The goal of the motion window is to capture the spatial location of the vibrating VFs in a window across the frames. The motion compensation is mainly performed to overcome the problem with the VF misalignment due to the laryngeal maneuvers during running speech and the movements of the endoscopic tip relative to the laryngeal tissues over time. After detecting the location of the vibrating VFs across the frames, the HSV frames are cropped based on the location of the center of the motion window. The motion window is designed in an ellipse shape (see [37] for complete description of the motion window). It is applied to each HSV frame in order to remove the irrelevant noise and tissues. This step is essential for extracting HSV kymograms inside a rectangular window that encloses the vibrating VFs in each frame. The HSV kymograms are then extracted for the frames between 25 ms before the onset of vocalization and 25 ms frames after the vocalization offset for each vocalization to ensure that the full pre-, post-, and peri-phonatory phases are included. For each vocalized segment, the first kymogram is extracted at the mid-line passing through the vibrating VFs inside the motion window. The spatial segmentation (explained in the upcoming section) is then applied to the 30 kymogram to detect the glottal edges. The kymograms are extracted at different cross sections of the VFs between the anterior and posterior commissure. That is, the y-axis of the kymogram image represents the left-right dimension of the video frame, while the x-axis refers to time (frame number). The spatial segmentation algorithm is applied to each kymograms which will be discussed in the following subsection. 2.5.2. Machine-Learning Based Image Segmentation for Clustering As mentioned earlier, after applying the data preprocessing step and extracting the kymograms of the HSV recording, a hybrid method for image segmentation is applied to the kymogram images. This image segmentation step consists of a combination of two techniques: a clustering technique and ACM as shown in Figure 2.2. This subsection discusses the development of the clustering technique. Selection of the right features from the HSV kymogram images is an essential step toward the successful implementation of the clustering method. In this work, three features are extracted from the kymogram images. The pixel intensities of the red and green channels in the extracted kymograms are considered as two features. This is because the regions of interest in the kymogram (glottal areas) have lower intensities (darker) than the neighborhood regions (the laryngeal tissues) – making pixel intensities a good predictor for glottal area. In addition, given the contrast between the intensity of the glottis and the surrounding regions, the third feature is the kymogram image gradient. As such, the image gradients are computed along the x- and y-axis in the kymogram with a step size of 8 pixels. Then, the overall gradient magnitude is calculated by taking the square root of the sum of squared of the pixels’ gradients in the two directions. The three extracted features are then be normalized between 0-1 and input into a k-means clustering technique [159]. The k-means clustering algorithm is an unsupervised machine learning technique and is well- known for image segmentation [159]. The k-means clustering technique is based on partitioning a dataset into a certain number of disjoint clusters (groups of data). This technique requires the initialization of the number of clusters (k) and the center of each cluster (centroid). Refer to [159, 160] for the full detail of the implementation and the algorithm. In this study, for the kymogram images, the number of clusters are 2 (inside or outside the glottal area) and the initial centroids are chosen based on k-means++ algorithm, which uses a heuristic in order to initialize centroid seeds for k-means clustering (see [160] for the full detail of the algorithm). The clustering algorithm then 31 computes the distance between the centroids and each pixel in the kymogram. The distance is calculated using the Euclidean distance as follows: 𝑑 = ‖𝐼(𝑥̥, 𝑦) − 𝑐𝑘 ‖, (2.6) where 𝑑 is the Euclidean distance, I(x, y) corresponds to the intensity of the kymogram, x and y refer to the pixel coordinates, 𝑐𝑘 is the cluster centroid, and k is the cluster number. Each pixel in the kymogram image is assigned to the nearest centroid based on the calculated distance leading to the formation of the initial clusters. Once the grouping is done, the algorithm recomputes the updated centroid of each cluster (𝑐𝑘 ) as follows: 1 𝑐𝑘 = ∑𝑥∈𝑐𝑘 ∑𝑦∈𝑐𝑘 𝐼(𝑥̥, 𝑦), (2.7) 𝑘 where this new centroid is the data point to which the summation of the distances from all the pixels located in that cluster is minimal. This process is repeated iteratively – reshaping the clusters in the image at each iteration – until converging, when the distance between the new and original centroids does not change. Instead of applying the clustering algorithm to the entire kymogram image for a specific vocalization, each kymogram is divided into smaller kymograms with a maximum of 50 frames to mitigate any possible impact of the image noise on the clustering accuracy. For example, when part of the kymogram has extremely bright pixels (saturated or near-saturated pixels), the clustering technique might be misguided, particularly with large number of frames. This might occur due to the movements of the epiglottis and large reflections from its surface. After applying the clustering algorithm to each kymogram, each pixel in the kymogram is assigned to either cluster 1 (a pixel belongs to the glottal area) or cluster 2 (a background pixel). All the pixels in the same cluster have similar labels. Accordingly, a new binary labeled image of the kymogram is constructed, where each pixel has the binary value of the cluster number. Accordingly, the spatial location of the glottal edges in the kymograms – corresponding to the left and right VF – is determined by spatially annotating the boundary of the glottal area cluster (cluster 1) as the initial contours for the ACM method. 2.5.3. Hybrid Method The hybrid method for spatial segmentation is developed by combining the clustering technique, discussed above, with an ACM method. The active contour in the ACM method is a spline that deforms spatially based on an internal rule (depending on the rigidity and elasticity of the contour) and an external rule (depending on the gradient of the image) until the contour can 32 capture the glottal edges in the image. This deformation is performed through an energy optimization process, which aims to minimize the sum of the internal and external energy functions, corresponding to the contour shape and the image gradient, respectively [132]. The glottal boundary locus resulted from the clustering method is provided to the ACM technique as the initial active contours for the right and left VFs in the kymograms. These initialized contours, estimated close to the VF edges, facilitate the efficient ACM implementation by accurately being deformed and exactly detecting the locations of VF edges in the kymograms. That is, the initial contours (also called snakes) are parametrized as a vector v(s) = [x(s), y(s)], where s ∈ [0,1]. The objective energy function that needed to be minimized is defined as [132, 161]: 1 𝐸 = ∫0 [𝐸𝑖𝑛𝑡 (𝑣(𝑠)) + 𝐸𝑖𝑚𝑎𝑔𝑒 (𝑣(𝑠))] 𝑑𝑠, (2.8) where Eint is the internal energy function and 𝐸𝑖𝑚𝑎𝑔𝑒 is the external image function. The internal energy function (Eint) of the contours is computed by: 1 𝐸𝑖𝑛𝑡 (𝑣(𝑠)) = 2 [𝛼(𝑠) |𝑣 ′ (𝑠)|2 + 𝛽(𝑠) |𝑣 ′′ (𝑠)|2 ], (2.9) where 𝑣 ′ (𝑠) and 𝑣 ′′ (𝑠) are the contours first and second derivatives, respectively; α and β are two weights included to adjust the elasticity and rigidity of the snake, respectively, which control the snake shape. The two weights α and β are set to 0.1 and 0.03, respectively. The image energy function (Eimage) counterbalances the internal energy and is given by: 𝐸𝑖𝑚𝑎𝑔𝑒 (𝑣(𝑠)) = − |∇𝐼(𝑥̥, 𝑦)|2, (2.10) where 𝛻𝐼(𝑥̥, 𝑦) is the spatial gradient of the kymogram image. The solution for equation (2.8) is based on discretization of the energy function. The finite difference method is used to approximate the first and second derivatives in the internal energy function. The internal energy function, given in equation (2.9), can be rewritten as follows, after the discretization: 1 𝐸𝑖𝑛𝑡 (𝑣𝑖 ) = [𝛼 |𝑣𝑖 − 𝑣𝑖−1 |2 + 𝛽 |𝑣𝑖+1 − 2𝑣𝑖 + 𝑣𝑖−1 |2 ], (2.11) 2 where vi refers to the ith snaxel; the snaxels are the vertices that make up the snake spline. The discretization of the image energy function, given in equation (2.10), yields: 𝐸𝑖𝑚𝑎𝑔𝑒 (𝑣𝑖 ) = − |∇𝐼(𝑣𝑖 )|2 . (2.12) By discretizing the total energy function, given in equation (2.8), the following equation can be derived, which is considered as a dynamic-programming problem [161]: 𝐸𝑡𝑜𝑡𝑎𝑙 = ∑𝑛𝑖=1[𝐸𝑖𝑛𝑡 (𝑣𝑖 ) + 𝐸𝑖𝑚𝑎𝑔𝑒 (𝑣𝑖 )], (2.13) 33 where n is the total number of snaxels referring to the total number of frames included in the kymogram. The solution to the above dynamic programming problem generates a sequence of functions {𝑆𝑖 }𝑛−1 𝑖=1 with one unknown variable vi, where Si is the optimum value function (see 2.14). At each Si, a minimization is conducted over the vi variable, where vi is a state variable and can be assigned m possible values. The value of m refers to the number of pixels in the neighborhood of the snaxel that the algorithm searches to find the optimal vi. In this study, the value of m is set to five. 𝑆1 (𝑣1 ) = 𝑚𝑖𝑛 𝑣1 [𝐸𝑖𝑛𝑡 (𝑣0 , 𝑣1 , 𝑣2 ) + 𝐸𝑖𝑚𝑎𝑔𝑒 (𝑣1 )], 𝑆2 (𝑣2 ) = 𝑚𝑖𝑛 𝑣2 [𝑆1 (𝑣1 ) + 𝐸𝑖𝑛𝑡 (𝑣1 , 𝑣2 , 𝑣3 ) + 𝐸𝑖𝑚𝑎𝑔𝑒 (𝑣2 )], (2.14) 𝑆3 (𝑣3 ) = 𝑚𝑖𝑛 𝑣3 [𝑆2 (𝑣2 ) + 𝐸𝑖𝑛𝑡 (𝑣2 , 𝑣3 , 𝑣4 ) + 𝐸𝑖𝑚𝑎𝑔𝑒 (𝑣3 )], . . 𝑆𝑛 (𝑣𝑛 ) = 𝑚𝑖𝑛 𝑣𝑛 [𝑆𝑛−1 (𝑣𝑛−1 ) + 𝐸𝑖𝑛𝑡 (𝑣𝑛 ) + 𝐸𝑖𝑚𝑎𝑔𝑒 (𝑣𝑛 )], and in the general case, 𝑖−2 𝑖−1 𝑚𝑖𝑛 𝑆𝑖 (𝑣𝑖 ) = 𝑣𝑖 [∑(𝐸𝑖𝑛𝑡 (𝑣𝑖 )) + ∑ (𝐸𝑖𝑚𝑎𝑔𝑒 (𝑣𝑖 )) + 𝐸𝑖𝑛𝑡 (𝑣𝑖−1 ) + 𝐸𝑖𝑛𝑡 (𝑣𝑖 ) + 𝐸𝑖𝑚𝑎𝑔𝑒 (𝑣𝑖 )] . (2.15) 𝑖=1 𝑖=1 The image gradient ∇𝐼 is calculated in the vertical direction along the left-right dimension of the kymogram image in each frame with the step size of 10 pixels. The gradient is computed for each kymogram to signify the glottal edges, where the intensity changes rapidly around the initialized contours. Accordingly, the movement of each snaxel for the left and right active contours is limited to the vertical direction. The positive and negative gradients are computed next. The positive gradient of the kymogram is obtained by assigning the negative gradient pixels value of zero. The negative gradient of the kymogram is obtained by assigning the positive gradient pixels zero. The positive gradient is used in the image energy function when searching for the left VF edges while the negative gradient is used for the right VF. To find the snakes that exactly capture the left and right VF edges, the discretized energy function in equation (2.15) is solved for i = 1, 2, 3, …, n. The minimization of the Si functions is done recursively, and the snake is updated during each loop until the value of the snake remains almost unchanged through the minimization procedure. As such, during each loop and at each snaxel, the value of the Si function is numerically calculated for the column-wise snaxel’s five 34 neighboring pixels. The neighboring pixel that leads to a minimum total energy is considered as the updated value of vi. After updating all the vi values, the next loop starts until the algorithm converges and the optimum snake spline is found. The convergence of the algorithm depends on an error function. This error function is defined as the sum of squared of the difference between the calculated snake in the current loop and the previous loop. The algorithm converges when the error became smaller than or equal to 1. The successful execution of the optimization process leads to an accurate adjustment of the initialized contours, resulted from k-means, and to a smooth segmentation of the VF edges in the kymogram images. The hybrid method (k-means + ACM) is applied to all the kymograms at different intersections of the VFs for glottal edge detection. The detected edges in the kymograms are then registered back to the HSV frames for each vocalization to determine the spatial location of the VF edges with respect to the original HSV frames. In order to demonstrate the advantage of combining ACM with k-means, the ACM method is directly applied to detect VF edges in the kymogram images without incorporating the clustering technique. Hence, instead of using the clustering method to initialize the active contours for the ACM method, a horizontal line (an active contour) is initialized in the kymogram space – passing through the center of the glottis. The contour initialization is performed for each kymogram ni using the first moment of inertia, denoted by 𝑀1 (𝑦, 𝑛𝑖 ), calculated as follow [82]: ∑𝐾𝑤 𝐾ℎ 𝑥=1 ∑𝑦=1 𝐼(𝑥̥, 𝑦, 𝑛𝑖 )𝑦 𝑀1 (𝑦, 𝑛𝑖 ) = , (2.16) ∑𝐾 𝑤 ∑𝐾ℎ 𝑥=1 𝑦=1 𝐼(𝑥̥, 𝑦, 𝑛𝑖 ) where x and y correspond to the spatial coordinate of a pixel, I(x, y, ni) is the pixel intensity in the kymogram image, 𝐾𝑤 is the kymogram image width, which is the number of frames in the kymogram, and 𝐾ℎ is the kymogram image height in pixels. Since the first moment of inertia determines the center of brightness, the kymogram is inverted to find the center of darkness (i.e., the centroid of the glottis). The resulted moment line of the kymogram image is considered as the initialized line. The ACM is then applied to the kymograms using this initialized straight line where it is deformed upward and downward until capturing the left and right VF edges. The performance of only implementing ACM to the kymogram images is evaluated in order to reveal whether the ACM is a suitable image segmentation method for localizing VF edges in HSV-based kymograms. In addition, the performance of the ACM is compared with that of the hybrid method by applying 35 the two methods on a decent quality kymogram and a kymogram with dim lighting and a degraded quality. These different applications are selected to address the hypotheses related to Aim 2. 2.6. Study III: Deep-Learning-based Representation of Vocal Fold Dynamics This study is conducted to pursue Aim 3 through addressing the following: Q3: Can the GAW be automatically extracted given the inferior image quality in the fiberoptic HSV and the excessive laryngeal movements in AdLD during running speech? H3.1: The hybrid method can be used as an automated labeling tool to train a robust DNN on detecting the glottal area in HSV during running speech. H3.2: This trained DNN will be successfully implemented for the automated extraction of the GAW in AdLD and normal controls even with its challenging image conditions. H3.3: The glottal midline along with the left and right VF edges can be successfully captured based on the segmented glottal area. The successful building of the hybrid approach to capture VF edges during vibrations allows for the development of a more generalized, flexible approach that can automatically capture the VF dynamics. Since the previous hybrid technique can capture the glottal edges during sustained VF oscillations, we utilize it to build a deep learning model that can segment glottal edges/area during also the nonstationary portions of running speech such as in prephonatory adjusments and during onsets and offsets of vibration, and when there is no VF vibrations [47, 45, 46]. As such, this model is designed based on DNN and utilizes the hybrid method as an automated labeling tool for training the segmentation network in order to precisely representing VF edges in all phonatory events during connected speech. The color HSV recording is considered to implement the proposed DNN approach. Below is a description of how the hybrid method as an automated labeling tool and an explaination on the design, training, and evaluation of the DNN for spatial segmentation. After building and implementing the developed DNN model to the color HSV data, it is retrained and applied to the monochrom HSV data which is also discussed in this section. The development of a new DNN model and its implementation to the monochrom data will pursue Aim 3. 2.6.1. Automated Labelling Tool The hybrid image segmentation tool, previously discussed, is implemented to provide an adequate estimate of the glottal area in a set of HSV frames, selected from the color HSV video, during VF vibration. This set of segmented frames form a training dataset on which a DNN is 36 trained to accurately segment the glottal area during connected speech in different phonatory events. Hence, instead of using manual labeling to create the training data, the hybrid image segmentation method is utilized as an automated labelling tool. This is done by applying the hybrid technique to the kymograms and locating the glottal edges at different cross sections along the VF length; then, these detected edges are registered back in the HSV frames where the glottal area is identified. These segmented HSV frames during VF vibration are used as automated, labelled data for the purpose of training a segmentation network (DNN). 2.6.2. Segmentation CNN Architecture The U-Net architecture is used, which is a fully convolutional neural network architecture. U- Net was introduced by Ronneberger and colleagues in 2015 as an image segmentation tool, particularly in the biomedical imaging field (Ronneberger et al., 2015). This network is a U-shaped network comprising of two parts: encoder and decoder. Figure 2.3 illustrates a schematic diagram of the DNN that is used in this work, which shows the proposed U-Net architecture based on the work of Ronneberger et al. The network is implemented using 64-bit MATLAB (MathWorks Inc., Natick, MA). As shown, panel (a) illustrates the general encoder-decoder design of the U-Net. Panel (b) displays the detailed structure of the network. An example of a network input (HSV frame) and output (segmentation mask) is shown in the figure in which the captured glottal area is automatically highlighted in white color. The HSV frames serve as the input. The feature maps dimensions are shown in the figure as well such that the first and the second entries represent the height and the width of the image while the third entry is the number of channels. For example, convolution (256, 256, 64) is an image with a height and a width dimension of 256 pixels along with 64 different channels. 37 Figure 2.3. Schematic diagram for the proposed deep neural network. Panel (a) shows the general encoder-decoder architecture of the U-Net. Panel (b) illustrates the detailed structure of the network. An example of a network input (HSV frame) and output (segmentation mask) is shown in the figure in which the captured glottal area is shown in white. The input image is first processed in the encoder (highlighted in green) through several layers of 3x3 convolutions-ReLU units. After each two layers of convolution-ReLU, a max pooling 2x2 layer was included (represented as red arrows). The outcome features of the encoder are then upsampled and processed within the decoder part (highlighted in orange) by several layers of convolution-ReLU, followed by transposed convolutions 2x2-ReLU (the blue arrows) for the downsampling. The last layer includes a final convolution, followed by a soft-max layer. The residuals (shown in light gray arrows) of each downsampled stage are concatenated from the encoder to decoder. The feature maps dimensions are shown in the figure. 38 The input images is first processed in the encoder (in green) which encodes the images into feature representations by extracting their main spatial features and context. The images in this contracting path is downsampled through multiple layers of 3x3 convolutions along with ReLU layers. The feature maps are then kept in the memory for latter concatenation with the decoder before performing a down sampling step (light gray arrows). After each two layers of convolution- ReLU, a max pooling 2x2 layer is included (red arrows) for downsampling to reduce the input image size while preserving the most predominant image features. This is done by a sliding window with a size of 2x2 pixels, where only the pixels with the maximum value in this window are considered. The extracted features from the encoder are then propagated and upsampled during an expansive path (the decoder in orange) by several layers of convolution-ReLU, followed by transposed convolutions 2x2-ReLU (the blue arrows) for upsampling. The encoder layers reconstruct/recover the dimension of the feature maps to ultimately match the input image resolution. After four decoder stages of upsampling, the last layer includes a 1x1 final convolution followed by a soft-max layer to classify each image pixel into two classes based on the resulted image features: a glottal area pixel (a value of 1) or a background pixel (a value of 0). 2.6.3. Segmentation CNN Training To train a DNN, an optimization technique is used to tune the network parameters that yields the minimum difference between the predicted outcome (by the developed network) and the expected outcome (by the ground-truth data). Adam optimizer is considered to train the network. The deep learning network (U-Net) is first trained on a training dataset, which is created using the automated labelling tool (the hybrid method). The training dataset includes 2,050 automatically labeled/segmented frames and is selected from the color HSV recording. These segmented frames are evaluated through visual inspection to validate the accuracy of the hybrid labeling tool prior to training of the neural network. 20% of the training data are used as a validation dataset to evaluate the performance of the network during the training process and, accordingly, tune the network parameters to enhance its performance. Because running speech imposes excessive laryngeal movements and frequently alters the spatial location of the VFs across the HSV frames, data augmentation is applied to the training images before training. This allows for a more generalized accurate model that can adapt to the variability in connected speech. Accordingly, the training images are translated with random shifts in both the vertical and horizontal directions, downscaled and upscaled, and randomly rotated. 39 These modified training images are used to train the constructed network, with a batch size of 10 for a maximum of 20 epochs. The trained network output is a binary image (segmentation mask) with the same dimensions as the input frames where only the pixels located in the glottal area are in white color while the other pixels are in black. 2.6.4. Segmentation CNN Evaluation Several networks are trained with different architectures and training parameters. Networks with different number of encoder-decoder stages are tested (ranging from 3 to 6 stages which refers to the number of times the input frames are downsampled/upsampled). Different batch sizes of 4, 10, 16, and 32 are also considered during the training of these networks. The segmentation performance of each of the trained networks are evaluated against a testing dataset, where the best- performing network is determined based on the highest segmentation accuracy scores. The testing dataset is comprised of manually labelled HSV frames. This dataset is created using 600 frames from different phonation events including sustained VF vibration, onsets/offsets of phonation, and when VFs are not vibrating. These frames are different from the training images. The glottal edges in these frames are manually segmented to serve as ground truth by an expert. Four measures are used for assessing the performance of the trained networks on the test set: accuracy, Intersection over Union (IoU), Dice Coefficient (DC), and Boundary-F1 (F1) score. The first three metrics (accuracy, IoU, and DC) are used to assess the segmentation quality in segmenting the glottal area while the F1 score is computed to assess the accuracy in detecting VF edges (glottal boundaries). The segmentation accuracy is calculated using the following equation: TP 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = TP + FN, (2.17) where TP refers to the true positive pixels which is the number of correctly predicted pixels inside the glottal region; in contrast, FN is the number of pixels incorrectly predicted as background (not glottal area pixels). The IoU metric is determined along with the accuracy to provide a more robust performance evaluation including the assessment of those pixels that are incorrectly classified. The IoU ranges between 0-1; a value of 0 refers to no overlap between the predicted glottal area and the ground-truth while a value of 1 refers to a perfect similarity between the estimated and ground- truth glottal area. IoU is calculated according to the following equation: TP 𝐼𝑜𝑈 = TP + FP + FN . (2.18) 40 DC is additionally computed to compare the segmentation results of this model against other models in the literature that used DC as an assessment measure [142]. The equation used to compute the DC score is as follow: 2 × TP 𝐷𝐶 = 2 × TP + FP + FN . (2.19) In addition to the above evaluation metrics, the F1 score is calculated to assess the accuracy of the boundaries and edges of the glottal area and VF. The F1 score allows us to measure how the predicted glottal/VF edges accurately match the ground-truth one. The F1 score is calculated using the following equation: TP 𝐹1 = 1 . (2.20) TP + (FP+FN) 2 2.6.5. Segmentation CNN Application to Monochrome HSV Data The best-performing trained network with the highest segmentation accuracy scores on the vocally normal color HSV video is applied to the monochrome HSV recordings from both vocally normal and AdLD individuals (subject pool two). To do so, the network with the same architecture should be first retrained on these new monochrome images in order to take into account and adapt to the changed environment of the new frames. Hence, the network is retrained on 4,500 HSV monochrome images selected from HSV recordings belonging to vocally normal and AdLD speakers. These frames are selected during VF vibrations, phonation onsets/offsets, spasms of VFs in AdLD patients, no voicing (no VF vibration), and even obstructed views of VF (due to, e.g., epiglottis or arytenoid cartilages motion). The glottal area in these training frames is manually segmented using a MATLAB labelling tool (“Image Labeler”) as can be seen in Figure 2.4. Panel (a) includes a screenshot of the tool showing several options for annotation and an example of a labeled image in which the glottal area pixels are highlighted in yellow. Panel (b) illustrates 6 manually segmented HSV frames (before and after segmentation besides a zoomed-in view) as a result of using the labeling tool. 41 Figure 2.4. MATLAB labeling tool used for manual segmentation of HSV frames (panel(a)); results of applying the labeling tool to manually segment the glottal area (highlighted in yellow) are shown for six different frames (panel (b)). The first and second column in panel (b) show the frames prior to and after the manual labeling; the third column shows the zoomed-in segmented frames. The network is evaluated again to make sure that it adapts to the new monochrome HSV frames with adequate segmentation quality; hence, the retraining parameters (particularly batch size and number of epochs) are fine-tuned. This is done by testing the network on 1000 manually segmented frames that are not a part of the retraining process and using the four assessment measures discussed before for evaluation (accuracy, IoU, DC, and F1 score). After fine-tuning and evaluating the network performance, it is implemented on the entire HSV videos of the participants (both vocally normal speakers and AdLD patients) to extract the change in the glottal area (glottal area 42 waveform) during phonation. The glottal area and its edges are estimated spatially in each frame of the HSV recordings based on the spatial segmentation resulted from applying the DNN. The integral of the areas enclosed by the VF edges is considered as an estimate for the glottal area. The GAW is then determined as the sequence of the estimated areas across the image frames in each recording. To automatically detect the left and right VF edges in HSV frames based on the extracted glottal area, the first image moment of inertia is computed for each row of the detected glottal area pixels (glottis center) on each HSV frame. Finding the first image moment localizes the center of brightness in the segmentation mask. After obtaining the center of brightness (center of glottal area) on each row of the image, a set of scattered points are attained and depicted on the original HSV frame. Figure 2.5. illustrates the sequential steps to obtain the glottal midline. The figure shows an original HSV frame along with its automatically detected segmentation mask (in black and white) which is obtained using the DNN tool. The segmentation mask shows a zoomed-in view for a better visualization of the segmented glottal area shape. The figure includes another image showing the original HSV frame overlaid with the detected glottal area besides another copy of the image showing the detected points that are related to the center of the glottal area. Figure 2.5. A schematic diagram of the required steps in detecting the glottal midline including sequence of images showing (from left to right) an original HSV frame, the segmentation mask (a zoomed-in view in black and white), automatically segmented glottal area (in red), and glottal midline (in cyan color). After detecting the midpoints of the glottal area, the corresponding midline is predicted. The midline is estimated as a fitted second-order curve in order to accurately represent the glottal center. Based on the detected glottal midline (the fitted curve), the spatial locations of the left and right VF edges are automatically determined with respect to the location of the midline. 2.7. Study IV: Automated Measurements of Glottal Attack and Offset Time The purpose of this study is to fulfill Aim 4 by addressing the following: 43 Q4: Are the glottal attack and offset times different between AdLD and normal controls? H4.1: An automated algorithm can be developed to measure GAT and GOT with comparable accuracy to visual measurements. H4.2: GAT and GOT will be significantly higher in AdLD versus normal controls. H4.3: GAT and GOT will show more variability in AdLD subjects. The successful implementation of the previous studies for spatial segmentation and HSV analysis allows to automatically capture the glottal area and VF edges in the HSV recordings during running speech. This facilitates the extraction of useful HSV-based measures associated with VF dynamics. The proposed HSV-based measures are chosen so that they can portray the VF dynamics during phonation onset and offset. The extraction of these measures helps fulfill Aim 4 by determining GAT and GOT. Noting that these analyses of generating HSV-based measurements are implemented using the monochrome data (subject pool two). By using this dataset, the measures are extracted from the HSV recordings during running speech for the sake of addressing the research question related to Aim 4. The measures include GAT and GOT. These two measures are automatically generated during the onset and offset of phonation through the extracted GAW and the segmented VF edges. Measuring GAT and GOT is important to be studied as they are critical factors in investigating the pathophysiology of voice disorders [153]. GAT and GOT are physiological measurements that correspond to VF vibrations. GAT is defined as the time difference between the first oscillation and the first contact of the VFs. GOT represents the time difference between the last contact and the last oscillation of the VFs. Those two measures are mainly computed based on the change of the spatial location of the VF edges during phonation onsets and offsets. Hence, they are directly calculated from the HSV recordings after the successful implementation of the spatial segmentation tools, discussed before. It is also aimed to generate these two measures automatically from the HSV videos and validate them against visual analysis. GAT is measured as the time delay between the rise of the energy of VF oscillation and VF contact during the onset of phonation. To do so, the normalized GAW and the average medial glottal contact waveforms are calculated from the GAW and the detected edges of the VFs. The energy of the two waveforms are then computed using a sliding window with a size of 30 ms. The GAW is defined as the area, measured in pixels, between the left and right VF edges. The average medial glottal contact waveform is identified as the average number of points (pixels) located 44 along the VF length that are in contact. The energy of the GAW increases at the beginning of VF oscillation, and the energy of the average medial glottal contact waveform rises sharply with VF contact. Therefore, the delay between the two energy waveforms can determine the GAT. The GAT is computed as the unbiased delay between the first derivative (with a time step of 25ms) of the two energy contours, using cross-correlation technique. The main reason to select the cross- correlation method is that this technique provides an unbiased measurement of the time delay without any predefined thresholds/conditions – free of any operator intervention or bias (fully automated) [153]. Similar to the computation of the GAT, GOT is determined using a similar procedure. GOT represents the time delay between the drop of the energy of VF oscillation and VF contact during the offset of phonation. So, the GAW and the average medial glottal contact waveform are computed similar to the way they are computed for the GAT as discussed above. In contrast, during phonation offset, the energy of the GAW drops and dissipates after the last VF contact since the oscillation of the VFs damps, and the energy of the average medial glottal contact waveform drops sharply since the VFs start to be separated after the last VF contact. The GOT is also calculated as the unbiased delay between the first derivative of the two energy waveforms (with the same time step as in GAT) using again the cross-correlation method. In order to fulfill Aim 4 and address its hypotheses, GAT and GOT are automatically computed using the above approach for the monochrome HSV recordings (subject pool two) from both the vocally normal speakers as well as the AdLD patients. The measurements are obtained using both the proposed automated algorithm and the visual analysis. In the visual analysis, three raters analyzed the HSV data from each participant using a camera playback software (Phantom Camera Control, PCC) where they can adjust the playback speed, gaussian filter setting, gain, brightness, and contrast of the video frames for a better visualization and better image quality [162]. The raters visually determined the timestamps corresponding to the first oscillation and contact frames as well as the last oscillation and contact frames for each vocalization [162]. Based on the difference between these timestamps, they computed the GAT and GOT measured in number of frames and in ms (dividing the number of frames by 000 fps). The raters compared and reviewed the vocalizations that showed large error between the raters (the GAT and GOT measures that resulted in more than 2.5ms for the contacts and 5ms for the oscillations). This was done to allow the raters to come to a consensus about their measurements. The visual and automated analysis of GAT and 45 GOT of the HSV recordings are compared to each other in order to validate the accuracy of the automated analysis in detecting the GAT and GOT. After validating the automated approach, the mean and standard of deviation values of GAT and GOT are computed in each HSV recording during the reading of the six CAPE-V sentences as well as the Rainbow Passage. The goal of the analysis of these measurements to demonstrate how the GAT and GOT measures are affected by the impaired vocal function of AdLD disorder during the onset and offset of phonation in running speech. Hence, a statistical analysis is used to investigate whether there is any significant difference between the GAT/GOT measures of the AdLD group in comparison with the normal controls. The GAT and GOT measurements are considered as continuous dependent variables in the statistical model. The group type (vocally normal subjects vs AdLD patients) is considered as a categorical independent variable. To test the proposed hypotheses of Aim 4, the t-test is conducted to compare the GAT/GOT between the non- pathological group and AdLD groups. 2.8. Study V: Lumped-Element Modeling The purpose of this study is to fulfill Aim 5 by addressing the following: Q5: Can a simplified one-mass model be optimized to accurately match the vibratory behavior of VF extracted from HSV? H5.1: A simplified one-mass model can successfully simulate both the vibratory and closure phases of VF motion. H5.2: The particle swarm optimization technique will enable accurate optimization of the model to predict the experimental glottal area waveform. H5.3: The optimized model parameters, obtained through inverse analysis of HSV data, can estimate the VF mass, elasticity, and viscosity indices. This section introduces the last study in this dissertation in order to address Aim 5 and its related hypotheses. The study investigates the feasibility of generating biomechanical characteristics of VF vibrations. The dynamic vibration of the VF (the experimental data) is obtained based on the monochrome HSV data analysis. This is done by combining the extracted VF vibration with a model-based approach. Hence, biomechanical measures are generated based on a vocalized segment during VF vibration in running speech as a proof of concept. In order to develop these measures, HSV data are first processed using the spatial segmentation techniques proposed in the previous sections. The main goal of these video analysis techniques is to spatially 46 capture the change in the glottal area during phonation across the video frames. Then, a one-mass lumped-element model is built [96, 97, 98, 99]. This model is designed such that each VF is described by one rigid mass coupled by a spring and a damper. The main model parameters associated with the VF properties are the masses, stiffness of the springs, and the damping parameters [95]. The model is combined with the resulted HSV analysis such that the extracted glottal area is utilized to optimize the model parameters. The model is optimized so that it can generate an oscillation behavior similar to the extracted glottal area during VF vibration from the HSV data. In the present work, a one-mass model, in comparison with multi-mass models – was considered as being sufficient to simulate the glottal area variations during the VF vibrations observed in the HSV data. That is, the extracted VF motion from the HSV images only represents a two-dimensional oscillation (the opening and closing along VF length) and does not display the contact area between the VFs along its thickness during the closure phase. Accordingly, increasing the number of masses to mimic the contact area of the VF oscillation was not necessary in the present work, and using a one-mass model to simulate/optimize the VF movement is a reasonable assumption. In other words, simulating the first mode of vibration using a one-mass model is enough to capture the observed VF vibrations in HSV. This mode of vibration of the VF represents a specific vibratory pattern and shape in which every tiny part of the VF tissue oscillates sinusoidally at the same frequency, which refers to the fundamental frequency of VF vibration [163]. For optimization of the model with the experimental HSV data, different techniques can be used, e.g., particle swarm optimization and genetic algorithm [93, 103]. In this work, particle swarm approach is used to optimize and approximate the model parameters. This optimization algorithm has high efficacy in solving different applications, particularly in the field of biomechanics [93]. It is simple and can determine the optimal model solution using a few parameters. After optimizing the model, the main model parameters are quantified in order to estimate the corresponding biomechanical measures of the vibrating VF. Each aforementioned step is discussed in detail in the following subsections. The first subsection (2.8.1) includes the model description, governing equations, mathematical representation of the parameters, and the time-integration method used for the numerical solution. The second subsection (2.8.2) describes the input parameter values and initialization of the model 47 parameters. The last subsection (2.8.3) discusses a detailed description of the optimization procedure including a description of the utilized experimental data, optimized parameters, the optimization procedure, and the optimization output. 2.8.1. Biomechanical Modeling In this study, a one-mass lumped model is implemented as the first step to simulate the VF vibrations. Figure 2.7 illustrates a schematic diagram of the lumped-element model that is utilized in the present work. As can be seen, the lumped model consists of a one mass (representing one VF), a spring (referring to VF elasticity), and a damping element (to simulate VF viscosity). In the illustrated diagram, the mass of the VF is denoted by m, VF elasticity is represented by k, and the damping coefficient of the VF is given by c. The diagram includes several other parameters that are used to describe the mathematical model. The subglottal pressure (Ps), the inlet glottis pressure (P1), and the outlet glottis pressure (P2) are shown in the figure with their relative locations. The glottal air flowrate, directed from the lungs, passing through the glottal constriction, toward the vocal tract is represented by Qg. This mathematical model allows us to simulate the movement of the VF mass in one dimension (x(t)). In this model, VF thickness is indicated by d, the length of the VFs is denoted by l, and the width between the two VFs is identified by w. Figure 2.7. A schematic diagram of a one-mass lumped model to simulate VF vibration. 48 Below is a detailed explanation of each step to drive the mathematical equations used in this work. The mathematical representation and the differential equation of motion of the above oscillating system are mainly derived from the Newton's second law of motion: ℱ = 𝑚𝑥̥̈ . (2.21) The acceleration (𝑥̥̈ ) of the VF depends on two variables: the net force ℱ acting upon the VF and the mass m of the VF. From equation (2.21) and the schematic diagram showed above, the governing equation of the system can be derived as follows: ∑ ℱ = −𝑐(𝑡, 𝑥̥(𝑡))𝑥̥̇ − 𝑘𝑥̥ + 𝐹(𝑡, 𝑥̥(𝑡)) = 𝑚𝑥̥̈ , (2.22) where F(t, x(t)) indicates the external forces that act upon the VF during vibration as a function of time t and displacement x(t). Equation 2.22 can be written as: 𝑚𝑥̥̈ + 𝑐(𝑡, 𝑥̥(𝑡))𝑥̥̇ + 𝑘𝑥̥ = 𝐹(𝑡, 𝑥̥(𝑡)). (2.23) To derive the above equation (2.23) per unit mass of the VF, both sides of the equation can be divided by the VF mass. As a result, the elasticity index and the viscosity index associated with the vibrating mass can be obtained. These two indices are computed from the following formulas: 𝑘 𝐸𝑙𝑎𝑠𝑡𝑖𝑐𝑖𝑡𝑦 𝐼𝑛𝑑𝑒𝑥̥ = 𝑚 and (2.24) 𝑐 𝑉̥𝑖𝑠𝑐𝑜𝑠𝑖𝑡𝑦 𝐼𝑛𝑑𝑒𝑥̥ = . (2.25) 𝑚 These two parameters are considered as biomechanical output measures of the model in the present study. The Elasticity Index is related to the displacement trajectory of the VFs in the HSV recordings. The Viscosity Index is another important measure to study the biomechanics of VF oscillation. The viscosity of VF tissue is a biomechanical property that measures resistance to the velocity of VF tissue deformation [164]. With higher VF viscosity, the VF oscillations would be expected to be more damped, and a greater subglottal pressure with larger air force would be expected to maintain the same vibratory behavior of VF [164]. After defining the main model equation and its parameters, the only term that needs to be defined in the above equation is the external forces that act upon the VF. The external force F(t, x(t)) is obtained based on the inlet glottal pressure (P1), which is right before the glottal airflow enters the glottis, and the outlet pressure (P2), which is right after the glottal airflow exists the glottis (see Figure 2.7). The following formula (2.26) is used to obtain the external force as a function of the inlet and the outlet glottal pressures P1 and P2 [165]: 1 𝐹(𝑡, 𝑥̥(𝑡)) = 2 𝑙𝑑(𝑃1 (𝑡, 𝑥̥(𝑡)) + 𝑃2 (𝑥̥(𝑡))). (2.26) 49 Experimental measurements have shown that P1 and P2 can be obtained from equation 2.27 using the PS and Bernoulli pressure PB [166]: 𝑃1 (𝑡, 𝑥̥(𝑡)) = (𝑃𝑆 (𝑡, 𝑥̥(𝑡)) + 1.37𝑃𝐵 (𝑡, 𝑥̥(𝑡))), (2.27) 𝑃2 (𝑥̥(𝑡)) = −0.50𝑃𝐵 (𝑡, 𝑥̥(𝑡)). PB in the above equation can be derived from the Bernoulli equation. That is, if the glottal airflow through an orifice (referred to the VFs) is ideal, lossless, and steady, the glottal inlet and outlet pressures will be identical and equal to PB [167]. PB can then be defined as the kinetic energy per unit volume attributed to the glottal airflow (Qg) and can be computed from the following formula: 𝜌𝑄𝑔 2 (𝑥(𝑡)) 𝑃𝐵 (𝑡, 𝑥̥(𝑡)) = , (2.28) 2𝐴2𝑔 (𝑥(𝑡)) in which Ag(t) represents the glottal area (the space between the two VFs). Qg is computed using an empirical formula. This experimental formula was obtained by van den Berg and others [168], where they empirically estimated the resistance of the glottal airflow (Ps/Qg) in an orifice. The following formula represents the derived empirical equation for determining Qg as a function of x(t) and Ps. 0.875𝜌 12µ𝑙 𝑄𝑔 (𝑥̥(𝑡))2 + 3 𝑄𝑔 (𝑥̥(𝑡)) − 𝑃𝑆 (𝑡, 𝑥̥(𝑡)) = 0. (2.29) 2𝑑2 𝑤(𝑥(𝑡))2 𝑑𝑤(𝑥(𝑡)) Two air properties are included in the above equation, which are the air density 𝜌 and the coefficient of viscosity µ. In this equation, w contributes to the nonlinearity of the model and depends on the trajectory and the displacement of the VF masses. Plugging in equations 2.27, 2.28, and 2.29 into equation 2.26, it can be seen the external force F(t, x(t)) is a function of the flowrate, which itself depends on the VF displacement x(t). Moreover, as can be seen from equation (2.29), the formula is a quadratic equation, where the variable Qg is the unknown variable. The positive root resulted from solving this quadratic equation is only considered to obtain the flowrate value. In order to compute Ag(x(t)), a resting position of the VF is defined such that when the displacement x(t) of the mass is zero, the glottal area corresponds to the initial area Ag0 between the two VFs, which is a constant. Accordingly, the change of the glottal area during the VFs vibration can be calculated using the following formula: 𝐴𝑔(𝑥̥(𝑡)) = 𝐴𝑔0 + 𝑙𝑥̥(𝑡). (2.30) Another parameter that we define is the critical displacement Xc, which is the displacement at the initiation of the closure phase: 𝐴𝑔0 𝑋𝑐 = − . (2.31) 𝑙 50 The closure phase in this study refers to when the glottal area is equal to zero in the experimental data (when the VFs are in contact). When the mass displacement reaches and exceeds a predefined critical displacement value Xc, a complete glottal closure happens, where the two VFs come into contact with each other. During the closure phase, when the mass exceeds the critical value Xc, the glottal area Ag(x(t)) becomes zero and, theoretically, it turns into a negative value. It should be noted that the glottal volume flowrate Qg also becomes zero during the closure. The derived differential equation and the defined variables and constants corresponding to the introduced lumped model can be summarized and rephrased into the set of the following equations: 𝑚𝑥̥̈ + c(t, 𝑥̥(𝑡))𝑥̥̇ + 𝑘𝑥̥ = e1 𝑃𝑆 (𝑡, 𝑥̥(𝑡)) − e2 𝑃𝐵 (𝑡, 𝑥̥(𝑡)), 𝑙𝑑 e1 = , (2.32) 2 𝑙𝑑 𝑒2 = 1.87 , 2 where e1 and e2 are constants. The parameterized variables c(t, x(t)) and Ps(t, x(t)) as well as the nonlinear function PB(t, x(t)) depend on the status of the glottis (either an open glottis or a closed glottis). These three variables are computed as follows: If X(t) ≥ Xc (the open glottis, referring to the opening phase): 𝜌𝑄𝑔 2 (𝑥(𝑡)) c(t, x(t)) = 0, Ps(t, x(t)) = 𝑃̅𝑠 , 𝑃𝐵 (𝑡, 𝑥̥(𝑡)) = (2.33) 2(𝐴𝑔0 + 𝑙𝑥(𝑡))2 If X(t) ≤ Xc (the closed glottis, referring to the closure phase) happened at a specific t, then t = t0c, taken as the initial moment of closure, then c(t, x(t)) = c', Ps(t, x(t)) = PSmax, 𝑃𝐵 (𝑡, 𝑥̥(𝑡)) = 0 when t – t0c ≤ tc (2.34) 𝑃̅𝑠 refers to the typical value of the subglottal pressure. PSmax indicates the pressure level to which the subglottal pressure is increased, acting as a built-up pressure during the closure time. The time at which the beginning of the closure phase happens is indicated by t0c, which depends on both x(t) and t. The subglottal pressure is parameterized as Ps = f(t, PSmax, tc, x(t)). This parameterization of the subglottal pressure allows to represent the change of the subglottal pressure during vocalization. The parametrization was performed as a step function, which is a simplified representation of the actual variations that occur in the subglottal pressure during VF vibration. Also, according to the above equation (2.34), the damping coefficient c' is parameterized as a function of t and x(t) to simulate the additional viscous damping occurring during the closure between the two VFs. Therefore, the above equations provide a mathematical representation of the model parameters including five input parameters m, k, c', PSmax, and tc; and the model output 51 would be the predicted glottal area (Ag(x(t)). These input parameters are being optimized using the optimization procedure as explained in section 2.8.3. Based on the derived mathematical representation during the closure phase, the external forcing function becomes .5(Psld), computed from equation (2.26 and 2.27), such that the forces (mainly coming from the Ps) act on the mass to open the VFs. During the VF contact, there is an increase in Ps, which is built up to help in pushing the VFs apart by a value of PSmax during the course of the opening phase. Another increase is considered during this time in the damping coefficient, where the damping coefficient c is increased by an additional viscous damping c' to simulate the overdamping contact between the VFs. When the VFs start to open (x(t) > Xc), the damping coefficient returns back to the value c and the forcing function changes back to being derived from Ps and PB. The rest of this section discusses the numerical integration of these equations in order to simulate the theoretical glottal area waveform Ag(x(t)) as the main output of the system. The above differential equation is solved using the classical Runge-Kutta approach (4th order) as an effective time-integration method to determine the theoretical trajectory and displacement of the vibrating mass, as well as the theoretical glottal area. To do so, the above differential equation is rephrased in a form of two 1st order differential equations, which can be formulated as follows: 𝑥̥̇ = 𝑉̥, (2.35) 1 𝑉̥̇ = 𝑚 [−c(t, 𝑥̥(𝑡))𝑉̥ − 𝑘𝑥̥ + e1 𝑃𝑆 (𝑡, 𝑥̥(𝑡)) − e2 𝑃𝐵 (𝑡, 𝑥̥(𝑡))]. (2.36) Overall, the Runge-Kutta method computes the solution and performs the integration by iteratively updating the approximation at each iteration (time step). This update is done using a weighted average function evaluation. The target of computation is to numerically integrate the above differential equation and obtain X(t). For a clearer representation of the execution of the method, the two differential equations mentioned above are combined into a single function dxdt as follow: 1 dxdt(t, x) = [𝑥̥2 , 𝑚 [−c(t, 𝑥̥1 )𝑥̥2 − 𝑘𝑥̥1 + e1 𝑃𝑆 (𝑡, 𝑥̥1 ) − e2 𝑃𝐵 (𝑡, 𝑥̥1 )], (2.37) where x refers to a solution vector including two values: x1, indicating the displacement x(t); and x2, indicating the velocity V. Considering these definitions, the method is implemented through the following steps: I. Initial values of the model are first determined for the solution vector x for both the initial displacement and velocity at t = 0 and x(0) = {𝑥̥, 𝑉̥}, where x1=𝑥̥ cm and x2 = 𝑉̥. In addition, 52 the time step Δt of the numerical solution is kept at 0.25 ms (1/4000 s) in order to match the frame rate of the high-speed camera for recording the experimental data. II. At each time step, intermediate slopes K1, K2, K3, and K4 are approximated. These slopes provide an approximation of the derivative of the solution at three different stages within the time step itself: at the beginning, middle, and the end of each time step Δt. They are evaluated using the derivative function dxdt at these three timestamps. The purpose of these slopes is to assist in capturing the variation in the derivative function that happens within the time step Δt; this results in a better accuracy in approximating the solution across the time steps. K1 and K4 refer to the slope at the beginning and end of the time step Δt, while both K2 and K3 indicate the slop at the midpoint (Δt/x), evaluated using k1 and k2. The variable i here is considered as a loop counter to iterate across the time steps. Below is the mathematical representation of the computations implemented for each derivative: 𝑘1 = Δt ∗ 𝑑𝑥̥𝑑𝑡(t(i), x(i)) 𝑘2 = Δt ∗ 𝑑𝑥̥𝑑𝑡(t(i) + Δt/2, x(i) + 𝑘1/2) 𝑘3 = Δt ∗ 𝑑𝑥̥𝑑𝑡(t(i) + Δt/2, x(i) + 𝑘2/2) (2.38) 𝑘4 = Δt ∗ 𝑑𝑥̥𝑑𝑡(t(i) + Δt, x(i) + 𝑘3) III. The approximated solution at the next time step is updated according to the computed slopes using the weighted average. The weighted average of these slopes enables better estimation of the solution at the next time step according to the following formula: 𝑥̥(𝑖 + 1) = 𝑥̥(𝑖) + (1/6) ∗ (𝑘1 + 2𝑘2 + 2𝑘3 + 𝑘4) (2.39) IV. The steps I and IV are repeated iteratively until reaching the specified time duration of the numerical integration and computing the solution vector at each time step– returning the approximated value of the displacement across time. The above numerical integration method is implemented using 64-bit MATLAB R2020b (MathWorks Inc., Natick, MA) as a powerful platform for building such models. After computing the theoretical displacement of the mass through the numerical integration technique, the simulated glottal area waveform is determined as the model output using equation 2.30. In the next section, the simulation as well as the values corresponding to the model parameters are discussed. 2.8.2. Model Parameters Initialization In this subsection, the model parameters initialization and assumptions are discussed in order to simulate the VF oscillatory behavior. The results of this simulation are discussed in Chapter 3. 53 The lumped-element model that is discussed above is simulated to produce theoretical displacements of the VFs. The main output of the simulation is the theoretical glottal area waveform Ag. This theoretical area is optimized afterwards with the experimental one, which is discussed later in the following subsection. In order to simulate the model, the following model constants are used [169]: 𝑃̅𝑠 = 8000 dyn/cm2, Ag0 = 0.05 cm2, µ = 1.86×10−5 g/(cm2.s), ρ = 1.2 × 10−3 g/cm3, l = 1.4 cm, d = 0.3 cm. The CGS system of units is used in this study. The damping coefficient c is considered zero during the VFs vibration when the VFs are open, yet when the contact occurs, the damping coefficient is considered to be c'. Ps is not considered constant during the simulation; instead, it is parameterized as a function of the following: the simulation time measured in s, the contact duration between the two VFs (tc in s), and PSmax referring to the brief built up of pressure during closure as discussed before. During the opening phase of the VF oscillations, Ps is fixed at 8000 dyn/cm2 (~800Pa), but, during the closure phase, it is increased to a maximum value at PSmax and returned back to 8000 dyn/cm2 when the VFs start to open again. In order to numerically solve the model, the initial displacement and velocity are defined as follows: 𝑥̥ = .01cm and 𝑉̥ = 0. These values are considered based on previous simulation studies [92, 169]. For the model simulation, the values for the mass m, the damping coefficient c' during closure, the spring stiffness k, maximum subglottal pressure PSmax, and the closure time tc are set to the following values: 0.24 g, 500 g/s, 5000 g/s2, 10000 dyn/cm2, and 2.75 ms. These values are chosen based on previous studies [169]. The simulation is carried out using these values to generate results that reflected the model's behavior prior to the optimization. 2.8.3. Model Optimization After the mathematical representation and simulation of the proposed on-mass model, the model is optimized with experimental HSV data. Hence, from the above description of the proposed biomechanical model, the oscillatory pattern of the VF vibration can be represented using a parameter vector or a set, denoted by q with six optimizing parameters: q = (α, m, k, c', tc, PSmax). Since the units of the simulated and the experimental glottal area waveforms do not match, the scaling factor α is used in the optimization process to minimize the difference between the two waveforms in terms of the amplitudes. In summary, the input model parameters used in the optimization process are as follows: the mass m, spring stiffness k, damping coefficient during the closure phase c', closure time tc, and the maximum subglottal pressure PSmax. The last two 54 optimizing parameters tc and PSmax are used to compute the subglottal pressure. The optimizing parameters are utilized as inputs to the mathematical model in order to obtain the simulated glottal area waveform Ag as an output from the simulation. The optimization procedure aimed to optimize the theoretical/simulated glottal area waveform generated using the model with the experimental glottal area waveform. The experimental glottal area waveform is extracted using HSV from a vocally normal participant during a vocalized segment, where the glottal area is computed at each frame. The experimental glottal area is automatically detected using the developed DNN tool in this dissertation for the glottal area segmentation [170]. To optimize the theoretical/simulated glottal area waveform and obtain a good match with the experimentally extracted area waveform, an objective function is computed by calculating the sum of squared error between the simulated and experimental glottal area waveforms at every time step (or frame). The objective function is normalized by the experimental glottal area. Equation (2.40) shows the objective function (Obj) formula that is considered in the current study: #𝑓𝑟𝑎𝑚𝑒𝑠 ∑𝑛=1 [𝐴𝑀𝑜𝑑𝑒𝑙 (n ∙ ∆t)−𝐴𝐻𝑆𝑉 (n)]2 𝑂𝑏𝑗(𝑞𝑖 ) ∶= #𝑓𝑟𝑎𝑚𝑒𝑠 , (2.40) ∑𝑛=1 𝐴2𝐻𝑆𝑉 (𝑛) where 𝑞𝑖 denotes the different sets of the optimizing parameters: 𝑞𝑖 , i = 1,2,…, N. As such, each set q refers to a parameter vector that includes potential values of the six optimizing parameters, mentioned above. AModel is the simulated glottal area while AHSV represents the glottal area extracted from the HSV recording. The ∆𝑡 indicates the simulation time step (considered as 0.25 ms), used to obtain the time corresponding to a specific frame number n. In order to minimize the above objective function, the particle swarm optimization (PSO) technique [171] is employed to determine the optimum values associated with each optimized parameter (the optimum parameter vector q*). Below is a brief description of the optimization method and its implementation in the present study. PSO is a population-based stochastic method, which was first inspired by the behavior of birds flocking [171]. This technique works based on a population of individuals (called particles). These particles navigate a designated search space in steps with adjustable positions and velocities through an iterative procedure. As such, each particle has a position, which represents a candidate solution for the optimization problem in the search space; candidate solution refers to a possible set of values for the optimizing parameters (𝑞𝑖 ). Moreover, each particle has a velocity value 55 (particle’s momentum/movement), which determines the direction of the particle that it can be guided towards a better position (potential solution). This potential solution corresponding to each particle is evaluated with each iteration step using the above objective function to identify how accurate this particle’s solution/position is. After this evaluation at a specific step, an updated velocity value is computed and assigned to each particle, on account of which they adjust their positions in the next iteration step. After updating the velocity and position of each particle, the algorithm reevaluates the adjusted particle’s position and iterates. In this iteration process, the algorithm iteratively updates the velocity value of each particle based on several factors: the particle’s current velocity value, the best position/solution found by the particle throughout the completed iterations (called local/personal best), the best-observed position/solution among all the particles of the entire swarm (called global best). Throughout the iterations, all particles can coalesce towards a location, referring to a possible solution, in the search space where it may reflect the optimum solution (optimum particles’ positions). Another way of convergence can happen in PSO where the local or the global best approaches a predetermined local optimum (acceptable level of error/threshold). In the present study, the PSO algorithm is implemented using some annotations and assumptions. The iteration number of the algorithm is denoted as j, j = 1,2,…, J with total iteration number of J = 400. The swarm included a total of N = 200 particles, and each particle is identified by unique i-value (same notations as in 2.40). A potential position of a specific particle i at certain iteration step is represented by 𝑞𝑖 (𝑗) including a vector of six potential values of the considered optimizing parameter. Similarly, the particle’s velocity is denoted by vi(j). Personal best position of each particle and the global best position of the entire swarm are identified by pb and gb. The algorithm is implemented through of the following steps: I. Initializing the swarm particles with specified initial positions and random velocities. Same initial positions are given to all particles at the first iteration step acting as initial guesses of the target optimizing parameter. In the present work, the initial values for the six optimizing parameters α, m, k, c', tc, and PSmax are set as 120, 0.1 g, 40000 g/s2, 400 g/s, 5×10−3 s, and 15000 dyn/cm2, respectively. In other words, the particles’ positions are chosen as 𝑞𝑖 (1) = (120, 0.1, 40000, 400, .005, 15000) in the first iteration. Also, a specific range was assigned to each optimization parameter value to constrain the search space as follows (listed as the same order as the initial values): α = 110 – 130, m = 0.04 – 0.30 g, k 56 = 10000 – 60000 g/s2, c' = 300 – 800 g/s, tc = 3 – 8×10−3 s, and PSmax = 8000 – 20000 dyn/cm2. II. Evaluating a fitness value from computing the objective function outcome (Equation 2.40) using each particle’s position (𝑞𝑖 (𝑗)) in order to assess the performance (the error) of the potential solutions. The fitness value is a scalar value (the inverse of the error), which represents the quality of the candidate solution – the larger the fitness value, the lower the error, and the better the solution. III. Updating the local best position and the global best position: (1) If the evaluated current 𝑞𝑖 (𝑗) is better than pb, pb is updated; (2) if current pb is better than gb, gb is updated. These updated best values are then used for the subsequent iteration of the algorithm. IV. Updating the current velocity and, accordingly, the position of each particle. The velocity is adjusted first based on the current velocity, distance from the best position, and distance from the global best position. This adjusted velocity value is then added to the current particle’s position according to the following formula [171]: 𝑣𝑖 (𝑗 + 1) = W[𝑣𝑖 (𝑗)] + 𝑍1 𝑟1 [𝑝𝑏𝑖 − 𝑞𝑖 (𝑗)] + 𝑍2 𝑟2 [𝑔𝑏 − 𝑞𝑖 (𝑗)], (2.41) where r1 and r2 indicate random numbers from 0 to 1. Three constants are included in the equation: W, Z1, and Z2 whose values are chosen according to the standard values in MATLAB. The first constant W represents the inertia weight is given by 0.1-1.1 as an adaptive value which is reduced gradually over iterations. The following formula is utilized to update the weight at each iteration in order to improve the optimization procedure for better convergency [172]: 𝑤𝑚𝑎𝑥 −𝑤𝑚𝑖𝑛 𝑊(𝑖) = 𝑤𝑚𝑎𝑥 − 𝑖 (2.41) 𝐽 Wmax and Wmin are given by 1.1 and 0.1 representing the maximum and minimum weights considered in the optimization process. Also, J refers to the maximum number of iterations that is considered by 400, as mentioned before. Adjusting W influences the relative weight given to each of the two other constants (Z1 and Z2). Z1 refers to a cognitive component to control the particle’s tendency to move to its best position (given by 1.49), and Z2 is the social component, which influences the tendency of the particle to move towards the global best position obtained among the entire swarm (given by 1.49). V. Moving and updating the particles’ positions based on the updated velocities. The new position of the particles is determined using the following formula [171]: 57 𝑞𝑖 (𝑗 + 1) = 𝑞𝑖 (𝑗) + 𝑣𝑖 (𝑗 + 1) (2.42) VI. Repeating steps II-V until the best solution found, the optimization procedure converges to a specified acceptable error level, or the termination criterion is met. The termination criterion in this work is getting to the maximum number of iterations (400) after which the algorithm stops and returns the best solution found including the optimum values of the optimized parameters. After carrying out the optimization procedure, the six optimized parameters are obtained (donated by q*). These optimum parameters including q(2,6), except the scaling factor, (m*, k*, c'*, tc*, PSmax*) are used as input model parameters to obtain the theoretical glottal area Ag* from the simulation. Ag* refers to the glottal area simulated using the optimized model parameters. The remaining optimizing parameter, the scaling factor, is used to obtain the optimized glottal area such that Amodel = (α*)Ag*, which then can be compared with the experimental one AHSV. The optimized parameters are also used to estimate the Elasticity Index k*/m* and the Viscosity Index c*/m* as the biomechanical measures corresponding to the investigated VF vibration. Additionally, these optimized parameters were validated with the typical values found in literature. 58 CHAPTER 3: RESULTS 3.1. Study I: Automated Detection of Vocal Fold Image Obstructions A sample of manually labeled frames from the training dataset is shown in Figure 3.1. The figure depicts the two different classes of frames considered during the training: “Unobstructed and Obstructed Vocal Fold”. Under the “Unobstructed Vocal Fold” class, frames from different phonatory events at various gestures in connected speech are depicted such as during VF sustained vibration, phonation onsets/offsets, and no vibration. In contrast, the “Obstructed Vocal Fold” group displays different configurations of VF obstructions that were observed during running speech. The figure presents partial/full obstructions due to epiglottis, arytenoid cartilages, laryngeal constriction, false VFs, or when VFs fall outside the view of the endoscope. Figure 3.1. A sample of classified HSV images during connected using the manual analysis (visual classification). The two sets of three columns display the two different groups of frames: “Unobstructed Vocal Fold” showing the presence of the true vocal folds and “Obstructed Vocal Fold” demonstrating an obstructed view of vocal folds. The results of applying the automated deep learning approach on the testing dataset is presented in Figure 3.2. A sample of random frames from the testing dataset is presented in the figure, which are classified and labeled by the developed automated approach. The figure depicts the classification outcome of the trained network for both “Unobstructed Vocal Fold” class (left side panels) and “Obstructed Vocal Fold” class (right side panels). For each class, different testing 59 frames are shown displaying various unobstructed views of the VFs and different configurations of obstructed views of the VFs. Almost all the frames, included in the figure, show a correct classification of the developed tool. The figure includes only one misclassified frame in the “Obstructed Vocal Fold” class (the right-side frame in the second row), which was supposed to be classified in the “Unobstructed Vocal Fold” class. Figure 3.2. The classification results using the automated deep learning approach on the testing dataset. The two sets of three columns display the correctly classified frames of the testing dataset as “Unobstructed Vocal Fold” (left side panels) and “Obstructed Vocal Fold” (right side panels). The performance of the developed CNN is demonstrated using confusion matrices when the trained network is applied to the validation and testing datasets as shown in Figure 3.3. The horizontal and vertical labels in the two matrices represent the predicted outcome of the classifier and the true visual observation classes, respectively. The cells show the number of correctly classified (in blue) and misclassified (in orange) frames in the “Unobstructed Vocal Fold” class and “Obstructed Vocal Fold” class, which are represented in the figure as “VF” and “No VF”, respectively. The associated accuracies of each classification are also represented in the green cells. As can be seen, the overall accuracies of detecting VFs in the validation and testing frames are 99.15% and 94.18% (shown inside the dark green cells). In both datasets, the network has a 60 slightly higher accuracy in terms of recognizing the VFs in the frames than detecting an obstructed view of the VFs. This slight difference can be seen from the precision values of each class in the figure (in the two light green cells in the bottom row of the matrices): 99.66% versus 98.64% in the validating frames and 97.24% versus 91.11% in the testing frames, respectively. The two light green cells in the right columns of the matrices show the sensitivity and specificity of detecting VF obstruction in the frames in percent, which are 99.66% and 98.66% for the validation dataset and 97.06% and 91.62% for the testing dataset, respectively. Furthermore, the F1-score in the validation dataset was 0.99 for both the Unobstructed and Obstructed Vocal Fold classes, while these scores marginally dropped to 0.94 in the testing dataset. Figure 3.3. Confusion matrices of the deep learning network, showing its performance on classification of the validation dataset (panel A) and the testing dataset (panel B). Blue and orange cells refer to the number of frames/images in each category, and the green cells represent the associated accuracy of each row and column – noting that the overall classifier’s accuracy is highlighted in the dark green cells. The horizontal labels represent the predicted outcome of the classifier on the “Unobstructed Vocal Fold” class (VF) and “Obstructed Vocal Fold” class (No VF). The vertical labels refer to the ground-truth labels observed by the rater for each class. The receiver operating characteristics curve of the developed CNN is illustrated in Figure 3.4. The figure depicts the curve (in blue) when the network was implemented for the validation dataset (panel A) and testing dataset (panel B). The closer the curve is to the upper left corner, the higher the overall accuracy of the CNN. The diagonal red line represents points where Sensitivity=1- Specificity. Along with the curve, the value for the area under the curve (AUC) is included in the 61 figure. As can be seen, both validation and testing curves show almost the same behavior. AUC of the validation and testing dataset curves are almost 1.00. Figure 3.4. The sensitivity-specificity curve (receiver operating characteristics curve), in blue, for the validation dataset (panel A) and the testing dataset (panel B). AUC refers to the area under the sensitivity-specificity curve. The diagonal red line represents points where Sensitivity=1- Specificity. The robustness evaluation of the proposed automated classifier was done through comparison between the CNN performance and manual analysis classification using two complete HSV recordings for a vocally normal participant and a patient with AdLD. The results of the comparison are listed in Table 3.1 for 264,400 HSV images from the vocally normal participant and 399,384 images from the patient. In the vocally normal participant, the manual analysis reveals that 38,497 out of 264,400 frames (14.56%) with an obstructed view of VFs, and the automated analysis shows almost a similar number of frames, 39,009 (14.75%). Likewise, in the patient, the manual and automated analysis demonstrate close number of frames for the obstructed VFs: 96,571 versus 97,545 out of 399,384 frames (24.18% versus 24.42%), respectively. The difference in the detected number of frames with an obstructed VFs between the automated technique and the manual observation is 512 (0.19%) and 974 (0.25%) for the vocally normal and disordered participant, respectively. 62 Table 3.1. Robustness evaluation: A comparison between the visual observation versus the automated technique in terms of number/percent of frames with obstructed view of vocal folds in the entire HSV recordings for a vocally normal participant and a patient with AdLD # HSV # Obstructed % Obstructed Difference Difference Frames Frames Frames (# Frames) (%) Normal participant Visual observation 38,497 14.56 264,400 512 0.19 Automated analysis 39,009 14.75 Patient with AdLD Visual observation 96,571 24.18 399,384 974 0.25 Automated analysis 97,545 24.42 The two confusion matrices for the robustness evaluation are presented in Figure 3.5 for a detailed comparison of the proposed CNN against the manual analysis – using the same two HSV recordings in Table 3.1. The two matrices in Figure 3.5 have the same formatting and color code as in Figure 3.3. For the vocally normal participant in Figure 3.5 (panel A), the manual analysis shows 225,852 frames with an unobstructed view of the VFs and 38,548 frames with an obstructed VFs view; based on the manual analysis, the proposed CNN demonstrates correct classification of 221,955 (98.27%) and 35,112 frames (91.09%), respectively. In the HSV video for the patient (panel B), the automated method shows a successful identification of the unobstructed and obstructed VF view in 287,055 out of 302,700 frames (94.83%) and 81,900 out of 96,684 frames (84.71%). Furthermore, the developed automated approach shows an overall accuracy of 97.23% and 92.38% for the entire HSV video of the vocally normal participant and the patient, respectively. The overall accuracy is measured as the network ability in recognizing both the presence and absence of the VFs in the HSV frames correctly. 63 Figure 3.5. Confusion matrix of the developed deep learning network for classification of HSV recordings of a vocally normal participant (panel A) and a patient with AdLD (panel B). The blue and orange cells refer to the number of frames/images in each category, and the green cells represent the associated accuracy of each row and column – noting that the overall classifier accuracy is highlighted in the dark green cell. The horizontal labels represent the predicted outcome of the classifier on the “Unobstructed Vocal Fold” class (VF) and “Obstructed Vocal Fold” class (No VF). The vertical labels refer to the ground-truth labels, which are visually/manually observed for each class. Similar to Figure 3.5, the evaluation metrics resulted from applying the proposed automated classifier on the two manually analyzed HSV videos for each class are represented as light green cells in Figure 3.5. In the normal participant recording, the automated technique has a sensitivity and specificity of 98.27% and 91.09% with respect to recognizing VFs obstruction in HSV frames whereas these values fall to 94.83% and 84.71% in the patient recording. The CNN precision scores are higher for the “Unobstructed Vocal Fold” class with 98.48% and 95.10% for the norm and disorder, respectively, compared to the “Obstructed Vocal Fold” class with 90.01% and 83.96%. A similar behavior was found with respect to F1-scores for both HSV videos – 0.98 and 0.95 for the “Unobstructed Vocal Fold” class and 0.91 and 0.84 for the “Obstructed Vocal Fold” class. Figure 3.6 shows the resulted receiver operating characteristics curve of the introduced classifier (in blue) for the two HSV videos (for robustness evaluation) of the vocally normal participant (panel A) and the patient with AdLD (panel B). The figure shows the change in the network threshold of the binary classification with respect to sensitivity and specificity of the developed classifier in recognizing the VF obstruction over the entire video frames. As can be 64 seen, the classifier shows a better performance with larger area under the curve when analyzing the normal participant’s HSV sequence than that of the patient’s. This is clear when comparing the two corresponding AUC of videos – shown in the bottom right corner of the panels in Figure 3.6. The AUC for the vocally normal participant is 0.99 while it marginally drops to 0.96 for the AdLD patient. Figure 3.6. The sensitivity-specificity curve (receiver operating characteristics curve), in blue, of the developed deep learning network performance on binary classification of the entire two HSV videos of a vocally normal participant (panel A) and a patient with AdLD (panel B). AUC refers to the area under the sensitivity-specificity curve. A detailed comparison between the automated technique against the visual/manual analysis is illustrated in Figure 3.7. The comparison is shown for each frame of the entire two HSV videos of the vocally normal participant (panel A) and the patient with AdLD (panel B). For each video sequence, the red and blue color represent the automated and manual method, respectively, for the instances during which VFs were visually obstructed. As can be seen in the figure, the results of the automated and visual detection display a similar pattern. Besides visual assessment, the accumulated overall accuracy (in solid black line), precision of detecting obstructed view of VFs (in dotted dark red line), and precision of detecting unobstructed view of VFs (in dashed green line) are also illustrated in the figure. The accuracies represent the performance of the developed classifier as a function of time for each HSV video. The accuracies were computed at accumulated time step of 1,000 frames over the entire length of the videos. That is, for each time step, the confusion matrix was generated to evaluate the performance of the automated technique versus the manual analysis on classifying the accumulated number of frames; these generated matrices were 65 then used to compute the accumulated accuracies over the video sequence. As such, the values of the three accuracies at the end of each video (at the last frame) refer to the accuracies over the entire HSV video frames, which was shown in Figure 3.5. As can be seen in Figure 3.7, both the overall accuracy and the precision of recognizing unobstructed view of VFs have nearly similar behavior with high values across each video’s frames. In line with the previous results, the precision of detecting obstructed view of VFs demonstrates slightly lower values than that of the unobstructed class over the entire video frames; the two curves also show different trends as well. Figure 3.7. Comparison between automated (in blue) and manual (in red) analysis of the instances during which vocal folds are obstructed. The comparison shown for the entire two HSV videos of a vocally normal participant (panel A) and a patient with AdLD (panel B). The accumulated overall accuracy (in solid black line), precision of detecting obstructed view of vocal folds (in dotted brown line), and precision of detecting unobstructed view of vocal folds (in dashed green line) are also illustrated. 66 After validating the proposed method, the difference between the AdLD group and the normal control group was investigated. The AdLD patients showed a considerable difference in the average percentage of VF obstruction in comparison with the vocally normal controls. That is, the AdLD group exhibited an average obstruction percentage of 26.1% of the recorded HSV running speech sample, whereas the vocally normal speakers demonstrated a noticeable shorter average duration of obstruction (19.75%). 3.2. Study II: Image Segmentation of Vocal Fold Edges 3.2.1. Image Segmentation Approach: Active Contour Modeling (ACM) Using the classical image processing technique for temporal segmentation for HSV color data preprocessing, the timestamps for all the vocalized segments of the HSV connected speech recording were extracted (except for the segments with epiglottic obstruction). Subsequently, the motion compensation was applied to each vocalized segment of the “Rainbow Passage” to capture the location of the vibrating VFs across the frames. The result of applying the motion window to three individual frames during five different vocalizations are depicted in Figure 3.8a-e. Each row in Figure 3.8 shows three frames for a different vocalization between the following frame numbers: 40,505-41,255 (Figure 3.8-a), 42,975-43,815 (Figure 3.8-b), 84,281-84,891 (Figure 3.8-c), 103,942-104,577 (Figure 3.8-d), and 109,548-110,363 (Figure 3.8-e). The individual frame numbers are shown on top of the HSV frames for each figure panel. The figure shows that the implemented motion window captures both the location and size of the vibrating VFs in different frames. 67 Figure 3.8. HSV frames along with the applied motion windows for three different frames at five different vocalized segments (panels (a-e)). The kymograms were extracted for each vocalization of the “Rainbow Passage” after aligning the VFs across the frames using the motion compensation method. Examples of the extracted kymograms at the medial section of the VFs are shown in Figure 3.9. The kymograms for five different vocalizations are shown in panel a-e between the same frame numbers as in Figure 3.8. You can see the onset and offset of phonation in each kymogram, and that the darker glottal area is almost on a straight line across the frames for each kymogram. 68 Figure 3.9. HSV kymograms at the medial section of the vocal folds for five different vocalized segments (panels (a-e)). The L and R on the y-axis refer to the left and right VFs, respectively. The results of applying the ACM method to four kymograms, extracted at different cross sections of VF, are illustrated in Figure 3.10. That is, after the snake initialization using the first moment of inertia line (a horizontal line spanning through the center of the glottal areas in the kymograms), the active contour algorithm was applied to the kymograms. The upper and lower snakes (active contours) corresponding to the left and right VFs for four kymograms are shown in Figure 3.10. The number of frames shown in the figure is 546, between frame 40,585 and 41,165, including voicing onset, VFs vibration, and voicing offset. Two zoomed-in image segments are included in Figure 3.10 to better visualize the performance of the algorithm. As seen, the ACM approach detects both the left and right VF edges (solid green line and dotted yellow line, respectively) at different cross sections – providing an analytical representation of the glottal edges. Moreover, the algorithm is able to capture the edges before the phonation starts and after the phonation ends. 69 Figure 3.10. Kymograms between frames 40,585 and 41,165 at four different cross sections of the vocal folds (panel a-d) along with the upper and lower active contours (solid green line and solid yellow line, respectively) corresponding to the left and right vocal folds. Two zoomed-in image segments are included to better visualize the performance of the algorithm. The L and R on the y- axis refer to the left and right VFs, respectively. 3.2.2. Image Segmentation Approach: The Hybrid Method The following results demonstrate the implementation of the proposed hybrid method (ACM + k-means clustering) for the color HSV data. An example of 5 cropped HSV frames extracted from a vocalization after applying the temporal segmentation and motion compensation techniques, previously discussed, are illustrated at the top of Figure 3.11. This vocalization was extracted between frames 32,659 and 35,111. The frame numbers are shown at the top of panel (b)-(f). As seen, the motion window captures the size and the spatial location of the VFs during different phases of the vibratory cycle. After applying the motion window, the HSV kymograms 70 were extracted at various cross sections of the VFs during each vocalized segment. Four kymograms, extracted at four different cross sections of the VFs during the same vocalization, are shown in Figure 3.11-g-j. The y-axis of the kymograms represents the left-right dimension of the HSV frame while the x-axis refers to the time (number of frames). Each kymogram in the figure displays the voicing onset and offset along with the vibration of the VFs. Figure 3.11. Panels (b)-(f): 5 HSV cropped frames for frame #32,974, 32,979, 32,984, 32,992, and 32,997 after applying the motion window to five different HSV frames (one HSV frame is shown in panel (a)). Panels (g)-(j): Four extracted kymograms at different cross sections of the vocal folds. The R and L on the y-axis indicate the right and left VFs in the HSV frames, respectively. The k-means clustering technique was implemented for each kymogram. Different subsets of features were fed into the machine learning (ML) algorithm to determine the proper number and combination of features leading to an accurate VF edge representation. Figures 3.12-14 illustrate a comparison between the results of applying two different combinations of the features for glottal area/edge detection: i) red and green channel intensities as two features (panel (a) in the figures) versus ii) the image gradient along with the red and green channel intensities as three features (panel (b) in the figures). The results of utilizing the other subsets of the aforementioned features 71 to perform the clustering showed poorer performance of the method in comparison with using the selected feature combinations in Figures 3.12-14. Figure 3.12 shows the result of applying the clustering technique to the kymogram shown in Figure 3.11-h between frame 32,709 and 35,061 (for a total of 143,167 data points). The scatter plot in Figure 3.12-a is generated by feeding the clustering algorithm the two intensity features: the green channel intensity and red channel intensity. The scatter plot in Figure 3.12-b is generated using the gradient feature along with both red and green channel intensities features. The glottal area cluster in the kymogram is shown by red diamonds and the non-glottal cluster is shown by blue circles. As seen, after adding the gradient feature to the intensity features in Figure 3.12-b, the two clusters can be distinguished in the scatter plot; in contrast, depending only on the intensity as a feature, it is relatively hard to divide the data points into two different clusters. The better performance of the ML method using the three features is more prevalent in Figure 3.13. Figure 3.12. Scatter plots of the two clusters when applying the clustering method to the kymogram in Figure 3.11-h between frame 32,659 and 35,111: (a) using the green and red channel intensities as the features; and (b) using both green and red channel intensities along with the gradient as features. Figure 3.13 shows the two clusters after applying the k-means clustering technique to the kymogram in Figure 3.11-h. The top figure illustrates the clustered regions using the two intensities as features and the bottom figure shows the result when using both the gradient and the intensities (green and red channel intensities) as three features. Figure 3.13 illustrates the clustered 72 areas on the binary labeled kymogram so that only two distinct colors are shown representing the two clusters obtained. As seen, using the gradient in addition to the intensity allows us to capture more information about the glottal area, which well aligns with the results obtained from the former figure. Figure 3.13. The clustered kymogram (from Figure 3.11-h) by employing the k-means clustering algorithm using (a) the red and green channel intensities as features and (b) the gradient along with the red and green channel intensities as three features. Figure 3.14 shows the detected edges of the glottal area based on the results of clustering. In this figure, only the glottal cluster region is shown with a white line in the original kymogram to have a better visual representation of the performance of the clustering method using the intensity features (panel (a)) and the gradient and intensity features (panel (b)). The comparison of panel (a) and (b) shows the improvement in clustering after adding the gradient feature to the intensity features. As can be seen in this figure, using only intensity features results in missing spatial information about the glottal area, particularly during the sustained vibration of the VFs. On the other hand, the glottal edges were detected more accurately when the gradient feature was used along with both the red and green channels intensities. This improvement is clear during the sustained oscillation of the VFs while it is not considerable during voicing onsets and offsets. 73 Figure 3.14. The detected glottal edges based on the results of k-means clustering algorithm using (a) the green and red channel intensities as features and (b) using the gradient along with both red and green channel intensities as three features. The preliminary segmented glottal edges as a result of applying the clustering technique were used as inputs to the ACM method. Figure 3.15 shows how using k-means clustering as an initialization step for the ACM impacts the accuracy of the method. The results are presented in four kymograms extracted at four different vocalizations. The detected glottal edges using the ACM alone and the developed machine-learning-based hybrid method are shown for two decent quality kymograms (between frames 40,505 to 41,255, panel (b), and 103,992 to 104,522, panel (d)) and for two challenging kymograms (between frames 18,975 to 19,803, panel (a) and 98,105 to 98,651, panel (c)). The figure depicts the result of applying the ACM method alone along with the performance of the hybrid method at each vocalization. Although the ACM performed better for the top kymograms in panel (b) and (d) in comparison with the (more challenging) kymograms at the top of panel (a) and (c), this method missed the glottal edges for several cycles as seen in the top figures in panel (b) and (d). The ACM was not able to capture the glottal edges for many glottal cycles in the dim kymograms as seen in the top figures in panel (a) and (c). In contrast, the hybrid method showed a considerable enhancement in the performance and high accuracy as it 74 detected the glottal edges precisely for all the kymograms, as seen in the bottom kymograms in panel (b) and (d), also in panel (a) and (c) with inferior quality and challenging kymograms. Figure 3.15. The detected glottal edges using the ACM method (top kymograms in panel (a)-(d)) versus the hybrid method (bottom kymograms in panel (a)-(d)) for the kymograms extracted at four different vocalizations. In Figure 3.16, five HSV frames are presented from each of the four different vocalizations in Figure 3.15 along with the detected glottal edges by the hybrid method. This figure shows the captured edges after registering the glottal edges from the kymograms back to the HSV frames. For each vocalization, the five frames are chosen to show several frames from a different phase of a vibrating cycle of the VFs. As can be seen in Figure 3.16, the hybrid method was able to track the left and right VF edges accurately during the VF vibration in different frames and vocalizations. 75 Figure 3.16. Five HSV frames from four different vocalizations (panel (a)-(d): between frame 18,975-19803, 40,505-41,255, 98,105-98,651, and 103,942-104,577 after implementing the hybrid method to spatially register the edges of the vibrating vocal folds. 76 3.3. Study III: Deep-Learning-based Representation of Vocal Fold Dynamics 3.3.1. Deep Learning Approach: Segmenting Network on Color HSV Data In this section, results of implementing the DNN to the color HSV video is illustrated. Overall, the automated labelling tool (the hybrid method that was discussed in the previous section) was able to segment the glottal area in the training HSV frames on which the neural network was trained. The DNN was successfully trained on the automatically segmented frames in the training dataset. The trained network was then tested on manually labeled HSV frames yielding a promising performance. Below is the summary of the results obtained from this work: starting with results of implementing the temporal segmentation method, the labeling tool (the hybrid method), the neural network, and the generation of the glottal area waveforms. Figure 3.17 shows the results of each preprocessing step at four different vocalized segments between frame numbers 4,261-5,551 (panel (a)), 42,999-43,774 (panel (b)), 84,900-86,118 (panel (c)), and 98,162-98,542 (panel (d)). In each panel, the outcome of applying the temporal segmentation, motion compensation, and kymogram extraction for a vocalization is illustrated. As shown, the utilized motion compensation specifies the true location of the VFs in the cropped frames. The stacked frames/cropped images refer to the sequence of image sections during the vocalized segments of the connected speech. These frames were used to generate multiple kymograms at different cross sections of the vibrating VFs (represented by a stacked kymograms in the figure). Examples of the extracted kymograms at the medial intersection of the VFs showing the variation in the glottal region across the frames can be seen in the right side of the figure. The kymograms span through the entire vocalization – clearly representing the vibratory patterns and behavior, namely, phonation onset, the sustained vibration of VFs, and phonation offset. 77 Figure 3.17. Results of applying temporal segmentation, motion compensation, and kymograms extraction at four different vocalized segments between frames: 4,261-5,551 (panel (a)), 42,999- 43,774 (panel (b)), 84,900-86,118 (panel (c)), and 98,162-98,542 (panel (d)). The stacked frames/image sections refer to the sequence of frames and the cropped images during each vocalized segment. The stacked kymograms, at each vocalized segment, represent the multiple kymograms extracted at different cross sections of the vibrating vocal folds. For each kymogram, the hybrid method (aka k-means-ACM) was applied to segment and detect the glottal edges during vocalizations. Figure 3.18 illustrates the results of implementing the k-means-ACM algorithm at various kymograms of Figure 3.17 that were extracted from different vocalizations. As illustrated in Figure 3.18, the k-means-ACM technique was able to accurately segment the edges of right and left VFs, shown in solid white lines in the kymograms (left panels of the figure). The glottal edges were then registered back to each HSV frame in the cropped images (see the mid panels in Figure 3.18) and the original HSV frames (shown in the right-side panels). This was done to segment the glottal area in each image; the glottal areas are shown in cyan in mid and right-side panels. 78 Figure 3.18. Results of applying k-means-ACM at four different vocalized segments between frames: 4,261-5,551 (panel (a)), 42,999-43,774 (panel (b)), 84,900-86,118 (panel (c)), and 98,162- 98,542 (panel (d)). Figure 3.19 shows the results of training the proposed DNN for two different vocalizations (panel (a) and (b)). Each panel shows the result for four frames, extracted from a different vocalization. The results in Figure 3.19 are displayed for the following frame numbers: #41,658, #41,738, #41,880, and #41,986 in panel (a) and frame #104,061, #104,162, #104,311, and #104,460 in panel (b). For each frame, the original HSV frame along with the associated binary segmentation masks are depicted for both k-means-ACM (the automated labelling tool) and the proposed DNN. The segmented glottal areas using the k-means-ACM (in cyan color) and DNN (in yellow color) are overlaid on top of each other (in the right-side panels of Figure 3.19) to demonstrate their differences. The DC and the BF (aka F1) scores that are associated with evaluating the similarities between the two segmented areas are included in the figure as well. As shown in the segmented frames and by the scores, the DNN demonstrates a relatively similar performance to the k-means-ACM on most of the presented frames in accurately segmenting the glottal regions. Most of the frames in the figure show that DC > 0.80 and BF > 0.9. In addition, it can be seen that the introduced network can even outperform the k-means-ACM in some frames (e.g., frame #41,658, #104,311, and #104,460) providing smoother glottal edges. 79 Figure 3.19. Results of implementing the k-means-ACM and the trained DNN; the segmented HSV frames along with the associated binary segmentation masks are shown for eight different frames extracted from two different vocalizations (a and b). (a) for Frame #41,658, #41,738, #41,880, and #41,986. (b) for frame #104,061, #104,162, #104,311, and #104,460. The segmented glottal areas using the k-means-ACM and DNN are shown in cyan and yellow color, respectively. The DC and the BF scores associated with the two segmented areas, overlaid on each other, are included at the lower right corner of the images. Figure 3.20 illustrates the performance of the proposed DNN on HSV frames extracted from three different vocalized segments (panel (a)-(c)): frame numbers 40,505-41,204 (panel (a)), 98,732-99,451 (panel(b)), and 106,118-108,084 (panel (c)). These frames were selected among those that were not used for training or testing the network, showing the performance of the 80 network for new frames. The network was implemented on the entire frame sequence of each vocalization – segmenting the glottal regions across frames. The glottal area of each frame in the sequence was computed and plotted in the figure during each vocalized segment to see how the algorithm can capture the glottal area variations at the onsets and offsets. The HSV frames in Figure 3.20 (indicated by red dots in the glottal area waveforms) were selected during different behaviors of the VFs. As such, for the two vocalizations in panel (a) and (b), the segmented frames are extracted near the voicing offset and onset at 138-172.5ms and 14-33.5ms, respectively. The segmented frames shown in panel (c) were extracted during the sustained oscillation of VFs between 222.5 and 229.5ms – representing sudden larger degree of VFs abduction during the sustained vibration. Figure 3.20. The glottal area waveform as well as five segmented HSV frames after applying the trained neural network at three different vocalized segments between frames: 40,505-41,204 (panel (a)), 98,732-99,451 (panel(b)), and 106,118-108,084 (panel (c)). The selected frames are marked by red dots on the glottal area waveforms. 81 Besides the visual inspection, the network was also tested against manually labelled frames (testing dataset) in order to provide a quantitative evaluation of the segmentation performance. When the proposed network was applied to the testing dataset, the results revealed promising accuracy scores and a good match between the predicted glottal area in comparison with the manually segmented glottal area in the testing frames. As such, the results demonstrated that the mean IoU and DC of the segmented glottal region were 0.82 (STD: 0.26) and 0.88 (STD: 0.25), respectively; STD refers to the standard deviation. In addition, the contour-based evaluation metric (BF score) showed a mean value of 0.96 (STD: 0.12) in terms of detecting the glottal area boundary. 3.3.2. Deep Learning Approach: Neural Network on Monochrome HSV Data The network which was trained and evaluated before on the color HSV dataset and whose results were discussed in the previous subsection was retrained on and applied to the monochrome HSV dataset. The results of this application are presented here. A sample of the images from the training dataset are shown in Figure 3.21. The top panel shows HSV frames from the vocally normal subjects and the bottom panel depicts HSV images from the AdLD subjects. Each panel illustrates a variety of image qualities and different gestures of VFs during several phonation tasks in running speech. The images display the VFs in different spatial locations, orientations, scales, and brightness conditions. Also, various states of VF behavior are shown, e.g., during vibrations, adduction/abduction with various degrees, and when the VFs are not vibrating. Additionally, other images show partial and full obstructed view of VFs due to the movement of the epiglottis, the constrictions of the arytenoid cartilages, and when the VFs fall outside the endoscope view. 82 Figure 3.21. A sample of images considered in the training of the developed deep neural network. Top and bottom panels show HSV frames from the vocally normal subjects and AdLD (referred as AdSD subjects in the figure) patients, respectively. After building the training dataset and manually segmenting the images using the labeling tool, an independent test set was created from new images, different than the training ones, from the HSV recording of an AdLD subject. A sample of the testing HSV frames is depicted in Figure 3.22. The figure includes two different sets of frames: individual testing images (top panel) and a set of consecutive frames. The top panel displays random HSV testing frames showing both a clear view of VFs with different gestures and a partial/full covered view of VFs with different obstructions. Additionally, in the bottom panel, a sequence of HSV frames is illustrated that shows several cycles of VF vibration. The short sequence includes 27 images, and these images were 83 selected with a step size of 12 frames from the HSV recording. The sequence displays a partial obstruction of VF by the right arytenoid cartilage. Due to the laryngeal maneuvers, the right arytenoid cartilage came very close to the endoscope in these frames displaying a bright area. It can be noticed that during the oscillation of VFs, the arytenoids are also moving, which is clear when comparing the first and last images in the sequence. That is, in the first frame, most of the VF tissues are covered by the right arytenoid cartilage while it is almost unobstructed in the last frame. Figure 3.22. A sample of random individual frames (top panel) and a short sequence of HSV frames (bottom panel), selected from the test dataset to evaluate the developed neural network. The performance of the developed DNN on the testing dataset is shown in Figure 3.23. The figure illustrates the results of applying the developed DNN on 14 testing frames. Some of these frames are selected from Figure 3.22 (top panel) that display different VF configurations with 84 relatively open glottal area and various image qualities in addition to a few testing frames (not in Figure 3.22) for a better segmentation evaluation. The testing frames are divided into two panels for clarity. For each frame, the original HSV image, before segmentation, is displayed besides two binary segmentation masks: one resulted from manual segmentation of the glottal area (ground truth labeling) and another one resulted from the automated segmentation by the DNN. Additionally, a zoomed-in view of the segmented glottal area is shown in the last column of each test image, where the manually detected area (in yellow) and the automatically segmented one (in blue) are overlaid on top of each other. Moreover, the DC values, which allow for a quantitative measurement of the similarity between the two detected glottal regions, are included inside each segmented frame. Hence, both subjective and quantitative evaluation of each frame is provided in the figure. As can be seen from the visual information in the figure, the automated approach demonstrates a large degree of match with the manual segmentations. Furthermore, this high match is reflected in the DC scores since most of the frames achieved DC values above 0.85. It can be noticed that, among the 14 images in Figure 3.23, two images show an obstructed view of the VF and glottal area. In these two images, the automated DNN results in a complete black mask, similar to the manual mask, with a DC of 1 (perfect match) – indicating the high ability of the network to recognize the absence of the glottis in these challenging images. 85 Figure 3.23. Performance of applying the developed neural network to segment the glottal area in a sample of 14 testing frames. The results of the different testing frames are displayed in two panels (7 per each panel). The original HSV images are illustrated besides the corresponding segmentation masks of the manual (second column in each panel) and automated analysis (third column in each panel). The segmented areas of each testing image are shown in the last column of each frame, where the manual (in yellow) and the automated (in blue) segmented areas are overlaid on top of each other. In addition to the evaluation of the DNN shown in the previous figure (Figure 3.23) on individual frames, the DNN was also assessed by applying it to the sequence of testing frames presented in Figure 3.22 (bottom panel). This sequence was a subset of the original testing set (1,000 frames) on which the performance of the neural network was demonstrated in Figure 3.24. 86 This subset included 166 consecutive HSV frames, which contained about 9 abduction-adduction cycles of VF vibrations. Among the 166 images, the results of implementation to 32 frames are displayed in the top panel of Figure 3.24. These sample frames, displayed in the top panel, were selected such that the glottal area is relatively large to facilitate the visual evaluation between the manual and the automated segmentation performances for the reader. As such, on each sample image in the top panel, the manually segmented glottal region and the automatically segmented one are overlaid on top of each other to demonstrate the discrepancy/match between them. Manual annotation is highlighted in yellow while the DNN segmentation is depicted in blue color. Also, the corresponding time (in ms) of the frame is included on each consecutive frame in yellow font. As shown, there is a considerable agreement between the results of the developed DNN and the manual labeling in the displayed frames. 87 Figure 3.24. Results of applying the developed automated approach to a testing sequence of HSV frames (166 consecutive images). The top panel shows a comparison between the manual (in yellow) versus automated (in blue) glottal area segmentation on 32 frames, selected from the sequence; the corresponding time of each frame is shown in yellow font. The bottom panel illustrates the segmented glottal area variation across the 166-frame sequence (in ms) when computed via the manual (in solid green line) and the automated method (in dotted red line). DC, IoU, F1, and accuracy scores are also included in the bottom panel. As can be seen in Figure 3.24, the glottal area waveform of the testing sequence is plotted in the bottom panel. The plot demonstrates the change in the segmented glottal area (measured in pixels) across the 166-frame sequence (measured in ms) as a result of using the manual analysis (green line) and the automated DNN (red dotted line). Both glottal area waveforms match well and exhibit a similar behavior across most of the sequence frames – demonstrating the promising 88 performance of the developed model. Also, there is a slight discrepancy observed between the manually and automatically computed areas during few instances when the VFs are fully abducted. During these instances, the DNN slightly overestimates the glottal area. In addition to the subjective evaluation, a quantitative assessment of the DNN was also considered by providing three metrics. The DC, F1, and accuracy scores were computed to quantitatively assess the segmentation quality of the DNN on the testing sequence. As shown, the mean DC, IoU, F1, and accuracy values are 0.84, 0.79, 0.98, and 0.88, respectively – demonstrating the high segmentation quality in terms of both the detected glottal region and its edges. The developed automated DNN was also tested on the entire testing dataset, which consisted of 1000 HSV frames in order to provide a quantitative evaluation of its segmentation performance. When the DNN was applied to the entire testing images, including both segregated frames as well as short sequences of VFs vibration in consecutive frames, the results revealed high segmentation accuracy and a good match between the estimated glottal region against the manually segmented one. The outcome mean scores of IoU, DC, and accuracy were 0.81, 0.86, and 0.89, respectively, revealing a high similarity between the manual and automated analysis as well as the promising performance of the developed DNN in detecting the glottal region in the testing frames. In addition, the automated model demonstrated high precision in estimating the glottal boundaries and VF edges with a mean F1 score of 0.93. In addition to the results presented above in terms of the performance and the validation of the proposed tool in detecting the glottal area (using, e.g., IoU) and its boundary/edges (using, e.g., F1), another method was developed for left and right VF detection. The following results illustrate the performance of the developed tool in detecting the left and right VF edges. As can be seen, Figure 3.25 shows three HSV frames. The first frame demonstrates the precise detection of the midline points in cyan color along with the fitted second-order curve in yellow color. The second image exhibits the HSV frame including only the midline for clarity. The third image represents the detected left and right VF edges plotted with the midline showing the successful description of the computed midline in capturing and matching the shape of the glottal area. 89 Figure 3.25. A schematic diagram of the detected glottal line on a HSV frame showing the midline in yellow color along with the detected left and right vocal fold edges (in red and blue color , respectively). Figure 3.26 shows the results of the DNN when applied to 12 individual HSV frames showing different configurations and gestures of the VF in terms of its location, size, image quality, and being partially obstructed. Seven frames were presented in the figure organized in five rows. Each row shows different segmentation results except the first one which depicts the original HSV frames. The second and third row exhibit the glottal area segmentation results in a form of segmentation mask and original frames with highlighted glottal area in red. The last two rows show the detecting of the midline in yellow and the left (in red) and right (in red) VF edges. As can be seen, the developed tool to detect the left and right VF edges was able to accurately capture the various shapes of the glottal area. 90 Figure 3.26. Results of the automated segmentation on seven different HSV frames with different vocal fold gestures. Five rows shown in the figure represent the original HSV frames, segmentation masks (glottal area in white), segmented glottal area (in red), the glottal midline (in yellow), and the left and right vocal fold edges (highlighted in red and green color). Another figure is included in order to demonstrate the performance of the developed tool for midline and VF edge detection. Figure 3.27 includes results related to the automated segmentation in terms of detecting the glottal area and the VF edges in a sequence of around 760 frames. The top panel illustrates the segmented frames displaying the segmented glottal area in red and VF edges in blue and red for the right and left VFs, respectively (along with the detected glottal midline in yellow). Bottom panel depicts the extracted glottal area waveform (measured in pixels) including the timestamps where the frames were extracted shown in red dots. The timestamps are also included in the frame images in the figure in white color. As shown in the figure, the developed tool was able to capture the glottal area as well as the left and right VF edges during different temporal locations within the presented HSV sequence. The sequence represents the high-quality performance of the DNN tool in capturing the detailed shape and sizes of the glottal area – 91 including when the VFs are widely open as well as when the VFs are vibrating. Also, the detected glottal midline was able to follow the different gestures of the glottal area. Figure 3.27. Automatically segmented glottal area and vocal fold edges in a sequence of 12 HSV frames depicted in the top panel. The segmented area, midline, left and right vocal fold edges are highlighted in red, yellow, red, and blue. Bottom panel illustrates the extracted glottal area waveform (measured in pixels). The timestamps where the frames were extracted are shown in red dots and included on frame images. 92 3.4. Study IV: Automated Measurements of Glottal Attack and Offset Time In this section, the deep learning tool, introduced earlier, was successfully implemented and was able to accurately determine the glottal area across the HSV frames during each the different vocalized segments in each monochrome video of each subject. Accordingly, the edges of the VF were precisely determined based on the segmented glottal area/boundary. This accurate segmentation enabled the successful computation of the GAT and GOT measurements corresponding to the onset and offset of each vocalization, respectively. This section demonstrates the results associated with developing the GAT and GOT measures. Results from applying the developed deep learning tool to a set of sequential frames that were captured during a phonation onset are presented in Figure 3.28. The figure illustrates the outcomes of this implementation. The top panel displays a subset of 12 HSV frames selected from various timestamps throughout the sequence and segmented using the automated tool. The segmented glottal area is shown in green, and the corresponding timestamps were indicated on each segmented image in yellow font. As can be seen, the segmented frames demonstrate the accurate detection of the glottal area via capturing the complex details of the glottis. The timestamps, at which the segmented frames were selected, were marked by red dots on the generated GAW, which is illustrated in the bottom panel to help with temporal referencing. The automatically developed GAW in the bottom panel provides an accurate visual representation of the change in the glottal area and VF behavior during the initiation of phonation in terms of pixel measurements. 93 Figure 3.28. Results of applying the developed deep learning tool to a sequence of frames during phonation onset. The top panel shows automated glottal area segmentation (highlighted in green) on 12 HSV frames, selected from different timestamps within the sequence. The corresponding time is indicated on each frame in yellow font. The bottom panel illustrates the segmented glottal area variation (measured in pixels) across the sequence during the onset of phonation. Results of the automated measurement of the GAT during phonation onset of a vocalization during running speech are shown in Figure 3.29. The automatically generated GAW (normalized) and the average medial glottal contact waveform are displayed in the top two panels, in red and blue, respectively. The GAW and contact waveform energy contours are also illustrated in the figure bellow the associate waveforms with the same related colors. At the bottom panel, the cross- correlation's results are shown in a green curve. On the cross-correlation graph, the automatically computed GAT is indicated by the time difference between the two horizontal dashed lines. As shown in the figure, the automated algorithm was able to detect the energy rise of both the GAW and the contact waveform and, accordingly, compute the delay between the two energy lines. In this example, the automated algorithm reveals a delay time of 14.75 ms – referring to the GAT value during this particular phonation onset. 94 Figure 3.29. Results of the automated measurement of the glottal attack time (GAT) during a phonation onset. The top two panels show the magnitudes of the normalized glottal area waveform (GAW) in red color and the average medial glottal contact waveform in blue. The bottom two panels illustrate the energy contours corresponding to each waveform along with the outcome of the cross-correlation in green color. The measured GAT is marked on the cross-correlation graph as the time delay between the two horizontal dashed lines (14.75 ms). The outcome from implementing the DNN tool to a sequence of HSV frames during the offset of phonation is introduced in Figure 3.30. The figure has the same formatting as Figure 3.28 – including two panels (top panel showing a sample of 12 segmented frames within the sequence and bottom panel showing the generated GAW during phonation offset). As shown in the top panel, although the segmented frames were displayed during a short period of time (short sequence) during running speech, different image quality and altered configurations/views/sizes of the VFs can be observed across the different frames. Despite that, as can be seen, the automated segmentation tool was able to precisely detect the glottal area regardless of these variations. 95 Moreover, the segmented GAW, shown in the figure, exhibits accurate illustration of the dynamic characteristics in the glottal area and VF behavior during the offset of phonation. The plotted GAW accurately represents and captures not only the oscillation during the steady-state vibration potion but also the small-amplitude oscillations existed toward the end of the phonation offset. Figure 3.30. Results of applying the developed deep learning tool to a sequence of frames during phonation offset. The top panel shows automated glottal area segmentation (highlighted in green) on 12 HSV frames, selected from different timestamps within the sequence. The corresponding time is indicated on each frame in yellow font. The bottom panel illustrates the segmented glottal area variation (measured in pixels) across the sequence during the offset of phonation. Figure 3.31 depicts the results of the automated GOT measurement during a phonation offset selected from a running speech sample. Similar to Figure 3.29, the top two panels display the normalized GAW and average medial contact waveform, automatically generated using the segmentation tool and visually represented in red and blue, respectively. The derived energy contours of each waveform are illustrated. The two energy waveforms show the accurate representation of the drop of the oscillation energy corresponding to the damping motion of the VF at the end of phonation. As can be seen in the figure, there is a time lag between the two energy 96 lines which can be precisely observed from the cross-correlation graph. This lag is shown between the dashed lines indicating the offset time between the two waveforms which is 28 ms in this phonation offset sample. Figure 3.31. Results of the automated measurement of the GOT during a phonation offset. The top two panels show the magnitudes of the normalized glottal area waveform (GAW) in red color and the average medial glottal contact waveform in blue. The bottom two panels illustrate the energy contours corresponding to each waveform along with the outcome of the cross-correlation in green color. The measured GOT is marked on the cross-correlation graph as the time delay between the two horizontal dashed lines (28 ms). The developed automated method for computing GAT and GOT was first validated against the visual analysis. This was done by applying the automated method to the HSV recordings of all the subjects and generating the GAT and GOT values during the different phonation onsets and offsets in each recording. The mean values of GAT and GOT during each recording was obtained using the automated algorithm in addition to the corresponding visual measurements. The values 97 computed from the two methods were compared. The results of this comparative analysis is shown in Figure 3.32 for both the vocally normal subjects (N) and the AdLD patients. As can be seen in the figure, the solid blue and green lines indicate the automated measurements of GAT and GOT, respectively, whereas the dashed lines indicate the visual measurements. Overall, the automated measurements precisely detect the GAT and GOT values – showing a close alignment and agreement with the visual analysis in most of the subjects with minimal differences. It can also be observed that the automated measurements demonstrate more accurate values of GAT and GOT in comparison with the AdLD patients – showing a marginally better agreement with the manual measures. In addition, the overall automated computation of the GAT showed slightly more accurate values compared to GOT. Figure 3.32. Results of the comparison between the automated measurements and the visual measurements of the GAT and GOT, both measured in ms, for the vocally normal participants (N) and the AdLD. The automated measurements of GAT and GOT are shown in solid blue and green bars, respectively. The manual measurements of GAT and GOT are illustrated in dashed blue and green bars. Overall, the analytical comparison revealed a minimal average discrepancy between the automated and the manual measurements. The average difference for the mean GAT between the automated and manual analysis was 1.6 ms across all the recordings. The mean GOT showed a slightly higher average difference of 2.7ms between the automated and visual measurements. Moreover, an additional quantitative analysis was carried out between the automated and the visual analysis to compare the magnitudes of the GAT and GOT within the levels of the vocalized segments across various subjects. The comprehensive statical analysis demonstrated a strong and significant correlation between the automated and manual measurements in both GAT and GOT. 98 A high correlation coefficient of Pearson r = 0.93 was found for GAT, suggesting a significant level of agreement between the two ways of measurements. Likewise, GOT measurements demonstrated a strong correlation coefficient of r = 0.91. An independent t-test was performed using the various vocalized segments in order to investigate to what degree the measurements of the automated and manual analysis differ for GAT and GOT. It was observed that there is no statistical difference between the automated and visual measurements – indicating a high level of similarity between the two methods of measurement. That is, the resulted p-value from the t-test conducted between automated and manual measurements was 0.86 in the GATs and 0.77 in the GOTs, refereeing to a minimal statistical discrepancy between the manual versus the automated approach. After validating the automated algorithm with the visual measurements, the automated method was used to compute the GAT and GOT values for all the subjects where a comparison can be made between the vocally normal subjects against the AdLD patients. Figure 3.33 and Figure 3.34 provide the mean values (shown in blue) along with STD (shown in light orange) of the GAT and GOT measurements, respectively, for each normal control and AdLD participant. As shown in Figure 3.33, the mean GAT values of almost all the AdLD patients were higher compared to the vocally normal individuals with a noticeable difference. In addition, the figure shows a discrepancy in the mean GAT measurements in the AdLD individuals ranging from 14.8 to 22.9 ms than in the normal controls with minimal fluctuation in the values (14.9 – 15.2 ms). In addition, as can be seen from the figure, the STD shows high values in AdLD in comparison with the vocally normal group – referring to the high variability observed in AdLD. Overall, AdLD group had higher average GAT values with (18.95 ms) than the vocally normal group (14.65 ms). The statistical analysis revealed that this difference was statistically significant between the two groups (p-value < 0.001). Conversely, the mean GOT values demonstrated a slight increase in the AdLD group (varying between 23.8 and 32.1 ms) versus the normal controls (ranging from 25.3 to 31.4 ms). Hence, the mean GOT value across all the AdLD patients, marked by 28.9 ms, didn’t demonstrate a statistical significant difference (p-value = 0.2) compared to the vocally normal controls, having an average value of 27.3 ms. Similar to the variability found in the GAT results, the GOT demonstrated larger values of STD in the AdLD group compared to the normal controls. 99 Figure 3.33. Results of the automated measurements of GAT (measured in ms) between the vocally normal participants (N) and the AdLD. The mean and STD values are shown in blue and light orange bars, respectively. Figure 3.34. Results of the automated measurements of GOT (measured in ms) between the vocally normal participants (N) and AdLD. The mean and STD values are shown in blue and light orange bars, respectively. 100 3.5. Study V: Lumped Modeling This section introduces both the simulation and the optimization results obtained using the developed single lumped model of the VFs. The model was built using the one-mass approach such that it can generate oscillatory behavior comparable with the one observed from the HSV data. Hence, after building the model, the experimental data were extracted using the segmentation tool developed, introduced earlier. This experimental data were represented by the automatically generated glottal area waveform. By matching the experimentally extracted glottal area waveform with the simulated one, the optimization was conducted in order to infer biomechanical measurements of the dynamics of the VFs. For that purpose, a vocalized segment extracted from the HSV of a vocally normal participant during running speech – showing VF vibrations – was considered for the analysis. The results corresponding to the modeling simulation and optimization are described in this section. The results of the simulation are illustrated in the following two figures. The simulation results of the model, presented in these two figures, were based on the input parameters that were chosen earlier in the second Chapter (Methodological Approach). The model constants related to Ag0, µ, ρ, l, and d were selected by 0.05 cm2, 1.86×10−5 g/(cm2.s), 1.2 × 10−3 g/cm3, 1.4 cm, and 0.3 cm, respectively. The damping coefficient was considered here as zero and was given a value of 500 g/s (c’) only during the closure phase of the VFs. The values of m and k were set to 0.24 g and 5000 g/s2. Also, the subglottal pressure, which was considered another variable here, was parameterized using the following values during the closure phase (PSmax and tc): 10000 dyn/cm2 and 2.75 ms. Noting that during the opening phase of the VFs, the subglottal pressure was given a value of 8000 dyn/cm2. Based on the listed model input parameters and constants, the simulation results are generated in the following two figures. Figure 3.35 shows the resulting glottal area waveform measured in cm2 as a function of time. The figure illustrates the behavior of the VF vibrations and how the area between the VFs change along with time. Also, the closure phase is represented in this figure as a dashed line below zero showing how the high value of the viscous damping during the closure phase act toward greatly attenuating the momentum energy and damping the motion of the VFs due to their contact. 101 Figure 3.35. Simulation results of the theoretical glottal area waveform as a function of time. The dashed line shows the movement of the vocal folds during the closure phase. Figure 3.36 illustrates several simulation results corresponding to different model parameters. The figure depicts the change of the mass displacement, glottal air volume flowrate, the subglottal pressure, the elastic restoring force induced by the spring, total external force acting on the vibrating mass, and the damping coefficient during the closure phase. Each subplot shows the behavior and the variation of each variable along with time during the simulation. The displacement subplot refers to the spatial movement of the VF mass, measured in cm. The glottal air flowrate illustrates a similar behavior to the displacement – demonstrating the change in the air flow corresponding to the abduction (maximum flowrate) and adduction (no flowrate) of the VFs. Also, the subglottal pressure and the damping coefficient plots reflect the fluctuation of their values during the closure (maximum build-up pressure and a large value of viscous damping) and the opening of the VFs (typical value of the subglottal pressure and zero damping). In addition, the behavior of the total external force acting upon the VF mass during vibration is illustrated in the figure as well as the characteristics of the force exerted due to the elastic spring forces. Overall, as can be seen from Figure 3.35 and 3.36, the model can capture the oscillatory behavior of VFs and even represent some degree of nonlinearity in the numerical solution. In addition to that, the contact between the VFs was also incorporated into the model as a form of closure time and shown in the simulation plots. 102 Figure 3.36. Simulation results of the mass displacement, glottal air volume flowrate, subglottal pressure, elastic restoring force, total external force acting on the vibrating mass, and the damping coefficient during the closure phase. After simulating the proposed lumped model, it was optimized against the experimental data. The optimization was conducted using the simulated glottal area waveform generated from the model along with the glottal area waveform automatically extracted from the HSV data using a vocalized segment of the VF vibration. The optimization procedure was carried out using initial values and constrained ranges associated with the optimization vector q (α, m (g), k (g/s2), c’ (g/s), tc (s), PSmax (dyn/cm2)) based on the optimization parameters mentioned previously in the second Chapter. The initial values were chosen such that q = {120, 0.1 g, 40000 g/s2, 400 g/s, 0.005 s, 15000 dyn/cm2}, and the constrained ranges associated with each optimization parameter were selected by {110 – 130, 0.04 – 0.30 g, 10000 – 60000 g/s2, 300 – 800 g/s, 0.003 – 0.008 s, 8000– 20000 dyn/cm2}, respectively. After running the optimization technique (particle swarm method), the objective function was computed at each iteration of the optimization. Figure 3.37 depicts the convergence plot associated with the optimization approach showing the objective function value at each iteration during the optimization process. The figure shows the normalized error (related to the objective function) along with the iteration number. As can be seen, the objective function value decreases and converges to a minimum value of 0.04155, exhibiting a plateau-like behavior after a specific number of iterations (around 60 iterations). Hence, the eventual objective function returned a 103 normalized optimization error of 0.04155 between the theoretical/simulated glottal area waveform in comparison with the experimental glottal area waveform. Figure 3.37. Results of the objective function value (normalized error) at each iteration during the optimization process. Upon the successful completion of the optimization process, the optimization method returned the optimized parameters associated with the model. These generated optimized parameters indicate that their optimal combination was achieved after the iterative optimization procedure and refining the solution. The outcome of the optimization procedure results in a set of optimized values of 124.67, 0.0501 g, 11,787 g/s2, 414.69 g/s, 0.0030005 s, and 8,571.3 dyn/cm2 corresponding to the optimizing parameters: scaling factor, the mass, the spring stiffness, the damping coefficient during closure, the closure time, and the maximum subglottal pressure, respectively. Based on the optimized parameters, the biomechanical measure of the elasticity index was 235,269 1/s2 and the viscosity index was 8,277 1/s. The obtained optimized parameters were utilized as inputs to the developed lumped model in order to visualize the resulted simulation. After incorporating the optimized parameters, the simulation of the VF movement was generated. Figure 3.38 illustrates the simulation results of the optimized model. The figure plots the computed glottal area waveform as a function of frame numbers (converted from time). The displayed theoretical glottal area waveform was multiplied by the optimized scaling factor in order to match the experimental glottal area (which was measured in pixels). The figure shows the oscillatory behavior of the mass (VF) including the simulated closure time. 104 Figure 3.38. Simulation results of the theoretical glottal area waveform as function of time using the optimized parameters. In addition, the behavior of the optimized subglottal pressure during the simulated VF vibration is illustrated in Figure 3.39. The figure demonstrates the corresponding changes in the subglottal pressure as a function of time. As can be seen in the figure, the pressure oscillates between the typical subglottal pressure value (8000 dyn/cm2) and the maximum pressure, which was the outcome of the optimization procedure (8571 dyn/cm2). In addition, the plot exhibits the adduction time (the optimized closure time which is tc) during which the subglottal pressure value was maintained at the maximum optimized pressure (referring to the build-up pressure during the closure of the VFs. 105 Figure 3.39. Results of the change in the parameterized subglottal pressure during the vibration of the vocal folds as a function of time using the optimized parameters. Pmax refers to the maximum optimized build-up pressure, and tc indicates the optimized closure time. In order to compare between the optimized theoretical glottal area waveform and the experimentally generated one, both waveforms were plotted in the same figure for better visualization as can be seen in Figure 3.40. the simulated glottal area (resulted from using the optimized parameters) is plotted in blue along with the experimental glottal area, shown in red dotted line. The simulated movement of the VFs during closure was depicted in the dotted blue line – referring to the negative area. As shown, the figure reveals a relatively similar behavior between the simulation and the experimental data where the optimized simulation model captures the main vibratory characteristics of the experimentally extracted glottal area. 106 Figure 3.40. Optimization results between the simulated (in blue) and the experimental glottal area (in red) waveforms – overlaid on top of each other. The dotted blue line shows the movement of the simulated vocal folds during the closure time. 107 CHAPTER 4: DISCUSSION 4.1. Study I: Automated Detection of Vocal Fold Image Obstructions The purpose of this study was intended to fulfill Aim 1 by addressing the following: Q1.1: Can DNN accurately classify HSV frames in AdLD during connected speech regardless of the excessive laryngeal maneuvers? H1.1: DNN can accurately classify HSV frames based on whether these frames display an obstructed view of the VFs. Q1.2: Does the presence of AdLD affect the durations over which VFs are visually obstructed in HSV during running speech? H1.2: The duration of the visual obstruction of the VFs will be longer in AdLD versus normal controls during connected speech. A deep learning technique was successfully developed as a classifier to automatically detect the VF obstruction in HSV data, recorded during connected speech. The introduced automated framework was developed and implemented based on HSV recordings of vocally normal individuals and patients with AdLD. A robust training dataset was created through a sample of visually labeled HSV frames, displaying various obstructed and unobstructed VF views. The deep neural network was built using CNN and was successfully trained and validated on the dataset to classify HSV frames into two classes: frames with or without VFs obstruction. The overall visual evaluation of the performance of the trained network showed high capability in recognizing the VF obstruction in HSV video frames. The results of implementing the trained CNN on a testing dataset, which was created from an HSV dataset on which the network was not trained, demonstrated high classification capability of the network in detecting different obstructions of the VFs with overall accuracy of 94.18%. This indicated how the presented automated approach was flexible and general toward classifying the VF obstruction of new HSV data from different participants with a high sensitivity and specificity of 97.24% and 91.11%. The developed network also returned high F1-score of 0.94 when applied to the testing dataset. This high F1-score revealed the high precision of the developed framework toward classifying the different obstructed views of the VFs in the HSV frames. A robustness evaluation was done to assess the performance of the trained CNN-based classifier by a thorough comparison between the results of the automated method against the manual analysis of two HSV recordings. The two videos were selected from two different 108 participants: one from a vocally normal person and another from a patient with AdLD. The comparison was conducted on the entire of the two HSV videos (consisting of 264,400 and 399,384 frames; over half a million HSV images in total). This massive number of images (663,784 frames) were manually classified by a rater to compare the developed CNN against visual observation of the rater. The results of the comparison revealed a promising performance of the automated classifier against the visual analysis. The percentage of the total number of frames in the two HSV videos that showed an obstructed view of the VFs was almost the same between the manual observation and the automated analysis – 14.56% versus 14.75% in the vocally normal individual’s HSV video and 24.18% versus 24.42% in the patient’s HSV video, respectively. As found in this study, the patient showed a higher number of frames with an obstructed view of the VFs, which can be explained by excessive laryngeal spasms in the AdLD patient. The high robustness of the developed technique in classifying laryngeal HSV data against the enormous number of manually labeled frames is an apparent advantage over the previously introduced classifiers. This is because the previous deep learning models were tested against considerably limited sizes of images at around 720 [136, 137, 138], 1,176 [140], and 5,234 laryngoscopic images [139], which were extremely smaller than the number of images used for assessing the introduced automated technique. The comparison between the manual and automated classification on each individual frame was further extended over the two HSV recordings by generating confusion matrices. Different metrics were used, based on the resulted confusion matrices, to provide a detailed evaluation of the developed CNN performance in this comparison: sensitivity, specificity, precision, F1-Score, and accuracy for detecting the obstructed/unobstructed view of the VFs in the HSV frames. Overall, the proposed deep learning approach demonstrated high robustness when applied to the two HSV videos against the visual observation with overall accuracies above 92%. The automated technique showed a better performance with higher accuracy when identifying the VFs in the frames than when recognizing an obstructed view of the VFs. The reason for this was that the VFs can be obstructed in different ways and configurations in connected speech – imposing a more challenging view for the automated approach. Furthermore, the developed network showed higher overall accuracy in the vocally normal participant’s recording (97.23%) than in the patient’s recording (92.38%). This is because, in running speech, patients with AdLD demonstrate an increase in laryngeal maneuvers and complex VF obstructions than vocally normal persons, imposing more challenging conditions for the 109 developed technique to maintain a high classification accuracy. These complex obstructions could be due to, for instance, epiglottis, left/right arytenoid cartilages, laryngeal constriction, false VFs, or any combination of these. In addition, the developed network had few challenges in detecting the frames with partial VF obstructions. This is because, in the manual labeled data used for training, it was challenging for the rater to exactly determine partial VF obstruction – such that if more than 50% of the VFs was obstructed, the frame would be classified as a frame with VF obstruction. This study is the first work that developed and applied a fully automated deep learning approach in order to detect VF obstructions in HSV data during connected speech. To the best of our knowledge, there are no other studies in literature that used a state-of-the-art deep learning technique as a classifier for frame selection on HSV recordings; instead, several studies implemented deep learning schemes to laryngoscopic images [139, 140]. The HSV data used in the present study in running speech exhibit lower image quality along with excessive laryngeal movements and significant changes in glottal posture, which impose considerable challenges upon applying the deep learning approaches compared to high-quality laryngoscopic images. In spite of these challenges, the developed approach was highly successful as a classifier in automatically selecting HSV frames based on the presence and absence of the VFs. The introduced technique achieved overall classification accuracy of 94.18% on the testing dataset, which is even a comparative accuracy against the accuracies found in literature using the better-quality/less challenging laryngoscopic images at 86-96% [139, 140]. This, therefore, demonstrates that the present deep leaning-based approach not only proved its high robustness in classifying HSV data against a huge number of manually labeled frames, but also revealed a promising performance in HSV data with challenging image quality in connected speech. Accordingly, hypothesis H1.1 was accepted. After validating the accuracy of the proposed classification network which addresses H1.1, it was implemented to investigate the differences between the AdLD patients and the vocally normal individuals in terms of the durations within the HSV recordings when the VF was visually obstructed. This investigation was performed to address H1.2. The comparative results demonstrated that there was a noticeable difference in the durations of the visual obstruction of the VFs in connected speech between AdLD (with an average obstruction of 26.1% across all the participants) versus normal controls (with average obstruction of 19.7% across the different 110 individuals). The outcome of the comparison supported the acceptance of H1.2. The reason for this difference stems from the impaired laryngeal control and excessive movements of the laryngeal tissues in AdLD – leading to frequent obstructions of the VF view – in comparison with the vocally normal subjects. Overall, the results demonstrated the applicability and the potential of this measurement in studying the differences between AdLD and normal controls. Therefore, the durations of visual obstruction might be a good measurement that could be used in future for determining the severity of the AdLD; however, a larger sample size would be useful to emphasize the findings and investigate the clinical relevance of this measure. 4.2. Study II: Image Segmentation of Vocal Fold Edges 4.2.1. Image Segmentation Approach: ACM The purpose of this study was intended to fulfill Aim 2 by addressing the following: Q2: Can VF edges be accurately and robustly segmented in HSV data during running speech in the presence of image noise? H2.1: The dark glottal area can be successfully silhouetted against the brighter surrounding VF tissue. H2.2: ACM can accurately segment VF edges in HSV data with excessive image noise during VF vibrations. The temporal segmentation technique, developed in a previous study [119], was successfully utilized to determine the vocalized segments of the “Rainbow Passage.” Subsequently, the motion compensation precisely located the vibrating VFs across the frames during the extracted vocalizations. After applying the motion compensation and determining the location of the vibrating VFs, digital kymograms were successfully extracted at various intersections of the VFs. The vibrating VFs always appeared on an almost straight line in the extracted kymograms, which was necessary for a better performance of the spatial segmentation algorithm. The automated snake initialization tool was successfully developed and accurately located a line that spanned through the glottis center in the extracted kymograms. The adjusted moment line was introduced because the results revealed how vulnerable the first moment of inertia line was toward the noise in the kymogram image. Based on the results, the proposed adjusted moment line demonstrated a better estimation than using only the first moment of inertia line for finding the center of the glottis. Obtaining an accurate snake initialization line was a necessary step toward a better performance of the ACM modeling algorithm and its convergence. 111 The ACM was successfully implemented for the kymograms of the vocalized segments of the “Rainbow Passage.” The application of ACM allowed the analytic representation of the VF edges at different cross sections of VFs from the anterior to the posterior commissure. The performance of the algorithm exceeded the challenging quality of the HSV images. From 76 vocalizations of the “Rainbow Passage”, the visual observation of the detected edges and the HSV kymograms showed that the algorithm’s error was not more than one pixel for 67 vocalizations (88%) deeming it successful for precise detection of the glottis boundaries. Due to dim lighting in some of the frames in the kymograms of the other 9 vocalizations, the active contour modeling (ACM) was not able to find the glottal edges. The visual observation also could not determine the glottal edges due to the lighting issue for these kymograms. This study showed the feasibility of automatic VF edge detection using the proposed ACM method in challenging data obtained using a color high-speed camera – leading to the acceptance of hypothesis H2.1 and H2.2. Color images are preferred over monochrome images by clinical specialists since color images allow them to evaluate the health of the tissues while observing and evaluating the vibrations of the VFs. Therefore, this study used a color high-speed camera to demonstrate that the proposed algorithm can be applied to color images. Moreover, the goal of this study was to develop an algorithm that works for the most challenging conditions given color images. Color images are challenging to analyze compared to the monochrome images due to the inherently higher dynamic range (image quality) of monochrome images and to the significantly more accurate representation of the gradients of the edges in the monochrome images. Despite the edge uncertainties on the color images, the paired active contour was not attracted to erroneous edges, and it maintained optimal rigidity. Since this work shows the robustness of the spatial segmentation method in the most challenging conditions due to color images, this method can be a promising image processing technique to detect VF edges in HSV data regardless its image quality. After registering back the segmented edges in the kymograms to the HSV frames, based on the visual inspection of the results, the implemented active contour modeling successfully detected the edges of the vibrating VFs across the frames during each vocalized segment. This method not only has been able to address the sensitivity of prior image segmentation techniques to image noise and intensity inhomogeneity, but also could tackle more challenging video quality in HSV data in connected speech. Despite the promising performance of ACM, it was vulnerable to very dim 112 lighting conditions in connected speech data, where the kymograms had inferior lighting conditions. This issue occurred due to the high sensitivity of the active contours toward their initialization, creating a challenge to accurately localize the contours near the glottal edges. Moreover, since ACM is an iterative method, it required a relatively long time for convergence because the analysis is done at all cross-sections of the VFs for each vocalization which could include thousands of frames. Therefore, this technique can be best used for HSV data collected using rigid videoendoscopy due to higher image quality. The following section discusses an advanced method (the hybrid approach) that can provide an enhanced performance and overcome the limitation of ACM on its dependency to the contour initialization and the high computational cost to be implemented even with inferior dim lighting image quality during connected speech. 4.2.2. Image Segmentation Approach: The Hybrid Method This study was focused to completely address Aim 2 in combination with the previous one by fulfilling the following: Q2: Can VF edges be accurately and robustly segmented in HSV data during running speech in the presence of image noise? H2.3: A clustering technique can be combined with ACM to build a hybrid method improving the edge segmentation accuracy of ACM during vocalization and when VFs are not vibrating. The temporal segmentation and motion compensation algorithms were successful in capturing the location of the vibrating VFs in a cropped motion window, which prepared the HSV frames for kymogram extraction. The HSV kymograms were generated at different cross sections of the VFs during each vocalization. The moment of inertia was used to successfully determine a horizontal line spanning through the center of the VFs in each kymogram, which was an important step before applying the hybrid spatial segmentation method to the kymograms. The selection and extraction of the appropriate features were done in order to implement the unsupervised ML technique (i.e., k-means clustering). A different number and combination of features were fed into the ML algorithm to determine the salient subset of features for the development of the method. These features included the intensities of red and green channels and the image gradient. It was found that using these three features was the most appropriate combination of features in terms of obtaining an adequate clustering performance. Given the three considered features, the implemented clustering algorithm was able to precisely cluster the kymograms’ pixels into two clusters (glottal area pixels and non-glottal area pixels). Subsequently, 113 the edges of the clustered glottal area pixels were spatially segmented, returning the top and bottom initialization contour lines corresponding to the left and right VFs, respectively. After obtaining the initial contours from the clustering technique, they were used as inputs to the ACM method to enhance its performance in segmenting the VF edges. The ACM method was successfully applied to the kymograms utilizing the initialized contours. The main weakness of the ACM method is the sensitivity to the contour initialization, which should be selected to be close to the glottal edges. In this study, using the clustering technique to initialize the active contours significantly improved the accuracy of the hybrid ACM in comparison with using the ACM alone. This hybrid method allowed for the accurate representation of the edges of the vibrating VFs in the kymograms at different intersections of the VFs. A comparison between the new machine- learning-based hybrid method against the ACM alone was conducted in order to show to what extent the new hybrid technique enhanced the performance of the VF edge representation in comparison with using only the ACM approach. The performance of the hybrid method was compared with that of the ACM by applying the two methods on two decent quality kymograms and two kymograms with dim lighting and degraded qualities. The results of the comparison revealed a significant improvement in edge detection by the hybrid method over using the ACM alone. This enhancement was more noticeable in the lower quality kymograms. This indicated how the proposed hybrid method was less vulnerable to the noise in the image compared to the ACM, which failed to detect the edges in the presence of significant noise in the kymograms. In addition, the computational cost of the hybrid method was half of the ACM technique. After applying the hybrid method, the segmented edges in the kymograms, which were extracted at different VF cross sections, were registered back to the HSV spatial frames to detect the VF edges in each individual HSV frame. The performance of the proposed hybrid method was tested through visual inspection of the detected VF edges in the HSV kymograms of different vocalization segments of the “Rainbow Passage.” Out of 76 vocalizations, the visual inspection of the detected VF edges in the extracted kymograms demonstrated that the developed hybrid technique successfully captured the glottal edges for 74 vocalizations with an error less than ±1 pixel. This yielded a high accuracy of 97.4% in VF edge representation using the hybrid method for HSV data during connected speech. The only other study performing the same task that we can compare our work with was our previously developed ACM method [34], which detected the glottal edges accurately in 88% of the vocalizations in the same HSV sample. There are no other 114 known studies of automated VF segmentation of HSV recordings during connected speech. The current study presented several of the vocalizations, where the ACM method failed. The higher accuracy and performance of the hybrid method, as were shown in this study, reveals its superiority over the ACM method; hence, hypothesis H2.3 was accepted. The extracted kymograms of the two vocalizations in which the hybrid method did not perform accurately had extremely dim lighting across most of the frames, which also made the visual detection of the glottal boundaries impossible, making it challenging to create an accurate reference manually. The hybrid method in this study is the first ML-based approach developed for VF segmentation during connected speech. The recently developed deep learning approaches for VF segmentation were all employed for HSV analysis during sustained vocalization with higher image quality [142, 143, 144, 145]. The developed hybrid method is fully automated, while the deep learning techniques, previously developed, required manual labelling of a part of the dataset in order to train the deep neural networks. Moreover, the prior deep learning methods are all spatial segmentation techniques; however, the hybrid method in this study is a spatiotemporal method that would lead to a higher robustness in case of irregular VF closure. The hybrid method in this study relies on the accurate performance of the developed motion compensation method; however, this is not an issue with the HSV analysis during sustained vocalization due to the little change in the VF location across frames. Since there is no known gold-standard accurate method to fully capture the VF edges from HSV data during connected speech, visual inspection was performed to serve as reference for validating the performance of the developed technique. It should be noted that this study showed the feasibility of the hybrid method for VF edge representation (in HSV data) during connected speech in one participant with no history of voice disorder. The proposed hybrid approach showed a promising performance for HSV data with the most challenging images, obtained by a color HSV system. This facilitates the future implementation of the proposed method on less challenging monochromatic images since a monochrome camera provides a higher sensitivity and dynamic range with better pixel representation. This will potentially lead to a higher accuracy and faster performance of the hybrid method for monochromatic HSV data. This study aimed to show the feasibility of this approach for color HSV images, which is preferred over monochromatic images by many voice specialists since color images allow them to better evaluate the health of the tissues. Although the promising performance of the hybrid method was shown during VF oscillation, the algorithm did not perform accurately 115 before and after the onset and offset of VF vibration. This was due to the deviation of the motion window from the VF location before and after the oscillation. However, this did not contradict the purpose of this study, which was to track the edges of the VFs during vocalization. Therefore, the application of the hybrid technique would be efficient during the more sustained portions of phonation (vibratory portions) during connected speech as it provides accurate detection of the edges of the VFs during vocalized segments in running speech regardless the inferior image quality. In the next section, the development of an algorithm to automatically detect the edges of the VFs when adducted and not vibrating will be discussed which would be valuable in studying laryngeal maneuvers as well as phonation onsets and offsets during connected speech. 4.3. Study III: Deep-Learning-based Representation of Vocal Fold Dynamics 4.3.1. Deep Learning Approach: Segmenting Network on Color HSV Data This study intended to partially address Aim 3 by fulfilling the following: Q3: Can the GAW be automatically extracted given the inferior image quality in the fiberoptic HSV and the excessive laryngeal movements in AdLD during running speech? H3.1: The hybrid method can be used as an automated labeling tool to train a robust DNN on detecting the glottal area in HSV during running speech. The present technique in this study used the power of the developed hybrid method, discussed in the previous section, and deep learning to overcome the challenges of the previous two image segmentation methods (ACM and the hybrid technique) in terms of detecting the glottal area during all phonatory tasks (including nonstationary portions) and when VFs are not vibrating. Hence, the proposed deep learning approach can be used as a robust and cost-effective tool for segmenting the glottal edges—regardless of the image quality and the phonatory tasks during running speech. This study showed the successful utilization of the previously developed hybrid segmentation technique as an automated labeling tool to form a training data set. In the hybrid method, k-means clustering technique was successfully applied to cluster the kymogram’s pixels into two clusters (glottal area and nonglottal area). The edges of the glottal area cluster were roughly segmented as initialized contours for the ACM method, which was then implemented to accurately segment the edges of the vibrating VFs in the kymograms. The combination of k-means and ACM yielded a precise detection of the VF edges, which were registered back to the original HSV frames to segment the glottal area. The hybrid method showed an accurate performance but mainly during 116 VFs sustained oscillation as mentioned before. Hence, the hybrid method was applied to segment a set of frames during those instances of sustained vocalizations in the HSV data. This allowed for automatic labeling of a huge number of HSV frames. A subset of these segmented/labeled frames were sufficient to create a training data set to train a deep neural network as a more robust segmentation technique that can work during different phonatory events other than the sustained vibrations. Using the hybrid method as an automated labeling tool offered a huge advantage over the manual labeling, which is commonly used in the literature [142, 143, 144, 145]. That is, the proposed deep neural network was trained using only automatically segmented frames (utilizing the hybrid approach) without the need for any manual labeling. So, one advantage of this method is that larger training data sets can be formed using the developed automated labeling tool in a cost-effective and objective manner, which is favorable for training deep-learning techniques. The deep neural network was built based on the U-Net architecture. Several networks with different configurations were successfully trained on the automatically labeled data set. Since the quality and performance of the automated labeling tool was evaluated in the previous study (the hybrid method), the automatically labeled data set was sufficient to successfully train the networks. In addition, to ensure the training process using the automatic labeling was appropriate, we have evaluated the automatically segmented frames via visual assessment before the training; furthermore, the trained networks were assessed against manually labeled frames (ground truth data). Among the trained networks, we found that the network, which was trained using a batch size of 10 and built with encoder–decoder depth of four had the best performance on the testing data set (the ground truth data) with the highest mean IoU (0.82). The other networks with different encoder–decoder depths and batch sizes showed poorer performance and lower IoUs. The visual evaluation of the HSV data of the female subject showed that the best trained network (the proposed one) outperformed the automated labeling tool (the hybrid method)— demonstrating better accuracy in segmenting the glottal edges and area, and higher robustness toward image noise based on what we found in our visual assessment. This promising performance of the trained network indicated the acceptance of hypothesis H3.1. In addition, the developed network showed a considerably lower complexity because it did not depend on several image processing steps to achieve the segmentation task as in the hybrid approach. Overall, the visual inspection of the performance of the introduced network showed a successful segmentation when implemented on the video frames. The accurate representation of the glottal area using the 117 developed network enabled the precise measurement of the variation of the glottal area over time (glottal area waveform). While the glottal area might be influenced due to relative motion of the endoscope and tissues during phonation in connected speech, it is still an important measure, which allows to evaluate the oscillation of VFs in the HSV data [173]. Although the network was trained on frames segmented during sustained VF vibration, it was generalizable and was able to correctly segment frames during more complex nonstationary events such as in onsets/offsets of phonation, voice breaks, irregular VF vibrations, and when VFs were not vibrating—overcoming our previous method’s limitation. Also, we found that the performance of the proposed approach was relatively stable and did not vary between the different phonatory tasks. Moreover, since the proposed network was trained on HSV frames that were segmented using the developed automated labeling tool, it was important to also validate the network by comparing it against manually segmented HSV frames. Hence, a separate manually labeled data set (testing data set) was created, where the glottal area in a set of new HSV frames were manually segmented, to test and quantify the performance of the proposed network. Different metrics were utilized to evaluate the network’s performance against the manually segmented frames: a contour- based metric (BF score) to evaluate the detected boundary of the segmented glottal area (glottal edges) and an area-based metric (IoU) to assess the segmented glottal area itself. The introduced network showed a high mean BF score of 0.96 (LD = 0.12) indicating high accuracy of the network in localizing the edges of the glottal area (i.e., VF edges). Furthermore, the developed network achieved a mean IoU of 0.82 (LD = 0.26) and a mean DC of 0.88 (LD = 0.25), signifying high precision in detecting the glottal area. This study introduced the first deep learning-based scheme for automatically segmenting glottal area in connected speech. So, there are no other studies that utilized the state-of-the-art deep neural network for glottal area segmentation in running speech to compare with. The recently introduced/utilized deep learning models applied deep neural networks to segment glottal area in grayscale [145, 144] and RGB [142] HSV data during sustained phonation using rigid endoscopes, but not during running speech using flexible endoscopy as in this study. HSV data in running speech, however, exhibit even lower image quality and excessive laryngeal maneuvers leading to considerable changes in the spatial location of the VFs. These constraints impose more challenges for the deep neural networks to successfully segment the glottal area in HSV in connected speech. Despite these challenges, the introduced approach showed a mean IoU of 0.82 and DC of 0.88, 118 which are even above the baseline scores mentioned in literature that utilized a less challenging and higher quality HSV data with IoU of 0.799 [143, 145] and DC of 0.85 [142]. This comparison though was on a different data set but showing how the proposed method achieved a promising performance on a more challenging data demonstrates the high competitiveness of our approach against the other related methods. Furthermore, the previous deep learning approaches for HSV analysis [142, 143, 144, 145] were entirely utilized for only spatial segmentation. Among these studies, Fehling et al. [142] was the only research group that designed deep neural networks that could keep the HSVs temporal information, and they evaluated the segmentation conformity over the course of time. However, the sequences they utilized were quite short. In contrast, the introduced deep learning model is a spatiotemporal technique, where the HSV data are first preprocessed using a temporal segmentation algorithm to extract the vocalized segments on which the proposed deep neural network was applied on long HSV sequences. This spatiotemporal feature enhances the robustness of the proposed model toward, for example, irregular VF closure. The present work was conducted to demonstrate the high capability and robustness of a new deep learning-based technique for automatically segmenting connected speech in challenging images using a color HSV data from one subject. It should be noted that the current work applied the developed method on color HSV data, which have smaller dynamic ranges in comparison with monochrome images. This will guarantee a higher accuracy of this method when applied to monochrome data with a higher dynamic range. In the next section, this approach will be applied to a larger sample size from individuals with and without voice disorders and on HSV data recorded using a monochrome camera with less challenging image quality. 4.3.2. Deep Learning Approach: Neural Network on Monochrome HSV Data This study was conducted to pursue Aim 3 through addressing the following: Q3: Can the GAW be automatically extracted given the inferior image quality in the fiberoptic HSV and the excessive laryngeal movements in AdLD during running speech? H3.1: The hybrid method can be used as an automated labeling tool to train a robust DNN on detecting the glottal area in HSV during running speech. H3.2: This trained DNN will be successfully implemented for the automated extraction of the GAW in AdLD and normal controls even with its challenging image conditions. H3.3: The glottal midline along with the left and right VF edges can be successfully captured based on the segmented glottal area. 119 In the current study, the aim was to provide a DNN model using a larger sample size, compared to the previous study, to achieve a segmentation task of the glottis. To do so, the DNN which was developed and discussed in the previous section is retrained using a sample of both vocally normal participants and AdLD patients to validate its robustness/efficacy and provide a reliable automated generalizable tool for HSV analysis in running speech. On top of that, the literature on analyzing AdLD via HSV has been almost non-existent during connected speech. Hence, the main objective of this work is to develop and validate an accurate methodology to identify VF edges and glottal area in AdLD and vocally normal participants. Apart from the poor image quality of HSV in connected speech (found in the present video data), AdLD disorder imposes excessive laryngeal tissue movements, which makes analyzing HSV recording of AdLD patients more difficult. So, successful implementation of an automated DNN method to segment glottal edges in a challenging HSV dataset as for AdLD would allow for analysis and measurements of VF dynamics in AdLD during running speech. This can provide information on the characteristics of the impaired voice production in AdLD, which can potentially reduce AdLD misdiagnosis. The present DNN approach was built based on the efficient architecture of the developed deep learning network discussed in the previous section. The built DNN was then successfully trained using a robust training dataset, which was created from HSV recordings in running speech of vocally normal subjects and AdLD patients. A sample of HSV frames were randomly selected from the recordings – including VF gestures in several phonatory events and different obstructed views of VFs – in which the glottal area was manually segmented by two raters after coming into a consensus for the detection of the area. This procedure guaranteed a fair representation of the various running speech events in the training/validation dataset and pixel-accurate segmentation of the manually annotated frames. Different architecture modifications and training strategies were considered to obtain a generalized high segmentation performance of our previously designed U-Net DNN. The quantitative evaluation of these different trained DNNs was done on a test set, including manually segmented frames from an independent HSV recording of an AdLD patient on which the DNNs were not trained. The test set was important to allow for an unbiased assessment of the developed DNN and a realistic estimate of the model performance when testing on a set of new images, different from the training frames. Among the various trained DNNs, we found that the DNN with an architecture of four encoder-decoder stages and retrained with a batch size of 16 showed the 120 best performance. That is, this optimum network showed a parallel improvement in all the four different assessment metrics (the mean DC, IoU, F1, and accuracy scores) on the test set. In contrast, the other trained networks (built with different architectures and trained using different batch sizes) demonstrated poorer performance and lower assessment scores on testing frames. The implementation of the best performing DNN on the testing dataset showed high quality of glottal area segmentation with a mean IoU score of 0.81, DC of 0.86, and accuracy of 0.89 when compared to the manual annotation. In addition to those area-based metrics (like IoU and DC), the developed DNN was also evaluated with respect to the precision in predicting the glottal boundary (the VF edges) using the F1 score. This dual evaluation was particularly important in two cases: (1) when the glottal edges were accurately segmented but some pixels inside the glottal area were missed or incorrectly predicted by the network and (2) when the glottal area was precisely detected whereas the pixels located on VF edges were incorrectly estimated by the network. The tested DNN demonstrated a high mean F1 score of 0.93, signifying high accuracy of the DNN in detecting the edges of VFs and a good match between the estimated glottal edges and the manually represented edges by the raters. Despite that, some discrepancies between the manual and the automated segmentations were mainly found near glottal edges. One reason could be that due to the poor image quality in the whole dataset including the testing frames, the edges were blurry, which made their manual segmentation challenging and sometimes inaccurate. So, since the model did not face that challenge, there is a high possibility that the model even outperformed the manual segmentation by providing accurate VF edges. This observation was clear when the VFs are fully open and right before and after the closure phase. Another discrepancy, though slight, was detected during the full abduction posture, particularly when there was no vibration including during the inhalation instances. This is due to that when the glottal area became large, VF edges became blurrier, and the glottis became brighter yielding a more challenging condition for the model to perform an accurate segmentation. In addition to the quantitative evaluation, the overall visual assessment of the best performing DNN demonstrated the high quality in detecting glottal area/edges in various phonatory tasks that frequently occurred during the running speech. As such, this accurate performance was shown in the imaging data that included sustained and irregular VF vibrations, offsets/onsets of phonations, voice breaks, and when VFs were not moving. The trained DNN was not only able to detect the presence of VFs in the frames when they were clearly displayed in the image, but also was able to 121 recognize the absence of VFs in the frames when they were visually obstructed by, for example, the epiglottis, arytenoid cartilages, and other laryngeal constrictions. This capability of our developed DNN was crucial in view of the fact that excessive laryngeal activities occur often in running speech. Furthermore, the results of developing an automated method to detect the glottal midline as well as the left and right VF edges based on the detected glottal area showed a promising performance. The introduced method was able to capture the edges of the VFs even in complex glottal area shapes and various sizes (including wide opening and during vibrations). The high performance was in line with what was found regarding the high quantitative score of detecting the glottal boundary which allowed the feasibility of the introduced glottal midline detection tool. Also, even with poor HSV quality and excessive laryngeal AdLD spasms, the tool was able to detect the glottal midline as a fitted second-order curve and, hence, capture the edges of the VF. Therefore, based on both the quantitative and visual evaluation of the developed DNN, hypothesis H3.2 and H3.3 were accepted. It was found that the model was slightly more accurate in normal speakers than in AdLD patients. That is because the AdLD disorder frequently showed uncontrolled laryngeal tissue movements that lead to a partial or full coverage of the VFs during phonation, which might have been more challenging for the model to identify. The ability of the developed technique to recognize these different postures and obstructions of VFs and avoid incorrect glottis segmentations makes it a robust and efficient method to analyze VF dynamics in connected speech. Furthermore, the precise segmentation of the glottal edges/area using the introduced method allowed for the accurate measurement of the glottal area waveform in running speech. In comparison with the manual segmentation of the glottal area in a sequence of HSV frames, the automated method provided even smoother glottal area waveform such that there were no considerable changes in the area across frames – indicating the accurate detection of the glottal area. The promising glottis segmentation quality demonstrated by the proposed approach in running speech addresses an existing literature gap in terms of the automated analysis of VF dynamics in HSV data. That is, previous image segmentation methods, including deep learning models were developed and evaluated using only rigid HSV in sustained phonation, [142, 143, 144, 145] not flexible HSV in connected speech as in the present work. Bridging this gap is essential to analyze VF vibratory characteristics and function in voice disorders, which are mostly present during the 122 running speech. In addition, HSV data recorded in connected speech impose poorer image quality and higher variability of the spatial location of VFs due to the excessive movements of laryngeal tissues than in sustained phonation. This creates more difficulties for the DNNs to achieve successful glottis segmentation in running speech HSV data. Despite that, the proposed method showed a mean IoU and DC scores, even larger than the baseline accuracies found in the previous deep-learning methods that were tested on high-quality HSV data with IoU of 0.799 [143, 145] and DC of 0.85 [142]. This comparison shows that our proposed technique, though tested on a more challenging dataset, has a considerable competitiveness against the related deep learning approaches in the literature. This study fills another research gap, which is the limited number of studies that analyzed AdLD disorder using HSV. To the best of our knowledge, this work is one of the earlies attempts in literature to provide accurate methodologies for quantifying VF vibrations during running speech, which enable us to investigate the VF dynamics in AdLD further using HSV in connected speech. Additionally, unlike the other deep-learning methods in the literature, the automated approach developed in the present study is the first deep learning-based approach for segmenting glottal area in HSV data obtained from AdLD subjects. The promising performance of the introduced method in detecting the glottal area change in running speech on challenging HSV data of AdLD subjects with excessive laryngeal movements can facilitate the development of HSV- based measures. Such measures allow for quantifying the vibratory behavior of VFs and the prephonatory adjustments such as the measurement of glottal attack and offset times in vocally normal speakers versus AdLD patients as an effective approach to evaluate the severity of this disorder, which will be discussed in the upcoming section. 4.4. Study IV: Automated Measurements of Glottal Attack and Offset Time The purpose of this study was intended to fulfill Aim 4 by addressing the following: Q4: Are the glottal attack and offset times different between AdLD and normal controls? H4.1: An automated algorithm can be developed to measure GAT and GOT with comparable accuracy to visual measurements. H4.2: GAT and GOT will be significantly higher in AdLD versus normal controls. H4.3: GAT and GOT will show more variability in AdLD subjects. The goal of this study was to develop an automated algorithm for measurements of GAT and GOT from HSV in connected speech as objective measures that could potentially facilitate the 123 diagnosis of AdLD in future. In order to achieve that goal, the automated segmentation tool that was developed and discussed earlier in the previous section was implemented on the monochrome recordings. The segmentation technique showed successful detection of the glottal area during the various onset and offset of phonation in running speech. The DNN tool demonstrated high- performance capabilities in detecting the varied sizes, geometries, locations, and configurations of the glottal area and VF that are commonly existed during the different phonation onsets and offset in running speech. Being able to capture this variability and transitional states extending from various degrees of VF opening and small-amplitude oscillations to steady-state vibrations, even with the presence of inferior variable image quality, demonstrated the high reliability and consistent accuracy of the developed tool in glottal area segmentation during the onset and offset of phonations. The successful segmentation outcome in capturing and quantifying the dynamic change in the glottal area facilitated the automated measurements of the GAT and GOT. Based on the segmented glottis, the contact between VF was successfully determined. This allowed the precise computation of the energy contours associated with both the dynamic vibration of VF (represented in GAW) and the VF contact (represented in GCW). By computing these two contours, the delay in time was successfully calculated using the cross-correlation technique between the rise in the energy contours, in case of the onset phonation (GAT), and the drop in the energy contours, in case of the offset of phonation (GOT). The automated algorithm showed efficient measurements of the GAT and GOT for the vocally normal group as well as the AdLD group. In order to validate the developed automated approach, visual analyses were carried out by three raters to obtain manual measurements of the GAT and GOT through visually determining the timestamps between the first oscillation and first contact (referring to the GAT) and between the last contact and last oscillation (referring to the GOT). The manual and the automated measurements were obtained from each recording. The comparative analysis showed a close agreement between the automated and visual measurements of GAT and GOT in most of the recordings with minimal differences. As a measure of the developed approach accuracy, the average difference between the automated and manual measurements – computed based on the mean of each recording – exhibited a small value of 1.6 ms for GAT and 2.7 ms for GOT. This minor difference between the automated and the visual analysis was even lower than the error found among the three raters (up to around 4.5 ms,) which was considered as an acceptable 124 deviation in the measurements of the GAT and GOT [162]. Additionally, the statistical analysis performed between the automated and the manual measurements within different vocalizations demonstrated a strong correlation between the two measurements in computing both GAT (r =0.93) and GOT (r= 0.91), indicating the high similarity degree between the two measures. These findings were also supported by the conducted t-test where no significant difference found between the automated versus the visual analysis in GAT, also GOT with p-values of 0.86 and 0.77, respectively. These comparative results reflect the reliability and accuracy of the automated analysis technique compared to the visual analysis in estimating GAT and GOT – leading to the acceptance of H4.1 Furthermore, results demonstrated that the automated algorithm was marginally more precise in determining the GAT compared to GOT across the different subjects. This small deviation in accuracy was primarily derived from the longer durations of the GOT that exhibited minimal amplitudes of VF oscillations toward the end of the offset which was difficult to define for the raters – causing small discrepancies with the automated method. Similarly, the automated measures obtained for the vocally normal group showed a marginally elevated precision for both GAT and GOT in comparison with the AdLD patients. This was mainly due to the irregularity found in the dynamic vibration of the VF in AdLD patients as well as the excessive phonatory breaks – making the analysis more challenging for both the automated and the visual analysis. Another likely cause of the difference between the automated and the manual analysis, though minimal, arose from the inferior image quality in the recordings and the blurriness found just prior to the adduction of the VFs, creating difficulty in determining the first and last contact frames. After validating the introduced automated approach, it was utilized to compute the GAT and GOT values across all the recordings in order to draw conclusions in terms of the differences between the normal controls and the AdLD. The results revealed that, overall, the GAT measures were longer in the AdLD patients in comparison with the vocally normal participants. The statistical analysis demonstrated a significant difference (p-value < 0.001) between the average GAT of the AdLD (18.95 ms) versus normal controls (14.65 ms). This significant difference of GAT supported part of H4.2 that indicates significantly higher GAT in AdLD versus normal controls. This finding was supported by a previous study that demonstrated a delay between the onset of phonation and the activation of the laryngeal muscles [174]. Also, the results found in the present study were in agreement with the findings found in literature that showed longer GAT in 125 AdLD than normal controls [175, 153]. In contrast, although the present automated analysis demonstrated a slight increase in the mean GOT of the AdLD group (28.9 ms) versus the vocally normal group (27.3 ms), the difference was not statistically significant (p-value = 0.2). This insignificant difference of GOT rejected part of H4.2 that indicates significantly higher GOT in AdLD versus normal controls. Furthermore, the results demonstrated that there was a larger variability in the measurements of the GAT and GOT, within the AdLD group, with a particularly greater variability observed in GOT. In opposition, within the normal controls, this variability was less, especially the GAT which showed a consistent measure with a minimal range of variability across the vocally normal individuals. This finding showed the acceptance of H4.3. The explanation for this primarily lay in the irregular/inconsistent behavior of the VF vibrations in AdLD along with the impaired neurological dysfunction impacting laryngeal muscle control [66]. Hence, given the statistical significance between AdLD versus normal controls besides the more consistency and less variability found within the normal controls, GAT can be a valuable clinical measure compared to GOT. A larger sample size can substantiate the findings on the impact of AdLD on GAT and GOT in order to indicate the clinical significance of the introduced measures. 4.5. Study V: Lumped Modeling and Optimization of Vocal Fold Vibration The purpose of this work was intended to fulfill Aim 5 by addressing the following: Q5: Can a simplified one-mass model be optimized to accurately match the vibratory behavior of VF extracted from HSV? H5.1: A simplified one-mass model can successfully simulate both the vibratory and closure phases of VF motion. H5.2: The particle swarm optimization technique will enable accurate optimization of the model to predict the experimental glottal area waveform. H5.3: The optimized model parameters, obtained through inverse analysis of HSV data, can estimate the VF mass, elasticity, and viscosity indices. The goal of this study was to introduce a biomechanical model that can mimic the mechanical vibrations of the VFs. A lumped-element model was built using a one-mass model. The model was designed such that each VF tissue was described by a rigid mass coupled by springs and dampers. The model was combined with experimental data where the change in the glottal area was extracted from a vocalized segment in the monochrome HSV data. The aim was to build this model and optimize its parameters so that the model can generate an oscillation behavior similar to the 126 extracted glottal area waveform from the HSV. For optimization, particle swarm technique was utilized to achieve the optimization task which was commonly utilized in literature [93, 103]. The one-mass lumped element model was successfully implemented in order to provide a relatively close behavior to the VF vibrations. To do so, the model was developed such that it can incorporate the dynamic behavior in VFs during both the vibration and the closure phases. Hence, several parameters were successfully considered in the model simulation. The model included parameters related to the VF mass and the spring stiffness (representing the elastic behavior of the VF). In addition to that, during VF closure, an extra damping coefficient was considered as a parameter in order to simulate the characteristics of the damped motion during VF adduction. It is expected to see a relative increase in the subglottal pressure during VF closure – referring to a build-up pressure – which helps in pushing the VF apart and complete the vibratory cycle. In order to incorporate this build-up pressure, a variable subglottal pressure was considered as a function of time where, during the closure time, an increase in the value of the subglottal pressure occurred to reach a maximum pressure. This maximum pressure during closure as well as the closure duration time were also incorporated into the model as an attempt to obtain a realistic simulation and capture the dynamic oscillatory behavior of the VFs. After incorporating the proposed parameters into the one-mass model, the modified model was successfully implemented and simulated. Although the model had a limited degree of freedom (using only one mass to represent VF), the model was able to sufficiently mimic the oscillatory behavior of the VF during both vibration and the adduction phase – leading to the acceptance of H5.1. In addition to building the model, several simulations were carried out with different model parameters in order to make sure that the model offered acceptable performance suitable to be optimized with experimental data. By using the additional damping parameter during the closure phase of the VF, the model was able to provide a behavior close to the VFs when they are in contact. That is, during the adduction of the VFs, the simulated VFs showed an overdamped movement (due to the increased value of the viscous damping parameter, c') in order to emulate the impact of collision between the VFs. Moreover, the subglottal pressure was effectively incorporated into the model simulation as a variable parameter. The simulation results showed the accurate representation of amplifying the subglottal pressure, reflecting the buildup behavior during the closure phase of the VF. Also, the model was able to estimate the variation in the glottal 127 airflow as well as the change in the glottal area waveform resulted from the oscillatory behavior of the simulated VFs. After the successful simulation of the model, an optimization technique was utilized. Particle swarm method was successfully used and employed to optimize the theoretical glottal area waveform resulted from the model simulation with the experimental glottal area waveform extracted using HSV from a vocalized segment. Optimizing the model with a vocalized segment during VF vibration was considered because the VF properties such as the mass and elasticity were not expected to considerably vary, allowing for better optimization. Most of the previous models for optimizing lumped models with HSV-based VF oscillation conducted using samples during VF vibration, as considered in the present work as well [93, 103, 104, 105, 84, 106, 92]. Six model parameters were successfully optimized including a scaling factor, the mass, spring stiffness, damping coefficient during closure, the closure time (tc), and the maximum subglottal pressure (buildup pressure). The scaling factor was used because the units of the simulated and the experimental glottal area waveforms did not match so using the scaling factor was able to minimize the difference between the two waveforms. The outcome of the optimization demonstrated a relatively good match between the simulated and the experimental behavior during both the vibratory portions when the VFs were open and the closure portions when the VFs are in contact. The successful optimization revealed the efficacy of the modeling parameters to produce close vibratory behavior compared to the experimental one; hence, H5.2 was accepted. Also, the optimized parameters showed an agreement with the typical ranges found in literature. The value of the optimized VF mass (0.05 g) lay within the expected range reported in previous studies, which was between 0.016 – 0.10 g, and the optimized elasticity (11,787 g/s2) revealed close agreement as well with several studies that reported a wide range of VF elasticity values (6,000 – 32,000 g/s2) [176, 177, 178, 179]. Moreover, the optimized subglottal pressure in the present study demonstrated a values between 8,000 – 8,571 dyn/cm2, which fell within the typical pressure values found in literature that showed a great consensus regarding the value of 8,000 dyn/cm2 with a range of 4,000 – 14,000 dyn/cm2 [169, 165, 179]. The optimized VF closure time at 0.003 s in the introduced model exactly approximated the value that was observed during the HSV recording – revealing the success of the optimization process in capturing the experimental closure phase time. By obtaining the optimized parameters, it was able to refer and estimate biomechanical 128 measurements of the VFs including the elasticity index and the viscosity index associated with the vibrating VFs. Therefore, hypothesis H5.3 was accepted. The successful development and application of such a simple model like the one-mass model in this study using a vocalized segment during running speech is considered one of the earliest attempts in literature of achieving such optimization using an improved one-mass model and during a vocalized segment in running speech. Overall, the successful implementation of the introduced model in the present work can open up a new line for future research and act as a tool that can be further developed and used for estimating important non-invasive biomechanical features of the VFs during connected speech which cannot be extracted using the conventional assessment/experimental clinical tools. Also, the successful validation of the related hypotheses of this study demonstrated the potential of the simplified model introduced here in quantifying some biomechanical properties of VFs using HSV data. By leveraging the advantages of simple models, better understanding of the temporal dynamics in running speech and the impaired vocal function in voice disorders such as AdLD can be explored. 4.6. Limitations and Directions for Future Studies This dissertation while providing pioneering HSV analysis methods and valuable insights into studying laryngeal mechanisms and vocal function in AdLD during running speech, is not without its limitations. The aim of this section is to examine the constraints associated with the present work and potential directions for future investigation. In study I, a new approach was developed to detect laryngeal activity in AdLD as a potential measure of severity by classifying the durations of VF image obstructions in HSV. Although the approach demonstrated promising accuracy, it showed difficulty in detecting the partially obstructed VF images in HSV. This is mainly because, in the manual labeled images used for training the developed DNN, it was challenging for the rater to exactly determine partial vocal fold obstruction. This shortage can be avoided by modifying the criteria used for partial obstructions or by adding one more category of classification – instead of obstructed and unobstructed VF image, three classes of unobstructed, partially obstructed, and fully obstructed can be considered. Also, the developed measurement in this study was only based on whether the VF images were obstructed or not but did not identify the type of these obstructions. This limitation can be a potential direction for future research as well where different types of VF image obstructions can be detected such as obstructions due to epiglottis, arytenoid, laryngeal compressions, etc. Being 129 able to classify these different types of tissue obstructions might lead to useful insights into the specific spasmodic behavior of AdLD or other neurological voice disorders. In addition, with a larger sample size, the clinical relevance of this measure can be further emphasized, and diagnostically relevant information could be extracted. For a future direction, the findings of the present study could assist in developing appropriate passages and speech texts with minimal obstructed view of the VFs for a more effective voice assessment in connected speech. Study II developed two image segmentation techniques for capturing VF edges in HSV during running speech. The first technique ACM showed difficulties in detecting VF edges in excessive image noise and inferior image quality in the HSV data. The second method was the hybrid approach. The hybrid technique addressed the ACM challenges and was able to accurately detect VF edges in poor HSV image quality. However, the algorithm encountered some difficulties to find the VF edges before and after the steady-state vibrations. This might be due to the deviation of the motion window from the location of the VFs prior and after VF oscillations – yielding inaccurate extraction of the kymogram images during the onset and offset of phonation. Due to this deviation, the vibrating VFs were not exactly in the center of the motion window. However, this was not an issue in the present study since the main goal was to capture the VF edges during vocalization and the vibratory phase of the VF. Therefore, for future directions, using the introduced ACM technique can be much more effectively applied to future applications using video data recorded via rigid HSV with higher image quality and less image noise. For more challenging conditions in running speech with inferior image quality, the developed hybrid tool can provide accurate performance in detecting VF edges, particularly, when applied to the more stationary phonation tasks in running speech to analyze the vibratory behavior of VFs using kymograms. Study III provided a quantitative representation of VF dynamics in AdLD during running speech using HSV. Although the method demonstrated accurate detection of the glottal area and its edges given the poor image quality and excessive laryngeal maneuvers in AdLD, there is a potential for further improvement. The main challenge found, though infrequent, was to mistakenly detect dark spots located toward the corners or the edges of the HSV frames as glottal area. Prospective research path can address this limitation by cropping the HSV frames before applying the DNN for glottal area segmentation functioning as a preprocessing step. This cannot only improve the segmentation accuracy by avoiding these misclassified pixels in the images, but 130 also reduce the computational cost. This is because the input images to the DNN would be cropped (becoming smaller in size) which would require less time to be processed by the developed DNN. Moreover, using the developed tool opens new avenues of future research where it enables the extraction of objective HSV-based measures from both vocally normal individuals and patients with voice disorders during VF vibrations. Having access to such automated measures would benefit future clinicians to analyze the huge datasets of HSV in connected speech that could potentially facilitate voice disorder diagnosis. Study IV developed and analyzed automated measurements of GAT and GOT using HSV in running speech. In spite of the substantial findings of this study demonstrating that GAT can be a potential candidate to assess AdLD in running speech, larger sample size is needed in order to emphasize these findings and differentiate between vocally normal adults and AdLD. Another related limitation of this study highlights the need to compare AdLD with other neurological voice disorders such as MTD and ET. This comparison is beneficial to reveal the main differences among the different voice disorders using the develop automated measures – leading to a better understanding of the difficulty associated with the misdiagnosis of these patients. Moreover, a potential research point exists in combining the different potential measures found in the present work in order to offer an assessment procedure of AdLD. That is, the measurement of the frequency and intensity of the laryngeal tissue activities can be combined with the GAT and GOT measurements as a more effective way for assessing the severity of AdLD. Moreover, the present study only investigated samples of running speech. This may lead to an important future research avenue that can study the impact of the different speech tasks and phonetic contexts on the severity of different voice disorders’ symptoms compared with normal controls. For example, these future studies may benefit from comparing sustained phonation, the different CAPE-V sentences, and the various phonetic sounds included in the Rainbow Passage. These future investigations may uncover groundbreaking findings regarding the correlations between the phonetic context and the severity of various neurological voice disorders. The last study, despite demonstrating the successful simulation/optimization of a simple lumped model to capture the vibratory characteristics of VF, has limitations. The limitation mainly arises from the simplifications/assumptions that were considered for model implementation. Describing VF as a one mass with a single degree of freedom compared to multi-mass models imposed few constrains. Including multiple masses allow for more precise simulation of the 131 intricate VF vibrations, particularly in the closure phases. However, the present model was built to relatively overcome the prior limitation and simulate the closure phase by including extra damping coefficient to emulate the impact of VF collision. So, future investigations can be done in this direction to compare the simulation of the present model with multi-mass models. Moreover, although the subglottal pressure in this model was parameterized as a step function to simulate the built-up pressure impact during closure, it was an idealized representation of the actual variations in the subglottal pressure during VF oscillations. That is, the sudden increase in the pressure at the beginning of the closure phase should be simulated as a smooth rise; similarly, the sharp drop of the pressure right after the closure phase can also be replaced by a smooth drop. This can achieve a closer emulation of reality during VF vibration, enhancing the fidelity of the simulation. Another limitation of the present model was that the VF was simulated with zero damping coefficient during the opening phases. This limitation also represents another potential point to be further studied in future by incorporating a damping coefficient of the VF during oscillation in addition to the extra damping during the closure phases. Also, the limitation of relying solely on the Bernoulli equation to model the airflow underscores the need of incorporating advanced aerodynamic models to precisely simulate the turbulent effects occurring during VF vibrations. Lastly, the fact that the model was optimized during VF vibrations highlights the potential in advancing the present model to also optimize the onset and offset of phonations in future investigations. 132 CHAPTER 5: CONCLUSION In order to gain a more comprehensive understanding of the impaired vocal function in AdLD and the persistent issue of misdiagnosis, this dissertation introduced one of the earliest endeavors in literature to utilize advanced HSV technology in studying this disorder during connected speech. Despite the challenges posed with HSV analysis in running speech and the lack of effective automated methods, various automated approaches and techniques were successfully developed in the present work to bridge this huge gap in literature and facilitate investigating the dynamic characteristics of VF in AdLD. Toward that endeavor, this dissertation presented five different studies aiming to tackle the previous challenges. The conclusion of each study is summarized as follows: Study I introduced an automated tool to classify HSV frames by detecting the instances during which the image of the VFs is optically obstructed. This tool enabled exploring how the presence of AdLD impact the durations over which VFs were visually obstructed in HSV during running speech – indicating the degree of laryngeal activities in AdLD. The developed technique can accurately perform an automated frame selection task in HSV recordings, even with a challenging image quality, to recognize and classify HSV frames with clear view of VFs. Also, it can provide precise analysis to investigate laryngeal maneuvers in AdLD patients within running speech. By using this tool, the analysis showed that there are remarkable differences in the durations of the visual obstruction of VFs between AdLD and the vocally normal individuals during connected speech. These durations of visual obstruction might be a good measurement that could be used for determining the severity of AdLD. Overall, utilizing this tool can be useful to provide insights into the impaired laryngeal activities in AdLD or other voice disorders in terms of the spasmodic behavior of the laryngeal tissue movements. Study II proposed a novel image segmentation tool for VF edge detection in HSV-based kymograms during the vibratory segments in running speech that can accurately perform automated analysis even with the poor image quality and noise existed in the transnasal HSV. Developing this automated tool addressed the lack of effective spatial segmentation methods amenable to HSV analysis in connected speech. The temporal segmentation and the motion compensation approaches used in this study successfully extracted the timestamps of the vocalized segments and localized the vibrating VFs of the HSV recording of the “Rainbow Passage”. The study showed the successful development of two image segmentation approaches for VF edges 133 using ACM and an unsupervised ML technique (k-means clustering). The combination between these two approaches resulted in a powerful hybrid method for spatial segmentation, which revealed a promising performance in precisely capturing the VF edges in kymogram images across frames. This hybrid method helped significantly improve the performance of the ACM method in terms of addressing the dependency of ACM to contour initialization, enhancing the edge representation accuracy, mitigating the sensitivity towards image noise, and providing a lower computational cost. Study III presented quantitative representation of VF dynamics in AdLD during connected with the groundbreaking application of HSV for disorder examination. This study extracted the GAW and glottal edges from HSV data in connected speech. Developing these analyses offered a considerable contribution to the existing literature in terms of offering a distinctive quantitative portrayal of the impaired behavior of VF vibrations. The study showed the successful implementation of an automated labeling tool for spatial segmentation of the glottal area in HSV frames, based on our previously developed hybrid method in study II, which formed a large training data set to effectively train the DNN. The developed DNN even outperformed the labeling tool by improving the segmentation accuracy in the glottal areas, enhancing the robustness toward poor image quality/noise, lowering the computational cost, and increasing the flexibility to accurately performing segmentation during complex events as in phonation onsets/offsets and voice breaks. The study also showed the successful implementation of the developed DNN and the accurate representation of VF dynamics in running speech HSV recordings of vocally normal individuals and patients with AdLD. The developed approach accurately segmented the glottal area, overcoming the inferior image quality and the excessive laryngeal spasms of AdLD recordings. Moreover, based on the detected glottal area, another proposed edge detection algorithm was successfully developed to capture the left and right VF edges along with the glottal midline. The high segmentation accuracy of the developed tools was demonstrated in onsets/offsets of vibration, voice breaks, prephonatory adjustments, and regular/irregular VF vibrations. These tools can aid clinicians in addressing the diagnostic challenges and early detection of this disorder. Study IV investigated the pathological vocal function during phonation onset and offset of AdLD using HSV in connected speech – bridging a huge gap in literature. In this work, an automated analysis was successfully conducted to measure the GAT and GOT from vocally normal 134 participants and AdLD to investigate the differences between the two groups. These analyses were carried out through developing an automated method for measurements. The automated measurements revealed minor, non-significant differences in comparison with the visual analysis – showing strong correlations between the two methods. The automated measurements demonstrated two main findings. That is, AdLD patients showed significantly longer GATs compared to the vocally normal group, and more variability was observed in both GATs and GOTs of the AdLD patients due to the considerable irregularity in their impaired VF vibrations. Since this study is considered one of the earliest attempts in literature to investigate these measurements in running speech for AdLD, these findings can serve as a foundational baseline for future research utilizing larger sample size and different voice disorders. The developed automated approach for glottal attack and offset time measurement can be valuable in the clinical practice. Obtaining such measures enables the exploration of clinically relevant information to address diagnostic challenges in AdLD. Study V developed a lumped-element model that can determine the biomechanical characteristics of VF using an HSV running speech sample. A one-mass model was successfully implemented and simulated in this study. Also, an optimization procedure, based on the particle swarm optimization technique, was performed in order to optimize the model parameters with the experimental HSV data in terms of the glottal area waveforms. The optimization procedure with six parameters – representing the main characteristic of the oscillatory behavior of VFs – was successful in determining biomechanical measure of VF (including VF mass, elasticity and viscosity) and generating a waveform closely matched with the HSV-based one. Although the model was built using only one mass with limited degree of freedom, the model was able to sufficiently emulate VF vibrations observed in HSV. This work contributed to an existing gap in literature where the previous studies used this inverse analysis technique neither simulating connected speech samples studying the impaired VF vibrations in AdLD. The study showed the potential of a simplified model like a one-mass model that can still quantify biomechanical properties of VFs using the HSV running speech sample. Overall, the successful implementation of the developed model in the present study paves the path toward a new line for future research where it can be used as a tool that can be further developed and used for estimating clinically relevant non-invasive biomechanical features of VFs in connected speech. 135 BIBLIOGRAPHY [1] R. L. Wegel, "Theory of vibration of the larynx," J. Acoust. Soc. Am., vol. 1, pp. 1-21, 1930. [2] J. Van den Berg, "Myoelastic-aerodynamic theory of voice production," J. Speech Hear. Res., vol. 1, pp. 227-244, 1958. [3] J. Van den Berg, "Physiology and physics of voice production," Acta Physiol. Pharmacol. Neerl., vol. 5, pp. 40-55, 1956. [4] R. Husson, "Etude des phénomènes physiologiques et acoustiques fondamentaux de la voix chantée," Thesis, Paris, France, 1950. [5] G. Portman, R. Humbert, J. Robin, P. Laget and R. Husson, "Etude electromyrographique des corde vocals chez l’homme," Compt. Rend. Soc. Biol. , vol. Paris 149, pp. 286-300, 1955. [6] G. Fant, In: Acoustic Theory of Speech Production, with Calculations Based on X-Ray Studies of Russian Articulations, Mouton and Co. N.V., The Hague, 1960. [7] J. Tonndorf, "Die Mechanik bei Stimmlippenschwingungen und bei Schnarchen," Z. HalsNasen, u. Ohrenheilk., vol. 12, pp. 241-245, 1925. [8] I. R. Titze, T. Riede and P. Popolo, "Nonlinear source-filter coupling in phonation: vocal exercises," J. Acoust. Soc. Am., vol. 123, no. 4, p. 1902–1915, 2008. [9] M. Zañartu, "Acoustic coupling in phonation and its effect on inverse filtering of oral airflow neck surface acceleration," Ph.D. Thesis, Purdue University, West Lafayette, IN., 2010. [10] B. H. Story, I. R. Titze and E. A. Hoffman, "Vocal tract area functions from magnetic resonance imaging," J. Acoust. Soc. Am., vol. 100, no. 1, p. 537– 554, 1996. [11] H. Takemoto, K. Honda, S. Masaki, Y. Shimada and I. Fujimoto, "Measurement of temporal changes in vocal tract area function from 3D cine-MRI data," J. Acoust. Soc. Am., vol. 119, p. 1037–1049, 2006. [12] H. M. Hanson, "Glottal characteristics of female speakers: acoustic correlates," J. Acoust. Soc. Am., vol. 101, no. 1, p. 466–481, 1997. [13] D. H. Klatt and L. C. Klatt, "Analysis synthesis and perception of voice quality variations among male and female talkers," J. Acoust. Soc. Am. , vol. 87, no. 2, p. 820–856, 1990. [14] I. Titze, Principles of Voice Production, Englewood Cliffs, NJ: Prentice-Hall, 1994. [15] M. Huffman, "Measures of phonation type in Hmong," J Acoust Soc Am, vol. 81 , no. 2, pp. 495-504, 1987. [16] Y. Koike, H. Takahashi and T. Calcaterra, "Acoustic measures for detecting laryngeal pathology," Acta Otolaryngol, vol. 84, no. 1-6, pp. 105-117, 1977. [17] E. Yumoto, W. Gould and T. Baer, "Harmonics-to-noise ratio as an index of the degree of hoarseness," J Acoust Soc Am., vol. 71, no. 6, pp. 1544-1550, 1982. [18] J. Hillenbrand, R. Cleveland and R. Erickson, "Acoustic correlates of breathy vocal quality," J Speech, Lang Hear Res., vol. 37, no. 4, pp. 769-778, 1994. 136 [19] G. Kempster, B. Gerratt, K. Abbott, J. Barkmeier-Kraemer and R. Hillman, "Consensus Auditory-Perceptual Evaluation of Voice: development of a stan dardized clinical protocol," Am J Speech Lang Pathol, vol. 18, pp. 124-132, 2009. [20] R. Zraick, G. Kempster, N. Connor, S. Thibeault, B. Klaben, Z. Bursac, C. Thrush and L. Glaze, "Establishing validity of the Con sensus Auditory-Perceptual Evaluation of Voice (CAPE-V)," Am J Speech Lang Pathol, vol. 20, p. 14–22, 2011. [21] M. Rothenberg and J. Mashie, "Monitoring vocal fold abduction through vocal fold contact area," Journal of Speech Language and Hearing Research, vol. 31, p. 338–351, 1988. [22] T. Luang-Thongkum, "Phonation types in Mon-Khmer languages," UCLA Working Papers in Phonetics, vol. 67, p. 29– 48, 1987. [23] P. Kitzing, "Stroboscopy–a pertinent laryngological examination," J Otolaryngol., vol. 14, no. 3, pp. 151-157, 1985. [24] D. M. Bless, M. Hirano and R. J. Feder, "Videostroboscopic evaluation of the larynx," Ear, Nose & Throat Journal, vol. 66, no. 7, pp. 289-296, 1987. [25] P. Woo, J. Casper, R. Colton and D. Brewer, "Aerodynamic and stroboscopic findings before and after microlaryngeal phonosurgery," J Voice, vol. 8, no. 2, pp. 186-194, 1994. [26] J. C. Stemple, L. E. Glaze and B. G. Klaben, Clinical voice pathology: Theory and management, Cengage Learning, 2000. [27] A. Stojadinovic, A. R. Shaha, R. F. Orlikoff, A. Nissan, M.-F. Kornak, B. Singh, J. O. Boyle, J. P. Shah, M. F. Brennan and D. H. Kraus, "Prospective functional voice assessment in patients undergoing thyroid surgery," Ann Surg., vol. 236, no. 6, p. 823–832, 2002. [28] D. D. Mehta and R. E. Hillman, "Voice assessment: updates on perceptual, acoustic, aerodynamic, and endoscopic imaging methods," Curr. Opin. in Otol., Head. and Neck. Surg., vol. 16, no. 3, pp. 211-215, 2008. [29] A. E. Aronson and D. Bless, Clinical Voice Disorders, Thieme, 2011. [30] R. Patel, S. Dailey and D. Bless, "Comparison of high-speed digital imaging with stroboscopy for laryngeal imaging of glottal disorders," Ann. of Otol., Rhinol & Laryngol, vol. 117, no. 6, pp. 413-424, 2008. [31] S. R. C. Zacharias, C. M. Myer, J. Meinzen-Derr, L. Kelchner, D. D. Deliyski and A. de Alarcón, "Comparison of videostroboscopy and high-speed videoendoscopy in evaluation of supraglottic phonation," Ann. of Otol., Rhinol & Laryngol., vol. 125, no. 10, pp. 829- 837, 2016. [32] D. D. Deliyski, Laryngeal high-speed videoendoscopy, in: Laryngeal evaluation: Indirect laryngoscopy to high-speed digital imaging, New York: Thieme Medical Publishers, 2010, pp. 243-270. [33] M. Echternach, M. Döllinger, J. Sundberg, L. Traser and B. Richter, "Vocal fold vibrations at high soprano fundamental frequencies," The Journal of the Acoustical Society of America, vol. 133, no. 2, pp. EL82-EL87, 2013. 137 [34] A. M. Yousef, D. D. Deliyski, S. R. C. Zacharias, A. de Alarcon, R. F. Orlikoff and M. Naghibolhosseini, "Spatial segmentation for laryngeal high-speed videoendoscopy in connected speech," J Voice, vol. 37, no. 1, pp. 26-36, Nov 27;S0892-1997(20)30408-2, 2023. [35] A. M. Yousef, D. D. Deliyski, S. R. Zacharias, A. de Alarcon, R. F. Orlikoff and M. Naghibolhosseini, "A Hybrid Machine-Learning-Based Method for Analytic Representation of the Vocal Fold Edges during Connected Speech," Applied Sciences, vol. 11, no. 3, p. 1179, 2021. [36] A. M. Yousef, D. D. Deliyski, S. R. Zacharias, A. de Alarcon, R. F. Orlikoff and M. Naghibolhosseini, "Automated detection and segmentation of glottal area using deep- learning neural networks in high-speed videoendoscopy during connected speech," in In 14TH INTERNATIONAL CONFERENCE ADVANCES IN QUANTITATIVE LARYNGOLOGY, VOICE AND SPEECH RESEARCH (AQL), Bogota, Colombia, 2021. [37] M. Naghibolhosseini, D. D. Deliyski, S. R. Zacharias, A. de Alarcon and R. F. Orlikoff, "A method for analysis of the vocal fold vibrations in connected speech using laryngeal imaging," in Manfredi C (Ed.) Proceedings of the 10th International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications MAVEBA, Firenze University Press, Firenze, Italy, 2017. [38] A. M. Yousef, D. D. Deliyski, M. Zayernouri, S. R. Zacharias and M. Naghibolhosseini, "Vocal Fold Detective Edge Analysis in High-speed Videoendoscopy during Running Speech in Adductor Spasmodic Dysphonia," in Proceedings of the 15th International Conference on Advances in Quantitative Laryngology, Voice and Speech Research (AQL), Phoenix, AZ, 2023 March 30-April 1. [39] A. M. Yousef, D. D. Deliyski, S. R. Zacharias and M. Naghibolhosseini, "Deep-Learning- Based Representation of Vocal Fold Dynamics in Adductor Spasmodic Dysphonia during Connected Speech in High-Speed Videoendoscopy," Journal of Voice, pp. S0892- 1997(22)00263-6, 2022. Online ahead of print. [40] M. Naghibolhosseini, A. M. Yousef, M. Zayernouri, S. R. Zacharias and D. D. Deliyski, "Deep Learning for High-Speed Laryngeal Imaging Analysis," in Proceedings of the 3rd International IEEE Conference on Computational Intelligence and Knowledge Economy (ICCIKE), Amity University, Dubai, UAE, 2023. [41] D. D. Mehta, D. D. Deliyski, T. F. Quatieri and R. E. Hillman, "Automated measurement of vocal fold vibratory asymmetry from high-speed videoendoscopy recordings," Journal of Speech, Language, and Hearing Research, vol. 54, no. 1, pp. 47-54, 2011. [42] D. D. Deliyski, M. E. Powell, S. R. Zacharias, T. T. Gerlach and A. de Alarcon, "Experimental investigation on minimum frame rate requirements of high-speed videoendoscopy for clinical voice assessment," Biomed. Signal. Process. and Control, vol. 17, pp. 51-59, 2015. [43] M. Zañartu, D. D. Mehta, J. C. Ho, G. R. Wodicka and R. E. Hillman, "Observation and analysis of in vivo vocal fold tissue instabilities produced by nonlinear source-filter coupling: A case study," Journal of the Acoustical Society of America, vol. 129, no. 1, pp. 326-339, 2011. 138 [44] D. D. Mehta, D. D. Deliyski, Zeitels, S. M, M. Zañartu and R. E. Hillman, Integration of transnasal fiberoptic high-speed videoendoscopy with time-synchronized recordings of vocal function, innormal & abnormal vocal folds Kinematics: High speed digital phonoscopy (HSDP), optical coherence tomography (OCT) & narrow band imaging, vol. 12, San Fransisco, CA: Pacific Voice & Speech Foundation, 2015, pp. 105-114. [45] M. Naghibolhosseini, D. D. Deliyski, S. R. C. Zacharias, A. de Alarcon and R. F. Orlikoff, "Studying vocal fold non-stationary behavior during connected speech using high-speed videoendoscopy," The Journal of the Acoustical Society of America, vol. 144, no. 3, pp. 1766-1766, 2018. [46] M. Naghibolhosseini, N. Heinz, C. Brown, F. Levesque, S. R. C. Zacharias and D. D. Deliyski, "Glottal attack time and glottal offset time comparison between vocally normal speakers and patients with adductor spasmodic dysphonia during connected speech," in 50th Anniversary Symposium: Care of the Professional Voice, Philadelphia, 2021. [47] M. Naghibolhosseini, D. D. Deliyski, S. R. C. Zacharias, A. de Alarcon and R. F. Orlikoff, "Glottal attack time in connected speech," in The 11th International Conference on Voice Physiology and Biomechanics ICVPB, East Lansing, MI, 2018. [48] C. Brown, M. Naghibolhosseini, S. R. C. Zacharias and D. D. Deliyski, "Investigation of high-speed videoendoscopy during connected speech in norm and neurogenic voice disorder," in Michigan Speech-Language-Hearing Association (MSHA) Annual Conference, East Lansing, MI, 2019. [49] D. D. Deliyski, "Clinical feasibility of high-speed videoendoscopy," in Perspectives on Voice and Voice Disorders, vol. 17, American Speech-Language-Hearing Association, 2007, pp. 12-16. [50] D. D. Deliyski and R. E. Hillman, "State of the art laryngeal imaging: Research and clinical implications," Current Opinion in Otolaryngology & Head and Neck Surgery, vol. 18, no. 3, pp. 147-152, 2010. [51] P. Woo, "Objective measures of stroboscopy and high speed video," Advances in Oto- Rhino-Laryngology, vol. 85, pp. 25-44, 2020. [52] C. Watts, C. Nye and R. Whurr, "Botulinum toxin for treating spasmodic dysphonia (laryngeal dystonia): a systematic Cochrane review," Clin Rehabil, vol. 20, no. 2, pp. 112- 122, 2006. [53] K. E. Castelon, I. Trender-Gerhard, 1. C. Kamm and e. al., "Servicebased survey of dystonia in Munich," Neuroepidemiology, vol. 21, p. 202–206, 2002. [54] J. M. Schweinfurth, M. Billante and M. S. Courey, "Risk factors and demographics in patients with spasmodic dysphonia," Laryngoscope, vol. 112, no. 2, pp. 220-223, 2002. [55] J. Stemple, N. Roy and B. Klaben, Clinical Voice Pathology: Theory and Management, 6th ed., Plural Publishing, 2018. [56] S. M. Cohen, K. J and R. N, "Prevalence and causes of dysphonia in a large treatment- seeking population," Laryngoscope, vol. 122, no. 2, pp. 343-348, 2012. [57] E. A. Nash and C. L. Ludlow, "Laryngeal muscle activity during speech breaks in adductor spasmodic dysphonia," Laryngoscope, vol. 106, p. 484–489, 1996. 139 [58] V. Vanderaa and L. A. Vinney, "Laryngeal Sensory Symptoms in Spasmodic Dysphonia," J Voice, 2021. [59] N. Mor, K. Simonyan and A. Blitzer, "Central voice production and pathophysiology of spasmodic dysphonia," Laryngoscope, vol. 128, p. 177–183, 2018. [60] N. Roy, A. Mazin and S. N. Awan, "Automated acoustic analysis of task dependency in adductor spasmodic dysphonia versus muscle tension dysphonia," Laryngoscope, vol. 124, pp. 718-724, 2014. [61] M. Cannito and P. Johson, "Spastic dysphonia: a continuum disorder," J Commun Disord., vol. 14, pp. 215-223, 1981. [62] D. W. Chen and J. Ongkasuwan, "Spasmodic dysphonia," International ophthalmology clinics, vol. 58, no. 1, pp. 77-87, 2018. [63] D. K. Chetri, A. L. Merati, J. H. Blumin and e. al., "Reliability of the perceptual evaluation of adductor spasmodic dysphonia," Ann Otol Rhinol Laryngol, vol. 117, p. 159–165, 2008. [64] M. B. Higgins, C. D. H and L. Shulte, "Phonatory air flow characteristics of adductor spasmodic dysphonia and muscle tension dysphonia," J Speech Lang Hear Res, vol. 42, p. 101–111, 1999. [65] N. Roy, "Functional dysphonia," Curr Opin Otolaryngol Head Neck Surg, vol. 11, p. 144– 148, 2003. [66] N. Roy, M. Gouse, S. C. Mauszycki, R. M. Merrill and M. E. Smith, "Task specificity in adductor spasmodic dysphonia versus muscle tension dysphonia," The Laryngoscope, vol. 115, no. 2, pp. 311-316, 2005. [67] D. K. Chhetri, A. H. Mendelsohn, J. H. Blumin and G. S. Berke, "Long-term follow-up results of selective laryngeal adductor denervation–reinnervation surgery for adductor spasmodic dysphonia," Laryngoscope , vol. 116, p. 635–642, 2006. [68] N. Roy, D. M. Bless, D. Heisey and C. N. Ford, "Manual circumlaryngeal therapy for functional dysphonia: an evaluation of short- and long-term treatment outcomes," J Voice, vol. 11, p. 321–331, 1997. [69] M. P. Cannito, G. E. Woodson, T. Murry and e. al., "Perceptual analyses of spasmodic dysphonia before and after treatment," Arch Otolaryngol Head Neck Surg., vol. 130 , p. 1393–1399, 2004. [70] D. T. Weed, B. S. Jewett, C. Rainey and e. al., "Long-term follow-up of recurrent laryngeal nerve avulsion for treatment of spastic dysphonia," Ann Otol Rhinol Laryngol., vol. 105, p. 592–601, 1996. [71] E. P. Silverman, C. Garvan, R. Shrivastav and e. al., "Combined modality treatment of adductor spasmodic dysphonia," J Voice., vol. 26, p. 77–86, 2012. [72] C. M. Sapienza, S. Walton and T. Murry, "Adductor spasmodic dysphonia and muscular tension dysphonia: acoustic analysis of sustained phonation and reading," J Voice, vol. 14, p. 502–520, 2000. 140 [73] C. J. Rees, P. D. Blalock, S. E. Kemp, S. L. Halum and J. A. Koufman, "Differentiation of adductor-type spasmodic dysphonia from muscle tension dysphonia by spectral analysis," Otolaryngol Head Neck Surg, vol. 137, p. 576–581, 2007. [74] R. Leonard and K. Kendall, "Differentiation of spasmodic and psychogenic dysphonias with phonoscopic evaluation," Laryngoscope , vol. 109, p. 295–300, 1999. [75] M. D. Morrison and L. A. Rammage, "Muscle misuse voice disorders: description and classification," Acta oto-laryngologica, vol. 113, no. 3, pp. 428-434, 1993. [76] B. Halberstam, "Acoustic and perceptual parameters relating to connected speech are more reliable measures of hoarseness than parameters relating to sustained vowels," ORL, vol. 66, no. 2, pp. 70-73, 2004. [77] Y. Maryn, P. Corthals, P. Van Cauwenberge, N. Roy and M. De Bodt, "Toward improved ecological validity in the acoustic measurement of overall voice quality: combining continuous speech and sustained vowels," J Voice, vol. 24, no. 5, pp. 540-555, 2010. [78] S. Y. Lowell, "The acoustic assessment of voice in continuous speech," SIG 3 Perspectives on Voice and Voice Disorders, vol. 22, no. 2, pp. 57-63, 2012. [79] D. D. Deliyski and R. E. Hillman, "State of the art laryngeal imaging: Research and clinical implications," Current Opinion in Otolaryngology & Head and Neck Surgery, vol. 18, no. 3, p. 147–152, 2010. [80] A. Olthoff, C. Woywod and E. Kruse, "Stroboscopy versus high-speed glottography: A comparative study," The Laryngo scope, vol. 117, no. 6, pp. 1123-1126, 2007. [81] P. S. Popolo, "Investigation of flexible high-speed video nasolaryngoscopy," J Voice, vol. 32, no. 5, pp. 529-537, 2018. [82] D. D. Deliyski and P. Petrushev, "Methods for objective assessment of high-speed videoendoscopy," Proc Adv in Quant Laryngol, p. 1–16, 2003. [83] C. Brown, D. D. Deliyski, S. R. C. Zacharias and M. Naghibolhosseini, "Glottal attack and offset time during connected speech in adductor spasmodic dysphonia," in Virtual Voice Symposium: Care of the Professional Voice, Philadelphia, 2020. [84] C. Tao, Y. Zhang and J. J. Jiang, "Extracting physiologically relevant parameters of vocal folds from high-speed video image series," IEEE Transactions on Biomedical Engineering, vol. 54, no. 5, pp. 794-801, 2007. [85] J. J. Jiang, C. E. Diaz and D. G. Hanson, "Finite element modeling of vocal fold vibration in normal phonation and hyperfunctional dysphonia: implications for the pathogenesis of vocal nodules," Annals of Otology, Rhinology & Laryngology, vol. 107, no. 7, p. 603–610, 1998. [86] I. Tokuda, M. Zemke and M. Kob, "Biomechanical modeling of register transitions and the role of vocal tract resonators," J. Acoust. Soc. Am., vol. 127, p. 1528–1536, 2010. [87] A. Palaparthi, T. Riede and I. R. Titze, "Combining multiobjective optimization and cluster analysis to study vocal fold functional morphology," IEEE Transactions on Biomedical Engineering, vol. 61, no. 7, pp. 2199-2208, 2014. 141 [88] V. M. Espinoza, M. Zañartu, J. H. Van Stan, D. D. Mehta and R. E. Hillman, "Glottal aerodynamic measures in women with phonotraumatic and nonphonotraumatic vocal hyperfunction.," Journal of Speech Language, and Hearing Research,, vol. 60, no. 8, pp. 2159-2169, 2017. [89] A. Giovanni, D. Demolin, C. Heim and J. M. Triglia, "Estimated subglottic pressure in normal and dysphonic subjects," Annals of Otology, Rhinology & Laryngology, vol. 109, no. 5, pp. 500-504, 2000. [90] K. Ketelslagers, M. S. De Bodt, F. L. Wuyts and P. Van de Heyning, "Relevance of subglottic pressure in normal and dysphonic subjects," European archives of oto-rhino- laryngology, vol. 264, no. 5, pp. 519-523, 2007. [91] P. Zhuang, J. T. Swinarska, C. F. Robieux, M. R. Hoffman, S. Lin and J. J. Jiang, "Measurement of phonation threshold power in normal and disordered voice production," Annals of Otology, Rhinology & Laryngology, vol. 122, no. 9, pp. 555-560, 2013. [92] M. Döllinger, U. Hoppe, F. Hettlich, J. Lohscheller, S. Schuberth and U. Eysholdt, "Vibration parameter extraction from endoscopic image series of the vocal folds," IEEE Transactions on Biomedical Engineering, vol. 49, no. 8, pp. 773-781, 2002. [93] P. Gómez, A. Schützenberger, S. Kniesburges, C. Bohr and M. Döllinger, "Physical parameter estimation from porcine ex vivo vocal fold dynamics in an inverse problem framework," Biomechanics and modeling in mechanobiology, vol. 17, no. 3, pp. 777-792, 2018. [94] K. Ishizaka and J. L. Flanagan, "Acoustic Properties of Longitudinal Displacement in Vocal Cord Vibration," The Bell System Technical Journal, vol. 56, no. 6, p. 889–918, 1977. [95] B. D. Erath, M. Zañartu, K. C. Stewart, M. W. Plesniak, D. E. Sommer and S. D. Peterson, "A review of lumped-element models of voiced speech," Speech Communication, vol. 55, no. 5, pp. 667-690, 2013. [96] J. Awrejcewicz, "Numerical Analysis Of The Oscillations Of Human Vocal Cords," Nonlinear Dynamics, vol. 2, no. 1, p. 35–52, 1991. [97] J. Liljencrants, "A translating and rotating mass model of the vocal folds," J STL-QPSR, vol. 1, p. 1–18, 1991. [98] P. Šidlof and J. Horáček, "Vocal fold motion and voice production: Mathematical modelling and experiment," in In Forum Acusticum 2005, Budapest, Hungary, 2005. [99] J. L. Flanagan and L. L. Landgraf, "Self-Oscillating Source for Vocal-Tract Synthesizers," IEEE Transactions on Audio and Electroacoustics, vol. 16(1), p. 57–64, 1968. [100] B. H. Story and I. R. Titze, "Voice simulation with a body-cover model of the vocal folds," J. Acoust. Soc. Am., vol. 97, p. 1249–1260, 1995. [101] K. Ishizaka, M. Matsuidara and T. Kaneko, "Input acoustic-impedance measurement of subglottal system," J. Acoust. Soc. Am., vol. 60, p. 190–197, 1976. 142 [102] I. T. Tokuda, J. Horáček, J. G. Svec and H. Herzel, "Comparison of biomechanical modeling of register transitions and voice instabilities with excised larynx experiments," Journal of the Acoustical Society of America, vol. 122, no. 1, p. 519–531, 2007. [103] P. Gómez, K. S, S. A, B. C and D. M, "Degrees of freedom in a vocal fold inverse problem," in In: International conference on bioinformatics and biomedical engineering, Springer, Berlin, 2017. [104] R. Schwarz, U. Hoppe, M. Schuster, T. Wurzbacher, U. Eysholdt and J. Lohscheller, "Classification of unilateral vocal fold paralysis by endoscopic digital high-speed recordings and inversion of a biomechanical model," IEEE transactions on biomedical engineering, vol. 53, no. 6, pp. 1099-1108, 2006. [105] T. Wurzbacher, R. Schwarz, M. Döllinger, U. Hoppe, U. Eysholdt and J. Lohscheller, "Model-based classification of nonstationary vocal fold vibrations," The Journal of the Acoustical Society of America, vol. 120, no. 2, pp. 1012-1027, 2006. [106] X. Qin, S. Wang and M. Wan, "Improving reliability and accuracy of vibration parameters of vocalfolds based on high-speedvideoand electroglottography," IEEE T Bio-Med Eng, vol. 56, no. 6, p. 1744–1754 , 2009. [107] R. Fraile, M. Kob, J. I. Godino-Llorente, N. Saenz-Lechon, V. J. Osma Ruiz and J. M. Gutierrez-Arriola, "Physical simulation of laryngeal disorders using a multiple-mass vocal fold model," Biomed. Signal Process. & Control, vol. 7, no. 1, p. 65–78, 2012. [108] M. Kob, Physical modeling of the singing voice, Aachen, Berlin: Unversity of Technology. [109] I. R. Titze, "The Myoelastic Aerodynamic Theory of Phonation," In: The National Center for Voice and Speech, 2006. [110] M. E. Smith, G. S. Berke, B. R. Gerratt and J. Kreiman, "Laryngeal paralyses: theoretical considerations and effects on laryngeal vibration," J. Speech Hear. Res., vol. 35, p. 545– 554, 1992. [111] T. Koizumi, S. Taniguchi and S. Hiromitsu, "Two-mass models of the vocal cords for natural voice synthesis," J. Acoust. Soc. Am., vol. 82, pp. 1179-1192, 1987. [112] B. Benjamin and G. Croxson, "Vocal nodules in children," Ann. Oto. Rhinol. Laryngol., vol. 99 , no. 5, p. 530–533, 1987. [113] T. Koizumi, S. Taniguchi and F. Itakura, "An analysis-by-synthesis approach to estimation of vocal cord polyp features," Laryngoscope , vol. 103, p. 1035–1042, 1993. [114] Y. Zhang and J. J. Jiang, "Chaotic vibrations of a vocal fold model with a unilateral polyp," J. Acoust. Soc. Am., vol. 115, p. 1266–1269, 2004. [115] E. W. Massey and G. W. Paulson, "Essential vocal tremor: clinical characteristics and response to therapy," South. Med. J., vol. 78, p. 316–317, 1985. [116] J. A. Logemann, H. B. Fisher, B. Boshes and E. R. Blonsky, "Frequency and coocurrence of vocal tract dysfunctions in the speech of a large sample of Parkinson patients," J. Speech Hear. Disord., vol. 43, p. 47–58, 1978. [117] Y. Zhang and J. J. Jiang, "Nonlinear dynamic mechanism of vocal tremor from voice analysis and model simulations," J. Sound Vib., vol. 316, p. 248–262, 2008. 143 [118] Y. Zhang, J. J. Jiang and D. A. Rahn, "Studying vocal fold vibrations in Parkinson’s disease with a nonlinear model," Chaos , vol. 15, p. 033903, 2005. [119] M. Naghibolhosseini, D. D. Deliyski, S. R. Zacharias, A. de Alarcon and R. F. Orlikoff, "Temporal segmentation for laryngeal high-speed videoendoscopy in connected speech," J. of Voice, vol. 32, no. 2, pp. 256.e1-256.e12, 2018. [120] T. Koç and T. Çiloğlu, "Automatic segmentation of high speed video images of vocal folds," Journal of Applied Mathematics, pp. 1-16, 2014. [121] J. Lohscheller, H. Toy, F. Rosanowski, U. Eysholdt and M. Döllinger, "Clinically evaluated procedure for the reconstruction of vocal fold vibrations from endoscopic digital high-speed videos," Medical Image Analysis, vol. 11, no. 4, p. 400–413, 2007. [122] S.-Z. Karakozoglou, N. Henrich, C. D’Alessandro and Y. Stylianou, " Automatic glottal segmentation using local-based active contours and application to glottovibrography," Speech Communication, vol. 54, no. 5, p. 641–654, 2012. [123] H. J. Moukalled, D. D. Deliyski, R. R. Schwarz and S. Wang, "Segmentation of laryngeal high-speed videoendoscopy in temporal domain using paired active contours," in Manfredi C (Ed.) Proceedings of the 10th International Workshop on Models and Analysis of VocaL Emissions for Biomedical Applications MAVEBA, Firenze University Press, Firenze, Italy, 2009. [124] Y. Yan, X. Chen and D. Bless, "Automatic tracing of vocal-fold motion from high-speed digital images," IEEE Transactions on Biomedical Engineering, vol. 53, no. 7, p. 1394– 1400, 2006. [125] Y. Yan, E. Damrose and D. Bless, "Functional analysis of voice using simultaneous high- speed imaging and acoustic recordings," Journal of Voice, vol. 21, p. 604–616, 2007. [126] D. D. Mehta, D. D. Deliyski, S. M. Zeitels, T. F. Quatieri and R. E. Hillman, "Voice production mechanisms following phonosurgical treatment of early glottic cancer," Annals of Otology, Rhinology and Laryngology, vol. 119, no. 1, pp. 1-9, 2010. [127] J. Demeyer, T. Dubuisson, B. Gosselin and M. Remacle, Glottis segmentation with a high- speed glottography: A fullyautomatic method, In: 3rd Adv. Voice Funct. Assess. Int. Workshop, 2009. [128] Y. Yan, G. Du, C. Zhu and G. Marriott, "Snake based automatic tracing of vocal-fold motion from high-speed digital images," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12), 2012. [129] Y. Zhang, E. Bieging, H. Tsui and J. J. Jiang, "Efficient and effective extraction of vocal fold vibratory patterns from high-speed digital imaging," Journal of Voice, vol. 24, p. 21– 29, 2010. [130] S. Zhou, J. Wang, S. Zhang, Y. Liang and Y. Gong, "Active contour model based on local and global intensity information for medical image segmentation," Neurocomputing, vol. 186, pp. 107-118, 2016. [131] G. Sulong, H. Abdulaali and S. Hassan, "Edge detection algorithms vs-active contour for sketch matching: Comparative study," Research Journal of Applied Sciences, Engineering and Technology, vol. 11, no. 7, pp. 759-764, 2015. 144 [132] M. Kass, A. Witkin and D. Terzopoulos, "Active contour models," International Journal of Computer Vision, vol. 1, no. 1, pp. 321-331, 1987. [133] C. Manfredi, L. Bocchi, S. Bianchi, N. Migali and G. Cantarella, "Objective vocal fold vibration assessment from videokymographic images," Biomedical Signal Processing and Control, vol. 1, no. 2, pp. 129-136, 2006. [134] G. Hinton, "Deep Learning — A Technology With the Potential to Transform Health Care," Journal of the American Medical Association, vol. 320, no. 11, p. 1101, 2018. [135] A. Esteva, A. Robicquet, B. Ramsundar, V. Kuleshov, M. DePristo, K. Chou, C. Cui, G. Corrado, T. S and J. Dean, "A guide to deep learning in healthcare," Nature Medicine, vol. 25, no. 1, pp. 24-29, 2019. [136] M. S, V. G. O, M. E. D, A. Laborai, L. Guastini, G. Peretti and L. S. Mattos, "Learning- based classification of informative laryngoscopic frames," Comput Methods Programs Biomed, vol. 158, pp. 21-30, 2018. [137] I. Patrini, M. Ruperti, S. Moccia, L. S. Mattos, E. Frontoni and E. De Momi, "Transfer learning for informative-frame selection in laryngoscopic videos through learned features," Med Biol Eng Comput, vol. 58, no. 6, pp. 1225-1238, 2020. [138] A. Galdran, P. Costa and A. Campilho, "Real-Time Informative Laryngoscopic Frame Classification with Pre-Trained Convolutional Neural Networks," in In: 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), 2019. [139] J. Ren, X. Jing, J. Wang, X. Ren, Y. Xu, Q. Yang, L. Ma, Y. Sun, W. Xu, N. Yang and J. Zou, "Automatic Recognition of Laryngoscopic Images Using a Deep-Learning Technique," The Laryngoscope, vol. 130, no. 11, pp. E686-E693, 2020. [140] H. Xiong, P. Lin, J. G. Yu, J. Ye, L. Xiao, Y. Tao, Z. Jiang, W. Lin, M. Liu, J. Xu and W. Hu, "Computer-aided diagnosis of laryngeal cancer via deep learning based on laryngoscopic images," EBioMedicine, vol. 48, pp. 92-99, 2019. [141] W. K. Cho, J. L. Yeong, H. A. Joo, I. S. Jeong, Y. Choi, S. Y. Nam, S. Y. Kim and S. Choi, "Diagnostic Accuracies of Laryngeal Diseases Using a Convolutional Neural Network‐ Based Image Classification System," The Laryngoscope, 2021. [142] M. K. Fehling, F. Grosch, M. E. Schuster, B. Schick and J. Lohscheller, "Fully automatic segmentation of glottis and vocal folds in endoscopic laryngeal high-speed videos using a deep Convolutional LSTM Network," PLoS ONE, vol. 15, no. 2: e0227791, 2020. [143] P. Gómez, A. M. Kist, P. Schlegel, D. A. Berry, D. K. Chhetri, S. Dürr, M. Echternach, A. M. Johnson, S. Kniesburges, M. Kunduk, Y. Maryn, A. Schützenberger, M. Verguts and M. Döllinger, "BAGLS, a multihospital benchmark for automatic glottis segmentation," Scientific Data, vol. 7, no. 1, p. 186, 2020. [144] A. M. Kist, J. Zilker, P. Gómez, A. Schützenberger and M. Döllinger, "Rethinking glottal midline detection," Scientific Reports, vol. 10:20723, 2020. [145] A. M. Kist and M. Döllinger, "Efficient biomedical image segmentation on EdgeTPUs at point of care," IEEE Access, vol. 8, pp. 139356-139366, 2020. 145 [146] A. Kist, P. Gómez, D. Dubrovskiy, P. Schlegel, M. Kunduk, M. Echternach, R. Patel, M. Semmler, C. Bohr, S. Dürr, A. Schützenberger and M. Döllinger, "A Deep Learning Enhanced Novel Software Tool for Laryngeal Dynamics Analysis," Journal of Speech, Language, and Hearing Research, pp. 1-15, 2021. [147] M. Döllinger, T. Schraut, L. A. Henrich, D. Chhetri, M. Echternach, A. M. Johnson, M. Kunduk, Y. Maryn, R. R. Patel, R. Samlan and M. Semmler, "Re-Training of Convolutional Neural Networks for Glottis Segmentation in Endoscopic High-Speed Videos," Applied Sciences, vol. 12, no. 19, p. 9791, 2022. [148] C. L. Ludlow, "Treatment approaches for spasmodic dysphonia: limitations of current approaches," Curr Opin Otolaryngol Head Neck Surg, p. 160–165, 2009. [149] G. S. Berke, K. E. Blackwell, B. R. Gerratt, A. Verneil, K. S. Jackson and J. A. Sercarz, "Selective laryngeal adductor denervation–reinnervation: a new surgical treatment for adductor spasmodic dysphonia," Ann Otol Rhinol Laryngol, vol. 100, p. 227–231, 1999. [150] E. Yiu, L. Worrall, J. Longland and C. Mitchell, "Analysing vocal quality of connected speech using Kay’s computerized speech lab: a preliminary finding," Clin Linguist & Phon, vol. 14, no. 4, p. 295–305, 2000. [151] F. Boutsen, M. P. Cannito, M. Taylor and B. Bender, "Botox treatment in adductor spasmodic dysphonia: a meta-analysis," J Sp Lang Hear Res, vol. 45, p. 469– 481, 2002. [152] A. Yousef, D. D. Deliyski, S. R. Zacharias and M. Naghibolhosseini, "Detection of Vocal Fold Image Obstructions in High-Speed Videoendoscopy During Connected Speech in Adductor Spasmodic Dysphonia: A Convolutional Neural Networks Approach," Journal of Voice, pp. S0892-1997(22)00027-3, Mar 15; S0892-1997(22)00027-3, 2022. Online ahead of print. [153] R. Orlikoff, D. Deliyski, R. Baken and B. Watson, "Validation of a glottographic measure of vocal attack," J Voice, vol. 23, no. 2, pp. 164-168, 2009. [154] M. Cannito and G. Kondraske, "Rapid manual abilities in spasmodic dysphonic and normal female subjects," J Speech Hear Res, vol. 33, p. 123–133, 1990. [155] N. Roy, "Differential diagnosis of muscle tension dysphonia and spasmodic dysphonia," Current Opinion in Otolaryngology & Head and Neck Surgery, vol. 18, no. 3, pp. 165-170, 2010. [156] O. Russakovsky, J. Deng, J. Krause, S. Satheesh, S. Ma, Z. Huang, K. Karpathy, A. Khosla, M. Bernstein, A. C. Berg and L. Fei-Fei, "ImageNet large scale visual recognition challenge," Int J Comput Vis, vol. 115, pp. 211-252, 2015. [157] T. Hirasawa, K. Aoyama, T. Tanimoto, S. Ishihara, S. Shichijo, T. Ozawa, T. Ohnishi, M. Fujishiro, K. Matsuo, J. Fujisaki, T. Tada and e. al., "Application of artificial intelligence using a convolutional neural network for detecting gastric cancer in endoscopic images," Gastric Cancer, vol. 21, pp. 653-660, 2018. [158] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv: 1412.6980., 2014. [159] J. Hartigan and M. Wong, "A K-means Clustering Algorithm," Applied Statistics, vol. 28, pp. 100-108, 1979. 146 [160] D. Arthur and S. Vassilvitskii, "k-means++: the advantages of careful seeding," in SODA ‘07: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Philadelphia, 2007. [161] A. Amini, T. Weymouth and R. Jain, "Using dynamic programming for solving variational problems in vision," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, no. 9, p. 855–867, 1990. [162] M. Naghibolhosseini, S. R. Zacharias, S. Zenas, F. Levesque and D. D. Deliyski, "Laryngeal Imaging Study of Glottal Attack/Offset Time in Adductor Spasmodic Dysphonia during Connected Speech," Applied Sciences, vol. 13, no. 5, p. 2979, 2023. [163] R. D. Blevins, Formulas for Natural Frequency and Mode Shape, Reprint Edition., 2001. [164] I. R. Titze, "The concept of muscular isometrics for optimizing vocal intensity and efficiency," J Res Singing, vol. 14, pp. 15-25, 1979. [165] E. Cataldo, F. R. Leta, J. Lucero and L. Nicolato, "Synthesis of voiced sounds using low- dimensional models of the vocal cords and time-varying subglottal pressure," Mechanics Research Communications, vol. 33, no. 2, pp. 250-260, 2006. [166] I. R. Titze, "Comments on the myoelastic-aerodynamic theory of phonation," Journal of Speech, Language, and Hearing Research, vol. 23, no. 3, pp. 495-510, 1980. [167] J. L. Flanagan, "Some properties of the glottal sound source," Journal of Speech and Hearing Research, vol. 1, no. 2, pp. 99-116, 1958. [168] J. VAN DEN BERG, J. T. ZANTEMA and J. Doornenbal, "On the air re sistance and the Bernoulli effect of the human larynx," J. Acoust. Soc. Amer., vol. 29, pp. 626-631, 1957. [169] E. Cataldo and C. Soize, "Voice signals produced with jitter through a stochastic one-mass mechanical model," Journal of voice, vol. 31, no. 1, pp. 111-e9, 2017. [170] A. M. Yousef, D. D. Deliyski, S. R. Zacharias, A. de Alarcon, R. F. Orlikoff and M. Naghibolhosseini, "A Deep Learning Approach for Quantifying Vocal Fold Dynamics during Connected Speech using Laryngeal High-Speed Videoendoscopy," Journal of Speech, Language, and Hearing Research, vol. 65, no. 6, pp. 2098-2113, 2022. [171] J. Kennedy and R. Eberhart, "Particle swarm optimization," in In Proceedings of ICNN'95- international conference on neural networks, 1995. [172] J. Xin, G. Chen and Y. Hai, "A particle swarm optimizer with multi-stage linearly- decreasing inertia weight.," in In International joint conference on computational sciences and optimization IEEE, 2009 . [173] D. D. Deliyski, P. P. Petrushev, H. S. Bonilha, T. T. Gerlach, B. Martin-Harris and R. E. Hillman, "Clinical imple mentation of laryngeal high-speed videoendoscopy: Challenges and evolution," Folia Phoniatrica et Logopaedica, vol. 60, no. 1, pp. 33-44, 2008. [174] N. G. De Biase, G. P. Korn, P. Lorenzon, M. Padovani, M. Moraes, G. Madazio and L. C. P. Vilanova, "Dysphonia severity degree and phonation onset latency in laryngeal adductor dystonia," Journal of Voice, vol. 24, no. 4, pp. 406-409, 2010. 147 [175] W. Chen, P. Woo and T. Murry, "Vibratory Onset of Adductor Spasmodic Dysphonia and Muscle Tension Dysphonia: A High-Speed Video Study," J. Voice, vol. 34, p. 598–603, 2020. [176] M. P. De Vries, H. K. Schutte and G. J. Verkerke, "Determination of parameters for lumped parameter models of the vocal folds using a finite-element method approach," The Journal of the Acoustical Society of America, vol. 106, no. 6, pp. 3620-3628, 1999. [177] M. S. Howe and R. S. McGowan, "On the single-mass model of the vocal folds.," Fluid dynamics research, vol. 42, no. 1, p. 015001, 2010. [178] F. Avanzini, P. Alku and M. Karjalainen, "One-delayed-mass model for efficient synthesis of glottal flow," in In Seventh European Conference on Speech Communication and Technology, 2001. [179] F. Avanzini, "Simulation of vocal fold oscillation with a pseudo-one-mass physical model," Speech Communication, vol. 50, no. 2, pp. 95-108, 2008. 148