EXPLORING AND ADDRESSING THE VULNERABILITIES OF MULTIMEDIA SERVICES OVER MOBILE NETWORKS: FROM DEVICES TO INFRASTRUCTURE By Jingwen Shi A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science—Doctor of Philosophy 2025 ABSTRACT As mobile systems evolve from traditional telephony network architectures (e.g., 3G) to all-IP-based network architectures (4G, 5G, and beyond), the IP Multimedia Subsystem (IMS) was introduced to provide users with a variety of multimedia services—such as voice calls, video calls, SMS, and emergency communications. However, while it enriches daily communication over cellular networks, it also introduces new security threats to the mobile communication ecosystem. In this dissertation, we systematically investigate the vulnerabilities introduced by architectural shifts in mobile networks, spanning from user devices to network infrastructure: (1) on the device side, we analyze the negative impact of transitioning IMS client implementations from traditional hardware-based solutions (in cellular modems) to software-based applications on mobile phones. Our study reveals that this shift significantly expands the attack surface, enabling adversaries to hijack, spoof, or manipulate signaling and media data across various multimedia services; and (2) on the infrastructure side, we examine privacy leakage issues in voice calls over IMS. Although all voice packets and signaling messages are encrypted, the underlying transmission patterns remain observable, thereby leaking user privacy. There are three key lessons learned from our study. First, current IMS standards lack robust security protections for IMS signaling routing on phones. Thus, the common socket communication allows interprocess communication to the IMS client within the same mobile system. This architectural gap enables malware to easily intercept or forge IMS signaling between the IMS client and the IMS server. It enables attacks that can prevent mobile users from accessing multimedia services across all available radio access networks - including 4G, 5G, and Wi-Fi. It also allows adversaries to spoof SMS messages with arbitrary display names. Second, IMS video sessions lack encryption and integrity protection beyond the IP layer. As a result, even with radio and IP layer protection in place, it cannot safeguard the IMS video data on a compromised mobile device before sending it to the air. This opens the door for adversaries to hijack legitimate video streams. We demonstrate that the attacker can hijack video sessions as covert channels, completely bypassing operator-level monitoring and charging policy. Third, although 5G/4G voice calls are encrypted for security and privacy, we unveil that side-channel vulnerabilities persist. In particular, transmission patterns and signaling metadata can still leak sensitive information about 5G/4G call states. We demonstrate a Cross-domain Identity Linkage (CrossIL) attack that can link user identities to their cellular identities with a success rate of 89% to 98%, highlighting the need for deeper privacy-aware design in encrypted mobile voice services. Building on our findings and lessons learned, we propose innovative countermeasures that not only address the identified security vulnerabilities but also pave the way for enabling more reliable and resilient multimedia services over mobile networks. Copyright by JINGWEN SHI 2025 "Live yourself as a light, Because you don’t know, Who by thy light, Out of the darkness." — Rabindranath Tagore v ACKNOWLEDGEMENTS Pursuing a Ph.D. has been a long, challenging, and unforgettable journey—marked by moments of happiness, excitement, hope, and, at times, discouragement and daze. I am deeply grateful to the many individuals whose support, guidance, and companionship have shaped and sustained me throughout this endeavor. First and foremost, I would like to express my sincere appreciation to my advisor, Dr. Guan-Hua Tu, for his unwavering support and mentorship. His encouragement led me to explore an entirely new domain in cellular networks. Under his guidance, I developed a deeper understanding of high-quality research practices, logical reasoning, and attention to detail. I am also grateful to Dr. Chi-Yu Li and Dr. Chunyi Peng for their invaluable feedback and guidance, particularly on academic writing. Collaborating with them has been a truly rewarding experience, and I am profoundly thankful for their mentorship. I would like to express my heartfelt gratitude to Yaron Koral, my mentor at the AT&T Lab, for offering me a fresh perspective on the industry, as well as for his trust and encouragement. I would also like to thank the members of my dissertation committee—Dr. Guan-Hua Tu, Dr. Zhichao Cao, Dr. Tianxing Li, and Dr. Yuying Xie—for their insightful feedback, constructive suggestions, and generous support throughout my research. My heartfelt thanks go to Xitong Zhang, who has been a steadfast companion throughout the past twelve years of my academic journey. I am also thankful to Changhan Ge, Shufan Wang, and Yanbin Liu for their encouragement during my transition into new research path, and for making my internship at AT&T Labs a memorable and enriching experience. I extend my gratitude to my labmates—Xinyu Lei, Tian Xie, Sihan Wang, Yiwen Hu, Minyue Chen, Height Yan, Yu-An Chen, Moyan Lyu, and Jared Singh Sekhon—for the camaraderie, discussions, and shared moments. I am equally appreciative of the friends I was fortunate to meet at MSU during my Ph.D. years, including (but not limited to) Guangliang Liu, Haitao Mao, Bocheng Chen, Zhiyu Xue, Lan Wang, Wei Ao, Wei Wang, Boyang Liu, Yunshi Liang, Mengying Sun, Deliang Yang, Haoyu Zheng, Tooba Nasir and Catherine Mfinanga. What I have learned from all of you has shaped me into a more inclusive, patient, and determined person. To those I may have unintentionally omitted, please accept my vi sincere apologies—you are no less appreciated. Above all, I extend my deepest gratitude to my family, whose love, patience, and support have been the cornerstone of my journey. Their belief in me has carried me through every challenge, and for that, I am forever thankful. vii TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 2 BACKGROUND AND STATE-OF-ART . . . . . . . . . . . . . . . . CHAPTER 3 SECURING IP MULTIMEDIA SUBSYSTEM (IMS) ON MOBILE DEVICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 6 . 20 CHAPTER 4 ENHANCING THE PRIVACY OF VOICE SERVICES OVER IMS FRASTRUCTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 CHAPTER 5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 viii CHAPTER 1 INTRODUCTION Multimedia services such as voice calls, video calls, messaging, and emergency communication are vital to modern life. They play a crucial role in both everyday personal communication and life-saving public safety services, such as 911 calls. As technology continues to advance, the scope and impact of these services are expanding. Emerging application scenarios such as autonomous vehicles and the Internet of Things (IoT) increasingly depend on robust and reliable multimedia communication over mobile networks. Parallel to this growth, mobile networks have evolved significantly from the early days of 3G to today’s 5G, and upcoming 6G. A central milestone in this evolution has been the architectural shift from circuit-switched networks to IP-based packet-switched infrastructures. In traditional circuit-switched networks, communication requires establishing a dedicated physical link between two endpoints before any data exchange can occur. This model provides reliability but lacks scalability and efficiency. By contrast, modern packet-switched networks break data into discrete IP packets that are independently routed across the IP-based networks. This paradigm shift enables more flexible and efficient communication, generating advanced multimedia service platforms for mobile networks - IP Multimedia Subsystem (IMS) [66]. IMS supports a wide range of media services by integrating key functionalities, including user authentication, session control, media processing, charging, and Quality of Service (QoS) enforcement. Though introduced during the 4G era, IMS is not confined to 4G networks. It is designed to support multimedia services across various access technologies, including 2G, 3G, Wi-Fi, Internet, and landline phones, providing a unified service backbone regardless of the underlying connection. Over the last two decades, IMS has gained widespread adoption across diverse access networks, including 4G LTE, 5G New Radio (NR), and Voice over Wi-Fi (VoWi-Fi). As of April 2023, over 290 service providers across 235 networks had deployed IMS-based voice services. By the end of 2022, 4.7 billion subscribers relied on IMS for voice communication, and this figure is projected to rise to 7.5 billion by 2028—representing approximately 90% of all combined 4G and 5G subscriptions [60]. 1 While IMS greatly enhances flexibility, scalability, and service reliability, it also introduces a range of complex security challenges. In response to these architectural shifts and emerging threats, this dissertation presents a comprehensive study of IMS-based mobile multimedia services. 1.1 Current Research Contributions The current research contributions can be categorized into two primary domains: one focused on mobile devices and the other on network infrastructure. Specifically, my work spans two main research directions: Figure 1.1 An overview of current research contributions. • Securing IP Multimedia Subsystem on Mobile Devices — focusing on vulnerabilities introduced at the mobile device. • Enhancing the Privacy of Voice Services over IMS Infrastructure — addressing user privacy leakage and side-channel threats within the network infrastructure. 2 Network Infrastructure(1) Securing IP Multimedia Subsystem (IMS) On Mobile Devices(2) Enhancing the Privacy of Voice Services over IMS InfrastructureMobile Device Securing IP Multimedia Subsystem on Mobile Devices. The IP Multimedia Subsystem (IMS) is a foundational framework for delivering multimedia services—such as voice and video calling, SMS, and emergency communication—across cellular networks. While its security mechanisms have been substantially strengthened over the past two decades, most of these enhancements have focused on the network infrastructure layer. Techniques such as mutual authentication (AKA), IPsec encryption, and STIR/SHAKEN caller ID verification provide strong protection within the operator’s domain. However, a critical blind spot remains: the mobile device (ME) itself. As smartphone architecture has evolved—shifting IMS functionality from secure hardware modems to more flexible but exposed application processors—device-side security standards have failed to keep pace. This discrepancy opens up new and under-explored attack surfaces. This research direction presents the first comprehensive security analysis of IMS client behavior on modern smartphones. We identify four key vulnerabilities in current IMS implementations: unprotected signaling routing, unrestricted signaling sources, insecure video data delivery, and unauthorized use of ViIMS channels. Based on these weaknesses, we design and demonstrate three proof-of-concept attacks, including (1) the DoS-ALL attack that denies IMS access across Wi-Fi and cellular; (2) the NameSpoofing attack that fabricates sender names, bypassing carrier-level validation mechanisms; and (3) the ViIMS-Any attack that exploits high-priority ViIMS channels. These attacks are experimentally validated on commercial smartphones using leading carriers in the U.S. and Taiwan. Our findings reveal a critical need to re-examine and strengthen mobile-side IMS security in light of architectural transitions. We propose a set of countermeasures to mitigate these emerging threats. Enhancing the Privacy of Voice Services over IMS Infrastructure. Mobile voice communication remains a fundamental aspect of daily life, even as rich communication services proliferate over mobile broadband. With the transition to all-IP in 4G and 5G networks, voice services have evolved into Voice over IP Multimedia Subsystem (VoIMS), encompassing VoLTE and VoNR technologies. These services are deployed by hundreds of operators worldwide and are expected to support billions of devices shortly. Security mechanisms for VoIMS are robust in design, leveraging layered 3 encryption, mutual authentication via SIM-based keys, and standardized IPsec protection. However, while these protocols ensure encryption and integrity of signaling and media packets, they do not fully address emerging privacy risks arising from how voice traffic behaves under real-world conditions. This research direction examines user privacy threats arising from network-level optimizations designed to enhance voice call performance. Techniques such as guaranteed-bit-rate bearers, ROHC, AMR codecs, and comfort noise generation, though effective individually and collectively, create distinct and predictable traffic patterns. We demonstrate that these patterns can be exploited to passively infer confidential information about encrypted calls—including call activity, call state, and even caller or callee identity—without decrypting any voice content. Through a series of proof-of- concept attacks, we show that adversaries can link users to specific cellular identities and mute call participants, posing serious privacy and integrity threats. Our contributions include the empirical analysis of new side-channel risks in VoIMS traffic and the proposal of a standards-compliant mitigation strategy that addresses these weaknesses without sacrificing service quality. 1.2 Dissertation Structure The structure of this dissertation is outlined as follows. Note that this dissertation will not introduce the project of safeguarding cellular emergency service security and improving user authentication for Internet applications in detail because those works are not the author’s main contribution. Chapter 2 provides the necessary background and reviews the state of the art. Chapter 2.1 introduces the evolution of multimedia services in mobile networks, with a focus on the architectural shift from circuit-switched to IP-based infrastructure. Chapter 2.2 first provides a technical primer on IMS in 4G and 5G networks, covering key components, security mechanisms, and signaling flows. It then introduces the system architecture and performance optimizations for IMS-based voice services in 4G and 5G. Chapter 2.3 surveys existing research on IMS security, identifying gaps in both device- and infrastructure-level protections. Chapter 3 investigates the security vulnerabilities of IMS implementations on mobile devices. 4 Chapter 3.1 presents the threat model and experimental setup. Chapters 3.2 and 3.3 reveal two fundamental issues: unprotected routing of SIP signaling messages and insecure access to IMS media sessions. Chapter 3.4 discusses the security of modem-based IMS clients and iPhones. Chapter 3.5 proposes a lightweight and standard-compliant solution that secures both signaling and media paths without modifying mobile infrastructure. Chapter 4 focuses on privacy vulnerabilities in IMS-based voice services. Chapter 4.1 outlines the threat model and methodology. Chapter 4.2 introduces a traffic analysis-based side-channel attack that enables call inference, and Chapter 4.3 demonstrates proof-of-concept attacks capable of inferring call states and speaker identity. Chapter 4.4 discusses practical deployment considerations, and Chapter 4.5 presents our defense mechanism. In the final chapter, Chapter 5 summarizes the key contributions of the dissertation and outlines directions for future research in securing next-generation IMS and emergency services. Section 5.1 consolidates the research findings across the dissertation, emphasizing their significance for both academic and practical stakeholders. Section 5.2 summarizes the insights and lessons learned from our study. Section 5.3 introduces two key areas for future exploration: (1) investigating the security of Next-Generation 911 (NG911) services on mobile devices and (2) detecting and analyzing privacy leakage in IMS-based robocalls and IVR systems. These topics reflect the ongoing evolution of mobile services and the need for forward-looking security research. The overarching goal of our research is to enhance the security of mobile multimedia technology, safeguarding network infrastructure, mobile equipment, and, ultimately, mobile users. 5 CHAPTER 2 BACKGROUND AND STATE-OF-ART In this chapter, we introduce the evolution of mobile multimedia services, network infrastructure, protocols, IMS service signaling flow, and IMS voice call primer involved in this dissertation. We further present the related state-of-the-art studies. 2.1 The Evolution of Multimedia Services over Mobile Networks We now present the evolution of mobile multimedia services, highlighting the historical timeline, associated standards, and key architectural shifts that have shaped the development of voice, video, and messaging services across generations of mobile networks. 2.1.1 An Overview of Evolution of Multimedia Services Figure 2.1 An overview of the standard release timeline and the development of multimedia services. The evolution of cellular technology [48, 130] has been marked by significant milestones, each generation (G) taking approximately a decade to develop, as shown in Figure 2.1. These advancements have not only transformed the way we communicate but have also played a pivotal role in the development of multimedia services. In this section, we will explore the key developments and standards that have shaped the cellular landscape, highlighting their impact on multimedia services. Before 3G, cellular networks evolved from fragmented analog systems to standardized digital 6 infrastructures, primarily supporting voice through circuit-switched technology. These early generations lacked efficient support for data services, limiting multimedia capabilities. The Evolution of 3G (1998): The Split Architecture of Circuit-Switching and Packet-Switching. In December 1998, the 3rd Generation Partnership Project (3GPP) was established to develop a specification for a 3G mobile phone system building upon the 2G GSM system [131]. The 3GPP is indeed responsible for designing and developing standards not only for 3G cellular technology but also for all subsequent generations, including 4G and 5G cellular networks. The introduction of 3G marked a pivotal shift from circuit-switched technology to packet-switching. Additionally, release 5 of the 3GPP standards formalized the concept of the IP Multimedia Subsystem (IMS) [17], which laid the foundation for multimedia services integration within cellular networks. As a result, data services such as web browsing and email became possible over mobile networks for the first time. However, multimedia services like voice and SMS continued to rely on legacy circuit-switched methods. This led to a split architecture: IP-based transport for data and circuit-switched infrastructure for multimedia services. The Emergence of 4G (2008): Transition to a Fully Packet-Switched Network. With 4G, the mobile network fully transitioned to packet switching. More importantly, it introduced the IP Multimedia Subsystem (IMS)—a standardized architectural framework specifically designed to deliver IP-based multimedia services over mobile networks. IMS represents a pivotal shift: voice and messaging are no longer confined to circuit-switched infrastructure. Instead, they are integrated into the packet-switched domain, enabling richer communication services and significantly greater scalability. The Advent of 5G (2018): Enabling Cloud-Native Architectures. With the advent of 5G, IMS has evolved into a cloud-native architecture, delivering greater flexibility and enhanced performance. This transformation paves the way for real-time multimedia applications, including mission-critical use cases such as autonomous driving. As we progress from 3G to the current 5G, the evolution of multimedia services within cellular networks is notable. With the transition to 3G and beyond, multimedia services have 7 become increasingly integrated into the cellular landscape. We will delve into the details of these advancements and compare the differences in multimedia services from the 3G era to the present state of 5G. This comparison will shed light on the profound impact that each cellular generation has made on the development and delivery of multimedia services. 2.1.2 Multimedia Services through Circuit Switched Core Network in 3G Figure 2.2 Architecture of 3G Mobile Networks. 3G introduced significant optimizations for data transmission, boosting the bit rate from kilobits per second (kbps) to megabits per second (Mbps). As a result, before 3G the mobile network primarily focused on providing voice and text services, while 3G extended its capabilities to include multimedia services such as image and video transmission. We next introduce the 3G networks as shown in Figure 2.2, which consist of four parts: User Equipment (UE), radio access network, core network, and data network. User Equipment (UE). The User Equipment (UE) serves as the entry point for mobile users into the cellular network. It encompasses two domains: the Mobile Equipment (ME) and the User Services Identity Module (USIM). The ME is responsible for radio transmission and executing various applications on the mobile device. The USIM, on the other hand, consists of a standalone smart card that stores user information used for authentication purposes [18]. Together, the ME and USIM form the user’s interface with the network. Radio Access Network (RAN). The Radio Access Network (RAN) is responsible for establishing 8 Voice Network(PSTN/ISDN)2G Base Station(BST)3G Base Station(NodeB)UEVoiceVoiceDataDataMGWMGWMSC ServerGMSC ServerCircuit-switched Core NetworkData Network(PDN)Packet-switched Core Network(GPRS)SGSNGGSNSignalingRadio Access NetworkData Network and maintaining the wireless communication link between the user equipment (UE) and the core network. In 3G, the base station within the RAN is known as the NodeB. These base stations facilitate wireless communication through various air interfaces and are crucial for network coverage and capacity. Circuit-Switched Core Network. For voice communication in 3G networks, the Circuit-Switched Core Network is employed. This network includes components such as the Media Gateway (MGW), Mobile Switching Center (MSC) server, and Gateway Mobile Switching Center (GMSC) server. The MSC and GMSC servers primarily handle call control and mobility control functions. Under their control, the MGW establishes the bearer connection required for every voice session [38]. Packet-switched Core Network. In contrast to circuit-switched voice communication, data transmission in 3G networks is achieved through the Packet-Switched Core Network, often referred to as the General Packet Radio Service (GPRS). Key elements in the GPRS include the Serving GPRS Support Node (SGSN) and Gateway GPRS Support Node (GGSN). These components are responsible for handling data transmission to external data networks, thus enabling mobile data services and internet connectivity [45]. Data Network. 3G networks support various types of data networks to accommodate different applications. For voice, two notable data networks are the Public Switched Telephone Network (PSTN) and the Integrated Services Digital Network (ISDN). PSTN is a traditional analog circuit- switched telephone network. At the same time, ISDN is a digital communication technology that offers faster data transfer rates, higher call quality, and the ability to handle multiple simultaneous connections. For data services, the Public Data Network (PDN) is utilized to provide data connectivity, including internet access, to mobile users. These data networks play a crucial role in supporting both voice and data services within the 3G cellular networks. Understanding the core components of 3G networks is essential for comprehending the functioning of these mobile communication systems. From the UE to the various core network elements, each component plays a unique role in enabling voice and data services. In the subsequent sections, we will delve deeper into their evolution in successive generations of mobile networks. 9 2.1.3 Multimedia Services through Packet Switched Core Network in 4G and 5G Figure 2.3 Architecture of 4G and 5G Mobile Networks. In Figure 2.3, the architecture and operations of 5G/4G networks are illustrated for both the control plane and user plane. To follow the flow from left to right, user traffic passes through several key components: the UE (User Equipment), RAN (Radio Access Network), the 5G/4G core network, and, depending on the service, it may proceed to the Internet for mobile broadband access or to IMS network or multimedia services (voice, video, 911 calls, SMS, and more). In the RAN, 5G employs gNodeB, and 4G uses eNodeB as the Base Station (BS) to provide radio access to the UE. For the control plane, the Mobility Management Function (MMF) handles tasks like registration, authentication, IP connectivity management, and mobility management. The Home Environment (HE) serves as the repository for user data. In the user plane, Gateways (GWs) play a critical role in forwarding traffic and managing IP connectivity. The IMS network consists of the Call Session Control Function (CSCF), Application Server (AS), and Home Subscriber Server (HSS). The three CSCF(s) are Proxy-CSCF (P-CSCF), Interrogating- CSCF (I-CSCF), and Serving-CSCF (S-CSCF), which collectively manage the SIP signaling for initiating, maintaining, modifying, and terminating IMS services (e.g., IMS call). CSCF will route the SIP signaling and media data to the assigned Application Server (AS), which executes the application. HSS is a database that contains subscription-related information to support user authentication and authorization. It also stores the subscriber’s location and IP information [23]. 10 4G Base Station(eNodeB)5G Base Station(gNodeB)UEVoice/Video/SMS/911 Call, etc.DataGWsMMFPacket-switched 5G/4G Core NetworkData Network(Internet)Radio Access NetworkData NetworkHESignaling4G:MME5G:AMF+SMF4G:HSS5G:UDM+UDR4G:SGW+PGW5G:UPFHEGWsMMFVoice Network(IMS)P-CSCFS-CSCFI-CSCFHSSAS 2.1.4 New Challenges in Mobile Network Evolution While IMS significantly enhances flexibility, scalability, and service reliability, it also introduces a new set of complex security challenges. These challenges stem from four fundamental shifts in the design and operation of IMS-based networks: (1) the transition isolated, hardware-based solutions to more accessible, software-based clients (e.g., IMS operating as an Android application); (2) the adoption of more flexible and extensible signaling protocols; (3) the migration of signaling from the secure control plane to the less protected data plane; and (4) the elevated prioritization of network resources for IMS traffic (i.e., IMS services versus general data services). (1) Transition from Hardware-Based to Software-Based Clients. In legacy mobile systems, call and SMS functionalities were tightly integrated into the hardware modem. This hardware-based design offered strong isolation and minimal exposure to software-level threats. With the introduction of IMS, this model has shifted. Most IMS clients now operate as software applications within the mobile operating system—similar to conventional Android apps. Services such as SMS and video calls have also migrated to the software layer, while only voice communication remains in the modem for reasons of backward compatibility. This software-centric architecture significantly expands the attack surface, making client-side exploitation more feasible and increasing the potential for privilege escalation, spoofing, or data tampering. (2) Flexible and Extensible Protocol Standards. Alongside the shift to software-based clients comes a second challenge: flexible signaling protocols. In 2G and 3G networks, signaling relied on rigid, binary-encoded formats strictly enforced by hardware, which made spoofing and message fabrication exceedingly difficult. In contrast, IMS in 4G and 5G employs the text-based Session Initiation Protocol (SIP), which is highly flexible and extensible. However, it also reduces barriers for attackers. Malicious attacker can now craft, intercept, or modify SIP messages with relatively little effort, increasing the risk of fraud and impersonation attacks. Additionally, IMS must support multiple standards for SMS—such as 3GPP (used by Verizon) and 3GPP2 (used by T-Mobile and AT&T)—to ensure compatibility across networks. If even one standard contains a vulnerability, the requirement for cross-support can allow that flaw to impact users on otherwise unaffected networks, 11 thereby amplifying the scope and complexity of potential attacks. (3) Migration from Control Plane to Data Plane. A third challenge arises from the relocation of IMS signaling from the control plane to the data plane. In 2G and 3G, signaling took place over the control plane, which was safeguarded by well-tested security mechanisms like authentication, encryption, and integrity protections. But in 4G and 5G, IMS signaling moves to the data plane, which is optimized for performance but lacks equivalent security protections. As a result, these messages may be exposed to weaker or inconsistently configured defenses, increasing the likelihood of spoofing, interception, or unauthorized access. This architectural change introduces a trade-off between performance and security, where the latter is often compromised. (4) Elevated Resource Priority for IMS Traffic. Finally, IMS services are assigned high-priority bearers with elevated QoS guarantees across the device, radio access network, and core infrastructure. These guarantees ensure low latency and high reliability for critical services like voice and emergency calls. However, it also introduces new security vulnerability and incentives for attackers to abuse those resources. Adversaries who manage to impersonate an IMS client may gain preferential access to network bandwidth, effectively hijacking system resources. Such abuse can lead to service degradation for legitimate users and may even allow attackers to bypass traffic monitoring and policy enforcement mechanisms. In light of these architectural changes and emerging security concerns, this dissertation undertakes a comprehensive study of IMS-based mobile multimedia serviers. The research spans mobile devices and network infrastructure, protocol specifications and implementations. Our objective is to uncover previously unexamined vulnerabilities, demonstrate real-world attack scenarios, and propose practical, scalable defenses to enhance the overall security of IMS-enabled multimadia services. 2.2 5G/4G IMS Primer In this section, we introduce the foundational concepts of 5G/4G network architecture, along with their associated security mechanisms and performance optimizations. For simplicity, we use a unified terminology to represent functionally equivalent components across 4G and 5G networks. 12 2.2.1 5G/4G IMS architecture and Service Flows In this section, we first present the necessary background on 5G/4G network architecture and its security measures. We then introduce the architecture and network stack on Mobile Equipment (ME), and finally present the IMS service flow. Figure 2.4 5G/4G mobile network architecture and its security; the architecture and potential security vulnerabilities of ME. 5G/4G mobile network architecture. Figure 2.4(a) shows 5G/4G network architecture and its operations in both control-plane and user-plane. From right to left, user traffic traverses the UE (User Equipment), RAN (Radio Access Network), 5G/4G core network, and Internet (mobile broadband) or IMS (voice/text). UE is the ME equipped with a valid USIM (UMTS Subscriber Identity Module); RAN uses 5G gNodeB or 4G eNodeB as the BS (Base Station) to provide radio access to the UE. In the control plane, MMF (Mobility Management Function) administrates registration, authentication, IP connectivity, and mobility, whereas HE (Home Environment) stores user data. In the user plane, GWs (Gateways) are used to forward traffic and manage IP connectivity. To offer guaranteed network performance for each UE, multiple IP flows are created and assigned with distinct QoS levels. Specifically, one flow is established for mobile broadband service to the Internet, whereas two flows are created to support multimedia services (e.g., voice and video calls) offered by the IMS: one for signaling and the other for media traffic; they are managed by IMS signaling servers and media gateways, respectively. The IMS signaling uses Session Initiation Protocol (SIP) [50] and the media traffic is transported over Real-Time Transport Protocol (RTP) [23, 39]. 13 (a) 5G/4G mobile network architecture.(b) 5G/4G security architecture.(c) ME architecture.BSSMSUSIMVoIMSIMSClientMEHEMMF(III)(IV)(I)(II)(I)(I)(I)(II)(I)IMSTS 33.210TS 33.203Application StratumServing StratumTransport Stratum(?)(?)IMS ClientTelephonyRILHardwareSoftwareTFTNon-IMSNative LibJava API3412Linux Kernel14…VulnerabilityAttackPHYApplicationProcessorModeme.g., SnapdragonMACRLCPDCPSDAP(5Gonly)RAN5G/4G Core NetworkIMS Core4G:MME5G:AMF+SMF4G:HSS5G:UDM+UDR4G:eNodeB5G:gNodeBControl-planeUser-plane4G:SGW+PGW5G:UPFMMFHEGatewayBSInternetIPFlow (Internet) IPFlow (SIP)IP Flow (RTP&RTCP)InternetIMSUESignalingServerMedia Gateway 5G/4G security architecture. Figure 2.4(b) shows that 5G/4G uses a multi-layer security architecture with three stratums: application, service, and transport. The security functions are divided into four domains [43, 44]: (I) network access domain, which ensures mutual authentication between the core network and the ME, as well as secure service access; (II) network domain, which guarantees secure communication among network entities; (III) user domain, which secures communication between the ME and the USIM; (IV) application domain, which protects message exchanges between ME applications and network servers (e.g., IMS). Such a security architecture shows that the access between the applications and the ME is not explicitly protected. ME architecture. Figure 2.4(c) illustrates the ME architecture, which includes both software and hardware components, with Android Phones serving as examples. The ME software includes OS, applications, and the user interface. The applications can be classified into IMS and non-IMS types with different protocol stacks on top of the Linux kernel, and specifically, each IMS application serving as an IMS client runs on the Telephony Framework and the RIL (Radio Interface Layer) for IMS functionalities. The ME hardware contains two major components. One is an application processor supporting the ME software, whereas the other is the cellular modem offering cellular connectivity and cellular-related services. The modem mainly contains cellular L1/L2 protocols, including PHY, MAC, RLC (Radio Link Control), and PDCP (Packet Data Convergence Protocol) for both 5G and 4G networks, as well as SDAP (Service Data Adaptation Protocol) for the 5G network only. Moreover, it contains a function, TFT (Traffic Flow Template), for associating packets with each specified IP flow based on the 5-tuple (source/destination IP addresses, source/destination port numbers, protocol ID) information so that the corresponding routing and QoS policy can be applied [19]. IMS service flows. Figure 2.5 depicts IMS service flows for text, voice, and video services. To access an IMS service, the UE needs to perform three actions. First, IP Connectivity Establishment [23] is performed to obtain IP connectivity for communicating with the IMS server. Second, IMS Service Registration [23] is made for service registration from the UE to the IMS server, but also for mutual authentication between them. It uses the SIP Registration procedure with the 14 Figure 2.5 IMS service flows. IMS-AKA (Authentication and Key Agreement) [41]. When the IMS signaling security is enabled, IPsec SAs (Security Associations) between the IMS server and the UE are established during the registration. Third, the UE carries out IMS Service Session Establishment to establish an IMS service session with another UE [23, 9] using SIP. The IMS text and call services have different establishment procedures, which are initialized with initial messages, SIP MESSAGE and SIP INVITE, respectively. In particular, to ensure carrier-grade IMS service quality, the IMS signaling with SIP messages and the IMS media traffic with RTP/RTCP packets are both prioritized over the traffic of mobile data services. Specifically, the QoS levels assigned to mobile data services are the best-effort transmission, with priority indexes ranging from 8 to 9 [34]. In contrast, those assigned to IMS signaling and media traffic are the best effort with a priority of 1 (smaller values indicate higher priority) and the guaranteed bit rate transmission, respectively. 15 IMS ServerCallerCallee5G/4G Core Network IP Connectivity Establishment IMS Service RegistrationSIP Register401 UnauthorizedSIP Register200 OK IMS Service Session EstablishmentSIP INVITESIP INVTE100 TryingSession ProgressSession Progress180 Ringing180 Ringing200 OK200 OKVoice/Video Conversation(RTP/RTCP packets)Case 1: Text Over IMSSIP MESSAGE202 AcceptedSIP MESSAGE200 OKCase 2: Voice/Video Call Over IMS 2.2.2 IMS Voice Service Primer In this section, we introduce the arthitecture and optimization for 5G/4G voice service (i.e., VoIMS). VoIMS is an essential VoIP-based voice solution for 5G/4G networks [69]. Figure 2.6 depicts its network architecture, main protocols, and a basic work flow. Figure 2.6 Network architecture, main protocols and an operation flow for 5G/4G voice over VoIMS. 5G/4G network architecture supporting VoIMS. It comprises two parts: the 5G/4G network infrastructure and the IMS domain. The former provides User Equipment (UE, e.g., mobile phones) with active mobile connections (user-plane data pipes) to deliver user traffic over IP within 5G/4G networks. User traffic packets in turn traverse the UE, the base station and the gateways in the core cellular network to reach the external Internet or the IMS domain (for 5G/4G voice), or vice versa. The IMS domain comprises two key components: media gateway and signaling server. The former delivers IP multimedia data (e.g., voice packets) to IMS clients (e.g., UEs); the latter processes all signaling messages which are used to establish and manage voice call session above IP. Main protocols for VoIMS. The main protocols above IP are Session Initiation Protocol (SIP) and Real-time Transport Protocol (RTP). SIP is used for voice signaling to initiate, maintain, modify, and terminate voice calls over IP. RTP transmits a live multimedia stream over IP. VoIMS takes the same choices used by VoIP. Below IP, the main protocol is PDCP [46]. It performs three main functions within 5G/4G networks. First, it compresses the IP headers of user-plane data packets 16 IMS DomainUE4G/5G user-planeVoIMStrafficVoIMSsignalingControlControlMediaGateway4G gatewaysUPF(similar to4G gateways)4Ge.g., authentication, security5GIPRTPProtocol Stack @UESIP…⓪Establish a 4G/5G user-plane pipe①Establish a VoIMScall (via SIP bearer )②Run a VoIMScall session (via RTP bearer)③Terminate a VoIMScall (via SIP bearer)SignalingServerPDCP (encrypted)Encrypted Pipe 4G/5G control-plane03…VoIMScall flowVoIMSenhancement CodecE1 … E4E3E4E3E4E1E2E1E2 to improve transmission efficiency over the air. data, namely, IP packets. The session keys are generated through 5G/4G security functions in the control plane [44, 43]. Third, it dispatches the upper-layer data to their corresponding radio bearers: Dedicated Radio Bearer (DRB) and Signaling Radio Bearer (SRB). DRB is used to carry traffic in the user plane, and SRB is for the 5G/4G signaling in the control plane. PDCP is the only Layer-2 protocol studied in this work because PDCP wraps other lower L2/L1 protocols to offer a user-plane pipe for IP packet delivery. Conceptually, there is no difference between 5G and 4G except that 5G supports varying QoS settings for distinct IP data flows [33]. VoIMS call flow. As illustrated in Figure 2.6, a VoIMS call typically takes three steps: establishment ( 1 ), call conversation ( 2 ), and termination ( 3 ), if a 5G/4G user-plane pipe below IP is available. Otherwise, it first establishes this pipe ( 0 ). Actually, this pipe is encrypted using the keys derived from the mutual authentication between the UE and the network. A VoIMS call session is established by SIP signaling; it starts when someone dials a phone number to generate a call request and ends when the call request is accepted by the other call party ( 1 ). A call conversation is then carried over this established call session ( 2 ). The voice call application uses a speech audio codec to convert voice traffic into a digital format, which is later delivered by RTP. To end the call, SIP is used again to terminate the VoIMS call session ( 3 ). Both RTP and SIP packets are further encapsulated into IP packets for delivery. Specifically, they are forwarded to the IMS through the user-plane (PDCP) pipe provided and encrypted by 5G/4G networks. Voice enhancement techniques. As illustrated in Figure 2.6, four techniques have been introduced by 3GPP to enhance quality and efficiency of VoIMS services, as illustrated in Figure 2.6. From bottom up, they include special radio bearers (E1), ROHC (E2), comfort noise (CN) (E3), and AMR speech codecs (E4). VoIMS uses a special DRB with a guaranteed-bit-rate to ensure sufficient radio bandwidth for voice [69]; ROHC compresses the headers of VoIMS packets to reduce transmission overhead [46]; CN injects some background noise to prevent an unexpected call termination caused by a period of total silence [25]; AMR speech codecs (e.g., AMR [25], EVS [28]) offer adaptive rates for voice speech, and VoIMS uses a lower coding rate for unvoiced packets that carry background 17 noise. These techniques indeed enhance 5G/4G voice quality and efficiency. However, we find that the good turns evil as they together bring unanticipated side effects that have not been reported before to leak confidential call information despite encryption. 2.3 State of the Art on IMS over Mobile Networks Many studies have explored the security issues of IMS services from mobile equipement and mobile network infrastructure. Mobile Equipment. The IMS security of the ME has attracted much attention recently. The related studies can be classified into two directions, namely, IMS service abuse and DoS attacks. In the first direction, [123] studies the insecurity of the IMS-based SMS and then uncovers the corresponding SMS abuse and spoofing attacks. [84] compromises the phone modem to abuse the IMS voice session to transmit malicious data. [59] defends against the caller-ID spoofing by verifying the caller’s call state based on a callback. The other direction focuses on DoS attacks against IMS services. Specifically, [99] hijacks the VoWiFi signaling session to launch stealthy IMS call DoS attacks based on an insecure design of the call state machine. [92] spams the voice bearer to launch a DoS attack by muting an ongoing VoLTE call. [76] presents several vulnerabilities, including an improper cross-layer security binding, for the IMS service, thereby causing DoS attacks on the cellular emergency service against anonymous UEs. [140] introduces side-channel inference techniques to identify specific IMS call signaling messages and launch DoS on the IMS service over Wi-Fi. Network infrastructure. Several works focus on the insecurity of the IMS server deployed in the cellular network infrastructure. They can be classified into two categories. First, two studies [101, 120] investigate potential flooding and DoS attacks against the IMS server. Specifically, one [101] is to show that the adversary can flood SIP registration messages to the IMS server, yielding the server’s extra CPU processing power. The other [120] presents that abrupt changes in the content of SIP session requests, as well as the SIP message sequence, can be used as detection features of the IMS flooding. Second, three research works [127, 114, 53] attack the IMS session authentication and privacy against the IMS server. They observe that differentiated call response 18 times can be used to identify cellular IoT devices, introduce an attack that eavesdrops on the victim’s VoLTE call based on an implementation flaw of reusing the network key stream, and uncover that the weak requirement of network certification in the standard may cause the leakage of the IMSI/APN information for a UE involved in a VoWiFi call, respectively. Over recent years, the realm of 4G/5G voice security on the network infrastructure side has garnered increasing attention from the research community [81, 91, 98, 105, 57, 58, 75, 113]. Previous investigations have predominantly delved into various security challenges, including aspects such as VoLTE call reliability [81], unauthorized data access via VoLTE signaling [91], caller spoofing attacks [85], Denial of Service (DoS) attacks [98, 58], 911 call security [75], and overall security analyses [57]. Intriguingly, a study [58] enabled voice monitoring and harnessed vulnerabilities in 5G standalone networks (specifically, EPS fallback) while also exposing encryption algorithm insecurities in 2G GSM networks. A separate recent investigation suggested that encrypted packets of VoLTE calls might be susceptible to decryption [113]. This claim hinged on the reuse of encryption keys across different VoIMS calls by the same mobile user, despite such key reuse being explicitly prohibited by 3GPP standards [32]. 19 CHAPTER 3 SECURING IP MULTIMEDIA SUBSYSTEM (IMS) ON MOBILE DEVICES The IP Multimedia Subsystem (IMS) delivers IP multimedia services, such as voice/video calling and texting, to mobile users over cellular networks. In the past two decades, IMS services have been augmented to support various access networks, incorporating VoLTE (Voice over LTE), VoNR (Voice over New Radio), and VoWi-Fi (Voice over Wi-Fi). IMS security is also enhanced with a suite of well-examined mechanisms, including 5G/4G AKA (Authentication and Key Agreement), cellular-specific multi-layer security, and IMS media security. Specifically, secret keys required for IMS sessions [40] are derived from the AKA mutual authentication, wireless transmission in the air (Layer 2) is encrypted using the derived keys, and IP session (Layer 3) is secured by Internet Protocol Security (IPsec) [42]. Moreover, network operators enforce additional measures such as STIR/SHAKEN [129] required by the FCC for caller ID authentication, protecting IMS services from malicious attacks. However, these security enhancements are primarily centered on cellular network infrastructure. Our security analysis reveals that security measures on the mobile equipment (ME) side have remained relatively unchanged over the years. There are many advances on ME; for example, smartphone vendors have migrated the IMS client from 5G/4G modem chips to application processors and segregated IMS voice and video media processing within modem chips and application processors. Unfortunately, we find that 3GPP-mandated IMS security measures on the ME side fail to keep pace with device-side technological advances, resulting in new security vulnerabilities and unprecedented attacks. Our security analysis on ME shows neither IMS media sessions nor their control signaling are well protected. Specifically, we discover four new vulnerabilities: (V1) unprotected IMS signaling routing, (V2) unrestricted IMS signaling source, (V3) unprotected video data delivery, and (V4) unrestricted source for IMS video delivery. Details are elaborated in §3.2 and §3.3. By exploiting these vulnerabilities on ME, we further develop three proof-of-concept attacks against IMS services: (A1) Denial of Service over All Networks (DoS-ALL), (A2) Named SMS Source Spoofing 20 Category Vulnerability Description Attacks V1. Unpro- tected IMS Signaling Routing Unprotected ME Routing For IMS Client Signaling (§3.2) V2. Unre- stricted IMS Signaling Source V3. Unpro- tected Video Data Deliv- ery Insecure ME Access for IMS Media Sessions (§3.3) ME does not ensure that all outgoing IMS signaling messages are sent to the IMS servers deployed by network operators; Routing to malicious programs at the ME is allowed. (§3.2.1 ME does not protect IMS client software from re- ceiving IMS signaling mes- sages originated from non- IMS servers (say, local apps). (§3.2.2) The IMS media transmis- sion between IMS client and cellular network mo- dem is not provided with confidentiality and integrity protection. (§3.3.1) Un- V4. restricted Source for IMS Video Delivery Cellular network modem cannot verify whether IMS video data is transmitted by IMS clients or other non- IMS applications. (§5.2) Empirical Validation Carrier Device OS Android 4.4.2, 7, 8, 9, 11, 13 US-I, US-II, US-III, TW-I, TW-II LG(G3, G7), TCL(40 XL), Samsung (S8,S10, S21) [A1] DoS-ALL, a novel DoS attack that prevents IMS clients from using all access networks over Wi-Fi, 4G LTE, 5G NR. (§3.2.3.1) [A2] NameSpoofing, an SMS spoofing attack fabricates the sender’s name, which is prohibited by the network. (§3.2.3.2) US-I†, US-II† [A3] ViIMS-ANY, an attack that abuses Vi- IMS as a covert com- munication channel be- tween two malicious MEs, bypassing oper- ator policies. (§3.3.3) Android 4.4.2, 7, 8.1, 10, 13 LG(G3), Samsung (S8,S10), Google (Pixel 1/3/5/7) Table 3.1 Summary of four vulnerabilities and three proof-of-concept attacks in this work. Note: † ViIMS experiments were conducted in US-I and US-II because US-III supports ViIMS only with very limited phone models; TW-I and TW-II do not support ViIMS yet. (NameSpoofing), and (A3) Covert Communications over Video-over-IMS (ViIMS-ANY). The first DoS-ALL attack prevents the victim phones from accessing IMS services over all access networks including Wi-Fi, 4G LTE and 5G NR. It is more threatening than any DoS attacks reported before; it not only denies the IMS service access over Wi-Fi networks but also prevents access to all alternative cellular networks. The second NameSpoofing attack creates a fake short message with a fabricated sender name, which is prohibited by cellular infrastructure to mobile users. Figure 3.1 gives an illustrative example where NameSpoofing is successfully launched on our lab smartphone and the victim receives a message from “Mark Zuckerberg verified by Verizon”. Unlike 21 Figure 3.1 A successful NameSpoofing attack. the existing SMS spoofing attacks, NameSpoofing is much more threatening because it fabricates the sender name instead of the phone number. Note that network operators do not allow SMS users to fabricate the sender’s name (here, Mark Zuckerberg) even though the phone number is spoofed; more importantly, “verified by Verizon" cannot be added into the sender’s name unless Verizon authenticates that the sender number is not spoofed and truly used by Mark Zuckerberg. It is much harder for the victims to know whether they suffer from the smishing/phishing attacks, particularly when the names are "verified" by network operators. In comparison to fake Amber/Wireless alert attacks [90, 54] that primarily target emergency attack scenarios, this attack is applicable to a broader range of attack scenarios. The third ViIMS-ANY attack abuses ViIMS, which is designated for delivering video calls over IMS. ViIMS is used by two adversary MEs for any data communications, which obtains guaranteed bit rates and a higher service priority that normal data services should not have. As such, ViIMS-ANY bypasses data service policies enforced by operators. Table 3.1 summarizes new vulnerabilities and attacks, which are experimentally validated using commodity phones with three top-tier U.S. carriers and two major operators in Taiwan. We further propose countermeasures to address identified vulnerabilities and evaluate their effectiveness (§3.5). 3.1 Threat Model and Methodology To support multimedia services, the IMS client is designed differently from the traditional one, offering circuit-switched call and text services. It contains control-plane and data-plane operations. 22 For the control plane, the IMS client is expected to support various multimedia services with a flexibility demand of being updated dynamically, so most phone models use a software-based design to function as a mobile application. Compared with the traditional one implemented in the phone modem, it has a larger attack surface and may thus be more vulnerable. It may suffer from the hijacking of the IMS signaling session due to the acquisition of root privilege [99, 84] or the delivery of spoofed signaling messages (i.e., SMS) given unprotected packet routing. In our work, we focus on the latter security threat, which has not been explored, in §3.2. For the data plane, there are currently two major multimedia services: Voice over IMS (VoIMS) and ViIMS. These two services are supported in different ways according to their different processing resource requirements. The voice data of VoIMS is processed by the phone modem; thus, the corresponding voice packets cannot be captured in the mobile OS. They are inherently protected by the hardware security of the modem. Although the modem can still be compromised by some specialized tools (e.g., QXDM [108]), thereby causing its voice session to be hijacked [84], the security threat is limited since the assumption for attackers in the threat model is too strong to be practical. However, with a demand for large processing resources, a multi-core application processor is deployed to process the video data of ViIMS. This new component might lack the conventional hardware security protection from the modem, thus broadening the attack surface. This motivates us to investigate the security of the IMS data-plane framework in §3.3. We next present the threat model and experimental methodology, ethical considerations, as well as responsible disclosure. Threat model. For the control-plane security threats of IMS services in §3.2, victims are mobile users with a subscription to operational IMS services, whereas the adversary develops a malware application and installs it on the victims’ MEs; notably, there have been many ways for the malware propagation [106], and it is not our focus. The malware application does not require any root privileges. Specifically, the DoS-All attack (A1) encompasses two attack scenarios with distinct requirements. First, the adversary compromises the Wi-Fi router that the victim ME connects to and assigns the ME a Wi-Fi configuration during the initial Wi-Fi association. Such a malicious router 23 can be deployed in some public areas (e.g., cafes, airports, and restaurants). In this case, the malware requires only the INTERNET permission at most. More threateningly, if the victim’s ME supports IPv6, the malware is not needed. Second, in case the victim ME is not trapped in the malicious Wi-Fi network, the malware is a must and needs not only the INTERNET permission but also the BIND_VPN_SERVICE one. As for the NameSpoofing attack (A2), the malware necessitates the INTERNET permission, and the BIND_VPN_SERVICE one is also needed for the victim MEs running Android 9 or higher. For the data-plane security threats in §3.3, victims are mobile operators, and adversaries are mobile users abusing IMS video channels. For the ViIMS-ANY attack (A3), it is assumed that the adversary can install a malware application with root privileges on their own ViIMS-supported MEs. This assumption is practical since the compromised MEs are attack devices held by the adversary for attacking the infrastructure. Notably, only the ME software is compromised, but the others, including the ME hardware, are not. Methodology. To validate the presented security threats, we conduct experiments in the networks of three top-tier U.S. carriers and two Taiwan carriers, which are denoted as US-I, US-II, US-III, TW-I and TW-II, due to a privacy concern. We mainly focus on 4G networks and 5G NSA (Non-Standalone) networks, since the 5G SA (Standalone) network has not been widely deployed yet. . We totally test 10 carrier-certified COTS (Commercial Off-The-Shelf) phone models, including LG G3/G7, Samsung S8/S10/S21, Google Pixel 1/3/5/7, and TCL 40XL, from four major brands. Their Android versions range from 4.4.2 to 13. The reason why the chosen phone models are mainly those with Android OS is that Android takes the largest share with 71.8% [1] of the worldwide mobile OS market. Ethical consideration. We bear in mind that some feasibility tests and attack evaluations may harm mobile users or carriers, so we conduct all the experimental studies in two responsible ways. First, we use only our own phones as the victim UEs. Second, we purchase unlimited plans for the text, call, and data services on all the tested phones. Notably, we do not seek to cause any 24 unnecessary damage but rather to make a disclosure about potential security threats in operational mobile networks. Responsible disclosure. We have reported all the identified vulnerabilities to the parties involved, including mobile OS vendors, phone manufacturers, and carriers. The proposed remedies have also been provided to them. 3.2 Unprotected ME Routing for IMS Client Signaling The ME routing requirement for the IMS client signaling seems to be simple and easily fulfilled, but it may not be restricted or protected from a security aspect. Specifically, the requirement needs to cover the delivery of both incoming and outgoing IMS signaling messages; the incoming ones shall originate from the IMS server and be delivered to the IMS client, whereas the outgoing ones shall be sent in the opposite direction. The other routing rules shall be prohibited; otherwise, some security threats may occur, e.g., the IMS server or the IMS client is spammed/spoofed by a third-party entity, and the IMS signaling session is hijacked by a man-in-the-middle (MiTM) attack. However, we discover that the routing rules for the IMS client signaling on the ME are not enforced to be exclusively restricted for the routing requirement; that is, the IMS signaling messages may be received from non-IMS parties or be maliciously routed to them. In the following, we present the corresponding two security vulnerabilities, namely (1) unprotected IMS signaling routing and (2) unrestricted IMS signaling source, and introduce two proof-of-concept attacks to show the real-world impact. 3.2.1 V1. Unprotected IMS Signaling Routing In the mobile OS, a network interface is created for the exclusive use of the IMS service (e.g., “rmnet_data0”), designated as an IMS interface, and is associated with a set of routing rules to route the IMS signaling. According to our investigation on Android OS with versions from 4 to 13, the IMS signaling routing is implemented by two components: RPDB (Routing Policy Database) and iptables. The RPDB defines the priority of routing policies, as shown in Figure 3.2, and each routing policy specifies a rule matching a routing table managed by the iptables. For example, the highest priority is the rule at the topmost line, “from all lookup local”, which means looking 25 Figure 3.2 Routing Policy Database (RPDB). up the “local” routing table for the packets from “all” sources. For each packet, the first matched rule from the priority list is applied. By following the priority list to search for a match rule for each packet, the first matched one will be employed. To support the IMS signaling over the IMS interface (e.g., “rmnet_data0”), there are two approaches observed. First, with the older Android versions, the IMS server address is explicitly specified in a routing policy of the RPDB and the policy is to look up the routing table of the IMS interface. Second, with the newer Android versions, the packets generated by the IMS client are identified by a framework mark [4] that is assigned to the client, and the mark is associated with the routing table in a routing policy. For example, the policy set to route IMS signaling packets in the RPDB shown in Figure 3.2 is the bottommost one. It means that all the packets with the framework mark (i.e., “fwmark”), “0x10fa4/0x1ffff”, are routed by looking up the “rmnet_data0” routing table. Seemingly, the routing of the IMS signaling is secure, since the non-IMS applications without root privilege are not allowed to modify the RPDB and routing tables. However, we discover that adding a routing rule to match and route the IMS signaling packets before they are matched with the IMS routing policy is still possible based on some specific operations supported for normal applications without root privilege. For example, activating the VPN service on an Android phone is allowed to create a virtual interface (e.g., “tun0”) and assign the interface an address; connecting the phone to a Wi-Fi network is allowed to assign an IP address to the Wi-Fi interface, which is usually given by the DHCP service of the Wi-Fi network. Once the adversary can compromise 26 0:from all lookup local 10000:from all fwmark0xc0000/0xd0000 lookup legacy_system10500:from all iiflo oifdummy0 uidrange0-0 lookup dummy0 10500:from all iiflo oifrmnet_data0 uidrange0-0 lookup rmnet_data0 10500:from all iiflo oifrmnet_data1 uidrange0-0 lookup rmnet_data1 10500:from all iiflo oifswlan0 uidrange0-0 lookup local_network13000:from all fwmark0x10063/0x1ffff iiflo lookup local_network13000:from all fwmark0x10fa4/0x1ffff iiflo lookup rmnet_data0 13000:from all fwmark0xd0254/0xdffff iiflo lookup rmnet_data1 Highest Priority Routing Policy (a) Before activating the malicious VPN service. (b) After activating the malicious VPN service. Figure 3.3 The routing rules of the ’local’ routing table. Figure 3.4 Failing to send SMS messages, there are no responses received from the IMS server. a VPN application or a Wi-Fi network, the assigned IP address can be set to the IMS server’s IP address and the IMS signaling can be thus routed to a compromised network interface, instead of the IMS interface. Experimental validation. We validate this vulnerability by developing a VPN application that assigns the IMS server’s IP address to the virtual interface. The experiment is conducted across three U.S. carriers and two Taiwan carriers. For each tested phone, we activate the VPN service using the developed application, and then send one SMS message to another phone number using the GUI of the SMS service. For all the tested phones, we observe that the VPN application can successfully assign the IMS server’s IP address to its established virtual network interface, and its routing information is updated in the “local” routing table, as shown in Figure 3.3. Moreover, the outgoing IMS signaling messages that carry SMS ones are routed to the VPN interface (i.e., “tun0”) instead of the IMS interface. It causes the delivery of the SMS messages to fail and no responses from the IMS server 27 local ::1 dev lo proto kernel metric 0 pref mediumlocal 2600:1007:110b:9b03:c923:69ce:83f:d04d dev rmnet_data0 proto kernel metric 0 pref mediumBefore Opening the VPN interfacelocal ::1 dev lo proto kernel metric 0 pref mediumlocal 2001:4888:2:fe40:a0:104:0:232 dev tun0 proto kernel metric 0 pref mediumlocal 2600:1007:110b:9b03:c923:69ce:83f:d04d dev rmnet_data0 proto kernel metric 0 pref mediumVPN is using the IMS server IPlocal ::1 dev lo proto kernel metric 0 pref mediumlocal 2600:1007:110b:9b03:c923:69ce:83f:d04d dev rmnet_data0 proto kernel metric 0 pref mediumBefore Opening the VPN interfacelocal ::1 dev lo proto kernel metric 0 pref mediumlocal 2001:4888:2:fe40:a0:104:0:232 dev tun0 proto kernel metric 0 pref mediumlocal 2600:1007:110b:9b03:c923:69ce:83f:d04d dev rmnet_data0 proto kernel metric 0 pref mediumVPN is using the IMS server IPRetransmission until timeout since the IMS client receives no response. Figure 3.5 Pixel phone modem’s extended debug messages collected via the QXDM [108]. are received, as shown in Figure 3.4, for all the phones except Google Pixel ones. The reason why this vulnerability does not work for the Google Pixel phones is that they use the IMS client supported in the phone modem to access IMS services, instead of that software-based IMS client in the Android OS, which is employed by the other tested phones. The modem-based IMS client can be accessed by Android applications via QMI (Qualcomm MSM Interface)[134], which is a proprietary interface for interacting with Qualcomm baseband processors. For example, to initiate a VoIMS call, the call application of a Pixel phone sends its modem a QMI_VOICE_DIAL_CALL_REQ message with a specified calling number and the call type (e.g., emergency and auto-selected). The modem then starts the call setup procedure by transmitting a SIP INVITE message to the cellular infrastructure without involving the Android OS, as shown in Figure 3.5. Thus, this supported modem operation makes the Pixel phones be immune to V1. Notably, common VPN applications are not allowed to intercept the IMS signaling without V1. Although they handle most data transmissions on the phones and the malicious ones may cause severe attacks on them, the VPN data transmissions do not cover IMS signallings being transmitted over 4G/5G networks. The importance of V1 lies in its ability to allow malware to intercept IMS signaling messages across all radio access networks, creating a new attack surface. Root cause and lesson learned. This vulnerability arises from a conventional function (i.e., packet routing) on phones, but the root cause is still a design issue from the IMS standard; that is, there is a lack of security protection over the IMS signaling routing on the phones. The mobile OS has fulfilled the requirement of routing all the packets generated from the IMS client to the IMS server, and this routing policy cannot be modified without root privilege. Without an explicit security manner over the IMS signaling routing from the IMS standard, the mobile OS should not take the blame. To address this vulnerability, a new security mechanism is needed to prevent any potential 28 Security Associations 1 2 3 4 Protocol IMS Client Port IP Direction IMS Server Port IP TCP UDP IP_A Server Port A IP_A Client Port A IP_A Server Port A IP_A Client Port A ↔ ↔ ↔ ↔ IP_B Client Port B IP_B Server Port B IP_B Client Port B IP_B Server Port B Table 3.2 Four IPsec SAs needed for IMS services. IMS-related policies or rules from nullifying the actual IMS routing policy. 3.2.2 V2. Unrestricted IMS Signaling Source When IMS signaling security is enabled, the IPsec SAs between the IMS client and the IMS server will be established during the IMS registration procedure [50]. The number of the established IPsec SAs can be up to four, as shown in Table 3.2, since the SIP messages are transmitted in two directions, i.e., outgoing and incoming, and can be sent over UDP and TCP. According to the IMS standard [50], all the packets belonging to these four IPsec SAs shall be offered encryption and integrity protection. Nevertheless, the IMS standard [50] does not expressly specify that the packets which are sent to the IMS client or the server but do not belong to those four IPsec SAs shall be discarded, so the ME may still route them based on its own policy, e.g., a pass-through policy. Especially when the source of IMS signaling packets is not restricted, a malware application on the victim UE may be allowed to send fabricated IMS signaling packets to the IMS client locally on the same UE. Given that SMS messages are delivered based on the IMS signaling, this vulnerability can be exploited to launch SMS spoofing attacks when the IMS client accepts and processes the fabricated packets. Experimental validation. We validate this vulnerability by developing an Android application, designated as FakeIMSSingaling, with only the INTERNET permission. It can fabricate a type of IMS signaling messages, SIP MESSAGE [35], which is designed to carry SMS messages. Given the IMS client’s IP and port, FakeIMSSingaling sends fabricated signaling messages to the IMS client using a local UE IP address as the source IP address, which is different from the IMS server’s. The experiment is conducted in the networks of three U.S. carriers and two Taiwan carriers. 29 Figure 3.6 A forged SIP message containing an SMS message. Figure 3.7 Packet traces collected from US-I (top, w/o IPsec), US-II (middle, with IPSec), and US-III (bottom, with IPSec). Figure 3.8 The number of the observed new IMS server IP addresses over time for three U.S. carriers and two Taiwan carriers. For each tested phone, the application sends a plain-text IMS signaling message, in the format of SIP MESSAGE, as illustrated in Figure 3.6, to the IMS client, while the tcpdump program captures routed packets. The message is assigned a UDP port number which is not used by those four established IPsec SAs, if the IPsec is adopted for the IMS signaling security (US-II, US-III, and TW-I); otherwise, the UDP port number is randomly selected (US-I and TW-II). Notably, US-II, US-III, and TW-I adopt the IPsec for the IMS signaling security, whereas US-I and TW-II do not. Our experimental results show that the fabricated IMS signaling message can be locally routed to 30 Faked SIP signaling carries SMS message050100150010203040Time(Hour)VariationUS-III050100150010203040Time(Hour)VariationUS-II050100150010203040Time(Hour)VariationUS-I050100150010203040Time(Hour)VariationTW-I050100150010203040Time(Hour)VariationTW-II the IMS client for all the tested carriers no matter whether the IPsec is used, as shown in Figure 3.7. However, this vulnerability does not work for all the tested phones. The Google Pixel phones are an exception. As explained in §3.2.1, they employ the modem-based IMS client, instead of the IMS client in the Android OS, so that no IMS signaling messages can be observed in the OS and sent to the IMS client successfully. Root cause and lesson learned. Seemingly, phone vendors should take most of the blame. After the second thought, it may not be the case. The reason is twofold. First, the common socket communication allows interprocess communication within a system so that the malware can use it to easily send fabricated SIP packets to the IMS client within the same system. However, the IMS standard does not explicitly prohibit it. Second, the phone vendors indeed fulfill the IPsec requirement for the IMS signaling, but the IMS standard does not stipulate how to deal with the packets which do not belong to the IPsec SAs but are sent to the IMS client. In view of these root causes, the vulnerability arises from a design issue of the IMS standard that the IMS client is not protected from receiving messages originating from non-IMS servers. It thus calls for a new security mechanism to ensure the source of the IMS signaling for the IMS client. 3.2.3 Proof-of-concept Attacks We devise two novel attacks using the vulnerabilities V1 and V2, respectively: (1) Denial of Service over All Networks, designated as DoS-All; and (2) Named SMS Source Spoofing, designated as NameSpoofing. DoS-All launches denial of IMS services by exploiting the Wi-Fi association of the victim UE or installing a malware application without root privilege on the UE. It causes the IMS services not only to suffer over Wi-Fi [139] but also to be blocked over cellular radios (e.g., 5G NR). NameSpoofing allows the malware to send spoofed SMS messages with the sender nicknames arbitrarily assigned (e.g., “Daddy”), to the IMS client locally on the victim UE; they can be successfully received by the SMS application and shown to the victim. Notably, these two attacks do not require root privilege and have been successfully validated with Android versions ranging from 4.4.2 to 13. 31 (a) Before the attack, IMS signaling is untouchable. (b) After the attack, IMS signaling is redirected. Figure 3.9 Overview of DoS-All attack on IMS signaling routing. 3.2.3.1 DoS-All Attack This attack can cause denial of IMS services for the victim UE by exploiting V1, even though the IMS service continuity between different access networks is supported (e.g., when an IMS service is not available over Wi-Fi, its offering can be handed over to another access network, 4G LTE or 5G NR [139, 138]). It can be very challenging, since IMS services are blocked from being offered through all the access networks, particularly for cellular access networks. Another reason IMS traffic is difficult to block lies in its transmission through a dedicated network interface that operates independently of the standard data path. This isolation makes it inaccessible to conventional VPN applications, which cannot capture or interfere with the traffic—as illustrated in Figure 3.9a. By exploiting vulnerability V1, the adversary can configure one of the victim UE’s network interfaces (i.e., Wi-Fi or VPN) to be with the IMS server’s IP address assigned to the victim UE, causing all IMS signaling messages to be transmitted to the local network interface. It prevents the IMS client from communicating with the IMS server, no matter which access network is used—as illustrated in Figure 3.9b. There are two attack cases. First, the victim UE associates with a compromised Wi-Fi network, and the Wi-Fi interface is maliciously assigned the IMS IP address. In this case, no malware application is needed on the victim UE. Second, the victim UE installs a compromised VPN application, so the VPN interface is abused to be assigned the IMS IP address. However, it is not trivial to get the IMS server’s IP address assigned to the victim UE in practice. With Android version 10 or lower, this information can be easily obtained when the permission of “READ_PHONE_STATE” is granted. The Android OSes with the later versions constrain the access 32 IMS AppMEInternetIMS Voice/VideoNon-IMS AppMEIMS SignalingInternetWi-Fi RouterInternetIMS5G/4GIMS AppMEInternetIMS Voice/VideoNon-IMS AppMEInternetIMS SignalingInternetIMS5G/4GMal-wareWi-Fi Routerredirect of this IMS information based on a privileged permission, “READ_PRIVILEGED_PHONE_STATE”. To avoid the requirement of privileged permissions, we propose that the adversary collects a list of IMS server IP addresses in advance within the proximity of each victim UE and assigns them to it during the attack. This mechanism is motivated by the observation that multiple UEs at nearby locations are likely assigned the IMS servers from the same pool; it is also reasonable that serving the UEs in a given range requires only a few IMS servers to be deployed. We conduct an experiment to validate the effectiveness of this mechanism for all the tested carrier networks. In this experiment, we disable and enable the airplane mode periodically on tested phones to trigger a new assignment of the IMS server IP while moving to different locations over time. The ranges of different locations are up to 400 KM and 181 KM for the experiment in U.S. and Taiwan, respectively. As shown in Figure 3.8, where the number of the observed new IP addresses over time varies at different locations, there are two main findings. First, the number of the IMS IP addresses assigned to the UEs within nearby areas is limited; specifically, there are 16, 7, 40, 2, and 2 different IP addresses for carriers, US-I, US-II, US-III, TW-I, and TW-II, respectively. Moreover, all the IP addresses can be collected within a short time, since no more new IP addresses appear after the first two hours in each experiment. Second, the collected IP addresses from two different areas for each carrier have a large overlap in percentages: 57.1% (US-II), 92.5% (US-III), 100% (TW-I), and 100% (TW-II), except for US-I (0%). Moreover, in all tested carriers, the overlap percentage can always reach 100% for any two locations with a distance no larger than 5 KM, which is much larger than the Wi-Fi network range. This experimental result shows that the proposed mechanism can allow the adversary to collect a set of potential IMS IP addresses for each target victim UE. Note that although the IMS IP address assigned to a victim UE cannot be accurately identified, the Wi-Fi and VPN interfaces are both allowed to be assigned multiple IP addresses so that the set of potential IMS IP addresses can be directly used for the attack. An experiment has been conducted to validate that there are up to 50 IP addresses successfully assigned to any of the Wi-Fi and VPN interfaces; this number of assigned IP addresses is greater than that of the IMS IP addresses observed for each carrier in the experiment. 33 Attack implementation and evaluation. We implement the DoS-ALL attack by considering two available manners, namely compromised Wi-Fi network and VPN malware. A successful attack depends on whether the IMS server IP address assigned to the victim UE can be correctly assigned to its Wi-Fi or VPN interface. ⋄ Compromised Wi-Fi network: We develop a customized DHCP server on a widely-used Wi-Fi router, GL.iNet GL-AX1800, with OpenWrt 21.02. It assigns a set of prepared IMS IP addresses to selected Wi-Fi clients (e.g., only smartphones) based on the device model name specified in each DHCP request message. The DHCP server supports the assignment of both IPv4 and IPv6 addresses, where the tested three U.S. carriers all use IPv6, whereas the tested two Taiwan carriers use IPv4. The IPv6 interface can be assigned multiple IP addresses with a mandatory multi-address feature [112], but only a single IP address is accepted by the IPv4 one. To launch the attack against the carriers using the IPv4 address type, a malware program with the INTERNET permission needs to be deployed on the victim UE; notably, the malware is not needed when IPv6 networks are supported by devices and carriers. It detects whether an assigned IP address is correct by listening to the port numbers used by the IMS session (e.g., 5060); when it is correct, the malware can receive IMS signaling messages. Once the assigned IP address is incorrect, the malware disconnects the UE and assigns it with another IP address via the DHCP server. Notably, it is observed that those two Taiwan carriers using the IPv4 each has only two IMS IP addresses, so the IP address can be correctly assigned with at most two assignments. ⋄ VPN malware: We develop a VPN malware application on Android phones and deploy a VPN server on the Internet. The VPN malware creates and manages the VPN interface based on the VpnService class. When connecting to the VPN server, it gets a set of potential IMS IP addresses and assigns them to the VPN interface using the function of VpnService.Builder.addAddress. We further launch the two attack approaches against all the tested phone models except for Pixels and cellular network operators. The result shows that victim UEs always suffer from denial of IMS services and are prevented from using IMS-based call or text services, even though the cellular signal quality or the Wi-Fi one is good in all the experiments. Figure 3.10 shows examples of 34 Figure 3.10 Successful denial of IMS services over Wi-Fi (Left), 4G (Middle), and 5G (Right) networks. Figure 3.11 USSD service over IMS signaling (Left) and successful DoS attack on USSD service (Right). successful attacks in three different access networks. Notably, we notice that some carriers deployed additional security mechanisms. Specifically, the phones tested in the networks of US-II and TW-II downgrade their access networks to the legacy ones (e.g., 2G and 3G networks) so that they can still have the legacy call and text services, but those tested for the other carriers suffer from denial of all the call and text services. Notably, DoS-ALL can intercept and drop IMS signaling, disrupting not only IMS-based calls and texts but also other services reliant on IMS signaling, such as USSD, RCS 1. For example, USSD is another IMS-based service (Figure 3.11), enabling users to dial quick codes for tasks like checking balances, paying fees, or changing passwords. When a DoS-ALL attack is launched, the IMS client loses connection to the IMS server, resulting in user-facing errors like “SYSTEM UNAVAILABLE” or “MMI Complete.” These observations reveal the broader impact of DoS-ALL on IMS-dependent services without additional attack techniques. Attack variance. The DoS-ALL attack can be extended to launch various MiTM attacks. The 1Rich Communication Services (RCS) relies on IMS Signaling to comlete the configuration and registration procedures [68] 35 IMS call DoS over 5G…Code to check the account balanceFailed tofetchthe account balance Figure 3.12 Intercepting a SIP message carrying an SMS message in an MITM attack. above malware, lacking root privileges, can intercept all outgoing IMS client messages, as shown in Figure 3.12, and then forward messages to a remote server or interact directly with the client through fabricated replies when IPsec is absent. 3.2.3.2 NameSpoofing Attack This attack exploits V2 to send victims spoofed SMS messages in which the sender names can be arbitrarily specified. It differs from conventional SMS spoofing attacks, which deliver spoofed SMS messages through the core network based on spoofed phone numbers, with two advantages. First, the attack does not require SMS messages to be sent through the network, so it cannot be impeded by any security mechanisms deployed in the network. On the other hand, the conventional ones have become much more challenging, since the FCC in the U.S. has mandated carriers to deploy STIR/SHAKEN [129] in the core network to defend against the spoofing attacks; specifically, STIR incorporates digital certificates into the IMS signaling to validate the identity of the SMS sender or the caller. Second, the attack can arbitrarily show the spoofed sender’s name without investigating any phone numbers trusted by the victim or stored in the contact list, but the conventional ones can show only inconvincible spoofed numbers on the victim phone if they are not in the contact list. To show spoofed names, we fabricate SMS messages in the format stipulated by the 3GPP2 standard [51], as shown in Figure 3.13, instead of the 3GPP standard format [22]. The former offers the capability of presenting the message’s originating address using ASCII characters (up to 128 characters), whereas the latter allows it to be only in the E.164 format (i.e., the format of phone numbers). Since the spoofed SMS messages do not need to pass through the core network, they can be successfully delivered to the IMS client based on V2 whenever the UE supports the 3GPP2 format. Surprisingly, most phone vendors, such as Samsung, LG, and TCL, still support both 3GPP and 36 Figure 3.13 A fabricated SMS message with a spoofed sender name in the 3GPP2 format. 3GPP2 standards for international roaming services. As a result, the present attack can successfully show spoofed names on victim UEs by specifying them in the field of the originating address. Attack implementation and evaluation. We develop the malware based on the FakeIMSSingaling application, which has only the INTERNET permission, by adding two new features for the attack: (1) identifying the IMS client’s IP address and UDP port number, which were manually configured during the validation of vulnerability V2; and (2) fabricating SMS messages in the 3GPP2 format. We conduct an experiment that uses the malware to send a spoofed SMS message with the sender name, “Mark Zuckerberg verified by Verizon” to the IMS client locally, for all the tested phones except Google Pixel phones, and all the tested carriers; notably, the carrier name “Verizon” is used solely for the testing purpose. The result shows that all the tested phones can successfully display the spoofed SMS message no matter which carrier network is connected. Figure 3.1 illustrates the spoofed SMS message displayed on a phone using US-I. Although carriers US-II and US-III conform to the 3GPP standard, the attack can still succeed since the fabricated SMS messages do not pass through the core network. This also validates that many mobile phones support both 3GPP and 3GPP2 standards. Moreover, note that the malware needs to use the IMS server IP address as the source address of the forged SIP packet carrying the spoofed SMS message if the victim phone is running Android 37 SMSin3GPP2formatSpoof name using 8-bit ASCII Figure 3.14 Overview of ghost conversation attack (Left) and a successful fabricated conversation (Right). 9 or higher; otherwise, the source address can be assigned any IP address. The IMS clients with Android 9 or higher validate the source address and discard SIP packets not from IMS servers. Given the malware with only the Internet permission, setting the IMS server IP to be the source IP address of fabricated SIP packets can be achieved by binding a UDP socket to the local network interface which is assigned the IMS server IP. For assigning the IP to a network interface, there are two approaches as presented in § 3.2.3.1: Wi-Fi-based and VPN-based. When the VPN-based approach is adopted, an additional permission, BIND_VPN_SERVICE, is required for the malware. Attack variance. The NameSpoofing attack enhances traditional phishing techniques, such as SMS messages with malicious URLs [121]. Unlike conventional phishing, which typically originates from unknown numbers, NameSpoofing allows attackers to impersonate trusted entities—e.g., friends or official institutions—by spoofing the sender name, making messages appear more credible. Additionally, it can fabricate ghost SMS conversations that do not exist. As shown in Figure 3.14, the adversary can use malware to interact with the IMS client in SMS conversations, creating fabricated evidence in courts that can instantly damage the victim’s reputation. As a result, NameSpoofing significantly increases the effectiveness of existing phishing attacks. 3.3 Insecure ME Access for IMS Media Sessions We further explore the insecurity of the IMS media session built on the ME. According to the IMS standard [20], the IMS media session should be encrypted and integrity-protected based on SRTP (Secure RTP); the SRTP keys are derived from the IMS call setup procedure. Since this security mechanism offers end-to-end protection between the IMS client and the IMS server, without 38 Mal-wareIMS ClientBobFake: VictimTomReply: AttackerIMS Client UIUE of AttackerFabricated SMS from malware compromising the IMS client, it is almost impossible to forge valid media packets or hijack the media session on the ME if the security protection exists. However, the SRTP is not a mandatory feature, so it may be absent on the ME, and then allow the adversary to fabricate valid media packets, which are just in plaintext and in the RTP format. Moreover, when the phone modem does not verify the originator of the received IMS media packets, they may be allowed to be dispatched to the radio bearer dedicated to the IMS media session. Thus, the IMS media bearer may be abused. Notably, the IMS voice packets are generated by the phone modem itself, so they do not have this security issue. In the following, we focus on the IMS video session. Unfortunately, the above potential security threat is discovered on COTS MEs. We identify two vulnerabilities that facilitate the potential abuse of IMS video sessions. The first vulnerability (V3) reveals that video data delivery is not protected by SRTP. All ViIMS packets are transmitted without confidentiality and integrity protection. Consequently, the adversary can easily use ViIMS packets to carry non-video data. The second vulnerability (V4) confirms that the phone modem does not impose any restrictions on the source of ViIMS packets, allowing the adversary to bypass the authentic IMS client and transmit non-video data to the cellular infrastructure over the IMS media bearer. We next elaborate on these two vulnerabilities and devise a proof-of-concept attack. Note that experiments are mainly conducted in US-I and US-II networks, as ViIMS is not yet supported by TW-I and TW-II, and US-III supports it on only a few phone models. 3.3.1 V3. Unprotected Video Data Delivery It has been reported that no SRTP protection is provided over the IMS voice session [92, 84], where voice packets originate from the phone modem. For the video session, though the video data are processed by a different component, the application processor, there exists a high probability that the SRTP is still missing due to the common practice. This practice can leave the video data for delivery unprotected on the UE. Once the UE is compromised with root access, the video packets can be captured and learned for the preparation of forging valid IMS video packets. 39 Figure 3.15 Unprotected IMS video packets in plaintext. Experimental validation. We validate this vulnerability on LG G3, Google Pixel 1/3/5/7, and Samsung S8/S10 with Android versions ranging from 4.4.2 to 13. On each tested phone connecting to a carrier network, we use Wireshark to capture packets while dialing a video call to another phone. It is observed that all the IMS video packets are in plaintext without any security protection on all the tested phones. Figure 3.15 shows one test result as an example. Root cause and lesson learned. The absence of the SRTP protection does not come without any reasons. The IMS video data delivery has been protected by the user-plane security built between the UE and the base station; it is performed at the PDCP layer with ciphering and integrity protection. Therefore, phone vendors and carriers may consider that such security mechanism has defended the video session against all the potential threats. However, it cannot safeguard the IMS video data on a compromised UE before they are sent to the air. This vulnerability is rooted in that the end-to-end security between the IMS client and the IMS server is not fulfilled; especially, the ME security is not considered. Note that the SRTP protection can be applied, along with SELinux, to safeguard IMS voice sessions. This prevents video calls from being tampered with or extracted, even by adversaries with root privileges. The reason is that SELinux, integrated into Android, employs MAC (Mandatory Access Control) [64] to restrict user access, including root users. 3.3.2 V4. Unrestricted Source for IMS Video Delivery The phone modem employs the TFT filter to identify IMS video packets and then dispatches them to the IMS video bearer [49], which offers guaranteed performance for the IMS video session. The TFT filter rule set for each bearer is based on the 5-tuple information (i.e., source/destination 40 Video Call Start Figure 3.16 A collected trace including forged RTP packets at the ViIMS callee for US-I. IP addresses, source/destination port numbers, and protocol ID); it can be easily obtained by the adversary from normal video packets or control-plane SIP messages. Once the forged video packets are given the correct 5-tuple information and the phone modem does not deploy any security mechanism to verify their delivery source, they could be forwarded to the IMS video bearer by the modem. Moreover, unlike the IMS voice data processed by the modem directly, the IMS video data are sent from the Android OS to the modem, so the forged video packets can be possibly delivered by a malware application in the same way. Experimental validation. We develop a malware application with root privilege to validate this vulnerability with US-I and US-II. Given a ViIMS call, the malware at the caller generates RTP packets with various payload sizes and sends them to the callee; notably, these packets are assigned a unique RTP SSRC (Synchronization Source) ID, 1234567890 (0x499602D2), and contain random data in the payload (as shown in Figure 3.16). Based on the collected trace at the callee, it is observed that all the RTP packets ranging from 100 to 1346 can be successfully delivered from the caller to the callee in US-I, whereas US-II only allows 10 particular sizes: 37, 169, 393, 489, 537, 585, 729, 1129, 1237, and 1294. Root cause and lesson learned. The root cause is that the phone modem does not verify the source of the IMS video data delivery, but depends on only the default TFT filter for the dispatching of video packets. Although the IMS server has a chance of inspecting the payload content of video packets to identify the forged ones, it is not allowed except for the approval from at least one party involved in the video call or the court, due to legal provisions for carriers[2]. Thus, addressing this vulnerability has to be at the ME. 41 Figure 3.17 Illustration of ViIMS-ANY attack. 3.3.3 ViIMS-ANY: Covert Communications over Video-over-IMS We next present a proof-of-concept attack in which two UEs communicate covertly with each other over the ViIMS data-plane channel by exploiting vulnerabilities V3 and V4. Different from previous attacks, the victims in this attack are carriers (e.g., US-I), not individual users. Specifically, adversaries are individuals seeking to exploit the carriers’ high-priority resources reserved for the ViIMS service to establish their covert communication channels, with full control over their own phones. The impact of this attack is expected to grow rapidly as the ViIMS service becomes more popular, even though ViIMS is still in the early stages of deployment. The three major U.S. operators — AT&T, Verizon, and T-Mobile — have introduced the ViIMS service, and some of them support inter-operator ViIMS calls. According to a report [110] by Juniper Research, the number of subscribed users is projected to reach 4.5 billion by 2025, representing 50% of global mobile subscribers. Superior to other video call services such as Skype, ViIMS guarantees performance with minimal overhead, relying on the IMS application for widely deployed VoIMS service; no additional applications are needed. Attack implementation and evaluation. We develop an attack library called ViIMSSocket using C and the raw socket APIs provided by the Linux kernel. It is given root privilege and provides upper-layer applications with a UDP-like packet transmission method for executing covert ViIMS communication, as shown in Figure 3.17. For adversaries with an engineering background, obtaining root privileges is technically feasible using tools like Magisk [136] and One-Click 42 Cellular Network Attacker 1UE1Attacker 2ViIMSCommunicationApplicationRead() Write() ViIMSSocketApplicationRead() Write() ViIMSSocketUE2 Root [15] on Android. It contains three major APIs: (1) ViIMSSocket(Callee’s Number), which establishes a covert communication channel with the callee over ViIMS and returns a socket ID; (2) ViIMSSocketWriteData(socketID, data), which transmits data to the callee; and (3) ViIMSSocketReadData(socketID, buffer), which receives data from the callee. Notably, ViIMSSocket prevents the actual IMS video packets from being transmitted to the IMS server, to maximize the communication capacity. We evaluate the throughput performance of the covert communication in the networks of US-I and US-II by sending a 10 MB file from one UE to another UE. The experiment runs ten times for each carrier. It is observed that the file is always delivered successfully. The average throughput measured on US-I and US-II is 545.7 Kbps and 581.4 Kbps, respectively. The achieved throughput values are much greater than the one (e.g., up to 38 Kbps) measured from the data transmission over the IMS voice data-plane channel [84]. By abusing IMS video sessions, the covert communication channel is given the guaranteed bit rate resource so that the throughput is guaranteed even in congested scenarios. Furthermore, it is observed that the covert communication can be sustained for at least 100 minutes during a ViIMS call. Attack variance. With the developed ViIMSSocket, potential attacks extend beyond covert communication. Adversaries can hijack video calls to launch video spoofing attacks, allowing for mobile deepfake video calls, encrypted stealthy communication channels, and video frame steganography attacks [135], evading carrier detection of non-video data transmission. Importantly, the video spoofing attack doesn’t require a malware application on the victim UE receiving the spoofed video call. 3.4 Discussion Now we analyze the security implications of modem-based IMS implementations and conclude with a brief discussion of the iPhone’s architecture and built-in protections. Is modem-based IMS client better? Google Pixel phones employ the modem-based IMS client, ensuring that no IMS signaling packets are routed in the Android OS, thus making them immune to vulnerabilities V1 and V2. This hardware-based approach appears more secure than software-based 43 methods on other phones but has its limitations. First, the Pixel phones remain vulnerable to the DoS-ALL attack, when extended to prevent IMS media from being sent to the IMS media server. This extension can be achieved by assigning the IMS media server address to a local network interface of the victim UE. Consequently, IMS media data generated by the application processor’s domain (e.g., video), rather than the phone modem’s, can be sent to the local interface instead of the IMS media server. Second, the modem-based IMS client lacks flexibility in updating IMS-related services, e.g., enabling any of the rich communication services (RCS) [67], since updating the phone modem requires collaboration from modem vendors like Qualcomm. It is less convenient and more time-consuming compared to a software update. Are iPhones secure? We conduct experiments on iPhones with four iOS versions (15/15.5/16.5/17) to validate the four discovered vulnerabilities. iPhones are immune to vulnerabilities V1 and V2 due to different network policies applied in iOS, which is built on a Unix-like OS (Darwin) [132]. Specifically, the iOS employs an interface-oriented approach, which restricts the routing of the IMS signaling to only the cellular interface, so V1 does not exist. It drops the IMS packets that do not belong to the established IPsec SAs, thereby avoiding V2. Since most recent iPhones do not support ViIMS [133], the validation of V3 and V4 is left for future investigation. 3.5 Solution In this section, we propose two remedies to address these four vulnerabilities and evaluate their effectiveness. 3.5.1 Restricted IMS Routing We propose a restricted IMS routing mechanism that contains two methods to address vul- nerabilities V1 (unprotected IMS signaling routing) and V2 (unrestricted IMS signaling source), respectively. First, the mobile OS shall prohibit any local network interface from being assigned the IMS server’s IP address so that the IMS signaling packets cannot be routed locally but to the IMS server. Second, the mobile OS shall be prevented from sending the IMS client any packets originating from local applications, so all the routing policies and tables shall prohibit the local IMS traffic routing. Take the routing table in Figure 3.3a as an example; the routing rule, “local 44 2600:...:83f:d04d dev rmnet_data0”, with the IMS client’s IP address shall be removed. 3.5.2 Protected IMS Media Sessions Applying the SRTP protection to safeguard IMS voice sessions can prevent video call tampering (vulnerability V3), but it does not forbid transmitting non-video data. The SRTP protection is built between the IMS client and the IMS server, so the modem is not allowed to verify the authenticity of the IMS video source (vulnerability V4). To this end, a secure communication channel between the IMS client and the modem has to be built. We adopt DHKE (Diffie-Hellman Key Exchange), which is effective in deriving shared secret keys, to establish a secure communication channel. This solution leverages the cellular infrastructure as a trusted intermediary in the DHKE procedure, preventing its common threat, MiTM attacks. It can avoid the use of asymmetric cryptography, which is commonly used to address the MiTM attacks but may not be supported on all MEs. This proposed DHKE procedure exchanges DHKE parameters between the IMS client and the modem during the SIP registration, while doing mutual authentication based on the 3GPP symmetric cryptography. Unless it is compromised, the established secure communication channel remains secure. unless the 3GPP symmetric cryptography is compromised. Notably, this proposed solution does not require any modifications from cellular network protocols or add new signaling messages. Figure 3.18 illustrates the proposed DHKE procedure with seven steps: 1 in the initiation of the IMS registration procedure [50], the IMS client selects a large prime number, q, a primitive root of q, 𝛼, and a private key, 𝑋𝑎; 2 the IMS client calculates its public key 𝑌𝑎 as 𝑌𝑎 = 𝛼𝑋𝑎 mod q; 3 the IMS client transmits the SIP REGISTER message carrying q, 𝛼, and 𝑌𝑎 to the IMS server; 4 the IMS server coordinates PCF, AMF, and the serving base station to transmit q, 𝛼, and 𝑌𝑎 to the RRC (Radio Resource Control) layer on the phone modem using the RRC Reconfiguration message; 5 the RRC layer on the phone modem selects a private key, 𝑋𝑏, calculates the corresponding public key, 𝑌𝑏 = 𝛼𝑋𝑏 mod q, calculates the shared secret key, K = 𝑌 𝑋𝑏 𝑎 mod q, provides K for the PDCP layer, and then transmits 𝑌𝑏 to the base station using the RRC Reconfiguration Complete message; 6 the phone modem’s 𝑌𝑏 is delivered to the IMS client through the SIP OK in response to the SIP REGISTER message; and 7 with the received public key, 𝑌𝑏, the IMS client calculates the shared 45 Figure 3.18 The DHKE procedure integrated into the 3GPP cross-layer communication framework. secret key and finally shares it with the phone modem. Note that the DHKE has ensured that two communicating parties can derive a shared secret key over an insecure channel. Even with an eavesdropper inside or outside the ME (e.g., eavesdropping on RRC messages) during the DHKE procedure, the shared secret key cannot be inferred or leaked. After the secret key is derived, the mobile OS must ensure that no applications, even those with root privileges, can access the IMS client’s memory where the key is stored. Against legacy and compromised UEs. Adversaries may use legacy UEs or compromised UEs built based on SDR (Software-Defined Radio) platforms (e.g., srsUE [7]) to launch the ViIMS-ANY attack, since they do not allow the proposed solution to be deployed. To address this issue, carriers can reduce attack incentives by preventing them from making high-bandwidth video calls, thereby limiting the bandwidth of their video sessions, whenever the deployment of the proposed solution is not detected. Moreover, the infrastructure can also detect them by monitoring their IMS media usage [55]. 46 Namf_Communication_N1MessageNotify (𝒀𝒃)RRC Reconfiguration Complete (𝒀𝒃)RRC Reconfiguration (𝒒, 𝛼, and 𝒀𝒂)Namf_Communication_N1N2MessageTransfer (𝒒, 𝛼, and 𝒀𝒂)AAR (𝒒, 𝛼, and 𝒀𝒂)AAA (𝒀𝒃)MEMobileNetwork1Selecting 𝒒, 𝛼, and 𝑿𝒂SIP REGISTER (𝒒, 𝛼, and 𝒀𝒂)34Selecting 𝑿𝒃, Calculating 𝒀𝒃 =𝛼𝑿𝒃 𝑚𝑜𝑑 𝒒,𝑲=𝒀𝒂𝑿𝒃 𝑚𝑜𝑑 𝒒SIP OK (𝒀𝒃)67The shared secret keyCalculating𝑲=𝒀𝒃𝑿𝒂 𝑚𝑜𝑑 𝒒BSIMS ServerPCFAMFIMS ClientModem4DL NAS Transport (𝒒, 𝛼, and 𝒀𝒂)45456UL NAS Transport (𝒀𝒃)66A Secure Communication Channel𝑲𝑲2Calculating 𝒀𝒂 =𝛼𝑿𝒂 𝑚𝑜𝑑 𝒒 3.5.3 Prototype and Evaluation We next prototype and evaluate the proposed remedies. Restricted IMS routing. We develop an Android system application, designated as IMSProtector, with root privilege. It mainly monitors three pieces of information: (1) the RPDB, (2) routing tables, and (3) network interfaces. It not only removes any routing rule allowing local IMS traffic routing but also deactivates the interface assigned the IMS server’s IP address, if there is any to be detected. To assess the effectiveness of IMSProtector, we launch the attacks of the ineluctable denial of IMS services and the named SMS source spoofing. As shown in Figure 3.19, IMSProtector can successfully defend against these two attacks. Specifically, it deactivates the local Wi-Fi network interface (i.e., tun0) assigned the IMS server’s IP address. It is also observed that the attack application, SMSNameSpoofer, is not allowed to transmit any SMS messages with named sources to the IMS client due to a lack of local IMS routing. Protected IMS media sessions. We implement and evaluate this solution on an SDR platform, using srsUE (v23.04) for emulating a 5G UE, srsRAN (v23.04) for emulating a 5G gNB, and open5GS (v2.4.11) for emulating a 5G core network; ZeroMQ [3] is used to implement the radio link between the gNB and the UE. Moreover, we develop an IMS client and an IMS server in Python, and deploy them on the srsUE and the open5GS, respectively. The PCF and the AMF in the core network, as well as the gNB, are modified to support the proposed DHKE procedure. This platform is built on a Dell XPS 13 laptop running Ubuntu 22.04, equipped with an i7-1185G7 CPU and 16GB of RAM. In the prototype, we use the shared secret key, K, to provide integrity and data origin authentication for IP packets exchanged between the IMS client and the phone modem. In particular, we add an option using an unassigned option type of 150 [5] to IP headers for the MAC (Message Authentication Code) verification. To examine the effectiveness of this solution, we launch the ViIMS-ANY attack after a secure communication between the IMS client and the phone modem is established. As shown in Figure 3.20, it is observed that the fabricated IMS video packets are detected and then dropped. Carriers have deployed the IMS system since launching VoLTE. Although 3GPP kept improving 47 Figure 3.19 IMSProtector: (Left) disabling an interface assigned the IMS IP address; (Right) local IMS routing is forbidden. (a) Traces on SDR UE (An adversary view). (b) Errors shown on the terminal of the SDR modem. Figure 3.20 Evaluation of enabling secure communications between the IMS client and the phone modem. its security designs over the last two decades, most enhancements have been focused on the cellular infrastructure. This caused the ME security in the IMS standard to lag behind the infrastructure security, posing security risks to cellular users and carriers. We conducted a comprehensive security study regarding the IMS signaling and media delivery on the ME; four vulnerabilities were identified, and the corresponding three attacks were exposed. These security threats have been validated using ten phone models and five carriers across two countries. Although we have proposed remedies to address them, completely solving them requires collaboration among carriers, phone vendors, and the cellular standard community. 48 Dropinvalid packets CHAPTER 4 ENHANCING THE PRIVACY OF VOICE SERVICES OVER IMS FRASTRUCTURE Mobile voice communication has been a long-standing and widely utilized service. Despite the growing popularity of third-party communication services over mobile broadband, voice calls remain prevalent, with a substantial user base [109, 118, 111]. In the transition to 5G/4G networks fully reliant on IP technology, voice communication has evolved into Voice over IP Multimedia Subsystem (VoIMS), known as Voice-over-New-Radio (VoNR) for 5G and Voice-over-LTE (VoLTE) for 4G [9]. Presently, more than 235 operators in 105 countries offer VoIMS services, with a projection to serve five billion devices by 2025 [65]. The security of 5G/4G voice calls is a primary concern, with encryption measures in place to safeguard confidentiality, privacy, and security. These security protocols incorporate well-established mechanisms such as 5G/4G Authentication and Key Agreement (AKA) and multi-layer security at Network Layer 3 and Layer 2 [44, 43]. In Layer 3, the protection of voice call is achieved through IPsec (Internet Protocol Security) for confidentiality and integrity [42]. To protect transmissions over the air, Layer 2, utilizing Packet Data Convergence Protocol (PDCP), provides encryption [46, 47]. Unfortunately, the pursuit of optimizing VoIMS quality and efficiency through standardized techniques, adopted by commercial 5G/4G networks, introduces unanticipated security implications. These optimization techniques include the use of guaranteed-bit-rate radio bearers, RObust Header Compression (ROHC) to compress packet headers [78], the implementation of Adaptive Multi-Rate (AMR) audio codecs [24] for varying radio conditions, and the incorporation of comfort noise for handling silence during calls[25]. Each technique individually aims to enhance call quality and efficiency. However, the good turns evil when putting them together. They presents unexpected vulnerabili- ties, potentially turning VoIMS calls into security threats. For instance, the combination of ROHC and comfort noise generates very small packets (less than 16 bytes) that can distinguish VoIMS traffic from other types. By this means, a VoIMS call can be detected by checking the presence of tiny packets. Moreover, examining voice packet patterns with and without comfort noises can 49 infer voice call states. More inference details are elaborated in §4.2. We want to highlight that it is not easy because of the sheer volume of encrypted packets over the air and the rich real-world complexity in human speech activities. These security concerns led to the development of proof-of-concept passive (§4.3.1) and active attacks (§4.3.2). The passive attacks leverage precise call state information to uncover the identity of the caller/callee (i.e., linking user identities to cellular identities such as phone numbers). In contrast, the active attack focuses on selectively muting one of the call participants, rendering Call Denial of Service (DoS) attacks more inconspicuous and efficient. These attacks have undergone rigorous validation and assessment through experiments conducted with all three major US operators, utilizing commercial mobile phones. A standard-compliant fix to these vulnerabilities has been proposed and assessed for its effectiveness in §4.5. In summary, we pioneer in its capacity to infer confidential 5G/4G call information without decrypting voice packets, essentially transforming beneficial call enhancement techniques into security concerns and emerging threats against 5G/4G call security and privacy. 4.1 Threat Model and Methodology We next present the threat model, followed by our responsible methodology and ethical considerations. Threat Model. In our threat model, adversaries are defined as individuals or entities that seek to monitor or launch attacks against mobile users by exploiting vulnerabilities in 5G/4G radio channels. These adversaries have the capability to eavesdrop on all communications occurring over public channels, such as 5G/4G radio channels, but they lack the ability to decrypt encrypted messages without access to the requisite decryption keys. In practice, adversaries can deploy their own equipment, like a 5G/4G sniffer, in close proximity to victim User Equipment (UEs) to intercept all packets transmitted over the air. However, they do not possess the means to compromise the security of any victim’s smartphone or the 5G/4G networks themselves. To simplify the scenario, we consider an adversary, referred to as "Evil," eavesdropping on one of the call participants (e.g., Alice) during a voice call, given that both call parties (e.g., Alice and Bob) are typically not in close 50 proximity and do not utilize the same radio channel that the adversary is intercepting. Responsible methodology and ethics. We adhere to a responsible and ethical research approach. Real experiments were performed in collaboration with all three prominent U.S. operators, which we refer to as OP-I, OP-II, and OP-III. The primary objective was to validate the identified vulnerabilities and evaluate the potential impact of attacks. Recognizing that certain feasibility tests and attack assessments could pose risks to network operators and their mobile users, we took precautions. Unless explicitly stated otherwise, experiments were carried out within a fully controlled environment. In this controlled setting, we utilized a 5G/4G sniffer implemented via software-defined radio (SDR) to gather information about voice calls from smartphones. Importantly, all smartphones involved in these experiments were the property of our research laboratory. To ensure that inadvertent attacks on smartphones not participating in the study were avoided, we implemented two critical measures: 1. All experiments were conducted within a private laboratory during off-peak times, and strict measures were taken to ensure the absence of any unauthorized individuals nearby. In this setup, one smartphone served as the victim, while several other smartphones acted as simulated users in proximity to replicate normal 5G/4G traffic. 2. In situations where potential passersby were present, we utilized phone-side cellular trace collectors, such as MobileInsight [102], to exclusively collect cellular radio traffic data from the victim’s smartphone. This approach guaranteed that no cellular radio traffic from non-participating smartphones was inadvertently collected. In some cases, specific attack experiments were conducted in semi-controlled environments or public places. Further details regarding the experimental configurations for each attack are provided in §4.3.1 and §4.3.2 of the study. This comprehensive approach underscores our commitment to responsible research practices and ethical considerations in the study of mobile network security. 51 Figure 4.1 Overview of side-channel call inference. 4.2 Side-Channel Call Inference In this section, we present call inference techniques to obtain confidential call information over encrypted packets (without knowing the decryption keys). 4.1 gives an overview of side-channel inference with three tasks. First, it detects the presence of a VoIMS call out of all the packets received in the air (4.2.1). Note that most packets are not for voice (say, for mobile data and 5G/4G signaling). Second, it infers call states for the detected VoIMS call, particularly who is talking (4.2.2). It means that the adversary Evil is capable of knowing more about how this voice call is going on by dividing a call conversation into fine-grained segments (e.g., Alice talks most time or rarely talks), thereby launching attacks based on precise call states. Last, it further infers the start and end time for each conversation segment (marked as “+” and “×” in 4.1). Such precise call state information makes it possible to launch attacks to infer more confidential information (say, user identities in §4.3.1 or selectively manipulate the target victim call at specific times §4.3.2. 4.2.1 Detecting VoIMS Calls At first glance, detecting the presence of an ongoing VoIMS call is not challenging, even though all the packets are encrypted. This is because the radio bearers used for VoIMS (voice traffic and signaling) differ from those used for mobile Internet data. A previous study [113] has observed the use of distinct DRBs (for example, DRB1: mobile Internet data, DRB2: VoIMS signaling, DRB3: 52 VoIMS voice packets). Even though traffic is ciphered at PDCP, it is not hard to detect VoIMS calls by analyzing the use of all DRBs. However, the reality is more complex and challenging. First, the mapping between a DRB number and its supported traffic type (e.g., DRB3 for VoIMS voice traffic) is never fixed or explicitly defined by any VoIMS standard. As a result, it varies with network operators and changes over time. Notably, each UE can have up to 8 DRBs. All PDCP packets transmitted over DRBs are encrypted, making discovering the DRB transmitting VoIMS packets challenging. Second, there are many packets over the air; it is not scalable to inspect all the packets in a large number of concurrent DRBs for an extended period. Here, we need a reliable, scalable, and lightweight solution to effectively detect the presence of VoIMS calls by concurrently screening all DRBs used by nearby mobile users. Our call detection approach exploits two voice quality optimization techniques: ROHC [46, 31, 37] and CN [25, 27, 21]. It is based on two key facts: (1) both ROHC and CN are mandatory features for VoIMS services regulated by 3GPP standards [25, 27, 21]; and (2) they produce special voice packets whose sizes are significantly smaller than non-VoIMS packets. The use of both techniques has been observed in all VoIMS call experiments conducted with three U.S. operators. A CN voice packet contains 35∼48 bits (4.375∼6 bytes) for noise information [24, 26, 28]; it is then encapsulated into an RTP packet using the payload type of 13 [79]. With ROHC, it is further compressed into a PDCP packet with a length of 8∼13 bytes. These tiny PDCP packets only appear during VoIMS calls, making the detection of their presence an indicator of VoIMS calls. Once a tiny packet is detected, we can determine its DRB number, revealing which DRB is used for VoIMS. Notably, the vulnerability lies in the fact that both ROHC and CN are exclusive to VoIMS and not used for other types of traffic. Empirical Validation. We conducted experiments with three U.S. operators to validate that only VoIMS calls generate tiny PDCP packets, while non-VoIMS applications do not. Our tests covered four phone models, including Google Pixel 5, Pixel 3, Samsung S5, and LG G3, all supporting VoLTE, and Pixel 5 supporting VoNR as a 5G phone. We ran three VoIMS-based applications (VoLTE, VoNR, and Google Voice) and 57 non-VoIMS applications selected from the top-100 53 mobile Internet applications. These test applications were roughly categorized into three other groups: (1) Non-VoIMS VoIP (e.g., Skype), (2) Non-VoIP streaming (e.g., Netflix, YouTube), and (3) Non-streaming (e.g., Amazon, Twitter, Reddit). For all VoIP applications (including VoIMS and non-VoIMS VoIP), each test consisted of a 30-second voice call with 10 seconds for ringing and 20 seconds for call conversation. For non-VoIP streaming applications, each test involved video streaming for approximately 5 minutes. For non-streaming applications, we continuously accessed their Internet services (e.g., sending messages, refreshing online content, and searching for products) for 1 minute. Table 4.1 shows the PDCP packet lengths for three VoIMS applications and the minimal length observed for each non-VoIMS application in our study. We made four key findings from this data: First, we observed that all three U.S. operators have implemented both ROHC and CN techniques for VoIMS calls. Second, none of the non-VoIMS applications produced tiny PDCP packets, while many of these tiny packets were consistently present during VoIMS calls. This finding remained consistent across all four phone models and three operators. Third, our approach was effective in detecting the presence of multiple concurrent VoIMS calls, with the capability to identify up to four concurrent VoIMS calls tested in our study. Finally, the number of tiny packets detected during a VoIMS call varied across network operators and mobile device models. We later demonstrate that this variation is primarily due to differences in speech coding rates (AMR), which directly impact call state inference (as discussed in §4.2.2). 4.2.2 Inferring Call States In this section, we describe the method for inferring finer-grained call states once a VoIMS call is detected. When the adversary, denoted as Evil, deploys a sniffer near Alice, the observed call states from Alice’s perspective include two primary conditions: "talking" and "not-talking (listening)" while the call conversation is active. In this context, "talking" indicates that Alice is speaking, while "listening" means that Alice is not speaking but rather listening to Bob. A straightforward approach for inferring call states is to examine the presence of Comfort Noise (CN) and non-Comfort-Noise (nonCN) voice packets in both the downlink (DL) and uplink (UL) 54 V o I M S N o n - I M S V o I P N o n - V o I P S t r e a m i n g N o n - S t r e a m i n g Skype Telegram Discord TextNow VoLTE (4G) VoNR (5G) Google Voice No. App Name 1 2 3 4 WhatsApp 5 WhatsApp Business 6 7 8 9 10 Google Hangouts Snapchat 11 Twitch 12 Spotify 13 14 Netflix 15 16 Disney+ 17 Amazon Prime Video 18 Xbox 19 NewsBreak 20 Bigo Live 21 Shazam! 22 YouTube Kids 23 24 Microsoft Edge Twitter 25 26 Photomath 27 Microsoft Teams 28 29 Booking 30 Duolingo The Weather Channel SoundCloud Zillow Len 11 – 13 12 – 13 13 32 32 37 42 42 42 50 54 42 42 42 42 42 42 42 42 42 54 59 42 42 42 42 54 42 42 42 N o n - S t r e a m i n g Snapfish Instagram TikTok No. App Name 31 Amazon 32 Reddit 33 McDonald’s 34 DuckDuckGo 35 DoorDash 36 WeatherPort 37 Waze 38 39 40 Walmart 41 Airbnb 42 AAA 43 44 Apex News 45 Expedia 46 Chrome 47 Brave Browser 48 Google Earth 49 50 51 Google Translate 52 Microsoft Authenticator 53 Acrobat Reader 54 Google Docs 55 DNS Speed Test 56 FTP Server 57 Outlook 58 Canva 59 Google Authenticator 60 SpeedVPN Thunder VPN Pinterest Len 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 54 61 30 37 42 42 42 42 42 42 42 54 54 54 Table 4.1 Tiny packets are only observed in VoIMS. directions to determine whether Alice is talking. Specifically, UL-nonCN packets are transmitted when Alice is speaking, while UL-CN packets are transmitted when Alice is not speaking and is in a period of silence. Similar observations are made for DL packets, depending on whether Bob is talking. However, several practical issues need to be addressed, as shown in Figure 4.2. First, there may be situations where the call conversation has not yet begun, even when packets 55 Figure 4.2 Call state inference for a single detected call (§4.2.2). are transmitted over the target DRB. This scenario occurs in cases of premium voice services. For instance, with features like Early Media [80], the callee can play alerting media (e.g., a song) to the caller before the conversation starts. Consequently, we introduce an additional call state, "no conversation," which indicates that the DRB is active, but the call has not been fully established. This state is identified when both #UL-CN and #DL-CN are equal to zero. The conversation begins as soon as the first CN packet is observed. Notably, the first CN packet can appear in either the DL or UL direction, as Alice can be either the caller or the callee. Second, distinguishing CN and nonCN packets solely based on their packet lengths can be challenging. However, our observation reveals that nonCN packets are consistently larger than CN packets. NonCN voice packets have a minimum PDCP payload length of 15.08 bytes (rounded up to 16 bytes), which is observed when the lowest VoIMS codec bit rate for nonCN voice packets is used (4.75 Kbps via AMR), and the inter-arrival time is an average of 20 ms (resulting in 50 packets per second). ROHC further reduces the size of RTP headers to 3.2–6.5 bytes. In contrast, CN packets have a maximum length of 13 bytes. In our approach, we employ a threshold of 𝜃 = 16, where packets with a payload length of less than 16 bytes are considered CN packets. To infer call states more accurately, we collect packet statistics every second, including the counts of CN and nonCN packets in the UL and DL, labeled as #UL-CN, #UL-nonCN, #DL-CN, and #DL-nonCN. Table 4.2 lists the criteria or intermediate results used to infer the three call states based on the presence of these four packet types: "No conversation" for situations before the call begins when both #UL-CN and #DL-CN are equal to zero; "Talking" when #UL-nonCN is greater 56 Call State Talking Listening No conversation Intermediate Criteria #UL-CN #UL-nonCN #DL-CN #DL-nonCN ∗ >0 =0 >0 ∗ ∗ >0 ∗ =0 ∗ >0 >0 Table 4.2 Intermediate criteria used to infer call states (∗: wildcard). than zero and #DL-CN is greater than zero, indicating that the user is sending voice packets to the remote party while receiving comfort noise packets; and "Listening" when #DL-nonCN is greater than zero and #UL-CN is greater than zero, indicating that the user is not sending voice packets but receiving comfort noise packets from the remote party. In practice, short-term call state inference based on these criteria may not be sufficient to accurately determine call time due to potential noise. We observe tiny CN packets in both directions even when Alice is talking or listening, as explained in Sections 4.2.3 and Figure 4.3a. In the following sections, we present our final approach for call state inference and discuss inferring call time. 4.2.3 Inferring Call Time (a) An intuitive approach (§4.2.2). (b) Our final approach with time inference. Figure 4.3 Comparison in a real-world instance. The approach outlined in Section 4.2.2 involves inferring call states on a per-second basis and accumulating the time periods associated with the talking and listening states. However, practical 57 AmplitudeTimetalk starttalk endAmplitudeTimetalk starttalk end Figure 4.4 Side-channel inference uses DBSCAN and MAVG to prevent unnecessary talking-listening state switches (§4.2.3). implementation reveals that this method encounters challenges due to the presence of additional noise packets. These noise packets fall into two distinct categories: "hidden" noise and "redundant" comfort noise. Hidden noise packets are unvoiced nonCN packets generated in response to unresolved en- vironmental noises when one party involved in the call is in the listening state. On the other hand, redundant comfort noise arises during brief speech pauses while the person is talking. The consequence of these noise packets is frequent and undesired transitions between the talking and listening states, as depicted in Figure 4.3a. Short speech pauses, often accompanied by redundant comfort noises, lead to inaccurate inferences regarding the talking state. These inaccuracies involve wrongly concluding that talking has ceased and transitioning to the listening state, only to revert quickly to the talking state as the person continues talking. Similarly, hidden noise packets affect the precision of inferences about when the listening state ends. Accurate inference of call times is essential for various call-state-based applications and attacks. This is particularly important in situations where multiple users engage in phone calls simultaneously. Inaccurate call times may not furnish adversaries with sufficient information to distinguish between calls or associate them with specific cellular/user identities. To address these issues, we introduce two approaches: (1) density-based spatial clustering of 58 applications with noise (DBSCAN) and (2) the moving average of the voiced packet ratio (MAVG). These strategies mitigate unnecessary transitions between the listening and talking states, resulting in more precise inferences of the start and end times for each call state, as depicted in Figure 4.4. ◦ DBSCAN serves to manage hidden noises and prevent unwarranted transitions from the listening state to the talking state. We analyze nonCN packets and categorize them into two groups: (1) voiced nonCN and (2) unvoiced nonCN. Hidden noise packets belong to the unvoiced nonCN category. Unvoiced nonCN packets do not contain user voice but carry uncanceled environmental noise. By classifying these unvoiced nonCN packets as comfort noise packets, we mitigate the issue of prematurely leaving the listening state. Differentiating between these categories remains challenging due to the variable voice coding rates inherent in the AMR audio codec used by VoIMS. To address this, we employ a classifier based on the well-established DBSCAN algorithm, classifying nonCN packets into voiced and unvoiced nonCN packets with a user-defined number of clusters or categories and a suitable 𝜖 value, representing the maximum distance range for data points in the same cluster. In our prototype, 𝜖 is set to 10, delivering comparable performance across all audio coding rates outlined in VoIMS standards [26, 24, 28]. ◦ MAVG addresses redundant comfort noises and prevents unwarranted transitions from the talking state to the listening state. This is achieved by ensuring that the transition from talking to listening is not solely based on minimal comfort noise. We develop a moving average algorithm that considers both comfort noise packets and unvoiced nonCN packets, which may be generated during short speaking pauses. The algorithm operates as follows: First, within a predefined time window (e.g., 2 to 4 seconds), denoted as 𝑤𝑛𝑑, we collect statistics on the numbers of uplink comfort noise packets, unvoiced nonCN packets, and voiced nonCN packets, represented as #CN, #Unvoiced-nonCN, and #Voiced-nonCN, respectively. Subsequently, we compute the percentage of voiced packets, labeled as 𝑉 𝑃, within each 𝑤𝑛𝑑, using the formula #𝐶𝑁+#𝑈𝑛𝑣𝑜𝑖𝑐𝑒𝑑−𝑛𝑜𝑛𝐶𝑁+#𝑉 𝑜𝑖𝑐𝑒𝑑−𝑛𝑜𝑛𝐶𝑁 ∗ 100%. If the observed 𝑉 𝑃 exceeds 50%, the state is inferred as "talking"; otherwise, it is inferred as #𝑉 𝑜𝑖𝑐𝑒𝑑−𝑛𝑜𝑛𝐶𝑁 "listening." Figure 4.3b provides an illustrative example illustrating the efficacy of our proposed solutions in 59 Metrics C Inference accuracy Time errors T Inference accuracy Time errors L Inference accuracy Time errors S5 G3 Cross-Phone Exp. Cross-Carrier Exp. OP-I OP-II OP-III OP-III OP-III S5 S5 100% 100% 100% 100% 100% 0.51s 0.57s 1.8s 100% 100% 100% 100% 100% 0.89s 0.99s 0.36s 100% 100% 100% 100% 100% 0.76s 1.22s 0.48s OP-III Pixel 3 Pixel 5† 100% 0.39s 100% 0.22s 100% 0.31s 0.45s 0.14s 0.53s 0.6s 0.3s 0.4s Table 4.3 The accuracy of VoIMS call detection, state and time inference (†:VoNR). C: Conversation, T: Talking, L: Listening. preventing unnecessary state transitions, thereby resulting in more precise inferences of talking and listening times. In conclusion, we determine the time for a call conversation as follows: The conversation initiates when the "talking" or "listening" state is identified for the first time. This approach is more robust compared to detecting the first PDCP packet over the target DRB. The conversation concludes when the last PDCP packet is sent or received over this target DRB. However, it is essential to acknowledge that time inference may exhibit slight inaccuracies because the official establishment or termination of the call occurs via SIP, delivered over SRB, rather than DRB. 4.2.4 Evaluation on 5G/4G Call Inference We conducted extensive experiments involving three major US operators, namely OP-I, OP-II, and OP-III, to evaluate the effectiveness (accuracy) of the proposed side-channel inference techniques in terms of VoIMS call detection, and call state and time inference. Our tests involved four 4G/5G smartphones: the Samsung Galaxy S5, LG G3, Google Pixel 3, and Google Pixel 5 (a 5G phone), and two mainstream VoIMS services, VoLTE and VoNR (limited to Google Pixel 5). Each experiment setting, which includes the operator, phone model, and VoIMS service, was executed 20 times. In each run, a victim VoIMS call spanned 30 seconds: 10 seconds for alerting, 10 seconds for talking, and 10 seconds for listening. The callee answered an incoming call 10 seconds after the ringtone, after which the caller and callee engaged in a 10-second conversation. Simultaneously, other participating smartphones (not the victim) ran various accompanying traffic, including non-VoIMS 60 Internet applications and VoIMS calls. We opted for fixed 10-second intervals in the evaluation experiments because they offered ample time to explore different call state transitions. The evaluation of call inference with varying intervals will be discussed in Section 4.3.1, 4.3.2 (proof-of-concept attacks). Table 4.3 illustrates the success of side-channel call inference in both cross-carrier and cross- phone scenarios. Due to space constraints, we present the results for all operators using the Samsung S5 and the results for all phone models using OP-III. Our observations include the following: 1. All VoIMS calls and states are reliably detected, and non-VoIMS traffic is never mistakenly recognized as VoIMS calls. 2. Average time errors remain consistently below 9%, except for the inference using the Samsung S5 over OP-III. Specifically, the average errors range from 0.3 seconds to 0.57 seconds (1.5% to 2.85%) for conversation (20 seconds), from 0.14 seconds to 0.89 seconds (1.4% to 8.9%) for talking (10 seconds), and from 0.31 seconds to 0.76 seconds (3.1% to 7.6%) for listening (10 seconds). 3. In experiments involving the Samsung S5 with OP-III, we noted that during a VoIMS call, comfort noise packets were transmitted at a low rate (i.e., less than or equal to 7 packets per second) to the cellular infrastructure. This rate is significantly lower than the rate stipulated by the VoIMS standard (i.e., 50 packets per second) [24, 26, 28, 16]. Consequently, this resulted in longer call state inference times and higher error rates. This anomaly could be attributed to an implementation flaw specific to the Samsung S5 and OP-III, as similar issues were not observed with other tested phone models and operators. 4.3 Proof-of-concept Attacks In the following sections, we introduce several proof-of-concept attacks that leverage the inferred VoIMS call state information. We start with passive attacks that reveal more sensitive information beyond call states, such as the caller’s identity (§4.3.1). Subsequently, we present an active attack that utilizes the inferred call information to overshadow a selected voice call and mute the victim at specific times (§4.3.2). 61 4.3.1 Who is Calling? The first passive attack aims to identify the caller and connect her two distinct identities: the user identity (e.g., name and phone number) and the cellular identity (e.g., International Mobile Subscriber Identity - IMSI, Radio Network Temporary Identity - RNTI, and Globally Unique Temporary Identity - GUTI). If an attacker can successfully link a mobile user’s user and cellular identities, it opens the door to powerful cellular-identity-based cyberattacks, such as IMSI-based Denial-of-Service (DoS) attacks [141], which can be targeted at specific high-value victims instead of random individuals. Numerous studies, including [56, 97], have demonstrated how adversaries can easily acquire user identities (including names and phone numbers) of mobile users through online payment services (e.g., PeopleLooker [10]), social network platforms, and data breaches from online service providers [11]. Regarding cellular identities, several methods [86, 77, 88] have been proposed to infer or link them to each other (e.g., by forcing a device to transmit its IMSI or linking RNTIs to an IMSI [88]). However, currently, no studies have presented stealthy methods to link user identities to cellular identities for a mobile user. While some research suggests the feasibility of such linkage, their techniques are not covert. For instance, Hussain et al. [77] require making multiple calls to the victim while already knowing their phone number. Importantly, it’s worth noting that compromising the carrier’s infrastructure is not considered within the threat model outlined in Section4.1. We thus developed a novel attack called "Cross-domain Identity Linkage attack" or "CrossIL," which leverages precise call state inference and correlates inferred call states with related visual data extracted from the visualization domain (such as video recordings). The core idea of this attack is illustrated in Figure 4.5. The motivation for the CrossIL attack arises from two key factors in the visualization domain: Distinct User Postures: Users typically exhibit different postures when using VoIMS services, such as holding a phone next to their ear. This behavior contrasts with how they interact with other mobile services, like Internet surfing or texting. To validate this, we conducted an online survey with college students. Among the 83 collected responses, 53 participants preferred to place their 62 Figure 4.5 Overview of an user identity linkage attack. phones near their ears during phone calls in public, while 30 participants did not (e.g., they used earphones). Face Recognition Advances: Face recognition techniques have become increasingly sophisticated, allowing adversaries to recognize people’s faces with high accuracy (greater than 90%) in video frames, even when the faces are small or appear in tiny frames. Once a face is recognized, adversaries can use reverse image search engines (e.g., PimEyes [13]) to find the owner’s name and then obtain their phone number through paid online services (e.g., PeopleLooker and Spokeo). It’s important to note that this attack involves deploying cameras and sniffers in a specific area (e.g., an airport or subway) to identify potential victims making calls on the spot. The attack does not target pre-selected individuals. Once eligible victims are identified, adversaries can launch cellular-identity-based attacks (e.g., IMSI-based Denial-of-Service [141]) against high-value victims only, either immediately or at a later time, instead of randomly selecting targets, which is an easier but less effective approach. Practicality of the Attack. Critics might raise concerns about the practicality of the CrossIL attack, specifically regarding three issues: (1) Deployment of Devices: Adversaries need to install cellular radio sniffers and surveillance cameras in public areas, which could potentially be discovered. (2) Multiple Users Making Calls: Multiple users might have VoIMS calls at similar times, making it 63 challenging to distinguish between them. (3) Lack of VoIMS Users: There may be no VoIMS users with ongoing calls during surveillance, making the attack less practical. However, we believe that these issues can be addressed without significant technical challenges: Modern hidden cameras [12] are discreet, have extended battery life, and offer ample storage capacity. Portable sniffers can cover more than 1 km [86], enabling covert operations. Our precise call state inference mechanism can differentiate between multiple VoIMS users with concurrent calls. The CrossIL attack specifically targets VoIMS users, so adversaries can strategically deploy the sniffers and cameras in selected public locations where phone calls are frequent, such as airports and hotel lobbies. With a high volume of people passing through during the day, it is likely that individuals making calls will be observed within the surveillance coverage. Attack design. The high-level concept of this attack involves gathering cellular identities from radio traces (via VoIMSAnalyzer) and user identities from the visualization domain (via VideoAnalyzer), and then linking them by correlating the victim’s call states and related motions. The primary challenge in launching this attack lies in accurately detecting when a voice conversation starts from recorded videos. Specifically, the main difficulty arises from distinguishing between the following two scenarios: The user initiates an outgoing call and waits for the called party to answer. The user answers an incoming call and listens to the caller without speaking at all. In both of these scenarios, the user exhibits similar behaviors: they move their phone to their ear and have no lip motions for a period. This issue leads to notable inference errors in call state determination from the visualization domain, significantly reducing the effectiveness of the attack. To address this challenge, we have devised a novel approach called "cross-domain indeterministic call state correlation." This approach introduces an indeterministic state L’ to account for these two scenarios. We will now elaborate on its three key components. ◦ VoIMSAnalyzer. The new functionality introduced for this attack involves extracting the cellular radio identity of each VoIMS call, such as C-RNTI, IMSIs, and TMSIs [30, 36]. In other words, VoIMSAnalyzer is now capable of not only discovering cellular identities but also inferring the corresponding VoIMS call states (e.g., talking and listening times) from encrypted radio traces. 64 Figure 4.6 Three steps for cross-domain identity linkage. ◦ VideoAnalyzer. The attack capitalizes on the increasingly mature face recognition techniques that can successfully recognize individuals with a high accuracy (exceeding 90%) in video frames, even when dealing with small or tiny faces [74, 95, 116, 73]. Furthermore, it takes advantage of public image search engines designed to identify people using facial images. For instance, PimEyes.com, as of August 2022, boasts a database containing more than 2.1 billion unique faces [13]. VideoAnalyzer’s methodology involves the extraction of call-related motions specific to each user’s identity. This can include actions like moving a phone closer to or further away from an ear. These extracted motions are then utilized to generate estimated call statistics from video recordings. VideoAnalyzer comprises three distinct sub-modules: (1) Call Motion Detector: This component is responsible for detecting two voice-call-specific human activities—moving a phone close to and away from an ear. These actions are used to identify the start and end times of phone calls, respectively. (2) Lip Motion Detector: The Lip Motion Detector serves the purpose of identifying the start and end times of each talking or listening interval. It achieves this by analyzing human lip motions, employing a recurrent neural network (RNN) model [115]. (3) Face Detector and Recognizer: This module is in charge of locating each user’s face in video frames and identifying their corresponding user identity, such as their name. It utilizes the Dual Shot Face Detector (DSFD)[93] for detecting human faces and the ResNet50[72] model for recognizing these faces. For every identified user identity, VideoAnalyzer produces output detailing the start and end times of each call conversation, along with the talking and listening time intervals, which may be interleaved. 65 Score TableOverlap DetectionVideo CandidatesL’ L’ T T T L0.81.00.80.60.40.50.80.60.5✓RadioVideoV1.1V1V1.2V1.3R1R2R3R1R2R3V1.1V1.2V1.3 ◦ Cross-domain identity linkage. This component associates a cellular radio identity with a user identity by correlating their corresponding call event sequences generated by VoIMSAnalyzer and VideoAnalyzer. The process of cross-domain indeterministic call state correlation is illustrated in three steps, as depicted in Figure 4.6. Step 1. Given a video-induced call record produced by VideoAnalyzer, the correlator searches through all radio-induced call records and checks whether any of them overlap with it based on their call start and end times. Step 2. Due to the indeterminacy of L’, which indicates listening to the other call party or waiting for the called party to answer, the video-induced call event sequence, designated as 𝐶𝐸 𝑆𝑒𝑞𝑣𝑖𝑑𝑒𝑜, is not in a deterministic form. We thus expand 𝐶𝐸 𝑆𝑒𝑞𝑣𝑖𝑑𝑒𝑜 to multiple deterministic call event sequences by exploring all possible states of L’ in practice. For example, a video-induced call event sequence, "S, L’, L’, T, T, L, E," can be expanded into three sequences: (1) "S, L, L, T, T, L, E"; (2) "S, L, T, T, L, E"; and (3) "S, T, T, L, E," which outputs all possible call event sequences. Step 3. We calculate matching scores between each of the radio-induced call event sequences and all the sequences expanded from the given video-induced call event sequence; the correlation with the highest matching score is chosen. We calculate the Edit distance (i.e., Levenshtein distance [87]), which quantifies the similarity between two strings, between two selected call event sequences and obtain their matching score as 1 − Edit distance |Longest call event sequence| . For example, the Edit distance between "S, T, T, L, E" and "S, L, T, T, L, E" is 1, and their matching score is 0.83 (1 − 1 6 ). Attack implementation. In addition to VoIMSAnalyzer, we implement VideoAnalyzer using Python3 on HPCC servers with the following libraries: Keras (Mask R-CNN and RNN lip movement model), Pytorch (Dual Shot Face Detector), keras_vggface (ResNet50), scipy (Cosine Similarity), and cv2 libraries. Notably, we did not need to collect a large-scale dataset, since all used models were pre-trained [71]. For example, Mask R-CNN had been trained on the Coco dataset with 80K training images [71] and validated on 35K test images; ResNet50 had been trained with 1.28 million training images from ImageNet and evaluated on 50K test images. Correlator was implemented in Python3 using timestamps recorded in radio traces and videos. 66 Attack evaluation. The attack evaluation is performed in both controlled (without passersby) and wild (with passersby) environments. The controlled experiment was conducted in a classroom, where only experiment participants were on campus during holidays, whereas the wild experiment was carried out in the lobby of a dormitory with passersby. There were 7 participants, and each of them was required to freely dial/receive 15 VoIMS calls within two hours under the surveillance of two cameras (iPhone 12). The participants can make phone calls simultaneously. In the experiment, we gauged the inference accuracy in terms of call start time, call end time, talking/listening times, and the association between the cellular and user identities. To obtain the ground truth, we not only collected cellular radio traces and recorded videos but also logged VoIMS call events from the participants’ smartphones using the Android logcat. Table 4.4 summarizes the experimental results. We have four observations. First, the success rates of linking cellular identities to user identities are 59/60 and 40/45 in controlled and wild environments. Such high success rates are achieved even when VideoAnalyzer has up to 17.3% error in estimating talking and listening times, since Correlator employs multiple call states, instead of only talking and listening times. Second, most estimation errors from VideoAnalyzer in the wild environment are obviously larger than those in the controlled environment. There are two reasons: (1) cameras were occasionally blocked by passersby (1/45 phone calls), and (2) the brightness of natural light is not always stable; specifically, 14/45 phone calls experienced short-time (a few seconds) underexposure/overexposure issues. To address this issue, adversaries may deploy multiple hidden surveillance cameras to reduce potential interference and noise, such as when a victim’s face is blocked by passersby. Third, VideoAnalyzer can precisely recognize the faces of participants for all the VoIMS calls and then discover their names from our database. Fourth, VoIMSAnalyzer in the controlled experiment has similar errors in estimating call start and end times as in the wild experiment, whereas its estimation on talking and listening times in the wild setting has higher errors (2.8%∼6.5%) than that in the controlled setting (2.8%∼4.9%). The reason is that the background noise of the wild environment is larger than that of the controlled one. Current prototype limitations. While our proposed attack demonstrates effectiveness in our 67 Module Performance Metrics Controlled Settings Wild Settings User1 User2 User3 User4 User5 User6 User7 RadioAnalyzer Events Call Time Estima- tion Call start error 0.92s 0.32s 0.85s 0.85s 1.20s 1.43s 1.60s Call end error 0.18s 0.27s 0.32s 0.37s 0.32s 0.15s 0.28s Talking & listen- ing time error 1.8s (4.6%) 1.4s (2.8%) 2.3s (4.9%) 1.6s (3.9%) 1.5s (2.8%) 2.88s (6.5%) 2.65s (6.3%) VideoAnalyzer Correlator Call Events Time Estima- tion Face Recogni- tion Cellular and Linkage ID User Call start error 1.87s 2.53s 1.99s 2.0s 6.42s 3.09s 2.85s Call end error 2.15s 3.74s 3.97s 2.14s 3.01s 4.95s 1.18s Talking & listen- ing time error 3.2s (8.2%) 4.12s (8.3%) 2.15s (4.6%) 2.23s (5.4%) 9.25s (17.3%) 6.67s (15.1%) 4.37s (10.4%) Accuracy 100% 100% 100% 100% 100% 100% 100% Accuracy 100% 93.5% 100% 100% 86.7% 86.7% 93.5 % ( 15/15) ( 14/15) ( 15/15) ( 15/15) ( 13/15) ( 13/15) ( 14/15) Table 4.4 Summary of cross-domain identity linkage attack performance. experiments, the current prototype exhibits several limitations. These limitations include: (1) the necessity for video recordings at a resolution of 1080P or higher; (2) skewed, crooked, and blurred faces cannot be well recognized; (3) immunity of video call and earphone users to this attack.; and (4) it only considers the locations with good cellular signals. Some techniques can be used to improve the prototype. To address these shortcomings, various techniques can be employed to enhance the prototype’s performance. For instance, employing tiny face recognition techniques as presented in studies such as [82, 70, 74, 95] can facilitate the recognition of small faces within low-resolution video recordings. To tackle skewed and crooked face recognition, methods outlined in [128, 94, 96] can be implemented. Furthermore, techniques from studies like [107, 100] can be applied to recognize blurred faces and address similar issues. We defer the exploration and implementation of these potential improvements to our future work. Limited to 4G? It might be argued that 5G users are immune to the proposed attacks since 5G mobile devices do not transmit the permanent cellular identity (e.g., IMSI) in plaintext but rather in ciphertext, known as Subscription Concealed Identifier (SUCI), making it impossible to learn the cellular identity. However, this may not be the case, as some researchers have demonstrated network downgrade attacks [83] capable of downgrading 5G mobile devices to legacy 4G networks. 68 Mapping RNTIs to IMSIs in a Short Time. While our sniffer is designed for long-term traffic sniffing, prior studies have shown that adversaries can compel a mobile device to transmit its IMSI to the cellular infrastructure using a fake EMM Service Request message [61]. Subsequently, they can continuously correlate all RNTIs assigned to this device [89]. 4.3.2 Selective Voice Muting Attack We have developed an active attack aimed at selectively muting one of the call parties based on inferred call states. This attack falls under the category of a Denial-of-Service (DoS) attack. However, it differentiates itself from typical jamming attacks that disrupt calls by continuously transmitting wireless noise indiscriminately. Such jamming attacks often result in degraded channel quality, which can be easily detected through physical-layer performance metrics, including bit error rate (BER) and signal-to-noise ratio (SNR) [52]. In contrast, our attack operates by strategically overpowering the victim’s voice packets with stronger signals when necessary, particularly when the victim is actively engaged in conversation (i.e., when the victim is talking). It accomplishes this by transmitting valid PDCP (Packet Data Convergence Protocol) packets through the victim’s assigned uplink channels at specific times. Consequently, the remote call party becomes unable to hear the victim’s voice, potentially leading to call termination after a prolonged silent period. In comparison to conventional jamming attacks, such as Jammer-V [113], which can be readily detected through abnormal fluctuations in physical-layer performance metrics like BER and SNR, our proposed attack offers a dual advantage. Not only does it manage to evade detection effectively, but it also optimizes the attack cost by transmitting PDCP packets only during moments when the victims are actively engaged in conversation. Attack design. This attack uses a cellular sniffer and a VoIMS analyzer built for the previous attacks. The main change is that we reduce the inference window from several seconds to 200 ms, in order to launch this attack in real-time. The inference threshold changes accordingly; if the percentage of voice packets is greater than 40%, the talking state is inferred; otherwise, the state is inferred as listening. Whenever the victim enters the talking state, this component signals the voice muter to 69 start the overshadowing attack until it goes to the listening state. The attack overshadows the victim’s uplink voice packets through Uplink Voice Muter. With the victim’s C-RNTI and uplink control information, it fabricates valid packets (i.e., random Internet Control Message Protocol (ICMP) packets) and transmits them using the physical uplink channels granted to the victim using stronger signals (e.g., 3dB higher [141]). Attack implementation. We implement this attack on an SDR platform using srsRAN (v20.10.1) [14], which can connect to the operational cellular network. It monitors the Physical Downlink Control Channel (PDCCH) to collect the uplink and downlink control information (DCI) [29] from nearby cellular devices. When the talking state of the target victim is detected, both the C-RNTI and uplink control information are sent to the uplink voice muter, which overshadows the victim’s uplink signals. We modify the values of the transmission gain (tx_gain) and receiver gain (rx_gain) in the ue.conf file to generate stronger signals as did in [141]. Attack Design. This attack leverages a cellular sniffer and a VoIMS analyzer previously developed for other attacks. The key modification is the reduction of the inference window from several seconds to 200 ms, enabling real-time attack execution. Correspondingly, the inference threshold has been adjusted: if the percentage of voice packets surpasses 40%, it is inferred as the talking state; otherwise, it is inferred as the listening state. When the victim transitions into the talking state, this component signals the voice muter to initiate the overshadowing attack until the victim returns to the listening state. The attack aims to obscure the victim’s uplink voice packets through the use of the Uplink Voice Muter. With access to the victim’s C-RNTI (Cell Radio Network Temporary Identity) and uplink control information, this component generates valid packets (e.g., random Internet Control Message Protocol (ICMP) packets) and transmits them over the physical uplink channels allocated to the victim, employing stronger signals, often around 3dB higher [141]. Attack implementation. The attack is implemented on a Software-Defined Radio (SDR) platform using srsRAN (v20.10.1) [14], facilitating connectivity with the operational cellular network. It monitors the Physical Downlink Control Channel (PDCCH) to gather uplink and downlink control 70 Figure 4.7 The selective muting attack only overshadows Alice’s uplink voice data when she is talking. information (DCI) [29] from nearby cellular devices. Upon detecting the talking state of the target victim, both the C-RNTI and uplink control information are relayed to the uplink voice muter, which proceeds to overshadow the victim’s uplink signals. This is achieved through the adjustment of parameters such as transmission gain (tx_gain) and receiver gain (rx_gain) within the ue.conf file, in a manner similar to what was demonstrated in [141]. Attack evaluation. We assess the efficacy of the proposed attack, referred to as SeletiveMuter, by conducting a comparative analysis against Jammer-V, an attack method utilizing state-of-the-art techniques [113] to disrupt VoIMS calls. Our experimental setup involves 30 participants, comprising 4 callers and 26 callees. Each callee is paired with a caller. The callees are categorized into two groups: "friends" and "strangers." In the first group, the assigned caller’s phone number is known to the callee, while in the second group, it remains unknown. The experiments span a duration of two weeks. For each caller, a minimum of three voice calls are placed to their assigned callees, under the conditions of both the SeletiveMuter and Jammer-V attacks. The muting attacks are executed against approximately two-thirds of the outgoing calls, with SeletiveMuter and Jammer-V methods being employed in equal measure. Figure 4.7 illustrates the operation of SeletiveMuter in a specific instance. Meanwhile, 4.8 compares the two attacks using the "attack time ratio" metric, defined as Overshadowing time Call time . Our 71 Figure 4.8 Boxplot of attack time ratios (ATRs) in three groups: Stranger, Friend and All (Stranger + Friend). analysis yields three key findings. First, all attack instances, whether SeletiveMuter or Jammer- V, achieve a 100% success rate. Across a total of 26 instances each, calls are terminated, on average, within a timeframe ranging from 6.2 seconds (strangers) to 16.8 seconds (friends). Second, SeletiveMuter exhibits higher efficiency, reducing the median attack time ratio by 59.9% to 61.8%. This improvement is attributed to Jammer-V overshadowing all uplink cellular signals of the victims, regardless of whether they are actively speaking or listening. In contrast, SeletiveMuter generates attack signals solely during the victim’s talking phase, as demonstrated in Figure 4.7. Third, the attack time ratio is slightly higher in the friend group due to the callers engaging in more conversation, and the callees not promptly terminating the call, even when voice communication is disrupted. When combined with the previously mentioned passive attacks, this active attack can compound the extent of damage. 4.4 Discussion The real-world implications of side-channel VoIMS call inference extend beyond the proof-of- concept attacks presented earlier. Exploring Further Attacks. Precise call state inference can serve to augment existing research in sociology and linguistics, where call state information is used to infer user profiles (e.g., residents, commuters, visitors [62, 122]), personality traits (e.g., extroversion, agreeableness [117, 103]), and dominant partners [63]. Additionally, it can be leveraged to deduce user behaviors (e.g., spam) [124, 125] and social interactions [137]. Furthermore, outgoing and incoming phone call patterns can be instrumental in inferring individual wealth [119, 63]. Steele et al. [119] illustrated that these call patterns provide valuable information for inferring the poverty and wealth of call 72 StrangerFriendAll(S+F)050100Jammer-VSelectiveMuterATR(%) parties. The stealthy monitoring and inference of voice calls raise additional privacy concerns in this context. For instance, side-channel VoIMS inference can be exploited in the context of robocalls or Interactive Voice Response (IVR) systems, such as those used by banks or customer service lines. Since IVR systems employ pre-recorded or synthesized voices, they generate distinctive audio fingerprints. By leveraging fine-grained call state inference, adversaries could differentiate robocalls from human conversations and potentially determine whether the victim is interacting with automated services. This includes inferring whether a user is contacting a bank or specific company, raising significant concerns about the leakage of sensitive personal activities and affiliations. While these attacks may be bypassed by Non-VoIMS apps like Skype, they are effective in scenarios where the phone number serves as the sole means of contact between the two call parties, such as in business-related calls. 4.5 Solution In this section, we present a standard-compliant solution to prevent VoIMS call states from being inferred, and then assess its effectiveness. Seemingly, it is not difficult to solve the above vulnerabilities by removing the unique features of those exploitable VoIMS packets described previously (i.e., tiny compressed comfort noise packets and distinguishable voiced/unvoiced packets in Section 4.2 ) with the insertion of additional padding to VoIMS packets (e.g., adding the padding to compressed comfort noise packets so that their packet sizes are larger than or equal to 21 bytes (i.e., smallest IP packets with the one-byte payload)). However, the conventional padding-based solution will lead to real-world negative impacts: (1) users may need to pay for the additional padding when operators charge the users by the volume of data transmitted/received, and (2) the IMS media gateways (as shown in Figure 4.9) have to remove the padding from VoIMS packets before forwarding them to the call recipients, thereby significantly increasing the loading of the IMS media gateways when considering a large number of voice calls1. Prototype. We thus propose to develop a singular rectifier that adds and removes necessary padding 1According to [104], Verizon customers make 800 million wireless calls per day, which is more than double the population of the United State 73 Figure 4.9 Singular Rectifier (SR) at the PDCP layer. at the PDCP layer of phones and base stations, as shown in Figure 4.9. Such design can address the aforementioned two issues. First, users do not need to pay for the additional padding, which does not reach the core network and is not counted by the charging function. Second, the loading of handling the padding is distributed among front-end base stations, thereby preventing the overhead from be imposed on the IMS media gateways. We use srsUE [7], srsLTE [8], Open IMS Core [6], and UCT IMS Client 1.0.14 [126] to serve as the 4G UE, the 4G infrastructure, the IMS core with a VoLTE server, and the VoLTE client app, respectively. We implement SR@PDCP by modifying the PDCP layer at both the UE and the eNB to handle necessary paddings of the VoIMS packets. In particular, we add paddings to all the PDCP packets whose payload lengths are smaller than 20 bytes and increase the size to 42 bytes, which can be observed from many non-VoIMS applications. The inserted paddings are removed at the PDCP layer of both the UE and the eNB before the corresponding packets are forwarded to the upper layer or next network element (e.g., the 4G gateway and 5G UPF). Evaluation. We evaluate its effectiveness and overhead. We re-run VoIMS call experiments where a user dials ten VoIMS calls from the srsUE and the callee answers each incoming call immediately. Each call lasts for 25 seconds with 10s for talking and 15s for listening. The result shows that VoIMSAnalyzer fails to detect any of the calls. Notably, the 4G core network does not receive any VoIMS packets with additional paddings, so no additional charges are made. We evaluate the solution overhead in terms of CPU usage, memory usage, and processing time. In practice, VoIMS 74 UEeNBGatewaySR@PDCPRelayGTP-UApplicationIPSR@PDCPRLCMACPHYRLCMACPHYUDP/IPL2L1IPGTP-UUDP/IPL2L1 clients do not keep sending out compressed CN packets during voice calls. Here, we assess the overhead in an extreme case where the VoIMS client keeps transmitting 11-byte comfort noise VoIMS packets to the OpenIMS server. Figure 13 shows that the average CPU and RAM usages slightly increase by 0.72% and 0.02% when the proposed solution is enabled. The average processing time at the PDCP layer increases by 1.46 𝜇𝑠 per packet (from 15.24 𝜇𝑠 to 16.70 𝜇𝑠). Thus, the solution is effective at an acceptable cost. The domain of 5G/4G voice calls has seen significant advancements with the introduction of various optimization techniques. These techniques have been designed to enhance the quality and efficiency of voice calls. However, it is important to recognize that these enhancements do not come without a cost. We shed light on the phenomenon of side-channel call inference that arises from the interplay of these 5G/4G call optimization techniques, which, in turn, gives rise to potential threats against 5G/4G voice calls. One of the key findings of our research is that adversaries have the ability to accurately infer confidential call information, including the determination of whether a call is in progress, identification of the individuals engaged in the conversation, and the timing of their communication, all without the need to decrypt the encrypted packets in transit through the air. This acquired call information, when exploited, can lead to the launch of both active and passive attacks against 5G/4G call users. It is essential to emphasize that these optimization techniques were not designed with the intention of compromising security for the sake of improving call quality. Instead, the security implications of these techniques are often subtle and result in unanticipated side effects. Addressing the new security challenges posed by evolving technologies in mobile networks is a formidable challenge, one that is ongoing and boundless. This undertaking necessitates collaborative efforts from all stakeholders, including standardization bodies, network operators, equipment vendors, and mobile users. 75 CHAPTER 5 CONCLUSION In this chapter, we summarize the key findings of our work, highlight the insights and lessons learned, and outline two promising directions for future research. 5.1 Summary of Results The rapid evolution of mobile networks—from the early days of circuit-switched systems to today’s IP-based architectures—has transformed the way multimedia services are delivered. The IP Multimedia Subsystem (IMS) has emerged as a central framework enabling advanced services such as voice and video calling, SMS, and emergency communications over 4G and 5G networks. While IMS has greatly improved flexibility, scalability, and service performance, this dissertation demonstrates that it also introduces new and underexplored security challenges that threaten both user privacy and service reliability. This dissertation presents a comprehensive and systematic security analysis of IMS across two major dimensions: mobile devices and network infrastructure. Securing IP Multimedia Subsystem (IMS) On Mobile Devices. On the device side, we identify four critical security issues arising from the transition of IMS clients from hardware-based implementations to software-based applications. These weaknesses expose users to three novel attacks that enable denial of service, spoofed messaging, and covert media hijacking. Our evaluation, conducted across multiple devices and carriers in two countries, confirms that these threats are real and impactful. Enhancing the Privacy of Voice Services over IMS frastructure. On the infrastructure side, we uncover a new class of side-channel vulnerabilities stemming from performance-optimizing techniques used in 5G/4G voice services. Despite strong encryption protocols, our findings show that adversaries can infer call states, speaker activity, and user identity simply by analyzing encrypted traffic patterns. Through these investigations, this dissertation not only reveals previously undocumented vulnerabilities in IMS-based communication systems but also demonstrates how emerging technologies, though well-intentioned, can inadvertently compromise security. To that 76 end, we propose a set of mitigation strategies that offer immediate remedies while also highlighting the necessity for longer-term collaboration among standard bodies, carriers, device manufacturers, and users. 5.2 Insights and Lessons As we move further into the 6G era and beyond, it is critical to proactively reassess the broader security implications introduced by rapid technological evolution. Design changes—while necessary for innovation—often introduce unforeseen vulnerabilities. Therefore, security must not be treated as an afterthought but as a foundational element embedded throughout the entire innovation lifecycle. As mobile communication systems become increasingly complex and interconnected, securing them is no longer a one-time effort but a continuous, evolving process. This dissertation offers both foundational insights and practical solutions to support that ongoing journey toward resilient and secure network architectures. We have two key insights and lessons learned from our study. Mobile device security lags behind the advancements in mobile network infrastructure. It leaves end users more vulnerable to emerging threats. To achieve proper end-to-end protection, future system designs must extend infrastructure-grade security to the mobile device side. Privacy must be elevated as a design priority within network architectures such as IMS. We found that the traditional security mechanism focused on performance and standard security mechanisms like authentication and encryption while overlooking the growing imperative of user privacy. 5.3 Future Work Building upon the findings of this dissertation, there are two promising directions for future research. Security of Next-Generation 911 (NG911) services on mobile devices. NG911 is an advanced, IP-based emergency communication system designed to replace legacy analog 911 infrastructure. It supports richer forms of communication—including voice, text, photos, and videos—enabling more dynamic interaction with emergency services. Notably, NG911 leverages an IP-based network similar to the IP Multimedia Subsystem (IMS) for routing traffic from mobile devices. This means that 77 NG911 may share many of the same vulnerabilities identified in this dissertation. As the emergency communication ecosystem shifts toward NG911, it is critical to investigate how adversaries might exploit mobile phones as entry points to interfere with or hijack these life-saving services. Future work will focus on identifying mobile-device-level threats specific to NG911, understanding their impact, and proposing defenses that maintain the accessibility and reliability expected of emergency communication systems. Tackling privacy leakage in IMS-based robocalls involving interactive voice response (IVR). Robocalls—automated calls where users interact with computer-operated voice systems—are increasingly common in sectors like mobile banking and retail customer service. Organizations such as JPMorgan Chase, Wells Fargo, Walmart, and Best Buy widely employ IVR systems to streamline user interaction and operational efficiency. In these calls, the IVR system issues pre-recorded or computer-generated prompts, and users respond through voice commands or keypad inputs. Given the sensitive nature of many robocall scenarios, especially those tied to financial services, any potential compromise in call privacy can carry serious consequences. This future work will continue to examine the security of IMS services within radio access networks, with a focus on detecting robocalls and, when possible, identifying user privacy during the conversation. 78 BIBLIOGRAPHY [1] [2] [3] [4] 20 android statistics in 2023 (market share and users). https://www.demandsage.com/ android-statistics/, 2023. Federal Phone Call Recording Law. criminal-resource-manual-1050-scope-18-usc-2511-prohibitions, 2020. https://www.justice.gov/archives/jm/ srsran 4g with zmq virtual radios. latest/app_notes/source/zeromq/source/index.html#zeromq-appnote, 2023. https://docs.srsran.com/projects/4g/en/ tc-fw(8) — Linux manual page. https://man7.org/linux/man-pages/man8/tc-fw. 8.html,. [5] Internet Protocol, 1981. https://www.rfc-editor.org/info/rfc791. [6] Openimscore. http://openimscore.sourceforge.net/, 2008. [7] srsue. https://github.com/domi007/srsUE, 2017. [8] srsENB. https://github.com/topics/srsenb, 2019. [9] IMS Profile for Voice and SMS. https://www.gsma.com/newsroom/wp-content/ uploads/IR.92-v15.0-4.pdf, 2020. [10] Peoplelooker. https://www.peoplelooker.com, 2020. [11] Yahoo! data breache. https://en.wikipedia.org/wiki/Yahoo!_data_breaches, 2020. [12] Mini spy camera 1080p. https://www.amazon.com/ Spy-Camera-Charger-Hidden-Surveillance/dp/B07GCKZKX8/, 2021. [13] Pimeyes-ceo: The user is the stalker, not the search engine. https://netzpolitik.org/ 2022/pimeyes-ceo-the-user-is-the-stalker-not-the-search-engine/, 2022. [14] srsran 20.10.1. https://github.com/srsran/srsRAN/releases/tag/release_20_ 10_1, 2022. [15] Kingroot. https://kingrootapp.net/, Jan 2024. [16] 3GPP. GSM 06.81: Digital cellular telecommunications system (Phase 2+); Discontinuous Transmission (DTX) for Enhanced Full Rate (EFR) speech traffic channels, 1999. [17] 3GPP. TS 22.228: Service requirements for the Internet Protocol (IP) multimedia core 79 network subsystem (IMS); Stage 1, 2000. [18] 3GPP. TS 23.101: General UMTS Architecture, 2000. [19] 3GPP. TS 23.125: Overall high level functionality and architecture impacts of flow based charging; Stage 2 (Release 7), Jun. 2007. https://portal.3gpp.org/desktopmodules/ Specifications/SpecificationDetails.aspx?specificationId=790. [20] 3GPP. TS33.328: rity, Nov. 2018. SpecificationDetails.aspx?specificationId=2295. IP Multimedia Subsystem (IMS) media secu- https://portal.3gpp.org/desktopmodules/Specifications/ plane [21] 3GPP. TS 26.449: Codec for Enhanced Voice Services (EVS); Comfort Noise Generation (CNG) aspects, March 2019. V15.1.0. [22] 3GPP. TS24.011: Point-to-Point (PP) Short Message Service (SMS) support on mobile radio interface, Nov. 2019. https://www.etsi.org/deliver/etsi_ts/124000_124099/ 124011/15.03.00_60/ts_124011v150300p.pdf. [23] 3GPP. TS 23.228: IP Multimedia Subsystem (IMS); Stage 2; V16.5, 2020. [24] 3GPP. TS 26.071: Mandatory speech CODEC speech processing functions; AMR speech Codec; General description, July 2020. V16.0.0. [25] 3GPP. TS 26.092: Mandatory speech codec speech processing functions; Adaptive Multi-Rate (AMR) speech codec; Comfort noise aspects, July 2020. V16.0.0. [26] 3GPP. TS 26.171: Speech codec speech processing functions; Adaptive Multi-Rate - Wideband (AMR-WB) speech codec; General description (Release 16), 2020. [27] 3GPP. TS 26.192: Speech codec speech processing functions; Adaptive Multi-Rate - Wideband (AMR-WB) speech codec; Comfort noise aspects, July 2020. V16.0.0. [28] 3GPP. TS 26.441: Codec for Enhanced Voice Services (EVS); General Overview, 2020. (V16.0.0). [29] 3GPP. TS 36.213: Evolved Universal Terrestrial Radio Access (E-UTRA); Physical layer procedures, 2020. [30] 3GPP. TS 36.300: Technical Specification Group Radio Access Network; Evolved Universal Terrestrial Radio Access (E-UTRA) and Evolved Universal Terrestrial Radio Access Network (E-UTRAN); Overall description; Stage 2, 2020. [31] 3GPP. TS 36.306: Evolved Universal Terrestrial Radio Access (E-UTRA); User Equipment (UE) radio access capabilities(Release 13), 2020. 80 [32] 3GPP. TS 36.331: Technical Specification Group Radio Access Network; Evolved Universal Terrestrial Radio Access (E-UTRA); Radio Resource Control (RRC); Protocol specification, 2020. [33] 3GPP. TS 37.324: LTE; 5G; Evolved Universal Terrestrial Radio Access (E-UTRA) and NR; Service Data Adaptation Protocol (SDAP) specification, 2020. [34] 3GPP. TS 23.203: Policy and charging control architecture , Mar. 2021. https: //portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails. aspx?specificationId=810. [35] 3GPP. TS 24.341: Support of SMS over IP networks; Stage 3, 2021. [36] 3GPP. TS 38.300: Technical Specification Group Radio Access Network; NR; NR and NG-RAN Overall Description; Stage 2 (Release 16) , 2021. [37] 3GPP. TS 38.306: NR; User Equipment (UE) radio access capabilities (Release 16), 2021. [38] 3GPP. TS 23.205: Bearer-independent circuit-switched core network, 2022. [39] 3GPP. TS 26.139: Real-time Transport Protocol (RTP) / RTP Control Protocol (RTCP) verifica- tion procedures (Release 17), Apr. 2022. https://portal.3gpp.org/desktopmodules/ Specifications/SpecificationDetails.aspx?specificationId=3709. [40] 3GPP. TS 33.102: 3G security; Security architecture, March 2022. V17.0.0. [41] 3GPP. TS 33.203: 3G security; Access security for IP-based services (Release 17) , Mar. 2022. https://portal.3gpp.org/desktopmodules/Specifications/ SpecificationDetails.aspx?specificationId=1055. [42] 3GPP. TS 33.210: Network Domain Security (NDS); IP network layer security, Sep. 2022. V17.1.0. [43] 3GPP. TS 33.401: 3GPP System Architecture Evolution (SAE); Security architecture (Re- lease 17), Sep. 2022. https://portal.3gpp.org/desktopmodules/Specifications/ SpecificationDetails.aspx?specificationId=2296. [44] 3GPP. TS 33.501: Security architecture and procedures for 5G System (Release 18) , Mar. 2022. https://portal.3gpp.org/desktopmodules/Specifications/ SpecificationDetails.aspx?specificationId=3169. [45] 3GPP. TS 48.016: General Packet Radio Service (GPRS), 2022. [46] 3GPP. TS 36.323: Packet Data Convergence Protocol Evolved Universal Terrestrial Radio Access (E-UTRA); https: (PDCP) specification, Mar. 2023. 81 //portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails. aspx?specificationId=2439. [47] 3GPP. TS 38.323: NR; Packet Data Convergence Protocol (PDCP) specifica- https://portal.3gpp.org/desktopmodules/Specifications/ tion, Mar. 2023. SpecificationDetails.aspx?specificationId=3196. [48] 3GPP. 3GPP Portal. https://portal.3gpp.org/#/55934-releases, 2023. [49] 3GPP. TS 24.008: Mobile radio interface Layer 3 specification; Core network proto- cols; Stage 3 (Release 18), Apr. 2023. https://portal.3gpp.org/desktopmodules/ Specifications/SpecificationDetails.aspx?specificationId=1015. [50] 3GPP. TS 24.229: IP multimedia call control protocol based on Session Ini- tiation Protocol (SIP) and Session Description Protocol (SDP); Stage 3 (Release 18) , Apr. 2023. https://portal.3gpp.org/desktopmodules/Specifications/ SpecificationDetails.aspx?specificationId=1055. [51] 3GPP2. 3GPP2 C.S0015-A: Short Message Service (SMS) for Wideband Spread Spec- trum Systems Release A, Sep. 2004. https://www.3gpp2.org/Public_html/Specs/C. S0015-A_v2.0_051006.pdf. [52] Youness Arjoune and Saleh Faruque. Smart jamming attacks in 5g new radio: A review. In 2020 10th Annual Computing and Communication Workshop and Conference (CCWC), pages 1010–1015, 2020. [53] Jaejong Baek, Sukwha Kyung, Haehyun Cho, Ziming Zhao, Yan Shoshitaishvili, Adam Doupé, and Gail-Joon Ahn. Wi not calling: Practical privacy and availability attacks in wi-fi calling. In Proceedings of the 34th Annual Computer Security Applications Conference, pages 278–288, 2018. [54] Evangelos Bitsikas and Christina Pöpper. You have been warned: Abusing 5g’s warning and emergency systems. In Proceedings of the 38th Annual Computer Security Applications Conference, pages 561–575, 2022. [55] Fabio Cecchinato, Lorenzo Vangelista, Giulio Biondo, and Mauro Franchin. Anomaly detection using lstm neural networks: an application to voip traffic. In 2021 IEEE International Conference on Recent Advances in Systems Science and Engineering (RASSE), pages 1–7, 2021. [56] Xiaolin Chen, Xuemeng Song, Guozhen Peng, Shanshan Feng, and Liqiang Nie. Adversarial- enhanced hybrid graph network for user identity linkage. In ACM SIGIR’21. [57] Hyungjin Cho, Seongmin Park, Youngkwon Park, Bomin Choi, Dowon Kim, and Kangbin IEICE TRANSACTIONS on Yim. Analysis against security issues of voice over 5g. 82 Information and Systems, 104(11):1850–1856, 2021. [58] Zhiwei Cui, Baojiang Cui, Junsong Fu, and Renhai Dong. Security threats to voice services in 5g standalone networks. Security and Communication Networks, 2022, 2022. [59] Haotian Deng, Weicheng Wang, and Chunyi Peng. Ceive: Combating caller id spoofing on 4g mobile phones via callee-only inference and verification. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, pages 369–384, 2018. [60] Ericsson. Voice and communications services trends and outlook, 2023. [61] Simon Erni, Martin Kotuliak, Patrick Leu, Marc Roeschlin, and Srdjan Capkun. Adaptover: adaptive overshadowing attacks in cellular networks. In Proceedings of the 28th Annual International Conference on Mobile Computing And Networking, pages 743–755, 2022. [62] Barbara Furletti, Lorenzo Gabrielli, Chiara Renso, and Salvatore Rinzivillo. Identifying users profiles from mobile calls habits. In ACM SIGKDD’12. [63] Julia A Goldberg. Interrupting the discourse on interruptions: An analysis in terms of relationally neutral, power-and rapport-oriented acts. Journal of Pragmatics, 1990. [64] Google. Android security paper 2023. https://blog.google/products/ android-enterprise/android-security-paper-2023/, Jan 2023. [65] GSMA. Volte market update. https://www.gsma.com/services/wp-content/ uploads/2022/04/GSMAi-VoLTE-Market-Update-final.pdf. [66] GSMA. Ims profile for voice and sms. version 13.0. https://www.gsma.com/newsroom/ wp-content/uploads//IR.92-v13.0-2-1.pdf, 2019. [67] GSMA. RCS Universal Profile Service Definition Document , Oct. 2019. https://www. gsma.com/futurenetworks/wp-content/uploads/2019/10/RCC.71-v2.4.pdf. [68] GSMA. Rich Communication Suite - Advanced Communications Services and Client Specifi- cation, Oct. 2019. https://www.gsma.com/solutions-and-impact/technologies/ networks/wp-content/uploads/2019/10/RCC.07-v11.0.pdf. [69] GSMA. Ims profile for voice and sms version 15.0. https://www.gsma.com/newsroom/ wp-content/uploads//IR.92-v15.0-4.pdf, 2020. [70] Mohammad Haghighat and Mohamed Abdel-Mottaleb. Low resolution face recognition in surveillance systems using discriminant correlation analysis. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pages 912–917. IEEE, 2017. 83 [71] Kaiming He, Georgia Gkioxari, Piotr Dollár, et al. Mask r-cnn. In IEEE ICCV’17. [72] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE CVPR’16. [73] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. [74] Peiyun Hu and Deva Ramanan. Finding tiny faces. In IEEE CVPR’17. [75] Yiwen Hu, Min-Yue Chen, et al. Uncovering insecure designs of cellular emergency services (911). In ACM Mobicom’22. [76] Yiwen Hu, Min-Yue Chen, Guan-Hua Tu, Chi-Yu Li, Sihan Wang, Jingwen Shi, Tian Xie, Li Xiao, Chunyi Peng, Zhaowei Tan, and Songwu Lu. Uncovering insecure designs of cellular emergency services (911). In Proceedings of the 28th Annual International Conference on Mobile Computing And Networking, MobiCom ’22, page 703–715, New York, NY, USA, 2022. Association for Computing Machinery. [77] Syed Rafiul Hussain et al. Privacy attacks to the 4g and 5g cellular paging protocols using side channel information. In NDSS’19. [78] IETF. RObust Header Compression (ROHC): Framework and four profiles: RTP, UDP, ESP, and uncompressed. https://datatracker.ietf.org/doc/html/rfc3095, 2001. [79] IETF. Real-time Transport Protocol (RTP) Payload for Comfort Noise (CN). https: //datatracker.ietf.org/doc/html/rfc3389, 2002. [80] IETF. Early Media and Ringing Tone Generation in the Session Initiation Protocol (SIP). https://www.rfc-editor.org/rfc/rfc3960, 2005. [81] Yunhan Jack Jia, Qi Alfred Chen, Zhuoqing Morley Mao, Jie Hui, Kranthi Sontinei, Alex Yoon, Samson Kwong, and Kevin Lau. Performance characterization and call reliability diagnosis support for voice over lte. In ACM MobiCom’15. [82] Dmitri Kamenetsky, Sau Yee Yiu, and Martyn Hole. Image enhancement for face recognition in adverse environments. In IEEE DICTA’16. [83] Mohsin Khan, Philip Ginzboorg, Kimmo Järvinen, and Valtteri Niemi. Defeating the downgrade attack on identity privacy in 5g. In SSR’18. [84] Hongil Kim, Dongkwan Kim, Minhee Kwon, Hyungseok Han, Yeongjin Jang, Dongsu Han, Taesoo Kim, and Yongdae Kim. Breaking and fixing volte: Exploiting hidden data channels and mis-implementations. In Proceedings of the 22nd ACM SIGSAC Conference 84 on Computer and Communications Security, pages 328–339, 2015. [85] Hongil Kim, Dongkwan Kim, Minhee Kwon, Hyungseok Han, Yeongjin Jang, Dongsu Han, Taesoo Kim, and Yongdae Kim. Breaking and fixing volte: Exploiting hidden data channels and mis-implementations. In Conference on Computer and Communications Security (CCS), pages 328–339, 2015. [86] Martin Kotuliak, Simon Erni, Patrick Leu, Marc Roeschlin, and Srdjan Capkun. LTrack: Stealthy Tracking of Mobile Phones in LTE. In USENIX Security’22. [87] Joseph B Kruskal. An overview of sequence comparison: Time warps, string edits, and macromolecules. SIAM review, 1983. [88] Swarun Kumar, Ezzeldin Hamed, Dina Katabi, and Li Erran Li. Lte radio analytics made easy and accessible. In ACM SIGCOMM’14. [89] Swarun Kumar, Ezzeldin Hamed, Dina Katabi, and Li Erran Li. Lte radio analytics made easy and accessible. ACM SIGCOMM’14. [90] Gyuhong Lee, Jihoon Lee, Jinsung Lee, Youngbin Im, Max Hollingsworth, Eric Wustrow, Dirk Grunwald, and Sangtae Ha. This is your president speaking: Spoofing alerts in 4g lte networks. In International Conference on Mobile Systems, Applications, and Services (MobiSys), pages 404–416, 2019. [91] Chi-Yu Li, Guan-Hua Tu, Chunyi Peng, Zengwen Yuan, Yuanjie Li, Songwu Lu, et al. Insecurity of voice solution volte in lte mobile networks. In ACM CCS’15. [92] Chi-Yu Li, Guan-Hua Tu, Chunyi Peng, Zengwen Yuan, Yuanjie Li, Songwu Lu, and Xinbing Wang. Insecurity of voice solution volte in lte mobile networks. In Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Communications Security, CCS ’15, pages 316–327, New York, NY, USA, 2015. ACM. [93] Jian Li, Yabiao Wang, Changan Wang, Ying Tai, Jianjun Qian, Jian Yang, Chengjie Wang, Jilin Li, and Feiyue Huang. Dsfd: dual shot face detector. In IEEE CVPR’19. [94] Pei Li, Loreto Prieto, Domingo Mery, and Patrick J Flynn. On low-resolution face recognition in the wild: Comparisons and new techniques. IEEE Transactions on Information Forensics and Security, 14(8):2000–2012, 2019. [95] Zhihang Li, Xu Tang, Junyu Han, Jingtuo Liu, et al. Pyramidbox++: High performance detector for finding tiny face. arXiv preprint arXiv:1904.00386, 2019. [96] Shengcai Liao, Anil K Jain, and Stan Z Li. Partial face recognition: Alignment-free approach. IEEE Transactions on pattern analysis and machine intelligence (TPAMI), 35(5):1193–1205, 2012. 85 [97] Siyuan Liu, Shuhui Wang, Feida Zhu, et al. Hydra: Large-scale social identity linkage via heterogeneous behavior modeling. In ACM SIGMOD’14. [98] Yu-Han Lu, Chi-Yu Li, Yao-Yu Li, Hsiao, et al. Ghost calls from operational 4g call systems: Ims vulnerability, call dos attack, and countermeasure. In ACM MobiCom’20. [99] Yu-Han Lu, Chi-Yu Li, Yao-Yu Li, Sandy Hsin-Yu Hsiao, Tian Xie, Guan-Hua Tu, and Wei-Xun Chen. Ghost calls from operational 4g call systems: Ims vulnerability, call dos attack, and countermeasure. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking, MobiCom ’20, New York, NY, USA, 2020. Association for Computing Machinery. [100] Feifan Lv, Bo Liu, and Feng Lu. Fast enhancement for non-uniform illumination images using light-weight cnns. In ACM Multimedia’20. [101] Jamila Manan, Atiq Ahmed, Ihsan Ullah, Leïla Merghem-Boulahia, and Dominique Gaïti. Distributed intrusion detection scheme for next generation networks. Journal of Network and Computer Applications, 147:102422, 2019. [102] MobileInsight. Mobileinsight. http://www.mobileinsight.net/, 2021. [103] Bjarke Mønsted, Anders Mollgaard, et al. Phone-based metric as a predictor for basic personality traits. Elsevier Journal of Research in Personality. [104] NYtimes. The humble phone call has made a comeback. https://www.nytimes.com/ 2020/04/09/technology/phone-calls-voice-virus.html, 2020. [105] Seongmin Park, HyungJin Cho, Youngkwon Park, Bomin Choi, Dowon Kim, and Kangbin Yim. Security problems of 5g voice communication. In Information Security Applications: 21st International Conference, WISA 2020, Jeju Island, South Korea, August 26–28, 2020, Revised Selected Papers 21, pages 403–415. Springer, 2020. [106] Sancheng Peng, Shui Yu, and Aimin Yang. Smartphone malware and its propagation modeling: A survey. IEEE Communications Surveys & Tutorials, 16(2):925–941, 2013. [107] Abhijith Punnappurath, Ambasamudram Narayanan Rajagopalan, Sima Taheri, Rama Chel- lappa, and Guna Seetharaman. Face recognition across non-uniform motion blur, illumination, and pose. IEEE Transactions on image processing, 24(7):2067–2082, 2015. [108] Qualcomm. Qxdm professional tool. https://www.qualcomm.com/media/documents/ files/qxdm-professional-qualcomm-extensible-diagnostic-monitor.pdf, 2020. [109] Grand View Research. Voice over lte market: Global industry trends, share, size, growth, op- portunity and forecast 2023-2028. https://www.researchandmarkets.com/reports/ 86 5732780/voice-over-lte-market-global-industry-trends, Jan 2023. [110] Juniper Research. Video calling demand booms during pandemic. https://pipelinepub. com/news/12307, Jan 2024. [111] KBV Research. Global mobile voice market size, share & industry trends analysis report by transmission, by end user, by regional outlook and forecast, 2022 - 2028. https: //www.reportlinker.com/p06364618/, Oct 2022. [112] RFC. Dynamic Host Configuration Protocol for IPv6 (DHCPv6), 2018. https: //datatracker.ietf.org/doc/html/rfc8415. [113] David Rupprecht, Katharina Kohls, Thorsten Holz, and Christina Pöpper. Call me maybe: Eavesdropping encrypted lte calls with revolte. In USENIX Security’20. [114] David Rupprecht, Katharina Kohls, Thorsten Holz, and Christina Pöpper. Call me maybe: Eavesdropping encrypted {LTE} calls with {ReVoLTE}. In 29th USENIX security symposium (USENIX security 20), pages 73–88, 2020. [115] Sachinsdate. Speaker detection by watching lip movements. https://github.com/ sachinsdate/lip-movement-net, 2016. [116] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In IEEE CVPR’15. [117] Clemens Stachl, Quay Au, Ramona Schoedel, et al. Predicting personality from patterns of behavior collected with smartphones. PNAS’20. [118] Statista. Mobile voice - worldwide. https://www.statista.com/outlook/tmo/ communication-services/mobile-voice/worldwide, 2023. [119] Jessica E Steele, Carla Pezzulo, Maximilian Albert, Christopher J Brooks, Elisabeth zu Erbach- Schoenberg, Siobhán B O’Connor, Pål R Sundsøy, Kenth Engø-Monsen, Kristine Nilsen, Bonita Graupe, et al. Mobility and phone call behavior explain patterns in poverty at high-resolution across multiple settings. Humanities and Social Sciences Communications, 8(1):1–12, 2021. [120] Qibo Sun, Shangguang Wang, Ning Lu, Kok-Seng Wong, and Myung Ho Kim. Sfads: A sip flooding attack detection scheme with the internal and external detection features in ims networks. Journal of Internet Technology, 17(7):1327–1338, 2016. [121] Sarah Tabassum, Cori Faklaris, and Heather Richter Lipford. What drives {SMiShing} susceptibility? a {US}. interview study of how and why mobile phone users judge text messages to be real or fake. In Twentieth Symposium on Usable Privacy and Security (SOUPS 2024), pages 393–411, 2024. 87 [122] Deborah Tannen. Turn-taking and intercultural discourse and communication. The handbook of intercultural discourse and communication, pages 135–157, 2012. [123] Guan-Hua Tu, Chi-Yu Li, Chunyi Peng, Yuanjie Li, and Songwu Lu. New security threats caused by ims-based sms service in 4g lte networks. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, pages 1118–1130, New York, NY, USA, 2016. ACM. [124] G Vennila, MSK Manikandan, and MN Suresh. Detection and prevention of spam over internet telephony in voice over internet protocol networks using markov chain with incremental svm. International Journal of Communication Systems, 30(11):e3255, 2017. [125] Ganesan Vennila, MSK Manikandan, and MN Suresh. Dynamic voice spammers detection using hidden markov model for voice over internet protocol network. Computers & Security, 2018. [126] David Waiting et al. the uct ims client. In IEEE TRIDENTCOM. [127] Sihan Wang, Guan-Hua Tu, Xinyu Lei, Tian Xie, Chi-Yu Li, Po-Yi Chou, Fucheng Hsieh, Yiwen Hu, Li Xiao, and Chunyi Peng. Insecurity of Operational Cellular IoT Service: New Vulnerabilities, Attacks, and Countermeasures, pages 437–450. Association for Computing Machinery, New York, NY, USA, 2021. [128] Xintao Wang, Kelvin CK Chan, Ke Yu, Chao Dong, et al. Edvr: Video restoration with enhanced deformable convolutional networks. In IEEE CVPR’19. [129] Wikipedia. STIR/SHAKEN. https://en.wikipedia.org/wiki/STIR/SHAKEN,. [130] Wikipedia. 3GPP. https://en.wikipedia.org/wiki/3GPP, 2023. [131] Wikipedia. 3GPP. https://en.wikipedia.org/wiki/3GPP, 2023. [132] Wikipedia. Darwin (operating system). https://en.wikipedia.org/wiki/Darwin_ (operating_system), Feb 2024. [133] Wikipedia. ios. https://www.apple.com/iphone-15/specs/, Jan 2024. [134] Wikipedia. Qualcomm msm interface. https://en.wikipedia.org/wiki/Qualcomm_ MSM_Interface, Jan 2024. [135] Wikipedia. Steganography. https://en.wikipedia.org/wiki/Steganography, Jan 2024. [136] John Wu. Magisk. https://github.com/topjohnwu/Magisk, Jan 2024. 88 [137] Danny Wyatt, Tanzeem Choudhury, Jeff Bilmes, and James A Kitts. Inferring colocation and conversation networks from privacy-sensitive audio with implications for computational social science. ACM TIST’11. [138] T. Xie, G. Tu, C. Li, C. Peng, J. Li, and M. Zhang. The dark side of operational wi-fi calling services. In 2018 IEEE Conference on Communications and Network Security (CNS), pages 1–1, May 2018. [139] Tian Xie, Guan-Hua Tu, Bangjie Yin, et al. The untold secrets of wifi-calling services: Vulnerabilities, attacks, and countermeasures. IEEE TMC’22. [140] Tian Xie, Sihan Wang, Xinyu Lei, Jingwen Shi, Guan-Hua Tu, and Chi-Yu Li. Mpkix: Towards more accountable and secure internet application services via mobile networked systems. IEEE Transactions on Mobile Computing, pages 1–1, 2022. [141] Hojoon Yang, Sangwook Bae, Mincheol Son, Hongil Kim, et al. Hiding in plain signal: Physical signal overshadowing attack on {LTE}. In USENIX Security’19. 89