Publication: Assessing the Scream Detection Model

Abstract

Scream detection models are a class of personalized . It aim to provide the right type and amount of support by

1. INTRODUCTION

Seeking help in emergency situations is a critical matter directly linked to survival. An ideal help request should be (1) directed to a trustworthy recipient, (2) immediate, and (3) delivered in an appropriate manner. Generally, the most representative method that satisfies these conditions is directly contacting public authorities through emergency hotlines such as 112 or 119. However, in actual crime situations or violent incidents, victims frequently face difficulties in making direct emergency calls.

A dating violence incident that occurred in the early morning of September 14, 2024, at a motel in downtown Seoul clearly demonstrated this problem. The victim continuously screamed for approximately 40 minutes, yet despite being in an area with high foot traffic near a police station, not a single person reported the incident. The victim was unable to call for help directly as the perpetrator had blocked access to their phone, and the situation was only resolved when police were dispatched following a report from a nearby resident. This case provides two important implications. First, situations do occur where victims cannot access direct means of reporting. Second, despite screaming being the most primitive and natural signal for requesting help, there exists no effective system to utilize it.

1-1. Current Solutions and Limitations

Emergency response applications currently used domestically and internationally primarily adopt button-based reporting methods. These systems are structured so that when users press the power button or volume button multiple times consecutively, notifications are sent to designated contacts or police. While this approach has the advantage of low false alarm rates due to its reliance on clear user intent, it has a fundamental limitation in that it does not work at all when victims are physically unable to press buttons during crisis situations.

A prior example of scream detection was the “Chilla” application developed in 2015 by Indian developer Kishlay Raj. This application detected screams and sent emergency signals via SMS, but used a simple detection method based on frequency and volume rather than deep learning models, and the service has since been discontinued. According to an interview with the developer, technical barriers such as Android OS’s strengthened background app restrictions and some manufacturers’ battery-saving features that interfered with app execution were major obstacles to service continuation.

Apple’s Sound Recognition feature, introduced in iOS 14, is an accessibility function that detects surrounding sounds and provides notifications to users. This feature ensures privacy as all processing occurs within the device, and can detect various sounds (fire alarms, dog barking, shouting, etc.). However, according to user reviews, detection accuracy is inconsistent and battery consumption is high. Most importantly, Apple does not provide this feature as an API, limiting its use in third-party applications.

1-2. Research Challenges

The most fundamental challenge in developing a scream-based emergency detection system is the classifiability of screams. Screams occur in various emotional states, including not only fear, anger, and pain but also joy, excitement, and surprise. According to research by Arnal et al. (2020), human screams can convey at least six different emotions and possess acoustically distinguishable characteristics. However, interestingly, it has been experimentally confirmed that ordinary people have difficulty auditorily distinguishing between screams of fear and screams of joy.

This has important implications for machine learning model development. First, the fact that screams are acoustically distinguishable suggests that classification through deep learning models is theoretically possible. Second, the fact that humans cannot intuitively distinguish them means that a systematic methodology is needed in the labeling process of training data. Since it is difficult to accurately classify screams into ‘alarming’ and ‘non-alarming’ categories by simple listening alone, a labeling strategy that considers both vocalization context and situational information is required.

Another significant challenge is managing false positives. In emergency reporting systems, excessive false positives lead to decreased reliability and can reduce response efficiency in actual emergency situations. The case of excessive disaster message notifications during the COVID-19 pandemic causing alarm fatigue among the public illustrates this problem well. Therefore, a scream detection system faces the challenge of simultaneously achieving high accuracy and low false positive rates.

Several constraints also exist in terms of technical implementation. First, to protect privacy, all audio processing must be completed within the device, with no transmission to or storage on external servers. Second, the system must operate 24/7 in the background on mobile devices while minimizing battery consumption. Third, it must operate robustly even in various environmental noise conditions.

1-3. Research Objectives

The objective of this study is to develop a deep learning-based scream classification system to address the problems presented above. The specific research objectives are as follows:

Establishing a scream classification framework: Construct a hierarchical classification system that distinguishes between general loud sounds (casually loud sounds) and screams, and further classifies screams into alarming screams (as danger signals) and non-alarming screams.
Building a training dataset: Construct a labeled dataset consisting of at least 1,000 samples per category utilizing publicly available audio datasets such as Google AudioSet.
Developing an efficient model: Design a lightweight deep learning model capable of real-time operation on mobile devices, simultaneously achieving high accuracy and low false positive rates.
Validating practicality: Verify deployment feasibility through performance testing under various environmental conditions.

This research aims beyond mere technical performance improvement to construct a practical system that can automatically request help even when actual crime victims are unable to report directly. Through this, we expect to provide an additional safety net, particularly for foreigners and people in geographically unfamiliar environments.

2. BACKGROUND

There is considerable

In this section we provide a brief overview of research on projects and background.

Objective: Build an AI model to detect and classify screams in real-time for automated emergency response in public spaces.
Motivation: Inspired by a 2023 incident where I reported a distress call at 3 a.m., saving a life, and reinforced by the need for faster emergency responses in urban settings.
Applications: Smartphones, IoT devices (e.g., smart speakers), urban surveillance systems.
Vision: Open-source the model to enable global collaboration and deployment for public safety.

3. SYSTEM OVERVIEW

3-1. Model Architecture

The proposed system employs ensemble of multiple layers including Convolutional Neural Networks (CNNs) and Convolutional Recurrent Neural Networks (CRNN) and YAMNet (Yet Another Mobile Network) for audio classification tasks. CNNs were selected due to their demonstrated efficacy in capturing temporal and spectral patterns in audio signals. For acoustic feature extraction, Mel-Frequency Cepstral Coefficients (MFCCs) were utilized as the primary feature representation. MFCCs effectively capture the perceptual characteristics of human auditory perception, making them particularly suitable for distinguishing between distress vocalizations and ambient sounds.

The selection of Convolutional Recurrent Neural Networks (CRNN) for this scream detection task was motivated by its proven effectiveness in capturing spatiotemporal dependencies in sequential data, as demonstrated in various domains including traffic forecasting and audio event detection.¹

Model performance was assessed using a comprehensive set of classification metrics, including accuracy, precision, recall, and F1-score. Particular emphasis was placed on minimizing false positive rates, as the application context necessitates reliable detection to avoid unnecessary alarm triggers that could undermine system credibility and user trust.

A custom dataset comprising 1,000 audio samples was compiled for model training and evaluation. The dataset composition was as follows: 500 scream samples representing the target class, 300 background noise samples encompassing various environmental sounds, and 200 casual conversation samples to represent typical non-emergency vocalizations. This distribution was designed to provide sufficient representation of both positive and negative classes while addressing the class imbalance inherent in emergency detection scenarios.

The model was implemented using Python 3, leveraging the TensorFlow framework for deep learning operations and the Librosa library for audio processing and feature extraction. All training procedures were conducted on an NVIDIA GeForce RTX 3060 GPU, which provided adequate computational resources for iterative model development and experimentation.

3-2. Challenges and Limitations

This study encountered several challenges that warrant discussion for future research directions. Acoustic Interference: The model demonstrated susceptibility to false positive detections in noisy environments, particularly when exposed to high-frequency ambient sounds. To address this limitation, future work should integrate advanced noise filtering techniques, such as Generative Adversarial Network (GAN)-based audio enhancement algorithms, which have shown promise in speech signal processing applications. Privacy Considerations: Continuous audio monitoring raises significant ethical concerns regarding user privacy and data security. The current implementation requires careful consideration of data handling protocols. To mitigate these concerns, we propose exploring on-device processing architectures that minimize data transmission to external servers, thereby reducing privacy risks while maintaining system functionality. Computational Efficiency: The current model architecture presents computational intensity challenges that limit its deployment on edge devices. The resource requirements exceed the capabilities of typical low-power embedded systems. Future iterations require optimization strategies, including model compression techniques such as quantization and pruning, to enable efficient deployment on resource-constrained devices.

4. EXPERIMENTS

We conduct experiments on three real-world large-scale datasets: (1) ESC50. Some studies have shown the result achieving up to 99.49% accuracy in environmental sound classification by using Mel spectrograms with transfer learning models (ResNet-152, DenseNet-161) and meaningful data augmentation applied directly to audio clips². Another datasets we have chosen is (2) FSD50K, a dataset containing 51,197 Freesound clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are CC-licensed, thereby making the dataset freely distributable (including audio waveforms)³. We conducted the research by splitting 3000 original data sets into smaller data sets within 2-3 seconds.⁴ The studies have inspired us to use average-pooled transformer layers to maximize accuracy of models using CRNN, CNN. Lastly, we adopt (3) RAVDESS to train recognition model to detect fearful emotions. ⁵

4-1. Limitations

4-2.

5. CONCLUSIONS

In this paper, we formulated the scream detection network and proposed the ensemble of three different models including diffusion convolutional recurrent neural network that captures the spatiotemporal dependencies, convolutional network, and pre-trained YAM-net for transfer learning with three different data sets: ESC50, FSD50K, RAVDESS. Especially, we divide a system into two different parts : first phase model discriminating scream from usual sounds and second phase model trained to find fear in the sounds. We further integrated three different models with voting. When evaluated on three datasets, our approach obtained significantly better prediction than baselines. For future work, we will investigate the following two aspects (1) applying the proposed model to other spatial-temporal forecasting tasks; (2) modeling the spatiotemporal dependency when the underlying graph structure is evolving, e.g.

REFERENCES

Zohaib Mushtaq, Shun-Feng Su, Quoc-Viet Tran, Spectral images based environmental sound classification using CNN with meaningful data augmentation, Elsevier, 2020
https://www.sciencedirect.com/science/article/abs/pii/S0003682X2030685X ↩︎
Eduardo Fonseca, Student Member, IEEE, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra, FSD50K: An Open Dataset of Human-Labeled Sound Events, arXiv:2010.00475, 2020
https://arxiv.org/abs/2010.00475 ↩︎
Prateek Verma and Jonathan Berger, Audio transformers: transformer architectures for large scale audio understanding,
https://arxiv.org/pdf/2105.00335 ↩︎
Ohad Cohen, Gershon Hazan, Sharon Gannot , Multi-Microphone and Multi-Modal Emotion Recognition in Reverberant Environment, arXiv:2409.09545, 2024
https://arxiv.org/abs/2409.09545
↩︎

justdoitjun

recent posts