Optimizing Performance of Real-time Detection | ITEA Journal

SEPTEMBER 2024 I Volume 45, Issue 3

Optimizing Performance of Real-time Detection and Classification for Military Personnel and Weapons

Jim Ferry

Karly Parcell

Undergraduate Student,
United States Military Academy, NY

Brian Choi Ph.D

Assistant Professor,
United States Military Academy, NY

James K. Starling Ph.D

Associate Professor,
United States Military Academy

 

Abstract

This study explores the application of the YOLOv5 model for real-time object detection, testing a modified YOLOv5 command line to maximize model performance in threat detection of soldiers and weapons. By leveraging a modified YOLOv5 model this study demonstrates significant improvements in detecting and classifying dismounted threats from surveillance footage. Aimes to minimizing the deployment of government employees for surveillance tasks and reducing human error in threat identification. Our findings indicate enhanced true positive rates for detecting guns and persons, reducing the need for extensive human monitoring. The study also discusses ethical considerations and future enhancements, including the use of synthetic data to train the model under varied conditions. This research contributes to the field by integrating cutting-edge AI to enhance the efficiency and accuracy of security operations, potentially transforming military and surveillance practices.

Keywords: YOLOv5, Real-time Classification, Object detection, Drones

Introduction

With the influx of real-time streaming in need of being analyzed during military operations, the number of operators required to monitor this footage has steadily increased to maintain accuracy. This is inefficient, as more resources are being allocated to monitoring footage and taken away from soldiers on the ground working. Not only does this problem increase the person-hours associated with monitoring real-time footage for threats, but it does nothing to minimize the human error associated with the operators constantly monitoring the excess of footage.

Real-time object detection and classification is at the forefront of emerging technology. You Only Look Once (YOLO) has been a continuously developing algorithm, making it a cutting-edge resource in object detection applications (Redmon et al. 2016). Various versions of the YOLO algorithm have been evaluated through implementation in current-day operations (Lee and Hwang 2022). The algorithm has been assessed across a diversity of fields, ranging from use in self-driving cars, agriculture, and the healthcare profession, to military purposes (Wang et al. 2022, Turay and Vladimirova 2022). The versatility of YOLO, used in multiple applications, is one of its primary strengths.

This research aims to address the issue of allocating resources for soldiers in combat to operators for monitoring incoming surveillance footage and increase the timeliness of decision making regarding military threats during combat. Putting soldiers back in the fight by utilizing video feed sourced from both mounted cameras and drones. Our model not only decreases the demand of operators to monitor screens in real time but also improves the efficiency of monitoring footage and accuracy by limiting human error. The general purpose of the model is to minimize the risk placed on soldiers during times of combat and assist them in determining the detection of threats, thereby improving the strategic placement of man and gun power during times of conflict.

We begin with the original YOLOv5 model and its parameters, using the command line to maximize the true positive rate of our model for the detection of guns and persons. A true positive detection is when the model is correct in both the location of the object within an image and the classification of the object detected. The version of YOLO chosen for our training and testing use is the model for YOLOv5. Over many iterations of testing and training of our model on a multitude of parameters, we present our highest-performing model and an explanation for each parameter and its place within our model.

Background

1. Evolution of YOLO

Since the first appearance of YOLO launched in 2016, the evolution of YOLO models has played a major role in advancing artificial intelligence (AI) usage across a multitude of fields (Redmon et al. 2016). The increased usage, understanding, and applications of detection and classification models are apparent when analyzing the evolution of YOLO. A distinction between classification and detection will be made in Section 2.3. The release of YOLOv1 revolutionized real-time object detection through its use of the one-pass-grid-based approach from which the name YOLO derives (Terven and Cordova-Esparza 2023). While groundbreaking, YOLOv1 had many limitations, performing very poorly when analyzing small overlapping objects and failing to distinguish between object types. Building off the foundations of YOLOv1, YOLOv2 emerged in 2017, improving the accuracy with the addition of anchor boxes and refining architecture while maintaining real-time capabilities (Terven and Cordova-Esparza 2023). In 2018, YOLOv3 was released to cater to a wider range of practical applications, with its multi-scale detection and deeper architecture (Terven and Cordova-Esparza 2023).

In 2020 the release of YOLOv4 marked a pivotal moment in the advancement of real-time object detection through its architecture, introducing enhancements such as optimization techniques, augmented data layers, and advanced features such as CSPDarknet53 for the backbone layers and the Path Aggregation Network (Terven and Cordova-Esparza 2023). YOLOv4 excelled in accuracy while preserving its hallmark near-real-time to real-time inference speeds. Through improved loss functions and innovative techniques, YOLOv4 produced greater precision on benchmark datasets and extended its usability across various platforms. YOLOv4 affirmed YOLO’s position as a pioneering solution in computer vision, setting the stage for subsequent iterations. This brings us to YOLOv5, the model we utilized for this research, which further elevated accuracy, robustness, and usability in the field of real-time object detection (Terven and Cordova-Esparza 2023).

2. Detection vs Classification

Establishing a distinction between image classification and object detection is necessary to understand the application of a Convolutional Neural Network (CNN) as a base for many different image classification and object detection algorithms. The most apparent distinction between the two is that image classification takes an image and predicts the class of that image as a whole (Kniazieva 2023). While, object detection singles out objects within an image and separately detects each object. This allows for the classification of multiple objects within a single image. Typically, a process performed using bounding boxes, specifically when discussing YOLO. Rather than giving an image one broad label, each object boxed is given its own classification. Using the same process as classifying an entire image but splitting one large image into multiple smaller sections and predicting their labels separately. Figure 1 below shows this difference, on the left-hand side, the image with only a dog, is classified as “dog” when detection is used, while on the right-hand side, the image of a dog and cat are detected first and then classified individually.

Figure 1 Classification vs Detection

Architecturally, the differences between basic CNNs and YOLOv5 are significant (Ashraf et al. 2022). A standard CNN architecture typically comprises a sequence of convolutional layers for feature extraction, followed by pooling layers, fully connected layers, and a final output layer for classification (Terven and Cordova-Esparza 2023). This structure is relatively straightforward and efficient for image classification. YOLOv5, on the other hand, features a more complex architecture tailored for object detection. It utilizes a single neural network that divides the input image into a grid and predicts bounding boxes and class probabilities for each grid cell. This approach enables simultaneous detection across the entire image. YOLOv5 additionally incorporates advanced features like skip connections, batch normalization, and a unique loss function to enhance learning efficiency and detection accuracy, making it far more sophisticated than an ordinary CNN (Chen et al. 2021).

3. Pretrained YOLOv5 Labels Data Set

The open-source YOLOv5 model has been pretrained on the COCO labels dataset (Tsung-Yi, et al. 2014). The dataset consists of training from approximately 118,000 images including 80 different labels. While “person” is a label contained within the COCO labels dataset, no type of gun or weapon is accounted for in the list of object categories. As “gun” and “person” are the 2 classification labels of interest for us in defining a threat, the base YOLO model must be modified and retrained for their inclusion.

4. YOLOv5 Architecture

The YOLOv5 architecture is broken into three sections, backbone, neck, and head. The backbone layers are where feature extraction takes place, making up the first 10 layers of a model. The use of the backbone in the modified model will be described in depth in the following section. The neck layers of YOLOv5 are utilized to detect features of various sizes. They consist of five cross stage partial (CSP) layers, four convolutional layers, and four connectional layers, producing a pyramid of features at varying complexities. The final layer of YOLOv5 is the head. The head layer is responsible for detection and classification, performing the detection step using bounding boxes (Jooshin, Nangir, and Seyedarabi 2023). There are multiple versions of YOLO open-sourced for use, nano, small, medium, large, and extra-large, where the use of a particular model size is dependent upon factors such as the dimensions and the sample sizes of training data. This study uses the medium version which strikes a balance between speed and accuracy of the model performance. YOLOv5 bounding boxes identify the region of the image where the object has been detected. Larger objects are detected with fewer anchor boxes and at lower resolutions, while smaller objects require more anchor boxes and a higher resolution image. Anchor boxes are prediction boxes for where the model believes an object could be during training and are turned into bounding boxes for the model’s final prediction.

4. Training Images

The compilation of images used throughout the training phases of our model was taken from the YouTube Gun Detection Dataset (YouTube-GDD) (Gu, Liao, and Qin 2022). This dataset consisted of 5,000 images taken from 343 different YouTube videos. All images contain either a “gun”, a person, or both. Contained within the 5,000 images, there are 16,064 instances of “gun” labeled and 9,046 instances of “person” labeled. Splitting the 5,000 images into 10 groups, each of 500 images. Two of these subgroups were chosen and labeled as testing sets, the other eight groups were used for the training set. The training set consisting of 4,000 images, was used to create and update the weights used during the testing of our model on the set of 1,000 remaining images. All images were prelabeled for use with YOLOv5 which increased the preparation of the model for testing and training, as opposed to the time that would be added in labeling the images to run for other versions of YOLO.

5. Modifying YOLOv5

In the field of object detection using deep learning, YOLOv5 stands as a significant development, primarily due to the specific training parameters it offers for the optimized performance of the model. The original YOLOv5 model used for training our data did not produce strong results, measured by the true positive rate, for the modified model, the focus was on maximizing the true positive rate and conversely, minimizing the false negative rate of prediction performance during testing. With these as the primary measures of our models’ performance, many parameters of our command line used during the training phase of our model were manipulated.

1. Defining Parameters of Original vs Modified Training Models

This section will explain the purpose and importance of each of the parameters mentioned in Table 1. Below are the parameters used for both the original and modified models, image size, batch, epoch, data, weights, freeze, and cache, followed by the role each parameter’s value plays in creating a higher-performing modified model. The original command line is set to optimize performance for images containing objects that the model has been pretrained on.

Table 1 Original and Modified YOLOv5 Model Parameters

The original model set all images to a standard size of 640 x 640 pixels. Fitting to a standard size when our objective is to create a robust model that can detect objects from a variety of angles and distances is not ideal. For the modified model, the parameter of multi-scale is added, this not only allows but forces a variety of resolutions in the images as they are used to train the model. Allowing for the number of pixels to range from as few as 320 x 320 to 1024 x 1024. Multi-scale allows the model to train at a variety of resolutions which improves the robustness, generalization, and adaptability of the model, increasing the performance of both image and video testing.

Both the original and modified have a batch size of 50. The batch size is the number of images the model goes through before updating weights. For one iteration through our training set of 4,000 images, the modified model updates 80 times or once every 50 images. Updating the model frequently can open our model up to overfitting on the dataset, but by taking the general information learned over a series of 50 images, our model is able to adapt to more generalized trends rather than specific image adaptations. Various batch sizes were tested throughout the training process, and 50 resulted in the greatest testing performance. This enables robust model performance when generalizing to a larger population.
Epoch value is the number of times the training dataset is run through. With an epoch value of 100, the model is trained on 400,000 images or rather 100 iterations of each image in the training set consisting of 4,000 images. While the epoch and batch size are the same for both the original and modified command lines, the addition of the multi-scale parameter for image size in the modified model makes it so that each time an image goes through the modified model, it goes through at a different randomly selected resolution. While the original model is training on each image at the same resolution for every iteration. The multi-scale parameter paired with the batch size and epoch value in the modified model helps to minimize the overfitting of our model to the training dataset in comparison to the original model (Models, n.d.).

The original model did not have a parameter for weights, freezing layers or cache. The modified model was given the parameter of weights and set to YOLOv5 weights for a medium-sized model. The weights correspond to architecture size of the model and using the medium-sized model gives a balance between speed of training and the accuracy of the model. While small weights would increase the speed and large weights prioritize accuracy of the model, medium weights give a good balance of the two. Allowing us to train on the images in a timely manner and performing with high accuracy during testing.

The parameter freeze, was added to the modified model and given a value of 10, meaning the first 10 layers of the model are frozen for each iteration of training. From section 2, subsection 5 above, the first 10 layers are considered the backbone layers of our model. This is where feature extraction takes place. Freezing the first 10 layers of the backbone, allows for the ability to preserve feature extraction abilities. The attributes that the pretrained model of YOLOv5 has developed over millions of iterations through their 118,000 images with 80 different object labels. Testing multiple models, it was found that freezing more layers, going into the neck and head where the early steps of detection begin was not beneficial for testing our data. This relied on the pretrained model for detections and predictions, when the pretrained model was not trained to detect guns, so there were many instances of guns present and undetected in training. While it did increase the true positive rate of persons due to being pretrained to detect people, there is a decrease in the prediction and confidence of predicting guns Lastly, the addition of cache was made to improve the training speed of the model. This additional parameter does not affect the performance of the model, but rather just eh speed of training and testing the model. After the images are run through once, they are saved for use in consecutive iterations (Glenn Jocher, 2020).

2. True Positive / False Negative Rate Calculation

True positive (TP) occurs when our model accurately predicts the object, for instance either a gun is present in the image and correctly detected and classified as a gun, or a person is present and correctly detected and classified as a person. False negative (FN) is when the model fails to detect or misclassifies the gun or person. An example in relation to our problem is if a gun is present in the image, but is not detected by the model, or the gun is detected but misclassified as person. These metrics are useful in comparing the performance of different command lines during the testing phases of YOLOv5 and seen in the results section below. The values in the confusion matrix comparison in Figure 2, are values produced by the YOLOv5 model, understanding how these values are produced is key in measuring the performance of the model.

The true positive rate (TPR) is then calculated by taking the number of true positives, times a gun is correctly detected when present, this is the numerator. Dividing then by all instances of gun present in the testing images, which is also the sum of all TP and FN, making the denominator. The equation for calculating true positive rate can be seen below in Equation (1).

(1)

The false positive rate (FPR) is the complement of the TPR. The FPR can be found by (1-TPR), or by taking the FP, instances when a gun is left undetected or misclassified as the numerator. Dividing again with the denominator of all instances of a gun present in the testing images, or the sum of FN and TP. Equation (2) gives the equation for false positive rate.

(2)

6. Results

1. Confusion Matrix Comparison

The primary method to measuring our model’s performance and comparing the modified model to the original, is through the confusion matrices. The values where both the True and Predicted labels are the same classification, are the true positive rates for the objects (“gun” and “person”). The other boxes in each column are summed to get the false negative rate for each object. The values in the confusion matrices shown in Figure 2 are produced from the performance on each model on the testing set of the YouTube-GDD.

Figure 2 Confusion Matrix Comparison

In terms of maximizing the true positive rate, it is apparent that from the original models 73% and 50% true positive rates for person and gun respectively, they were increased when measuring the performance of the modified model on the testing images set. The modified model was able to predict with a true positive rate of 87% for person, increasing by 14% from the original models predictive performance. For true positive rate of guns, there was a 16% increase, going from 50% to 66% in the modified model.

2. Performance on Video Testing

Starting with the original command line for training the model and replacing multiple parameters, the modified model is produced. This model maximized the TPR for detecting both guns and persons in our testing data. The values to the right of the classification labels in the figures shown below is the confidence of the model in its prediction. For example, the largest bounding box on the left side of Figure 3, the confidence is 0.28. This means that the model is 28% certain that it has correctly identified a gun in that particular region of the image. The highest confidence is one, meaning the model would be 100 percent confident. Figure 3 shows a frame taken from a video used for testing with the weights from our original model.

Testing with the original model in Figure 3, there are no persons detected in the frame, and the guns detected on the left side of the image are bounded and classified with very low confidence. Only one gun is predicted by the model with a high confidence. A possible explanation for this is that the model is overfitting to the training data. For there exists an imbalance as there are far more instances of guns than persons and the precautions such as multi-scale and freezing layers discussed above to minimize the overfitting of the modified model were not taken.

Figure 3 Image from Video Testing Original YOLOv5 Model

Figure 4 Image from Video Testing Modified YOLOv5 Model

The modified model predictions are shown in Figure 4. While the model is still failing to detect a person and gun within the image, the two persons and guns that are detected have accurate bounding boxes, correct classifications and the model is highly confident in its predictions. The modified models’ predictions in Figure 4, also do not indicate any misidentifications, while the original models’ predictions in Figure 3 show a box around the person, predicting them as a gun. While a gun is contained within that box, when comparing with the predictions of Figure 4, it is easy to see that the prediction was supposed to be “person”, not “gun”. Similarly, the smaller bounding box prediction within the large box that is on the shoulder/back of the individual, also predicts “gun”, when a gun is not contained at all within that region of the image.

Discussion

Optimizing the YOLOv5 model for real-time detection and classification of military personnel and weapons has led to significant enhancements in model performance, particularly in increasing the true positive rates for detecting guns and persons. This improvement is critical for military operations, where accurate and timely threat detection can significantly impact decision making and resource allocation. The modified YOLOv5 model, with its advanced training parameters such as multi-scale training, layer freezing, and balanced batch size, has shown a substantial increase in detection accuracy. This advancement reduces the need for human monitoring and minimizing the potential for human error.

One of the primary implications of our findings is the potential to transform military surveillance practices by reducing the number of personnel required to monitor real-time footage. This shift allows for the reallocation of these personnel to more critical on-ground tasks, enhancing operational efficiency and effectiveness. The increased true positive rates for both guns and persons signify a more reliable and robust detection system that can operate with higher confidence in varied environments and scenarios. This improvement is particularly relevant for enhancing the strategic placement and utilization of military assets during combat situations.

8. Conclusion

1. Future Work

The first priority for future work lies in the presentation and effectiveness of the alert notification to bring attention to the operators of the threat detected while providing a brief and accurate description of the threat. Currently the ability exists to extract information for the contents detected in the frame, but there is not a platform for the notification to be presented on. Solutions for this is to present the extracted information via a desktop notification, or email/text notification in order to extend alerts to not just those behind the screen, but out in the field. Additionally, a threshold needs to be created and criteria defined for when an alert is sent. Following the extraction of information, an alert will be sent each time the classification changes. Repeated detection might unnecessarily overload the operator. Implementing a well-defined criteria in order for the extracted information to be sent is critical in optimizing response time and confirmation by the operator.

Utilizing synthetic imagery for training our model, as opposed to the current dataset, can allow increased model performance (Dewi et al. 2021). While the modified model showed improvement, factors outside the model must still be considered for what can be done to further increase the true positive rate. Training on synthetic data, using Arma 3, increases robustness by decreasing the limits of outside factors, such as weather, access to personnel, weapons, and uniforms, the model will be able to train on a virtually unlimited set of images created in the game (Bohemia Interactive 2023). Gaining the ability to create scenarios, from the angle of drones and mounted cameras, in 360 degrees of soldiers in various uniforms, with a multitude of guns, and in a variety of weather and geographical locations. This has potential to create a more robust model and aims to improve on real-time performance of true positive rates of classification for the model.

Finally, we plan to diversify the training data to account for the model’s performance in distinguishing non-weapons from real weapons. Examples include an individual holding a camera in the same stance a gun is held in or a soldier holding brooms or sticks instead of a gun. In addition, training on both friendly and foreign uniforms will account for soldier distinction. This will minimize the false positive threat alerts sent to an operator.

2. Ethical Considerations

AI or technology is not given the power to make decisions but rather, help inform operators to make a timelier decision. The drone does not advise or act on any classification or detection it performs. While ideally the model would make decisions to eliminate the need for operator action and time, the model would need to undergo greater testing and increase performance before becoming a viable option. The majority of ethical considerations that come into play with our model surround the training set information our model is trained on. Currently, the model is not trained on a diverse population of images, specifically uniforms, there is a risk of many individuals being identified as threats when they are friendly forces. To mitigate this, before use in a combat scenario, the model will send an alert to notify an operator of the threat detected, the model’s detection and classification is not the sole authority in actions being taken towards a detected threat.

Producing a diverse dataset where potential enemy forces uniforms are used, as well as their weapons, could also be seen as unethical, depending on how this dataset was produced. To mitigate this consideration, future work will increase diversity of uniforms, weapons, backgrounds and person position, created by synthetic images. The synthetic images can provide a more diverse dataset for testing, not using real soldiers, but lifelike images that can be manipulated and used to train our model without the need to use real soldiers.

Acknowledgements

We would like to thank Ms. Cathleen Rush and Ms. Natalie Stone at the United States Army Special Operations Command (USASOC) AI division at Fort Liberty NC.

References

Ashraf, Abdul Hanan, Muhammad Imran, Abdulrahman M. Qahtani, Abdulmajeed Alsufyani, Omar Almutiry, Awais Mahmood, Muhammad Attique, and Mohamed Habib. 2022. “Weapons Detection for Security and Video Surveillance Using CNN and YOLO-V5s.” Computers, Materials and Continua 70 (2): 2761–75. https://doi.org/10.32604/cmc.2022.018785.

Bohemia Interactive. 2023. “Arma 3.” https://arma3.com/.

Chen, Leiyu, Shaobo Li, Qiang Bai, Jing Yang, Sanlong Jiang, and Yanming Miao. 2021. “Review of Image Classification Algorithms Based on Convolutional Neural Networks.”

Dewi, Christine, Chen Rung-Ching, Liu Yan-Ting, Jiang Xiaoyi, and Hartomo Kristoko Dwi. 2021. “Yolo V4 for Advanced Traffic Sign Recognition With Synthetic Training Data Generated by Various GAN.” IEEE Access. https://doi.org/10.1109/ACCESS.2021.3094201.

Horvat, Marko, and Gordan Gledec. “A comparative study of YOLOv5 models performance for image localization and classification.” In 33rd Central European Conference on Information and Intelligent Systems (CECIIS), p. 349. 2022.

Jocher, Glenn. Ultralytics YOLOv5. Version 7.0. 2020. AGPL-3.0. https://github.com/ultralytics/yolov5.https://doi.org/10.5281/zenodo.3908559

Jooshin, H. K., M. Nangir, and H. Seyedarabi. 2023. “Inception-YOLO: Computational Cost and Accuracy Improvement of the YOLOv5 Model Based on Employing Modified CSP, SPPF, and Inception Modules.” IET Image Processing. https://doi.org/10.1049/ipr2.13077.

Kniazieva, Yuliia. 2023. “What’s the Difference Between Image Classification & Object Detection?” 2023. https://labelyourdata.com/articles/object-detection-vs-image-classification.

Lee, Jeonghun, and Kwang il Hwang. 2022. “YOLO with Adaptive Frame Control for Real-Time Object Detection Applications.” Multimedia Tools and Applications 81 (25). https://doi.org/10.1007/s11042-021-11480-0.

Models, Machine Learning. n.d. “Optimizing Machine Learning: Determining the Ideal Number of Epochs.” https://machinelearningmodels.org/optimizing-machine-learning-determining-the-ideal-number-of-epochs/.

Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. “You Only Look Once: Unified, Real-Time Object Detection.” In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016-December:779–88. IEEE Computer Society. https://doi.org/10.1109/CVPR.2016.91.

Terven, Juan, and Diana Cordova-Esparza. 2023. “A Comprehensive Review of YOLO: From YOLOv1 and Beyond,” April. http://arxiv.org/abs/2304.00501.

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár, “Microsoft COCO: Common Objects in Context.” https://arxiv.org/abs/1405.0312

Turay, Tola, and Tanya Vladimirova. 2022. “Toward Performing Image Classification and Object Detection with Convolutional Neural Networks in Autonomous Driving Systems: A Survey.” IEEE Access 10. https://doi.org/10.1109/ACCESS.2022.3147495.

Wang, Aichen, Tao Peng, Huadong Cao, Yifei Xu, Xinhua Wei, and Bingbo Cui. 2022. “TIA-YOLOv5: An Improved YOLOv5 Network for Real-Time Detection of Crop and Weed in the Field.” Frontiers in Plant Science. https://doi.org/10.3389/fpls.2022.1091655.

Yongxiang, Gu, Liao Xingbin, and Qin Xiaolin. 2022. “YouTube-GDD: A challenging gun detection dataset with rich contextual information.” https://arxiv.org/abs/2203.04129.

Author Biographies

Karly Parcell is an Undergraduate student, majoring in Applied Statistics and Data Science at the United States Military Academy.

Dr. Brian Choi is an Assistant Professor in the Department of Mathematical Sciences at the United States Military Academy at West Point. His PhD was in pure and applied math at Boston University.

Dr. James K. Starling is an Associate Professor in the Department of Mathematical Sciences at the United States Military Academy at West Point. He serves as an Operations Research/Systems Analysis (ORSA) Army officer with over twenty-four years of active-duty experience. He received his PhD from the University of Washington in Industrial & Systems Engineering.

ITEA_Logo2021
  • Join us on LinkedIn to stay updated with the latest industry insights, valuable content, and professional networking!