Synthetic Data for Target Acquisition | ITEA Journal

DECEMBER 2024 I Volume 45, Issue 4

Synthetic Data for Target Acquisition

Maxwell Felter

Maxwell Felter

Cadet; United States
Military Academy

Major Javier Sustaita

Instructor; United States Military Academy

Dr. James Starling

Associate Professor; United States Military Academy

DOI: 10.61278/itea.45.4.1004

Abstract

The application of computer vision for target acquisition of military combat vehicles requires ample labeled data. This data can be hard to come by and challenging to create. This article encourages the use of synthetic data as an alternative to the creation of real data. As a proof of concept for this alternative route, we explore using computer-generated images (CGI) and artificial intelligence (AI) text-to-images to improve model performance on the object detection of combat vehicles. Both forms of synthetic data show roughly a 10% improvement in relevant performance metrics. Our study also concluded that the way synthetic data is introduced to models during training is vital to their performance. The most successful approach consisted of mixing real and synthetic images during training and close behind that technique was pre-training on synthetic images before training on real images. We found that using the real photos alone or pre-training on real images before synthetic ones did not produce effective models. These results provide a reason for the military to direct more resources and funding toward synthetic data creation, comment on what types of synthetic data to create, and how to introduce it in model training. These results impact the direction of further research as synthetic environments improve over time and will help prepare our computer vision-driven systems for the future battlefield.

Keywords: Computer vision, synthetic data, CGI, generative AI, digital sets

Introduction

Modern-day applications of computer vision often find roadblocks in the quantity and quality of training data available. Especially in the case of object detection tasks, it can be time-consuming to create and label new data. Military applications of computer vision, often within the realm of target acquisition, require effective models that can be relied on to act appropriately. The primary objective of this study is to evaluate the effectiveness of synthetic data in improving the performance of computer vision models for the object detection of military combat vehicles. Specifically, this research aims to compare the performance of models trained on computer-generated images from the Unreal Engine with those trained on AI text-to-image photos produced using the Stable Diffusion architecture. Additionally, this research explores the optimal training technique for integrating synthetic data with real-world data to enhance model accuracy and reliability.

The necessity for this technology has long been acknowledged. In 1999, a white paper from the US Army’s Training and Doctrine Command (TRADOC) stated that this autonomous capability would “augment manned platforms in every facet of operations on the ground, sea, air, and space” (US Army Training and Doctrine Command 1999). Our nation’s future conflicts will increasingly rely on these technologies and the image sensors we continue to attach to our equipment. The Department of Defense has maintained the view that “we have always been at the forefront of technological advances to ensure an enduring competitive military advantage against those who threaten our security and safety” (Department of Defense 2019).

With the advancing use of drones on the battlefield, we are further inclined to develop this technology to match our adversaries. In the high-stakes application of computer vision to the object detection of combat vehicles, there will likely always be demand for more quality data to improve models. However, large and organized stores of this data can often be hard to come by. The openly available data that does exist can be poorly labeled and require many hours of labor to clean. By using this data, you risk introducing unnecessary bias to the models since they can only be as good as the data they are trained on.

Creating new quality data is challenging due to the need for various combat vehicles, physical environments, and photographic equipment. Additionally, we cannot be sure that the domain of our training data closely matches the domain of the next conflict we intend to employ this technology. It is possible that our next conflict could see changes to vehicle structure, environment, and weather conditions. A feasible and sustainable solution to this layered problem could be found in the use of synthetic data. In the context of military vehicles, synthetic data can provide a level of operational security not possible when collecting real-world data. Collecting real data requires putting the vehicles out in the open in varying terrains that could expose intentions to the adversary. Furthermore, synthetic data’s minimal barriers to creation mean that a small team with minimal equipment can implement it without any extensive approval needed.

Background

Synthetic data has been increasingly used as AI researchers exhaust available datasets, specifically when it comes to niche tasks. From autonomous driving (Ros et al. 2016), to commercial object detection (Borkman et al. 2021), synthetic data has played a key role in bypassing the need for real data. Our initial hypothesis is that the closer a type of synthetic data resembles the real target domain, the more it will benefit computer vision models and translate well to their application in real scenarios. How closely a synthetic environment matches a target domain is challenging to measure.

Researchers have explored the use of synthetic data for military applications, but due to the technological restraints of simulation environments and their rapid advancement, their synthetic images are often outdated when compared to today. Lee et al. (2020) have applied synthetic data to military vehicle detection; however, they focused on the augmentation of synthetic data to maximize results rather than creating realistic images to begin with. Can, Er, and Yusuf Kunt (2021) also investigate the use of synthetic data for vehicle detection from an overhead point of view (POV) but do not provide a comparison of training techniques. The primary focus in both studies is on creating a suitable synthetic set and does not reach into the discussion of how to best use the set.

In this study, we analyze and compare the effects two different modes of synthetic data have on this specific task to suggest whether they should be further used to bolster target acquisition models. We also explore three additional training techniques rather than strictly the mixed approach, in which all images, real and synthetic, are mixed for training. The first mode of synthetic data we explore is CGI through the Unreal Engine environment. The Unreal Engine is a popular 3D gaming engine created by Epic Games (Epic Games. Unreal Engine), commonly used for photorealistic games and scene creation. We use the 2022 version, Unreal Engine Five. The second mode of synthetic data we inspect is AI text-to-image photos produced using the stable diffusion architecture (Automatic1111, 2022). Stable diffusion is a mainstream text-to-image diffusion model released in 2022 and used to render pictures from custom prompts. This study provides value to the discussion of synthetic data’s use in military object detection as it analyzes two of the most modern sources of synthetic imagery, as well as the most effective practices for its utilization.

Technical Content

1. Set Creation

Before creating our synthetic images, we needed to decide what type of real photos we intended to mimic with the synthetic environments.  Given that many available battlefield photos today are from drones and other aerial vehicles (especially due to the Ukraine war), we chose to focus on drone-style images. We created our target set of real images from various publicly hosted datasets on roboflow.com (“Roboflow.” Roboflow.ai).  Roboflow, a repository for creator-labeled image datasets, has recently seen an influx of drone-captured battlefield images due to global conflicts. Using these various datasets, we selected aerial-style images from varying distances and angles to form a representative collection of photos. Ultimately, our real set consisted of 508 images. The majority of these images seem to come from the war in Ukraine, however, some come from US Army training videos. Since many of these images come from the same region and time, the dataset contains a bias towards those specific types of equipment and conditions. A future study that creates a more distributed selection of images could help confirm that our results extend to more domains. As we did not have access to a diverse set of clean images so for our study, we allowed the target dataset to focus on the Ukraine war drone-style images. A sample of three selected images is shown in Figure 1.

Example images from the real-only target set
Figure 1: Example images from the real-only target set.

To produce our CGI dataset, we began with an Unreal Engine (UE) setup containing various M1 Abrams assets to use in different scenes. Scenes were created both from scratch as well as with preset environments in Unreal Engine 5 (UE). Key considerations included varying the types of environments, backdrops, and extraneous objects presented in the scenes. Our synthetic data can provide context to a computer vision model about how tanks appear in these environments and what objects should not be detected as tanks. After setting up a scene, we used the UE built-in camera tool to record videos capturing the M1 Abrams tank in the environment. To simulate drone-style images, we collected videos from above the tanks, manually moving the camera to mimic how a drone might fly through the scene rather than scripting its path. Realizing our limitations in UE due to little experience with the software, we incorporated additional open-sourced videos from YouTube created in UE5 to increase the size of our CGI dataset.  Many of these videos used M1 Abrams as well as additional tank models. Altogether, our CGI dataset contained images from ten clips taken in our six personally created scenes as well as 15 clips from YouTube.  

To label these videos, we used two pipelines: cvat.ai (“CVAT.” Www.cvat.ai.), and Roboflow. In cvat.ai, videos were uploaded and split into frames. Using their interpolation tool, we were able to label every 5th or 10th frame, and it would automatically fill in the labels between. When labeling for this task, we used only one class, tank, but had to produce bounding boxes for all instances in each image. Labeled frames were then uploaded to Roboflow for storage which has a built-in labeling feature.  When labeling within Roboflow, we relied on their automatic labeling assistant, which predicts the bounding box around tanks with a You Only Look Once (YOLO) computer vision model trained on more general objects such as vehicles and allows us to manually adjudicate when needed. An example of these images is shown in Figure 2.

Example images from CGI dataset
Figure 2: Example images from CGI dataset (Unreal Engine environment).

For the creation of our AI dataset, we needed to produce prompts in bulk to be fed into the stable diffusion model. To do this, we utilized ChatGPT (ChatGPT, OpenAI), prompting it to make 100 image descriptions that would produce varying drone-style images of tanks. An example of the roughly 1000 prompts used is shown in Figure 3.

“Produce 50 prompts I can feed into a text-to-image AI model (stable diffusion) that will create images of military tanks from various aerial perspectives. Ensure different environments, weather conditions, and angles are produced.”

Figure 3: ChatGPT prompt used to create image descriptions.

Aerial view of a military tank in a desert landscape at sunrise.

Aerial view of a military tank in a dense forest during autumn.

Aerial view of a military tank crossing a river in a jungle environment.

Figure 4: ChatGPT output examples fed into stable diffusion architecture.

With proper formatting, the image descriptions were fed directly into the stable diffusion architecture to form two images per text description. An example of these image descriptions is shown in Figure 4. The stable diffusion architecture uses a classifier-free guidance (CFG) scale to dictate how closely the model should be held to the provided prompt (Automatic1111, 2022). An image created with a lower CFG value may include additional items not mentioned in the prompt, whereas a higher value will only contain items explicitly mentioned in the prompt. When creating images, we utilized a classifier-free guidance (CFG) scale of 8, as that number has been cited to have the best results in synthetic data creation (Tian et al. 2023). After each batch of image creation (200 images from 100 prompts), we manually went through images and selected photos without obvious errors. Many photos were due to unrealistic representations of tanks or a lack of tanks present. Three examples of AI images are shown in Figure 5. When the desired number of photos was reached for each session, we uploaded them to Roboflow for labeling and storage just as we did with the CGI images.

Comparing Figures 1, 2, and 5, a lot is revealed about the differing nature of real, UE, and AI images. We expect only realistic things to be present in the real images. However, the frequent poor image quality, weather conditions, and smoke present can produce unique pictures that aren’t well represented in the synthetic forms. Notice in the left picture shown in Figure 1 how the leftmost tank is partially obscured by the radiating smoke. This effect is often challenging to recreate in the UE or prompt for AI. In the images of Figure 2, there is little variability in the visibility of tanks. Things such as fog, rain, and smoke are possible to add to scenes in UE, however, the computing power required to keep a high-level definition and the labor involved in implementing them make it unrealistic given the resources we work with. In the rightmost image of Figure 5, many of the tanks are shown with no main gun, something that we know is unrealistic given the structure of the tanks present. While selecting images for our AI dataset, we chose to allow some minor errors such as those as it is very challenging to create images free of all errors consistently.

Example images from AI dataset
Figure 5: Example images from AI dataset (stable diffusion).

2. Experimental Design

Now, with each set created and labeled, we compare the modes of synthetic data. We contrast four training techniques as shown in Figure 5: training on each set individually (T1), training on combined datasets that hold all real and synthetic images together keeping both types of synthetic images separate (T2), training on real before synthetic (T3), and training on synthetic before real (T4). The corresponding letters in each branch of the flowchart describe each separate model trained within a technique. For example, T1 trains three models (a, b, c), one for each dataset type (real, UE, AI). We standardize the amount of training by keeping the epochs used constant. An epoch constitutes one full pass of the training data through the model, allowing it to see every image in the training set and adjust weights accordingly. As more and more epochs progress, the model begins to optimize its weights and find the best solution for the training data. All rounds of model training used 300 epochs and took the best weights (rather than the weights on the 300th epoch) for comparison. We chose to use 300 epochs since it serves as a benchmark for model comparison and allows for model convergence without excessive training time (“PyTorch,”). In T1, we trained two separate models, one for each mode of synthetic data, for 300 epochs on strictly the synthetic dataset. In T2, we mix each synthetic dataset with the real dataset and train for 300 epochs on either merged dataset (one for each mode of synthetic data). In T3 and T4, we take a sequential training approach. T3 consists of training a model on the real dataset for 300 epochs and then either of the synthetic datasets for another 300 epochs. T4 reverses this order, training on the synthetic dataset before the synthetic datasets. Keeping either training batch at 300 epochs allows us to ensure the models see every image the same number of times in all training techniques. For example, 300 passes of the combined datasets in T2 gives the model the same number of looks at the images as 300 epochs of training on the real dataset and then another 300 epochs of training on the synthetic sets.

To quantify model performance on the validation and test sets, we used the mean average precision (mAP). Precision is defined as the ratio of true positive detections to the total number of positive detections (both true and false positives), measuring the accuracy of the positive predictions of the model. The mean average precision takes the average of each model’s precision in all photos within a validation or test set. In our calculation of mAP, a correct prediction of a military vehicle is defined by a predicted bounding box that has an intersection over union of 0.5. This means that the intersecting area of the ground-truth bounding box (label) and the predicted box divided by the union of the two shapes (without double counting the overlapping area) is at least 0.5. Essentially this means we caught at least half of the object without using a box too large to do so. The mAP is most used to gauge computer vision model performance since it classifies correct predictions and can be applied to tasks that involve the distinction between multiple classes, however, in our study, we use only one class: military vehicles. We expected T1 and T3 to underperform compared to the other techniques since it will readjust weights for maximized performance on synthetic images rather than the real images trained on initially (in the case of T3). We also anticipated an increase in performance for T2 and T4 when compared to training on the real images only since they should allow the models to gain transferable information from the synthetic data without overriding the benefits of the real data.

As shown in Table 1, the CGI dataset is roughly 2.5 times as large as the real training set, and the AI dataset is roughly 1.6 times as large. We allow for the CGI dataset to be larger than the AI dataset as the CGI images are inherently less diverse due to their collection from videos. We expect each image (every few frames) from a video to be relatively similar, whereas very few AI images look similar. The size of our datasets aims to provide enough images for training while keeping the datasets small enough to show an observable margin between training methods. With more real images provided, we hypothesize that more synthetic images would need to be created to create the same benefit. The two modes of synthetic data were never mixed in the same set as we are not only concerned with whether the synthetic data can improve model performance, but also which mode of synthetic data is more effective. For every training technique, the validation and test sets were composed of only real images and contained the same images for proper comparison. The experimental design is described in the flowchart shown in Figure 6. The CGI and AI datasets lack a validation and test component since we only intend to use those images for training. If we were to validate or test models on synthetic images, the output would describe their performance on the synthetic images. In contrast, we intend to measure their performance on real images only since these models are designed to be used for real-world data.

Dataset Train Validation Test
Real 508 193 183
CGI (Unreal Engine) 1254    
AI (Stable Diffusion) 810    

Table 1: Image counts per dataset (synthetic datasets are not used to validate or test and are shown in black coloring as a result).

Experimental design flowchart
Figure 6: Experimental design flowchart depicting the four training techniques. Technique 1 (leftmost) trains models on each set individually without mixing. Technique 2 (middle left) trains models on a mixture of the real set and synthetic sets. Technique 3 (middle right). Technique 4 (rightmost).

We used the YOLO version five (yolov5) small architecture to assess the performance of our four training techniques (Jocher et al. 2022). Though higher-performing models exist in the YOLO community, such as the recently released yolov9, we believe that the effects of synthetic data will be similar in identical regardless of the specific model architecture being used. Yolov5 includes five model sizes: nano, small, medium, large, and extra-large. We chose the small architecture due to the relatively small size of our datasets and the single class within our domain. Smaller architectures can be beneficial to generalize off smaller training sets and less complex domains. Also, smaller models are not as computationally expensive allowing for quicker training and testing as well as smaller file sizes when saving the best weights of each iteration.

3. Results

The training output data shown in T1 of Figure 7 shows that using only synthetic data to train models did not produce optimal performance on the real images. Furthermore, the T3 graph in Figure 7 shows that training a model on synthetic data after already being trained on real images led to no significant improvement in performance. This makes sense because models trained on the real data already have a decent understanding of how tanks manifest in different real environments, but being trained on synthetic data afterward distorts this understanding and begins to optimize weights for the synthetic style of tank representations. The two superior techniques ended up being the mixed (T2 in Figure 7) and real after synthetic (T4 in Figure 7) approaches.

Validation results throughout training for T1, T2, T3, and T4
Figure 7: Validation results throughout training for T1, T2, T3, and T4.

As described in the first bar pair of Figure 8, the test set mAP for models trained on the real and synthetic mixture sets was 0.700 and 0.719, whereas, for the real after synthetic approach, it was 0.695 and 0.716. Only one mAP is provided for testing since the model does one round of predictions for all photos and then calculates the mAP without updating weights as it would while validating during training. The mixed approach (T2) leads the model to treat each synthetic and real photo it sees during training with equal importance when it comes to updating weights. Since we intend to prioritize performance on real photos, we want the model to place more emphasis on each real photo compared to each synthetic photo. The synthetic-to-real approach (T4) shown in the second bar pair of Figure 8, attempts to work around this by allowing the model to gain an initial understanding of the various tank representations in the synthetic environment and then alter that understanding and optimize its weights on the real photos. However, though T4 did perform well, it was slightly below that of T2 for both synthetic datasets. We do not expect, given these environments and resources, to be able to completely replace real images with synthetic ones. A notable detail to our results is the fact that the AI images showed a greater increase in performance than the CGI images on the validation set in most training methods, whereas the CGI dataset showed superiority on the test set after training. A possible explanation for this result is that the test and validation sets have inherent differences due to chance. Given their sizes, 183 and 193 images, it is possible that they include a slightly different variety of images, but we would not expect this to occur given their random selection and the use of image augmentation.

Test set mAP by technique
Figure 8: Test set mAP by technique.

Discussion

The results from this experiment provide proof of synthetic data’s potential to benefit computer vision models applied to the aerial view of military vehicles on the battlefield. Simple synthetic environments can be created with little costs upfront and significantly speed up the process of producing a model ready to be deployed in the real world. Our experiment saw a 10% improvement for the best techniques using synthetic data, and it can be assumed that there is more room for improvement given more and higher quality synthetic data. The advantage provided by training models either on a mix of all synthetic and real images (T2) or training on real images after synthetic (T4) can also inform researchers as they begin to implement synthetic data for similar applications so that they do not have to run a similar experiment on their own before doing so.

Future Research

As more and more synthetic environments are created, a study that details the best environments for this application could be beneficial to guide future research on synthetic data for an aerial view of the battlefield. Deciding which environment or mode of synthetic data is best could also rely on the specific task and potential for automation. In this experiment, we did not pursue an automated framework for synthetic data creation, labeling, and model training. However, developing such a framework would be ideal for future use of synthetic data leveraging its greatest advantage to real data. A project that works on automating each step of the pipeline to train computer vision models could see great success compared to a manual approach. Within this pipeline, if model weaknesses can be categorized automatically and fed into a synthetic data generator that will target those weaknesses, it could much more rapidly improve performance.

For example, automatically knowing what angle, weather condition, or distance a model is underperforming can allow for more deliberate synthetic data generation. In a CGI environment such as a video game engine; class labels and bounding boxes can be automatically generated when creating new photos. AI images, however, do not hold this possibility. For an object detection task such as that of target acquisition, using a gaming engine could be the most effective approach simply because it might be most easily automated. AI images should be used for a task requiring image classification instead of object detection, as labels can easily be produced based on the prompt that created that image. For example, when producing an image containing tanks, we know that the label for that image should be “tank”, though we don’t know where in the image the text-to-image model will place the tank. Images can then be automatically created with attached labels and fed into an image classifier continuously to improve performance on the given task. Another area for further research is the transformation of pre-trained classification models to become object detection models. This would allow the use of AI and an automated pipeline to train a classifier and then transform the classifier to provide bounding boxes and therefore become viable for target acquisition technologies.

Conclusion

At present, these synthetic environments do not show an ability to replace the need for real photos, the results from this experiment serve as a proof of concept that they can be used supplementary to real photos to produce more effective models. As mentioned earlier, it is time-consuming and costly to produce large amounts of new real data for this specific military application of computer vision, however, synthetic data shows a promising route to curtail this issue. Instead of allocating funds toward the creation of new, real photos, those developing these target acquisition technologies can instead work on producing a highly effective synthetic environment to ensure models are trained on all possible representations of the given vehicles. The aerial sector of the battlefield has only become more and more important in recent conflicts. As a way of getting ahead during these changing times, and best equipping our aerial technologies, we must do everything we can to properly train the computer vision models to eventually be deployed. A concrete proof that synthetic data has a role to play in this aerial spectrum and a suggestion on how to utilize it serves as the first step as military researchers dive deeper into the topic and prepare us for what is next.

References

Automatic1111. Stable Diffusion Web UI, 2022. https://github.com/AUTOMATIC1111/stable-diffusion-webui.

ChatGPT, OpenAI. https://chat.openai.com/chat.

“CVAT.” n.d. Www.cvat.ai. https://www.cvat.ai/.

Department of Defense. 2019. “Summary of the 2018 Department of Defense Artificial Intelligence Strategy: Harnessing AI to Advance Our Security and Prosperity.” February. https://media.defense.gov/2019/Feb/12/2002088963/-1/-1/1/SUMMARY-OF-DOD-AI-STRATEGY.PDF.

Epic Games. Unreal Engine (version 4.22.1), 2019. https://www.unrealengine.com.

German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M López. 2016. “The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes.” Computer Vision and Pattern Recognition, June. https://doi.org/10.1109/cvpr.2016.352.

Glenn Jocher, Ayush Chaurasia, Alex Stoken, Jirka Borovec, NanoCode012, Yonghye Kwon, Kalen Michael, et al. 2022. “Ultralytics/Yolov5: V7.0 – YOLOv5 SOTA Realtime Instance Segmentation.” Zenodo. November 22, 2022. https://zenodo.org/badge/latestdoi/264818686.

Hyeongkeun Lee, Kyungmin Lee, Hunmin Yang, and Se-Yoon Oh. 2020. “Applying FastPhotoStyle to Synthetic Data for Military Vehicle Detection.” October. https://doi.org/10.23919/iccas50221.2020.9268331.

Oguzhan Can, Ozlem Er, and Yusuf Kunt. 2021. “Use of Synthetic Data on Object Detection Applications.” June. https://doi.org/10.1109/siu53274.2021.9477977.

“PyTorch.” n.d. Www.pytorch.org. https://pytorch.org/hub/ultralytics_yolov5/.

“Roboflow.” n.d. Roboflow.ai. https://roboflow.com/.

Steve Borkman, Adam Crespi, Saurav Dhakad, Sujoy Ganguly, Jonathan Hogins, You-Cyuan Jhang, Mohsen Kamalzadeh, et al. 2021. “Unity Perception: Generate Synthetic Data for Computer Vision.” ArXiv.org. July 19, 2021. https://doi.org/10.48550/arXiv.2107.04259.

US Army Training and Doctrine Command. 1999. “Concept of Employment for Unmanned Systems (Draft).” August 24. See also US Army Training and Doctrine Command, “Unmanned System Future Capability (Working Draft).” August 19.

Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, and Dilip Krishnan. “StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners.” arXiv [Cs.CV], 2023. arXiv. http://arxiv.org/abs/2306.00984.

Biographies

Maxwell Felter is a cadet in the class of 2026 at the United States Military Academy pursuing an Honors BS in Applied Statistics and Data Science. Since beginning his undergraduate research as a sophomore, he has found a passion for machine learning and defense innovation. He plans to further his research during his undergraduate years and aspires to attend graduate school during his Army career.

Major Javier J. Sustaita is an instructor in the Department of Mathematical Sciences at the United States Military Academy at West Point. He has served as an Army Aviation officer with over fourteen years of active-duty experience. Major Sustaita received his master’s in applied mathematics from the Naval Postgraduate School in Monterey, California.

Dr. James K. Starling is an Associate Professor in the Department of Mathematical Sciences at the United States Military Academy at West Point. He serves as an Operations Research/Systems Analysis (ORSA) Army officer with over twenty-four years of active-duty experience. He received his PhD from the University of Washington in Industrial & Systems Engineering.

ITEA_Logo2021
  • Join us on LinkedIn to stay updated with the latest industry insights, valuable content, and professional networking!