MARCH 2025 I Volume 46, Issue 1
MARCH 2025
Volume 46 I Issue 1
IN THIS JOURNAL:
- Issue at a Glance
- Chairman’s Message
Conversations with Experts
- Testing Without Being a Tester: A Conversation with Dr. Bill D'Amico
Technical Articles
- INNOVATION Independent Automated Verification and Validation Testbed for Test and Evaluation
- Defect Characterisation in ICT Scanned Energetic Materials Using Machine Learning
- SCOUT - Pushing High Performance Computing to the Data
- Real-Time Inference for Unmanned Ground Vehicles Using Lossy Compression and Deep Learning
- An AI Model Performance Benchmarking Harness for Reproducible Performance Evaluation
News
- Association News
- Chapter News
- Corporate Member News
![]()
SCOUT – Pushing High Performance Computing to the Data
J. Michael Barton, Ph.D
Parsons Fellow
Aberdeen Proving Ground
Thomas Kendall
Master of Science in Theoretical and Applied Mechanics
University of Illinois
Jamie Stack
Master of Engineering
Pennsylvania State University
Abstract
SCOUT (SuperComputing OUTpost) is a first-of-a-kind mobile, deployable, turnkey high performance computing (HPC) system. It contains the same processors as one of the world’s fastest computers, SUMMIT, and is available to all DoD services and agencies. It is housed in a 53’ trailer with compute, memory, storage, network, cooling, backup power, power conditioning, and fire suppression. All it needs for operation is external power, networking, and a ½” water line` for humidification. Computing requirements in remote locations and at the tactical edge continue to grow, especially for artificial intelligence, machine learning, and autonomous systems. The DoD High Performance Computing Modernization Program provides this supercomputing capability in support of the DoD science & technology, test & evaluation, and acquisition engineering communities. SCOUT details are presented and three use cases are described. Lessons learned are enumerated in consideration for future deployable systems.
Keywords: Edge computing, high performance computing, deployable high performance computer, machine learning, image recognition
Introduction
An increasingly instrumented world is producing massive amounts of data of which only a small fraction is necessary for knowledge discovery. Managing and gleaning value from the data has become a challenging problem demanding new artificial intelligence (AI)-enabled adaptive computing solutions for reducing complexity and accelerating inference at the point of need. The environment is highly dynamic and algorithms must manage the dynamism on possibly resource-constrained edge devices. The flexibility of machine learning algorithms presents an opportunity to continuously adapt task execution and implement results throughout a network to meet changing requirements and available resources.
The study by Zhang et al.1 determined that the internet is growing exponentially, doubling in size every 5.32 years. Evans2 analyzed internet growth and concluded that somewhere between 2008 and 2009 there were more ‘things’ connected to the internet than there were people in the world, establishing the birth of the Internet of Things (IoT). The term IoT was originally coined in 1999 for radio frequency identification tags, according to Ashton3. IoT means devices with sensing and processing ability that exchange data with other devices and systems over a network.
The current explosion of the internet is enabled by a nexus of developments: wireless connectivity, readily available bandwidth, IPv6, low-cost sensors, and powerful, small form factor processors. Introduction of smart phones, tablet computers, and wearables, enabled by social networking, produced explosive growth of mobile computing. Social networking became a killer app that drove the marriage between mobile devices and cloud computing and fostered mobile computing. The prolific contributor to this growth today is the IoT. According to CISCO4, there are about 29 billion networked devices. The fastest growing mobile device category is machine-to-machine followed by smartphones, and growth is exponential. Within machine-to-machine, connected cars are the fastest growing type. Mobile computing is not restricted to commercial applications; the same technology is embodied in myriad military devices.
In military applications fixed infrastructure is rarely available. Edge and core devices are resource constrained – network bandwidth, battery life, storage capacity, memory capacity, processor speed. Compounding the issue are the time critical nature of nearly every application, constantly changing proximity, and network delays and disruptions. Size, weight, and power (SWaP) constraints are augmented with a time constraint, SWaP-T, in contested and adversarial environments. Besides time to decision as a critical driver, the “time constant” for machine learning models is measured in days and weeks not months and years. We must be able to adapt on the battlefield or at least at the edge; we can’t afford to reach back to garrison for new versions, updates, fixes, etc. These increased challenges for tactical applications have led to the concept of the Internet of Battlefield Things (IoBT5), the IoT with military ‘things’ and military constraints.
Computing at the edge is another example of pushing computation to the data rather than moving data to a central location for processing. It has also come under the names of data locality6 and processing forward7. Early manifestations were to use portable compute capability, for example, in a truck, in remote locations for training, test, experimentation, and demonstration events. In the late 1990s, a feasibility study was conducted by the Army to deploy a Thinking Machines CM-5 for data acquisition, reduction, and visualization, and post-test data analysis8. A few more generations of Moore’s law were needed to shrink the capability from two tractor-trailers to a more practical size, a prototype of which is the SCOUT system at the DEVCOM Army Research Laboratory (ARL) (https://centers.hpc.mil/systems/unclassified.html#SCOUT).
Meet SCOUT
In 2019, the High Performance Computing Modernization Program (HPCMP) directed the ARL DoD Supercomputing Resource Center (DSRC) to develop specifications for a mobile HPC capability. The purpose was to demonstrate HPC at the edge capabilities to support large-scale artificial intelligence and machine learning tasks, including training and inference. It was designed to be deployable to the tactical edge, DoD test ranges, and other remote locations. The initial specifications were then subject to the constraints of being self-contained aside from external power, fitting within the volume of an International Standards Organization (ISO)-compliant 53-foot long intermodal shipping container, and supplying its own cooling of everything contained within; and, of course, the constraint of cost. The SCOUT system is the result of the trade studies and compromises.
Requirements presented to the ARL team were in the form of designing a system to meet a use case involving an inference-dominated workflow to support processing video streams from hundreds to thousands of drones. A secondary requirement was to support retraining of neural networks to counter potential target identification mitigations an adversary might deploy to reduce the success of a surveillance drone swarm. A market research study identified that commercial offerings of ISO intermodal container-based data centers were limited to a 125 KW power envelope and were typically air cooled with condenser units installed on the exterior walls or roof. Our requirement prohibited these appendages, as the unit would no longer be able to be transported without significant disassembly/re-assembly. Shipping the system intact also led to the requirement for only solid state storage devices, as magnetic spinning media would be prone to erasure or other damage if exposed to a strong magnetic field from nearby shipping containers or due to vibrations during shipping and handling.
In developing the final acquisition requirements for SCOUT, the ARL team accepted a moderate risk by requiring an ISO-compliant solution coupled with a combined training and inference requirement that far exceeded the 125 KW that was commercially available. The SCOUT requirements were approximately twice this value and led to industry developing a water cooled solution within the ISO constraint.
Architecture
SCOUT is an IBM Power9 architecture, the first in the HPCMP. It consists of 22 machine learning training nodes (6 NVidia Volta V100 GPUs per node), 128 inference nodes (4 NVidia T4 GPUs per node), 2 visualization nodes, 4 login nodes, 5 scheduling/container orchestration nodes, and a parallel storage system which provides three independent file systems with 1.3 petabytes (PB) of shared solid state capacity. All components are connected via100 gigabit per second InfiniBand and dual 10 gigabit per second Ethernet networks. The processors are the same as contained in Summit, the supercomputer at Oak Ridge National Laboratory that debuted in June 2018 at #1 on the TOP500 list (www.top500.org). In June 2024, Summit was in 9th place; the #1 position was held by Frontier at the Oak Ridge National Laboratory, the first supercomputer in the world to exhibit exascale performance as measured by the High Performance LINPACK benchmark (https://icl.utk.edu/hpl/).
Receiving and Installing SCOUT
Due to its containerized packaging, SCOUT arrived nearly ready to be powered on for integration into the ARL DSRC network. Upon arrival via a lowboy trailer, a single crane set the system in place on the pad, Figure 1.

Figure 1: Scout Installation
The prepositioned 480V, 3-phase power cables were routed and terminated into SCOUT’s power panel, the humidification water line was connected, and ARL DSRC’s network switches and fiber optic networking cables were installed in dedicated rack space to connect the system to the DSRC. The lithium-ion batteries for the uninterruptible power supply (UPS) were shipped in specially designed wooden crates stored within the container. These batteries were installed in the UPS battery cabinet and UPS startup by the manufacturer’s representative was completed. The closed chilled water loop was filled using vessels that hold the approximately 100 gallons necessary for operation, which can also be shipped within the container. Following these steps, the system was powered on for integration and acceptance testing.
SCOUT’s Home and Relocation Requirements
SCOUT resides on a 55’ x 10.5’ concrete pad with no more than 1” elevation difference from corner to corner. It is provided with 480V, 3-phase external power, network, and a ½” water line for humidification, Figures 2 and 3.

Figure 2: SCOUT at home

Figure 3: SCOUT interior
It is designed for outdoor placement and can withstand environmental stresses caused by wind, sun, precipitation, thermal fluctuations, and smoke or particulates. The following are considerations for relocation:
- Availability and location of utilities, such as electricity, fresh water, and telecommunications. Water demands are quite low and can be accommodated via a storage tank if a pressurized source is not available.
- A location that is secure, can be protected from vehicular traffic and other external risks, and is in an access controlled area.
- Placement away from hazardous conditions such as harmful fumes and gas vapors, corrosive or bio-hazardous sources, flammable or explosive objects, or high external contaminate-producing sources.
- Above the high-water mark of the land, avoid areas prone to flooding or where water collects. Recommend placing in the 500-year flood plain, away from riverbanks, coastal areas, or downstream of waterways.
- Away from residential areas (recommend 150 feet)
- Away from overhead high voltage lines
Avoid placement near sources of electrometric interference such as antennas, microwaves, radar, or cellular towers.
SCOUT Use Cases
We ran several use cases on SCOUT to stress test the system and to understand its capabilities and limitations. Two were training and inference benchmarks using production-level artificial intelligence systems: the Video Processing Exploitation Framework and the Ordnance Threat Target Automated Recognition system9. For another we chose ImageNet10, an industry standard object recognition dataset. A final use case is image recognition using the Janus code11. Results with ImageNet and Janus are described below.
ImageNet
In preparation for receiving SCOUT, we chose a test case to run on an existing platform containing an NVidia DGX with 4 nodes (SuperMicro) and 8 Volta V100 GPUs per node. We compared results with those using NVidia Pascal P100 GPUs. For comparisons we used ImageNet with ResNet. ImageNet is an ongoing effort to provide researchers around the world with image data for training large-scale object recognition models, https://www.image-net.org/about.php. It is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a “synonym set” or “synset”. There are more than 100,000 synsets in WordNet; 80,000+ of them are nouns. The goal of ImageNet is to provide on average 1000 images to illustrate each synset. Images of each concept are quality controlled and human annotated. We worked with 14 million images (350GB) with labels in 1000 basic categories (dog, cat, table…), see Figure 4.

Figure 4: Examples from the ImageNet Collection
ResNet, short for Residual Network, is a specific type of neural network introduced in 2015 by He, et al.12 to address issues of convergence and degrading accuracy when more layers are added to deep convolutional neural networks. We applied the ResNet50 model, meaning 50 neural network layers, to the ImageNet dataset.
Figure 5 compares single node performance for increasing GPU count for the P100 and V100. Not surprisingly, the V100 performance is superior.

Figure 5: ImageNet Training on the DGX, P100 versus V100 Performance
Figure 6 presents V100 performance on the same test case with increasing node and GPU count. For this specific case there does appear to be a point of diminishing returns, where additional nodes and GPUs do not produce increased performance. This behavior will likely change with larger datasets and more network layers. Figure 7 presents results on SCOUT for ImageNet and ResNet50 compared to results using the NVidia DGX.
Figure 6: ImageNet Training on the DGX
The newer hardware yields a payoff in AI/ML workloads, illustrated by the jump in performance between the NVidia DGX (SuperMicro) nodes and SCOUT nodes. This improvement alone will save time and could justify acquisition of the new hardware. There is a point of diminishing returns, however, meaning that no matter how many more nodes you throw at the problem, you will not get an increase in performance. This is typical of classical parallel computing as well, for example using message passing, where there is a limit on how much can be parallelized in a given code (Amdahl’s Law13), or the amount of computation on a node becomes too small to derive more benefit.

Figure 7: ImageNet Training on SCOUT
The Janus Project
The Intelligence Advanced Research Projects Activity (IARPA) launched project Janus in 2014 to address perceived inadequacies in face recognition algorithms. Historically, automated face recognition software works best on well-lit frontal poses, such as passport and driver’s license photos. It is less accurate when lighting is poor, resolution is low, faces are obstructed, camera angles vary, and facial expression is uncontrolled. Project Janus improved the performance of face recognition tools by fusing temporal and contextual information from social media14. Our primary objective was to answer three questions: (1) How much speed up do we realize in training the models on high performance computers? (2) Does a high-performance computer allow us to train using a larger set of identities and images, and (3) Does using the larger dataset improve model accuracy?
| System | type | nodes | cores | gpus | memory | storage |
| scout | IBM Power9 | 152 (with 22 training, 128 inference) | 6,080 | 132 NVidia V100s; 512 NVidia T4s | 45 TB | 330 TB SSD |
| sofia | IBM Power9 | 17 (with 11 training, 2 inference) | 520 | 66 NVidia V100s; 4 NVidia T4s | 1 PB | 1.2 PB |
Table 1: SCOUT Hardware
We utilized SCOUT and a smaller computer called SOFIA. SCOUT and SOFIA are GPU-dense with NVidia V100 training nodes and T4 inference nodes. The ARL center wide file system provided 3.3 PB of bulk storage accessible from all systems. Table 1 contains system details.
The updated version of Janus we received is referred to as 3C-Prime. A significant task was to work through the builds for the target HPC platforms to
assemble, port, and package the software. Our approach followed six steps:
- Install the Janus software to perform baseline testing and establish initial workflows
- Upload and process the DoD dataset
- Evaluate demographic statistics within the dataset and develop methods to address statistical skews in demographic representation for model training
- Update Janus software to improve function and performance
- Package the software using a Singularity container for simplifying deployment on other HPC systems
- Retrain Janus models using the full DoD dataset
We validated the algorithms and used a subset of the full DoD dataset to test the inference tools and algorithms using a test dataset of 300,000 identities and 2 million images. Figure 8 depicts the Janus architecture as developed at the University of Maryland.

Figure 8: University of Maryland – Janus Architecture
Previous experience with the original Janus 3C demonstrated training runtimes on the order of three weeks using 300,000 identities. Training with 300,000 identities was a limitation of GPU memory size at the time. The training component of the software was limited to a single node using multiple GPUs. To support a larger number of identities, a multi-node multi-GPU solution was developed. We incorporated a technique called Partial-FC (partial fully connected layers of the deep neural network)15,16. The training software and workflow were modified to support Partial-FC.
The use of Partial FC allowed scaling over much larger datasets by distributing the final layer of the neutral network across multiple GPUs on multiple nodes. This removed the limitation in training dataset size, although some practical upper-bound undoubtedly exists. In general, results showed that the best performers were models trained with the largest datasets. When verified against a subset of DoD data that was held back and not used for training, performance of the best performing new models was significantly better than the original.
One advantage of the DoD dataset is that it contains more identities than the original 3C-Prime was trained against. To see how the number of identities in the training set affects performance, multiple dataset sizes were trained, specifically, dataset sizes of 300K, 1 million, 1.3 million, and 2.3. million identities were created. Figure 9 shows the 1-to-N search comparison of the top 2 performers from each size dataset along with the original 3C-Prime. Performance improves when increasing from 300K to 1 million identities. Much less of a performance increase is realized with the increase from 1 million to 1.3 million or 2.3 million, indicating that increasing beyond 2.3 million identities may not be worth the additional compute time for the incremental increases in accuracy.

Figure 9: True Positive Rate of 3C-Prime and Best Performing 10 Trained Models Against IJB-C
The project successfully addressed the three major areas where questions existed. We demonstrated the ability to train models faster and on much larger datasets than previously reported. Whereas the original Janus 3C work reported training on 300K identities and required approximately three weeks to complete; we trained on the same dataset in 2 days. We were able to train on as many as 2.3 million identities, a limit imposed by the data available for training not by the software. We demonstrated that training against a much larger dataset did, in fact, produce better models with some caveats for the context and use cases of the trained models.
Discussion
The use case presented to the ARL team was inference heavy. This, coupled with the power constraint that was estimated to be twice the then commercially available capability of 125 KW, led to an architecture with fewer training-capable GPGPUs than the actual research and development workload ultimately demanded. Exceeding the commercially available 125 KW capability of ISO containerized data centers resulted in the system being over the weight that standard trailers for intermodal containers could safely transport. After delivery, IBM notified ARL that the design of the inference nodes precluded transport of the system with the nodes installed in the racks. The nodes must be removed, packaged, and transported separately, with the process reversed prior to use at the remote site. (IBM and the Government ultimately agreed to a consideration package due to this situation.) Consequently, SCOUT was never relocated and remained on the original pad at ARL. The container constituted approximately 20% of the value of SCOUT. A fixed data center asset could be more training resource heavy for the same cost.
The Janus project illuminated some issues or needs that appear characteristic of machine learning applications, not specific to SCOUT. As the project progressed, many new files were created by each user. To give all project members access to the data, chmod and chgrp needed to be run recursively. In most situations this is a non-issue; however, when working with 10s to 100s of millions of files, consideration must be given to having the HPC environment write the files with the needed permissions and groups a priori. The same applies when moving, removing, and renaming files – the absolute number of files involved can bring the file system to a near halt. The large number of files and the volume of data also led to planning the calculations: staging data, selecting naming conventions, agreeing on file and folder structure, etc. SCOUT was especially useful in the context of data staging, since the compute nodes had direct access to shared center wide file system; this was not the case on SOFIA and moving such volumes of data required careful planning.
Conclusion
SCOUT was slow to be adopted by users due to the large learning curve of moving existing codes and workflows to the Power9 environments. And SCOUT’s mobility limitations meant it was never used at remote sites. Yet, SCOUT provided us with knowledge of mobile systems that we could not have obtained any other way and that will be used in the future. Since SCOUT’s deployment at ARL, AI accelerators have continued to grow in both raw performance and performance per watt; however, the overall power of each accelerator has increased from approximately 300 watts to 1000 watts. Further, nodes package multiple accelerators in a memory coherent capability. Nodes therefore require several kilowatts each and water cooling. A future deployable HPC for AI workloads will need to retain the water-cooled approach that SCOUT pioneered. With compute density increasing, it may be feasible to develop a capability in a 40’ intermodal container or separate the power and cooling infrastructure into dedicated containers that are each under the standard shipping weight threshold.
References
1DEVCOM Army Research Laboratory, DOD Supercomputing Resource Center and Parsons Corporation, Aberdeen Proving Ground, MD
2DEVCOM Army Research Laboratory, Army Research Directorate, Aberdeen Proving Ground, MD
3DEVCOM Army Research Laboratory, DOD Supercomputing Resource Center, Aberdeen Proving Ground, MD
- Zhang, G-Q., Zhang, G-Q., Yang, Q-F., Cheng, S-Q. and Zhou, T. 2009. Evolution of the Internet and its Cores. New Journal of Physics 10: 1-11.
- Evans, D. 2011. The Internet of Things: How the Next Evolution of the Internet Is Changing Everything. Cisco Internet Business Solutions Group White Paper, April. https://www.cisco.com/c/dam/en_us/about/ac79/docs/innov/IoT IBSG 0411FINAL.pdf (accessed August 14, 2017)
- Ashton, K. 2009. That ‘Internet of Things’ Thing. RFID Journal May-June.
- 2020. Cisco Annual Internet Report (2018–2023). CISCO white paper, 1–35.
- Wigness, Maggie, Abdelzaher, Tarek, Russell, Stephen and Swami, Ananthram. 2023. “Internet of Battlefield Things: Challenges, Opportunities, and Emerging Directions” In IoT for Defense and National Security , IEEE, 5-22, doi: 10.1002/9781119892199.ch1.
- Bittencourt, L., Immich, R., Sakellariou, R., Fonseca, N., Madeira, E., Curado, M., Villas, L., DaSilva, L., Lee, C. and Rana, O. 2018. The Internet of Things, Fog and Cloud continuum: Integration and challenges. Internet of Things, 3–4 134-155, doi.org/10.1016/j.iot.2018.09.005.
- Perconti, Philip and Morris, Christopher. 2016. Future Military Signal Processing and Information Fusion. Keynote Address, Sensor Signal Processing for Defence Conference Edinburgh, Scotland, September 22, 2016.
- Barton, J.M. and Namburu, R. 2017. 70 Years of Scientific Computing in the Army. The ITEA Journal of Test and Evaluation 38: 196-207.
- Brewer, W., Geyer, C., Kleiner, D. and Horne, C. 2021. Streaming Detection and Classification Performance of a POWER9 Edge Supercomputer. In IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 1-7, doi: 10.1109/HPEC49654.2021.9622852
- Russakovsky, O., Deng, J., Su, H. et al. 2015. ImageNet Large Scale Visual Recognition Challenge. Int J Comput Vis 115, 211–252. doi.org/10.1007/s11263-015-0816-y
- Adams, J. et al. 2023. Accelerating Image Recognition Using High Performance Computing. The ITEA Journal of Test and Evaluation 44 (4).
- He, K., Zhang, X., Ren, S. and Sun, J.,2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385v1 [cs.CV].
- Amdahl, Gene M. 1967. Validity of the single processor approach to achieving large scale computing capabilities. AFIPS ’67 (Spring): Proceedings of the April 18-20, 1967, Spring Joint Computer Conference, 483 – 485, https://doi.org/10.1145/1465482.1465560.
- Chellappa, R. 2020. Final Technical Report, Janus Transition Contract. University of Maryland.
- An, Xiang et al. 2021. Partial FC: Training 10 Million Identities on a Single Machine. In 2021 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).
- An, Xiang et al., “Killing Two Birds with One Stone: Efficient and Robust Training of Face Recognition CNNs by Partial FC,” arXiv:2203.15565v1, 28 Mar 2022.
Author Biographies
J. MICHAEL BARTON, Ph.D. Parsons Fellow, has worked on Aberdeen Proving Ground since 2001, ten years supporting the Army Test and Evaluation Command and the last nine with the Army Research Laboratory working in large-scale data analytics, high-performance computing, and outreach to test and evaluation and other stakeholders. Dr. Barton’s career is in physics-based modeling and simulation with six years as an aerospace consultant, twelve years as a contractor supporting the Air Force at the Arnold Engineering Developmental Complex in Tennessee and the National Aeronautics and Space Administration Glenn Research Center in Ohio; and the first 4 years of his career with The Boeing Company in Seattle. He received Bachelor of Science and Ph.D. degrees in engineering science and mechanics from The University of Tennessee-Knoxville and a Master of Engineering degree in Aeronautics and Astronautics from The University of Washington.
THOMAS KENDALL has worked in the field of high-performance computing for ARL and its predecessor, the U. S. Army Ballistics Research Laboratory, for over 35 years. He currently serves as the High Performance Computing and Infrastructure Advisor to the Director of the Army Research Directorate. Notably, from 2017 through 2022, through a detail to the DoD HPCMP, he led the acquisition of the program’s supercomputers, including the SCOUT system. Mr. Kendall holds a Bachelor of Science (Magna Cum Laude) in Mechanical Engineering and Mechanics from Drexel University and a Master of Science in Theoretical and Applied Mechanics from the University of Illinois at Urbana-Champaign.
JAMIE STACK has worked in the field of software develop for over 15 years starting first in the commercial world where she worked on developing GPGPU code for FDTD applications before coming to ARL 10 years ago. Ms. Stack’s career has been focused on the research and development of high-performance codes for a variety of different architectures. In addition, she focused on developing methods to port modeling and simulation tools to run on high-performance computers to match real world problems. Ms. Stack currently serves as the Director of Customer Success for the DoD High Performance Computing Modernization Program ARL Center. Ms. Stack holds a Bachelor of Science in Computer Engineering and Master of Engineering in System Engineering from the Pennsylvania State University.
Dewey Classification: L 681 12

