Volume 44, Number 3

SEPTEMBER 2023 I Volume 44, Issue 3

Laura FreemanEditor in Chief


DATAWorks Special Issue


Welcome to a special issue of the ITEA Journal.  This issue showcases research in T&E from the Defense and Aerospace Test and Analysis Workshop (DATAWorks), which occurred this past spring. DATAworks is the result of a multi-organization collaboration with the Director of Operational Test & Evaluation (DOT&E) within the Office of the Secretary of Defense, National Aeronautics and Space Administration (NASA), the Institute for Defense Analyses (IDA), and the Section on Statistics in Defense and National Security (SDNS) of the American Statistical Association (ASA). The workshop is strategically designed to strengthen the community by applying rigorous statistical approaches to test design and data analysis in the fields of defense and aerospace.

ITEA is thrilled to showcase 10 peer reviewed articles that capture the research from DATAWorks in this special issue.  There are numerous other presentations available and additional content is available at: DATAWorks 2024 Site (testscience.org).

A special thank you to this issue’s guest editors: Dr. Victoria Sieck and Dr. Joseph Warfield.  Here is a little more background on our guest editors:

  • Victoria R. C. Sieck is currently a Squadron Director of Operations. Additionally, she is an Adjunct Professor at the Air Force Institute of Technology (AFIT), where her research interests include design of experiments, and developing innovate Bayesian approaches to DoD testing. Formally the Deputy Director of the Scientific Test & Analysis Center of Excellence (STAT COE), she worked with major acquisition programs within the Department of Defense (DoD) to apply rigor and efficiency to current and emerging test and evaluation methodologies through the application of the STAT process. As an Operations Research Analyst in the US Air Force (USAF), her experiences in the USAF testing community include being a weapons and tactics analyst and an operational test analyst. Dr. Sieck has a M.S. in Statistics from Texas A&M University, and a Ph.D. in Statistics from the University of New Mexico.
  • Dr. Warfield is a principal applied statistician and group supervisor of the Systems Performance Analysis Group in the Force Projection Sector at Johns Hopkins University Applied Physics Laboratory (JHU/APL). His work employs expert analytical capabilities and domain expertise to test and evaluate the technical performance of strategic weapon systems for the Navy and Air Force throughout their lifecycle. He also serves as an established trusted resource in statistics, design of experiments, data science and machine learning across JHU/APL, participating in projects supporting Department of Homeland Security (DHS), Department of Health and Human Services (DHHS), and many other Department of Defense (DoD) programs.  His current research focuses on sequential design methods for generalized linear model applications, Bayesian methods for reliability estimation, and statistical methods for anomaly detection in unstructured data sources. He has developed and instructed several JHU/APL strategic education courses in statistics and design of experiments. Dr. Warfield has a B.S. in Mathematics from Loyola University, Maryland, a M.S. in Statistics from Virginia Polytechnic Institute, and Ph.D. in Statistics from University of Maryland, Baltimore County.

To kick-off the issue we have another outstanding conversation with an expert.  In this issues conversation, J. Michael Barton, Ph.D., talks with Mr. Robert (Bob) J. Arnold on his reflections on a career in Test and Evaluation. Bob shares his reflections across his career including his time as an Air Force Senior Leader.

Below is the technical article line up:

I-TREE: A Tool for Characterizing Research Taxonomies by Aayushi Verma

  • Abstract: The Institute for Defense Analyses (IDA) is developing a data strategy that implements data governance, data management, and data architecture practices and infrastructures. The data strategy leverages data to build trusted insights and establishes a data-centric culture. One component of the data strategy is a set of research taxonomies that describe and characterize research at IDA. We have created a dataset that consumes disparate data sources related to these taxonomies and unites them with metadata about research products and projects to create quantified analytics addressing questions about research at IDA. This dataset is curated and ingested by an interactive Shiny web application using R, which has been named I-TREE (IDA-Taxonomical Research Expertise Explorer). In this paper, I explain how we used data science to create I-TREE, which aids IDA in collecting new insights and making informed decisions.

Scientific Measurement of Situation Awareness in Operational Testing by Elizabeth A. Green, Miriam E. Armstrong, and Janna Mantua

  • Abstract: Situation Awareness (SA) plays a key role in decision making and human performance; higher operator SA is associated with increased operator performance and decreased operator errors. While maintaining or improving “situational awareness” is a common requirement for systems under test, there is no single standardized method or metric for quantifying SA in operational testing (OT). This leads to varied and sometimes suboptimal treatments of SA measurement across programs and test events. This paper introduces Endsley’s three-level model of SA in dynamic decision making, a frequently used model of individual SA; reviews trade-offs in some existing measures of SA, and discusses a selection of potential ways in which SA measurement during OT may be improved.

Using Changepoint Detection and Artificial Intelligence to Classify Fuel Pressure States in Aerial Refueling by Nelson Walker, Michelle Ouellette, Andrew Welborn, and Nicholas Valois

  • Abstract: An open question in aerial refueling system test and evaluation is how to classify fuel pressure states and behaviors reproducibly and defensibly when visually inspecting the data stream, post-flight. Fuel pressure data streams are highly stochastic, may exhibit multiple types of troublesome behavior simultaneously in a single stream, and may exhibit unique platform-dependent discernable behaviors. These data complexities result in differences in fuel pressure behavior classification determinations between engineers based on experience level and individual judgment. In addition to consuming valuable time, discordant judgements between engineers reduce confidence in metrics and other derived analytic products that are used to evaluate the system’s performance. A fuel-pressure artificial intelligence classification system (FACS), consisting of a changepoint detection algorithm and expert system, has provided a consistent and reproducible solution in classifying various fuel pressure states and behaviors with adjustable sensitivity. In this paper, we explain how the FACS system was built, provide examples of the solution in action, and discuss implications of this method.

Post-hoc Uncertainty Quantification of Deep Learning Models Applied to Remote Sensing Image Scene Classification by Alexei N. Skurikhin, Giri R. Gopalan, Natalie E. Klein, and Emily M. Casleton

  • Abstract: Steadily growing quantities of high-resolution aerial and satellite imagery provide an exciting opportunity for geographic profiling of activities of interest. Advances in deep learning, such as large-scale convolutional neural networks (CNNs) and transformer models, offer more efficient ways to exploit remote sensing imagery. However, while transformers and CNNs are powerful models, their predictions are often taken as point estimates. They do not provide information about how confident the model is in its predictions, which is important information in many mission-critical applications, and therefore limits their use in this space. We present and discuss results of post-hoc uncertainty quantification (UQ) of deep learning classification models. In particular, we evaluate UQ metrics on trained “black-box” models to evaluate each model’s calibration. We consider an application of ten deep learning models to remote sensing image scene classification, and compare classification predictions of these models using image data augmentation and evaluation metrics, such as classification accuracy, Brier score and expected calibration error.

Development of Wald-Type and Score-Type Statistical Tests to Compare Live Test Data and Simulation Predictions by Carrington A. Metts and Curtis G. Miller

  • Abstract: This work describes the development of a statistical test created in support of ongoing verification, validation, and accreditation (VV&A) efforts for modeling and simulation (M&S) environments. The test computes a Wald-type statistic comparing two generalized linear models estimated from live test data and analogous simulated data. The resulting statistic indicates whether the M&S outputs differ from the live data. After developing the test, we applied it to two logistic regression models estimated from live torpedo test data and simulated data from the Naval Undersea Warfare Center’s Environment Centric Weapons Analysis Facility (ECWAF).  We developed this test to handle a specific problem with our data: one weapon variant was seen in the in-water test data, but the ECWAF data had two weapon variants.  We overcame this deficiency by adjusting the Wald statistic via combining linear model coefficients with the intercept term when a factor is varied in one sample but not another.  A similar approach could be applied with score-type tests, which we also describe.

Test and Evaluation of Systems with Embedded Machine Learning Components by Michael R. Smith, Carianne Martinez, Joe Ingram, Mark DeBonis, Christopher Cuellar, and Deepu Jose

  • Abstract: As Machine Learning (ML) continues to advance, it is being integrated into more systems. Often, the ML component represents a significant portion of the system that reduces the burden on the end user or significantly improves task performance. However, the ML component represents an unknown complex phenomenon that is learned from collected data without the need to be explicitly programmed. Despite the improvement in task performance, the models are often black boxes. Evaluating the credibility and the vulnerabilities of ML models poses a gap in current test and evaluation practice. For high consequence applications, the lack of testing and evaluation procedures represents a significant source of uncertainty and risk. To help reduce that risk, we present considerations to evaluate systems embedded with an ML component within a red-teaming inspired methodology. We focus on (1) cyber vulnerabilities to an ML model, (2) evaluating performance gaps, and (3) adversarial ML vulnerabilities.

Experimental Design for Operational Utility by Jim Ferry

  • Abstract: Experimental Design for Operational Utility (EDOU) is a variant of Bayesian Experimental Design (BED) developed for principled cost/benefit analysis. Whereas traditional BED posits an information-based utility function, EDOU’s utility function quantifies the value of knowledge about a system in terms of this knowledge’s operational impact.  Rough knowledge of important characteristics can matter more than precise knowledge of unimportant ones.  EDOU assesses various testing options according to the tradeoff between the expected utility of the knowledge gained versus the cost of testing.  When stakeholder priorities are captured in an operational utility function, it can recommend optimal decisions about which tests to conduct and whether further testing is worth the cost.  The framework is illustrated with a simple example.

Review of Surrogate Strategies and Regularization with Application to High-Speed Flows by Gregory J. Hunt, Robin L. Hunt, and Christopher D. Marley

  • Abstract: Surrogate modeling, also known as meta-modeling or emulation, is an important class of techniques used to reduce the burden of resource-intensive computational models by creating efficient and accurate approximations. Surrogates have been used to great effect in design, optimization, exploration, and uncertainty quantification for a range of problems. Consequently, the development, analysis, and practice of surrogate modeling is of broad interest. In this work, select surrogate modeling strategies are studied as archetypes in a discussion on parametric/nonparametric strategies, local/global modeling, complexity regularization, uncertainty quantification, and strengths/weaknesses. In particular, we consider several variants of two powerful surrogate modeling strategies: polynomial chaos and Gaussian process regression. We evaluate these approaches on several synthetic benchmarks and real models of a hypersonic inlet and thermal protection system. Throughout, we analyze trade-offs that must be navigated to create accurate, flexible, and robust surrogates.

Estimating Sparsely and Irregularly Observed Multivariate Functional Data by Maximillian G. Chen

  • Abstract: With the rise in availability of larger datasets, there is a growing need of methods to rigorously analyze datasets that vary over a continuum, such as time. These datasets exist in many fields, such as defense, finance, sports, and medicine. Functional data analysis (FDA) methods, which addresses data varying over a continuum, typically make three assumptions that are often violated in real datasets: all observations exist over the same continuum interval (such as a closed interval [a,b]), all observations are regularly and densely observed, and if the dataset consists of multiple covariates, the covariates are independent of one another. We present the functional principal components analysis (FPCA), multivariate FPCA (MFPCA), and the principal components analysis through conditional expectation (PACE) methods, which allow us to estimate the multivariate functional data observations that are sparsely and partially observed with measurement error. We discuss implementation results on simulated and real datasets, and suggest open problems for future research.

Statistical Methods Development Work for M&S Validation by Curtis G. Miller

  • Abstract: Modeling and simulation (M&S) environments feature frequently in test and evaluation (T&E) of Department of Defense (DoD) systems. Many M&S environments do not suffer many of the resourcing limitations associated with live test. We thus recommend testers apply higher resolution output generation and analysis techniques compared to those used for collecting live test data. Space-filling designs (SFDs) are experimental designs intended to fill the operational space for which M&S predictions are expected. These designs can be coupled with statistical metamodeling techniques that estimate a model that flexibly interpolates or predicts M&S outputs and their distributions at both observed settings and unobserved regions of the operational space. Analysts can study metamodel properties to decide if a M&S environment adequately represents the original systems. This paper summarizes a presentation given at the DATAWorks 2023 workshop.

Enjoy the Issue!

  • Join us on LinkedIn to stay updated with the latest industry insights, valuable content, and professional networking!