Adaptive T&E via Bayesian Decision Theory | ITEA Journal

SEPTEMBER 2024 I Volume 45, Issue 3

Adaptive Test & Evaluation via Bayesian Decision Theory

Jim Ferry

Jim Ferry

Metron, Inc.
Reston, Virginia

Nate Crookston

Nate Crookston

Metron, Inc.
Reston, Virginia

Adam Ahmed

Adam Ahmed

Metron, Inc.
Reston, Virginia

 

DOI: 10.61278/itea.45.3.1002

Abstract

Bayesian decision theory is a framework for making decisions that are expected to have the best outcomes based on available data. Dynamo is a paradigm that applies Bayesian decision theory to Test & Evaluation (T&E). It provides capabilities to visualize how knowledge of a system under test is updated as test results arrive, to assess the ways in which this knowledge could be refined under various test plans, and to recommend the test decisions with the best potential for furthering stakeholder priorities.

Dynamo stands for Dynamic knowledge + Moneyball. Dynamic knowledge refers to the management of the various forms of uncertainty associated with T&E. This knowledge is updated in real time and is leveraged to provide various analytic capabilities. Moneyball refers to optimization criteria that express stakeholder priorities and testing costs in a common currency to facilitate testing decisions that provide the most “bang for the buck.”

This article discusses the principles behind the Dynamo paradigm and the progression of capabilities it provides for T&E as it is deployed in three phases. It ends with a simple example that illustrates how the impetus to test is generated at decision boundaries in knowledge space.

Keywords: Dynamo, Dynamic knowledge, Moneyball, Test & Evaluation, OT&E, Bayesian decision theory, Sequential Bayesian inference, Operational utility

Motivation

How much is conducting a single trial of a defense system worth?

This is the animating question for this article. The thought process of answering it leads to an approach to Test & Evaluation (T&E) that differs from the current state of the art in T&E. It requires an analysis of the mechanism through which testing generates value, which, in turn, requires precise management of the knowledge accumulated during testing.

For example, suppose a defense system with a given delivery schedule has been valued at $1 billion if it meets its requirements and is delivered on time. This value may be assigned during the defense authorization and appropriations process and re-examined during Nunn–McCurdy review [1]. Suppose a proposed test event for the system costs $20 million. Is this too expensive? What about a single trial that costs $20 thousand? On what basis does one even decide such questions?

The Dynamo Paradigm for Test & Evaluation

Dynamo is a paradigm for T&E that answers these questions using Bayesian decision theory (BDT) [2]. To apply Dynamo, one formulates the knowledge one has about a system under test (SUT) and utility functions for decisions about it. The knowledge is dynamic in two senses:

  • it changes in response to (even small amounts of) test data, and
  • it is powerful enough to provide probabilistic predictions of key outcomes.

The utility functions are defined over a set of terminal decisions made at some decision point in the T&E process. They quantify the utility of a decision for a given state of knowledge. The utility is expressed in the same units as the cost of testing, enabling the kind of principled cost/benefit analysis that Michael Lewis called Moneyball when applied to building a baseball team [3].

The Dynamo paradigm may be illustrated by a simple example. Suppose current knowledge about a SUT is that there is a 50% chance it is “good” and a 50% chance it is “bad.” The set of terminal decisions is whether to accept or reject the system. Accepting it entails a cost of $400 million to build it, and the value of the system will be $1 billion if it is good and $0 if it is bad. In this case, the expected utility of the accepted decision is 50% × $1 billion + 50% × $0 − $400 million = $100 million. On the other hand, rejecting the system entails a $0 cost but provides a $0 value.

Suppose there is a test decision available in addition to the terminal decisions. This decision is to conduct a $20 million test event that resolves whether the system is good or bad so that the system can only be built when it is known to be good. The net utility of accepting a good system is $1 billion − $400 million = $600 million. The expected utility of the test decision is 50% × $600 million + 50% × $0 − $20 million = $280 million. Because the expected utilities of the test, accept, and reject decisions are $280 million, $100 million, and $0, respectively, it is best to test in this case.

Exemplar System: Q-53

The knowledge state one has about a SUT is much more complex, in practice, than simply “50% good, 50% bad.” The form of the knowledge state must be tailored to the specifics of an individual SUT. The research and development (R&D) prototype for Dynamo (called “core Dynamo”) has been developed around the AN/TPQ-53 (Q-53) radar system as its primary exemplar. The Q-53 system attempts to localize the Point of Origin (POO) of a battery firing one of several munition types. Each trial of the system is represented as a pair (x,y), where x is the environment and y is the outcome. The environment x is a vector that represents the various test factors described in the Q-53’s Test and Evaluation Master Plan (TEMP). These include the range from the radar to the POO, the aspect angle (i.e., the munition’s heading relative to the vector from the radar to the POO), the radar’s operating mode, and the munition type. The outcome y represents the POO error: i.e., the distance between the true POO and the radar’s estimate of it. Other than some implementation details (e.g., using logarithms of distances and one-hot encodings for categorical quantities), the data stream of test results from the Q-53 system is a sequence of these (x,y) pairs.

Dynamo processes this stream of (x,y) pairs into the knowledge one has about the Q-53 system at any time. This knowledge object can be leveraged for a variety of purposes, including Moneyball cost/benefit analysis for testing decisions.

KNOWLEDGE REPRESENTATION IN DYNAMO

Dynamo leverages the knowledge about a SUT for multiple purposes
Figure 1: Dynamo leverages the knowledge about a SUT for multiple purposes

As depicted in the first row of Figure 1, it is possible to use the data stream of (x,y) pairs directly to assess whether a SUT meets the requirements given in its TEMP. The TEMP specifies requirements for Key Performance Parameters (KPPs) and Key System Attributes (KSAs). These requirements may be expressed as statistical properties that outcomes y must satisfy for various sectors of the design space. Care must be taken, however, to ensure that the test environments x representing a sector are not biased toward particularly easy or hard regions. For example, if part of the definition of a sector were a range of 1 to 10 km, then one might specify a test plan (or design matrix) X whose rows x have ranges spread uniformly over 1 to 10 km. Nevertheless, if the radar had a bias toward acquiring targets at shorter ranges, the resulting data stream of (x,y) pairs would be biased toward the short-range cases acquired. Because POO errors tend to be smaller at short range, the conclusions for the sector would be biased as well.

The larger issue, however, is that valuable information is thrown away when the data stream of (x,y) pairs is used only to evaluate requirements. For example, an excellent way to characterize a system’s performance is to indicate what outcome y would occur when it is used in an arbitrary environment x. If x happened to be one of the test points, one could simply return the corresponding outcome y. If repeated trials were performed in environment x, one could return the set of outcomes y, pointing out that, because x does not include all factors that may influence the outcome, various outcomes are possible. The challenge, of course, is to indicate what outcomes y are likely to occur when x is not one of the test points. In this case, one must implicitly or explicitly invoke some model that has been informed by the test data.

When multiple outcomes y are possible for an environment x, one may model the inherent variability of the outcome as a likelihood function L(y|x,θ), which provides the probability distribution of y for any value of x. Here, θ is a parameter vector that encodes the (probabilistic) relationship between x and y: cf. the second row of Figure 1. If the form of the likelihood function is well-suited to the SUT, then there exists some value of θ for which L(y|x,θ) is an accurate model of the probability of the outcome y in any environment x. Whether this value can be estimated well from the testing data depends on the SUT, the choice of model, and the amount of testing data available.

A model allows requirements to be assessed by averaging over all environments in a sector rather than just over the ones tested, which mitigates biases introduced by averaging over test points. A model also enables new types of evaluation criteria, such as the optimal design metrics developed for linear–Gaussian models [4]. Such metrics quantify how precisely θ would be known after executing a specific test design matrix X. However, as Figure 1 indicates, there are conflicts between the compliance mindset of satisfying requirements and the information mindset of precisely characterizing the SUT’s behavior. The compliance mindset can lead to overtesting for cases in which the value of a KPP has been narrowed down to a small interval that straddles a compliance threshold. The information mindset can lead to overtesting when the value of a KPP still has a broad range of possible values, but all such values lie on one side of the compliance threshold. See [5] for further discussion of this point.

The main message of Figure 1 is that better testing models support more and better evaluation criteria. If one can develop a probabilistic model L(y|x,θ) and learn an accurate value of θ, this would seem to be the ultimate evaluation criterion. It would not only provide a prediction of the system’s behavior for any environment one encounters, it would also provide the full probability distribution of outcomes that could occur. Assuming x and y have been formulated well, the model would also provide the required information about the KPPs and KSAs. While this capability would indeed be a very useful characterization of the system once it is fully tested, there is an additional consideration during the T&E process itself: the distinction between aleatoric and epistemic uncertainty.

Aleatoric Uncertainty. Assuming there is a unique value of θ that optimizes the accuracy of the model L(y|x,θ), the ideal outcome of testing would be to learn this value of θ exactly. For example, suppose a vast number of tests conducted in one environment had a median POO error of 30 meters and a (equal-tailed) 95% containment interval of 25-to-35 meters. In another environment, a vast number of tests may have a median of 20 meters and a 95% containment interval of 5-to-80 meters. Finding a value of θ for which the likelihood function L(y|x,θ) exhibits these same statistics over y, for those two environments x, would be quite promising. However, it is natural to react to an interval of 5-to-80 meters by deeming it too large and thinking more testing is needed to find “the true POO error” for that environment. This reaction is misguided, however. If a vast number of tests exhibit a 95% containment interval of 5-to-80 meters, that is simply what the truth is for this environment. It indicates that in this environment, the SUT itself has a POO error with a large aleatoric uncertainty. Aleatoric uncertainty cannot be reduced by further testing. Rather, exhaustive testing would indicate precisely how much aleatoric uncertainty an SUT exhibits in various environments x.

Epistemic Uncertainty. Thorough testing may or may not produce a good estimate of θ. When it can be estimated, it quantifies the aleatoric uncertainty of a SUT. However, there is an additional kind of uncertainty present in the T&E process: the epistemic uncertainty of not knowing θ precisely. Just as the aleatoric uncertainty is encoded in the model L(y|x,θ), the epistemic uncertainty is encoded in a Bayesian model P(θ|κ) over the parameter vector θ given the knowledge κ, as shown in the third row of Figure 1. The term “aleatoric” refers to the “roll of the dice” uncertainty intrinsic to a system. The parameter vector θ encodes both an estimate of the outcome y in any environment x and the aleatoric uncertainty about y. On the other hand, “epistemic” refers to the knowledge that one has about a SUT. The knowledge κ encodes both an estimate of the parameter vector θ and the epistemic uncertainty one has about θ.

A Bayesian model P(θ|κ) further enhances the tester’s ability to assess requirements and optimize the information characteristics of a test design. Rather than using the data to provide an estimate of θ, the knowledge κ maintains the probabilities of the various values θ could have, updating these probabilities whenever a new data pair (x,y) arrives. This fine-grained quantification of knowledge is necessary to quantify the expected utility of testing an environment x or executing a test plan X.

In the Dynamo prototype, the Normal–Normal-Inverse-Gamma model is used for L(y|x,θ) and P(θ|κ). In this model, the parameter vector θ comprises a scalar σ2 and a vector c, and L(y|x,θ) is a normal distribution on y with mean c⋅x and variance σ2. The knowledge κ has four components. It has two scalars, α and β, that specify an inverse gamma distribution on σ2. It also has vector μ and a covariance matrix V that, together with σ2, specify a multivariate normal distribution on the vector c. Thus, κ has all the information necessary to fully describe the probability distribution P(θ|κ). The details of this model, including its Bayesian update rules, are discussed in [6].

Phased Deployment of Dynamo

Dynamo currently exists as the R&D prototype “core Dynamo.” The first deployment of Dynamo will begin shortly and will proceed in the phased manner discussed below. One goal for all such deployments is to fold the lessons learned back into core Dynamo to streamline subsequent deployments. The three phases of Dynamo deployment are based on the three properties listed in Table 1. These properties define what is essential to implement to deploy Dynamo. The implementation of each property leads to a corresponding capability enhancement.

Table 1: Capabilities enabled by properties in a three-phase deployment of Dynamo

Phase Property Capability
1 Fidelity Dynamic Knowledge Visualization
2 Prediction Test Option Analysis
3 Utility Moneyball Decision Aid

Property 1: Fidelity

The knowledge κ must represent what is known about the SUT at any time.  The SUT’s TEMP specifies requirements for KPPs and KSAs.  The information contained in κ must suffice to produce probability distributions over their values.  The requirements specified for the KPPs and KSAs are used

  • to visualize κ for the tester in a way that highlights critical information, and
  • to inform the utility functions discussed below under Property 3.

Whenever new information about the SUT becomes available, whether this be via test results or subject matter expertise, there must be a mechanism to update the knowledge state κ to some state κ+ that reflects the updated understanding. The requirement that the knowledge κ exists at any time implies that one must be able to:

  • represent whatever knowledge one has prior to testing, and
  • assimilate even a small amount of data to make a small adjustment to κ.

A common objection to the first point is that one should assume one knows nothing prior to testing in order not to introduce subjective biases.  However, a complete lack of knowledge is never a realistic stance.  If miss distance were a KPP, a complete lack of knowledge would imply that one has no idea whether miss distances are typically less than the diameter of the solar system.  Similarly, it may be difficult to appreciate how a single data point updates knowledge, but if it didn’t, then a billion data points wouldn’t update knowledge either.  Dynamo’s Fidelity property requires that the knowledge κ respect these logical requirements of reasoning under uncertainty so that the benefit of performing a test can be properly assessed.

Phase 1 of implementing the Dynamo paradigm for a specific SUT focuses on developing an appropriate representation of the knowledge κ for that system.  The goal of a Phase 1 deployment is to achieve fidelity in this representation.  This is accomplished by implementing an interactive visualization of the knowledge state κ, customized to highlight the information that testers find the most useful.  This visualization would provide drill-down and analysis capabilities to allow testers to explore and understand the knowledge state κ as it evolves in response to new test results.  In addition to providing a valuable tool, developing the knowledge representation in this way ensures that subsequent Phase 2 and 3 deployments will be properly tailored to tester priorities.

Example Dynamo screenshot: high-level, Phase 1 summary Figure 2: Example Dynamo screenshot: high-level, Phase 1 summary

Figure 2 is a screenshot of the core Dynamo GUI applied to the “Mock-Cannon” scenario. This scenario is used in lieu of Q-53 to illustrate Dynamo in unclassified contexts. It depicts the initial, high-level visualization of the knowledge κ. In this case, the tester is interested in an operational Effectiveness metric which is a composite of two KPPs. The tester is also shown experimenting with additional Suitability and Survivability metrics, but not with any metric for Lethality. For each evaluation criterion, the dials show (a) a needle pointing to the median and (b) the 10th-to-90th percentile wedge. The GUI provides drill-down capabilities for the details of the Effectiveness information being shown, as well as Phase 2 and 3 analytics.

Property 2: Prediction

The second property required of the knowledge κ is that it can be leveraged to simulate the kinds of test results that are ingested by the Phase 1 deployment. The nature of the simulation required is different from the high-fidelity physics simulations used in digital twins. For Dynamo, the purpose of the knowledge κ is to define the probability distribution over the quantities that indicate what the system’s KPPs and KSAs are. Thus, in the Q-53 example, the goal is not to develop an exquisite simulation of raw radar data, but rather to maintain probability distributions over the quantities that directly inform the KPPs and KSAs. In this case, one of the KPPs is about the size of the POO error in various environments. Thus Property 2 requires that κ is sufficiently powerful to provide a probability distribution over the POO error in any environment that one may wish to test.

E. T. Jaynes argues that Bayesian reasoning is essentially the only consistent way to extend Aristotelian logic to handle uncertainty [7]. Implementing Properties 1 and 2 in accordance with the laws of Bayesian reasoning yields a system for sequential Bayesian inference [8]. Property 2 allows one to predict the distribution of outcomes y, for environment x, given the current state of knowledge κ. Property 1 allows one to assimilate even a single test datum (x,y) to update a knowledge state κ to κ+. Combining these properties allows one to determine the distribution of updated knowledge states κ+, given the current state κ, for a trial in environment x. The same process, applied in batch, provides the probability distribution of updated knowledge states κ+ for any ensemble of environments x comprising the rows of a test plan matrix X [9].

This phase of Dynamo deployment enables a tester to analyze various possible test plans X, assessing and visualizing the range of possibilities of what would be known about the system afterward. Phase 1 of Dynamo deployment is focused on developing an effective visualization capability for a single state of knowledge κ. This already involves managing both aleatoric and epistemic uncertainty. Phase 2 adds a third layer: the probability distribution over what the updated knowledge state κ+ would be given the current knowledge κ and a proposed test plan X.

Property 3: Utility

A Phase 1 Dynamo deployment provides real-time updates of the knowledge κ about a SUT. A Phase 2 deployment allows the tester to analyze the impact of selecting a test plan X when the current knowledge state is κ. These uncertainty management capabilities are enabled by Properties 1 and 2 above. Property 3 connects the knowledge state κ to the real-world operational impact of the SUT.

To formulate the third property requires the notion of a terminal decision. To inform decisions during a test event, Dynamo requires a specification of the space of possible decisions after the test event is complete. If there is only one possible terminal decision – if the system were preordained to be accepted, for example – then Dynamo would typically issue the sensible recommendation to stop testing immediately. The third property required for Dynamo is that every terminal decision d has a utility function ud (κ) that reflects the operational value of making decision d when the knowledge state is κ.

A Phase 3 implementation of Dynamo leverages these terminal utilities to compute the net utility of the various possible testing options available during a test event. It then recommends decisions that provide the most bang for the buck by comparing the cost of a test to the expected value of the knowledge it would produce. However, there are both practical and technical challenges to developing the full Phase 3 capability.

One practical issue is that the planning for test events does not permit the flexibility to select which trials to perform on the fly. Another is that such tests need to be scheduled and resourced in such a way that there are little cost savings from not performing all the planned tests. A third is that it is difficult to elicit the required terminal utility functions because it requires stakeholders to explicitly specify and quantify their priorities. The purpose of this paper, then, is not to promote the immediate adoption of Phase 3 Dynamo but rather to socialize the concept. For example, although test events may be rigidly planned and resourced, Moneyball cost/benefit analysis could be applied to other decision points in the acquisition life cycle where there is more flexibility.

The flip side of these considerations is that rigidity in the T&E process also has downsides. Systems can be tested against requirements that rapidly become outdated in a rapidly changing world. One reason it is difficult to elicit stakeholder priorities is that it forces hidden assumptions into the light. However, once these priorities are explicitly codified, it is easier to update them to accommodate a shifting geostrategic landscape.

The technical challenge of Phase 3 Dynamo is that an exact implementation is impractical except in simple cases. The outline of an exact implementation is as follows. Suppose the range of testing options available to an Operational Test Agency is specified at every stage during a test event and that the cost of any test plan X is specified as some value cX. Then, the utility of the knowledge state κ can be computed recursively using sequential Bayesian decision theory [10]. First, one can compute the utility uD (κ) of terminating the test event now: uD (κ) is defined to be the maximum of ud (κ) overall terminal decisions d∈D. Next, one can compute the utility uC (κ) of continuing the event: uC (κ) is defined to be maximum of uX (κ) for all X∈C, where uX (κ) is the expected utility of executing the test plan X. The utility u(κ) of the knowledge state κ is then the maximum of uD (κ) and uC (κ). This part of the test decision aid procedure is straightforward.

The challenge comes in computing uX (κ) for test plan X. The utility uX (κ) is defined to be the expected value of u(κ+ ) overall updated knowledge states κ+ given the current state κ and test plan X, minus the cost cX of executing the test plan [6]. Phase 2 of Dynamo provides the required probability distribution of κ+ given κ and X. However, computing u(κ+ ) involves the same procedure as computing u(κ) itself. This, then, involves a second recursion over twice-updated knowledge states κ++, and so on. This recursion can continue many steps into the future, which quickly becomes intractable.

One approximate method being developed for computing this recursion is based on a sequential probability ratio test (SPRT) [11]. Another is based on the way in which the impetus to test propagates outward from terminal decision boundaries, which is discussed below.

Example: Exact Utility Calculations

The utility equations discussed above can be solved exactly for a simple system in which

  • There is no environment variable , and
  • The outcome is simply “hit” or “miss.”

In this case, the parameter vector θ reduces to a single scalar: the hit rate p. The likelihood function L(y|x,θ) is then just L(y│p)=p when y is a hit and L(y│p)=1-p when y is a miss. That is, the SUT has some unknown hit rate p governing whether outcomes are hits or misses, and one wants to make decisions about the system based on the observed outcomes. The knowledge in this simple example can be represented as κ=(a,b), where a and b are the parameters of a beta distribution. The values of a and b can be interpreted as the number of hits and misses observed, respectively (plus some “pseudo” hits and misses that represent prior knowledge). Results and insights for this case have been discussed extensively in [6] and [9], so the discussion below is somewhat abbreviated. Despite its simplicity, the tractability of this case continues to yield insights into the mechanism by which testing generates value.

To illustrate this, let v(κ)=u(κ)-uD (κ)=max(0,vC (κ)) be the marginal value of having the option to continue testing a system. The recursion for u(κ) described above can be translated into a recursion for v(κ). This may be expressed concisely as

The recursion for u(κ) described above can be translated into a recursion for v(κ).

The change of variables from u(κ) to v(κ) introduces a new source term into the equations:

The change of variables from u(κ) to v(κ) introduces a new source term into the equations

The quantity gX (κ) is the expected increase in the terminal utility from executing the test plan X. The terminal gain gX (κ) concentrates at the boundaries between terminal decision regions (provided the terminal utilities ud (κ) are intrinsic (cf. [6])). This is illustrated in Figure 3 for a case with three terminal decisions: Accept (best when a is large and b is small: i.e., the system has a high hit-rate), Improve (i.e., send the system back to the developer for improvement – best for intermediate hit-rates), and Reject (best when a is small and b is large).

Figure 3: Terminal gain concentrates at terminal decision boundariesFigure 3:  Terminal gain concentrates at terminal decision boundaries

When the above recursion for v(κ) is solved, it yields the decision chart shown in Figure 4. One can interpret this as gX (κ) generating an “impetus to test” at the terminal decision boundaries, which then propagates outward in some manner to recommend further testing even when the benefit of testing cannot be realized in a single trial.

Decision chart for Beta-Bernoulli systemFigure 4: Decision chart for Beta-Bernoulli system

Figure 5 shows a variant of Figure 4 with a different color scheme, and for a range of costs per trial. Here the different terminal decision regions are shown in grayscale, and the continue testing region is colored using the value of v(κ). As the cost per trial decreases, the marginal value of having the option to continue testing increases, and the Continue Testing region widens.

Continue Testing region expands as cost-per-trial decreases

Figure 5: Continue Testing region expands as cost-per-trial decreases

The knowledge κ may be rewritten as the pair (a;n). Figure 6 plots v(κ) for the case n=100.

Numerical computations of marginal testing value converge to zero-cost formula

Figure 6: Numerical computations of marginal testing value converge to zero-cost formula

The marginal value of having the option to continue testing has a cusp-like peak at the terminal decision boundaries. The form of these cusps converges to a zero-cost-per-trial limit. This zero-cost case is not just numerically tractable but analytically tractable as well: i.e., there is a closed-form expression for the value in this case:

closed-form expression

Here Γ(s,x) denotes the incomplete gamma function, which can, in turn, be expressed in terms of the error function “erf”(x) in this case.

Because this case is simple enough to allow such analytic calculations, it is helpful to use it to formulate approximate decision methods. The figures above demonstrate how the impetus to test is generated at the terminal decision boundaries via the terminal gain function gX (κ). This suggests that an approximate decision aid could be formulated by computing the location of decision boundaries, the strength of the terminal gain there, and the shape of the functional form via which it propagates outward from these boundaries.

Summary

The Dynamo paradigm formulates T&E in accordance with the principles of Bayesian decision theory. It processes testing data and subject matter expertise into a knowledge state that manages the epistemic uncertainty associated with testing. This knowledge state is responsive to even small amounts of new test data. It is also sufficiently powerful to provide probabilistic predictions of key outcomes and of the KPPs and KSAs formulated in terms of these outcomes. Maintaining the knowledge state in this way supports capabilities to visualize test progress in real-time and to predict the impact of proposed test plans.

Dynamo leverages the knowledge state to assess which test plans are expected to provide the most “bang for the buck.” There are practical and technical challenges in realizing this final “Moneyball” phase of Dynamo deployment. The approach to addressing the practical challenges is to identify decision points in the acquisition life cycle where Dynamo’s cost/benefit analysis would be compatible with existing procedures. The approach to addressing the technical challenges is to develop approximate methods for computing the required utilities of test plans. An exact solution for these utilities was computed for a simple hit/miss example. This example suggests a general principle of the impetus to test being “generated” at terminal decision boundaries and then propagating into the interior of the terminal decision regions. Understanding the nature of the impetus to test is one avenue being explored to develop effective approximate utility computation methods.

Dynamo addresses the complexity of T&E by formulating the ideal Bayesian-decision-theoretic solution and then developing the approximations needed to make it feasible in practice. A benefit of this is that the later phases of Dynamo impose rigorous demands on the earlier phases. In particular, the Moneyball approach to utility requires sophisticated uncertainty management about the state of the test. This uncertainty management module then becomes a valuable tool in its own right.

The Dynamo paradigm has been implemented as an R&D prototype (core Dynamo). It will soon begin its first deployment, beginning with the formulation, management, and visualization of the knowledge state. The long-term plan for Dynamo is to deploy it for a series of systems, incorporating the lessons learned into core Dynamo in order to accelerate subsequent deployments. As this plan is realized, Dynamo will provide novel analytic and visualization tools and will facilitate protocols designed to reach trustworthy decisions sooner and at a lower cost.

Acknowledgements

This work was supported by The Director, Operational Test & Evaluation (DOT&E). The authors would also like to the thank the reviewers, whose comments were insightful, professional, helpful, and extensive.

References

[1] Schwartz, Moshe. 2015. The Nunn-McCurdy Act: Background, Analysis, and Issues for Congress. Congressional Research Service.

[2] Berger, James O. 1985. Statistical Decision Theory and Bayesian Analysis. New York: Springer-Verlag.

[3] Lewis, Michael. 2003. Moneyball: The Art of Winning an Unfair Game. New York: W. W. Norton and Company.

[4] Wu, C.F. Jeff, and Michael Hamada. 2009. Experiments: Planning, Analysis, and Optimization. 2nd ed., Wiley.

[5] Ferry, James, Lawrence Stone, Thomas Corwin, Adam Ahmed, Sean Daugherty, Jeremy Werner, and Sandra Hobson. 2024. “Use of Bayesian methods to optimize decisions,” Naval Engineers Journal, 136 (1): 79–84.

[6] Ferry, James and Adam Ahmed. 2024. “A Bayesian Decision Theory Paradigm for Test and Evaluation,” Proceedings of the 27th International Conference on Information Fusion, Venice, Italy, July 7-11, 2024.

[7] Jaynes, Edwin T. 2003. Probability Theory: The Logic of Science. Cambridge, UK: Cambridge University Press.

[8] Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. 3rd ed. Boca Raton, FL: CRC Press.

[9] Ferry, James. 2023. “Experimental design for operational utility,” The ITEA Journal of Test and Evaluation, 44 (3).

[10] Chernoff, Herman. 1972. Sequential Analysis and Optimal Design. Philadelphia, PA: Society for Industrial and Applied Mathematics.

[11] Wald, Abraham. 1945. Sequential Tests of Statistical Hypotheses. Annals of Mathematical Statistics, 16 (2): 117–186.

Author Biographies

Jim Ferry, Ph.D. is a Principal Research Scientist at Metron, Inc. His research focuses on the problems of the Defense and Intelligence communities. His interests encompass the application of Bayesian methods to a variety of domains: Test & Evaluation, tracking and data association for kinematic and non-kinematic data, and the synthesis of classical detection and tracking theory with the modern theory of networks. Prior to Metron, he worked in computational fluid dynamics at UIUC, specializing in multiphase flow and thermal convection. Dr. Ferry holds a Ph.D. in applied mathematics from Brown University and an S.B. in mathematics from MIT.

Nate Crookston, Ph.D. is a Senior Software Developer at Metron, Inc. He is the lead developer for the Metron DOT&E effort as well as projects relating to naval logistic resupply. Prior to Metron, he worked at the Space Dynamics Laboratory on near-real-time tactical image processing, display, and dissemination problems with a focus on image processing algorithm development. Dr. Crookston holds a Ph.D. in Electrical & Computer Engineering from Johns Hopkins University, and a M.S. in Computer Engineering and a B.S. in Electrical Engineering from Utah State University.

Adam Ahmed, Ph.D. is a Research Scientist at Metron, Inc. He is the technical lead for the Metron DOT&E effort. His research interests include applying novel Bayesian approaches to testing and evaluation, machine learning methods for small datasets as applied to undersea mine classification, and time series classification for continuous active sonar systems. Dr. Ahmed holds a Ph.D. and M.S. in physics from The Ohio State University, and a B.S. in physics from UIUC.

ITEA_Logo2021
ISSN: 1054-0229, ISSN-L: 1054-0229
Dewey Classification: L 681 12

  • Join us on LinkedIn to stay updated with the latest industry insights, valuable content, and professional networking!