SEPTEMBER 2025 I Volume 46, Issue 3
SEPTEMBER 2025
Volume 46 I Issue 3
IN THIS JOURNAL:
- Issue at a Glance
- Chairman’s Message
Technical Articles
- Kernel Model Validation: How To Do It, And Why You Should Care
- Confidence-Based Skip-Lot Sampling
- Eucalyptus – An Analysis Suite for Fault Trees with Uncertainty Quantification
- Digital Twins in Reliability Engineering: Innovations, Challenges and Opportunities
- Competence Measure Enhanced Ensemble Learning Voting Schemes
- Advancing the Test Science of LLM-enabled Systems: A Survey of Factors and Conditions that Matter Most
- Beyond Accuracy: Evaluating Bayesian Neural Networks in a Real-world Application
- Balancing Structure and Flexibility: Evaluating Agile, Waterfall, and Hybrid Methodologies in Aerospace and Defense Projects
Workforce of the Future
- Building Confidence, Interest, and Opportunity: A Social Cognitive Career Theory-Based Analysis of the Young Women in Engineering Outreach Program
News
- Association News
- Chapter News
- Corporate Member News
![]()
Advancing the Test Science of LLM-enabled Systems: A Survey of Factors and Conditions that Matter Most

Karen O’Brien
Senior Principal Data Scientist and AI/ML Capability Focus Area Lead
Modern Technology Solutions, Inc, Huntsville, AL
![]()
![]()
Abstract
Large Language Model (LLM)-enabled systems are increasingly integrated into Warfighter applications. While their potential to enhance operational effectiveness, suitability, and survivability is attractive, their limitations demand rigorous testing to ensure mission success. This paper addresses the gap between the LLM testing practices in academia and industry and the need for scientific rigor in DoD test and evaluation (T&E) by identifying an initial set of factors and conditions that drive LLM-enabled system performance, providing a survey of literature with practical test ideas, and suggestions for probing challenges such as model sycophancy, the reversal curse, and scoring. With this information, evaluators can design rigorous experiments that generate evidence for evaluation questions with power and confidence. Special attention is given to tailoring, customization, and the use of subject matter experts in scoring.
Keywords: Scientific Test and Analysis, Large Language Models, Artificial Intelligence, Designed Experiments
Introduction
Large Language Models (LLMs) are large machine learning models that are pre-trained on vast amounts of text data (Zhao et al. 2023). LLMs leverage a type of neural network technology called transformers (Vaswani et al. 2017). Transformers consist of an encoder (the portion that builds models from text) and decoder (the portion that generates new text based on the encoded text) with self-attention capabilities. LLMs have become ubiquitous in modern software tools and are often conflated to mean “Artificial Intelligence (AI).” Their strengths are attractive because of the potential capabilities they could add to the Warfighter, but their limitations are concerning and pose a risk to the mission. Thus, it is essential to test LLM-enabled systems with rigor to inform decision-makers about their contribution to operational effectiveness, suitability, and survivability.
However, there is a gap between the standards for scientific test and analysis techniques set forward by the DoD Director, Operational Test and Evaluation in 2009 (DOT&E 2010), (and reiterated in the 2024 updates to DoDI 5000.98 and the DODMs 5000.100 and 5000.101), and the status quo of LLM testing in academia and industry. Closing this gap requires a thorough understanding of the factors and conditions that drive experimental outcomes in LLM-enabled systems. With this understanding, evaluators will be able to design rigorous and reusable experiments and studies that provide high quality evidence for evaluation questions. This paper will provide a survey of the literature that informs this objective and provide a starting point for tailoring, as needed.
Technical Content
1. An Introduction to Factors, Conditions (Levels), and Experimental Design
Whether the study is simple or statistically sophisticated, observing and controlling the variables that impact outcomes is a fundamental component of any experimental design.
Factors: Variables that (may) impact the outcome of an experiment. Sometimes called independent or explanatory variables, these are held constant or varied in systematic ways during the experiment. Some factors cannot be controlled, but can be observed, measured, and documented.
Levels: Levels are the values of the factors used in the experiment. They may be quantitative (like air temperature or the brightness of light) or qualitative (color, preference, perceived difficulty). They may also be continuous (like) time, or discrete (like mission type).
Conditions: Conditions are a way to describe the level of a factor in a way that is relatable to a stakeholder. The levels of a temperature in an experiment might be -40◦F (-40◦ C), 72◦F (22◦ C) and 125◦ F (52◦ C) but ‘Arctic in winter,’ ‘room temperature,’ and ‘Death Valley in summer’ conditions are easier to relate to. It is common within the DoD Test and Evaluation (T&E) community to use the terms “levels” and “conditions” interchangeable, even if this is not precise[ The field of statistics uses this term differently. Conditions serve as acceptance criteria, or a threshold of data quality that the data collection must achieve to accept the data for analysis. ].
Factors and conditions play a role in different types of experimental designs (Dean et al. 2017). In experimental studies such as Design of Experiments, A/B Testing, Hypothesis testing, and similar, the factors’ levels are manipulated in accordance with the experimental design to observe their effects on outcomes. In observational studies or unscripted ‘free play’ test excursions, testers observe the natural variation in factors and conditions as they happen (Rosenbaum 2013). The results are ‘tagged’ or annotated with information about the factors and conditions that were applicable to that outcome. Both approaches are essential for test planning and for post-test analysis, including root cause analysis of unexpected occurrences in test.
2. Overarching Categories of Factors for LLM-enabled Systems
Because the architectures of LLM-enabled systems vary significantly (Aquino et al. 2025) – ranging from providing input directly to the LLM model to more complex systems that embed LLMs as a component that is wrapped in layers of guardrails and software interfaces – it is critical to obtain an architecture diagram (Figure 1). Each functional node, and connection between nodes, is a potential source of evaluation questions, factors, and conditions.
Figure 1: Sample architecture diagram of an LLM-enabled system, specifically a Retrieval-Augmented Generation (RAG) configuration.
1 The field of statistics uses this term differently. Conditions serve as acceptance criteria, or a threshold of data quality that the data collection must achieve to accept the data for analysis.
For simplicity, we will assume a Retrieval-Augmented Generation (RAG) (Arsland et al 2024) architecture for this paper. RAGs are common LLM-enabled systems in which the LLM’s response to user prompts is constrained to a knowledge base of documents called a corpus. It is important to acknowledge that the quantity of factors that arise from a short list of categories of factors shown in Figure 2 in even the simplest RAG architecture will quickly become infeasible to test. Evaluators must prioritize based on experimental design choices, the envelope of operational conditions, stakeholder needs, and available resources.

Figure 2 Overarching categories of factors that impact LLM-enabled system performance.
The End-Users: The first overarching category of factors for experimental design is focused on the human operating the system – the end-user. The end-user prompts the system with input and users will vary by role, experience, preferences, and cognitive demands. As humans, end-users may make assumptions and leaps of logic. They may be forgetful, imprecise, ambiguous, and may make errors. Their vigilance about prompt engineering discipline or fact checking may wane. They may request forbidden information or test the guardrails of the system.
In addition to capturing demographic information about the end-user (suggestions in Table 1), the variation in user prompting patterns is an important source of operational realism (suggestions in Table 2).
Table 1 Suggestions for Human-centric Factors and Levels
| Human Factors | Levels (tailor as needed) & Categorical Tags |
| Demographics | As required for the study |
| Experience | [measure of time, usually years and months] |
| Domain expertise | [novice], [mid-career], [advanced] |
| Task Success Rate | [Without LLM], [With LLM] |
| Total Time to Complete Task | [Without LLM], [With LLM] |
| Cognitive load | [low], [typical], [high], [extremely high] |
| Available time to complete task (deadline) | [less than typical], [typical], [more than typical] |
Table 2 Suggestions for Prompt-centric Factors and Levels
| Prompt Factors | Levels (tailor as needed) & Categorical Tags |
| Requires reasoning | [see the cognitive decomposition procedure in “Tasks” section] |
| Assumes a context | [yes], [no] x [specified], [not specified] |
| Uses jargon | [niche definition], [synonym], [contextual synonym], [acronym]
Note: A contextual synonym is jargon or shorthand that makes sense only within a specific context. For instance, “A reg” is short for “A regulation” but could also be slang for a law, policy, memo, or other guidance. |
| Error in reasoning | [false supposition], [mistaken facts] |
| Contains Typo | Construct pairs of test items: without typo vs. with typo
[number of typos] x [word criticality – random, stop word, key term] x [typo type – atomic, linguistic, phonetic, keyboard flub, autocorrect, random, other] |
| Contains a Reversal form | Construct pairs of test items: A → B vs. A ← B |
| Requests forbidden information | [unintentional leakage], [intentional/adversarial] |
| Contains multiple instructions | [1-n instructions to follow] |
It is important when designing user-centric experiments to consider the unintended consequences of LLM training and how they manifest given the variability in users and user prompting styles. One such concern is model sycophancy (Perez et al. 2023; Sharma 2023) which arises from reinforcement learning from human feedback (RLHF). The models are incentivized to agree with humans (and be excessively agreeable), treating all information provided by the user as correct. This same process also contributes to models presenting incorrect output in a confident way and confabulating an incorrect answer in a compelling, confident way.
Evaluators can design a test excursion that explores an LLM-enabled system’s ability to handle errors when doing so contradicts the model’s RLHF’s training (Ranaldi and Pucci 2023) by including a statistically useful sample of prompts in which:
- The correct answer is “no”.
- The prompt contains an incorrect assertion (false supposition) which requires the model to contradict the user.
- The prompt contains negation logic. (“except”, “but not”, “other than”, etc.)
- The corpus does not contain the answer, and the appropriate response is “I don’t know” or a similar non-answer.
Another artifact of LLM training that can become problematic for end-users is described by Berglund, et al. 2023 and Wu, et al 2024 as the “reversal curse.” Current generation autoregressive LLMs’ (such as ChatGPT, Claude, etc.) performance may vary as a function of the sequence of the information in the prompt, independent of the common ‘prompt engineering’ tricks. Because the embedding models are generated by reading sentences forward, a model trained on A → B may not successfully answer A ← B. Consider a test excursion of question pairs (A → B and a corresponding A ← B version) and compare. Some nuance in scoring may be required for A ← B forms because there may be more than one successful answer.
The Tasks: The end-user will use the LLM-enabled system to perform a variety of tasks, which constitute the next overarching category of factors to consider in test design. Tasks may serve as the dependent variables. Task success will be impacted by variations in the other factors (independent variables) in the mission context. Tasks vary by cognitive complexity, the number of steps to perform, and the time and resources available to perform the tasks. The definition of ‘success” will also vary. Some tasks that humans perform may not be feasible for the LLM-enabled system.
For each user task, perform a task decomposition and interpret tasks into constituent cognitive tasks (Knisley et al. 2021, Coffey and Herholz 2013). Organize and rank the tasks in terms of cognitive complexity. Although there is no consensus on a best framework for doing so, Kelly et al. (2015) and Bloom’s Revised Taxonomy (Guinet 2024, Krathwohl 2002) are practical starting points. A notional example is provided in Figure 3. Each category on Bloom’s taxonomy becomes an evaluation question around which an evaluator can build rigorous experiments.
It is important to acknowledge, in considering the concept of reasoning in marketing claims about LLMs and frameworks like Bloom’s Taxonomy, that there is not a consensus on the measurement and evaluation of human cognition. This is an area of active research and lively debate (Brown 2006, Calvo and Gomila 2009, Gage and Baars 2019).
Figure 3: Notional tasks mapped to Bloom’s Revised Taxonomy categories
There is nevertheless a methodological gap between the research question, “Do LLMs reason?” (of interest to academia and the frontier LLM development labs) and the practical T&E question, “How well do LLM-enabled systems, as part of a human-machine team, support user tasks that require reasoning?” Reasoning ‘benchmark’ test packages often do not generalize to the real world or predict LLM-enabled system performance on tasks that require human reasoning (Nezhurina et al. 2024). Because an LLM-enabled system that does not reason the way humans do can still provide satisfactory support to an end-user, evaluators are encouraged to build user task-centric test excursions to characterize system performance in each category of tasks.
The Documents: Assuming a RAG-type LLM system with a document corpus, the operationally realistic documents that humans use to perform their tasks will vary widely by type, format, and content. This variation serves as another overarching category of factors for consideration. Documents may be formatted in a way that is not machine-readable or relies on formatting and symbolism to communicate information. They may contain incorrect information, may evolve over time, or become obsolete. Sometimes documents contradict each other, and human judgement is required to rank documents in terms of their authoritativeness or correctness.
At the time of this writing, most commercially available LLMs were trained on pure text. The operationally realistic documents that end-users will use are generally not pure text and the full meaning of a page or slide is rarely encoded entirely as text. Figure 4 illustrates some of the common document features that can challenge LLM-enabled systems, including non-machine-readable illustrations and the interruptions in the text caused by headers, footers, and footnotes. Table 3 lists some of the elements that contribute to challenging formats, much of which can be captured as metadata about the documents and used as factors in testing.

Figure 4: A sample page from an operationally realistic document that may pose challenges to an LLM-enabled system.
Table 3 Suggestions for Document-centric Factors and Levels
| Document Factors | Levels (Tailor) & Categorical Tags |
| File Type | [.txt text], [.doc Document], [.pdf Document], [.xls Document], [.ppt PowerPoint], etc.… |
| Machine Readability | [Digital from outset], [High quality OCR scan], [low-quality OCR scan], [mixed OCR] |
| Source | [Gov’t], [Academia], [Industry], [Media], etc.… |
| File Date | [Date] |
| Authority Hierarchy | [Law, EO] > [Direction, Regulation] > [Policy] > [Manual], etc.… |
| Objectivity | [custom scale – objective vs. opinion] |
| Distribution Controls | [Controlled/unreleasable], [publicly releasable], [copyrighted], etc… |
| Columns | [1-n], [mixed] |
| Text Formatting
(bold, italic, size, color, etc…) |
[encodes meaning], [decorative] |
| Text arrangements
(callout boxes, block quotes, bumper stickers) |
[encodes meaning], [decorative] |
| Document Organization | [Outline hierarchy], [Section Headers, no Outline], [Navigation aids/cues] |
| Tables | [simple], [complex] x [OCR’d text], [non-OCR’d text] x [portrait orientation], [landscape orientation] x
[discussed in the text], [not discussed in the text] |
| Graphs (communicates data) | [line graphs], [bar charts], [etc] x [discussed in the text], [not discussed in the text] |
| Diagrams | [geospatial/maps], [diagrams], [flowcharts] x [discussed in the text], [not discussed in the text] |
| Images | [photograph], [illustration] x [with text]. [without text] x [discussed in the text], [not discussed in the text] |
| Schedules | [timelines], [calendars], [roadmaps], [status], [plans] x [discussed in the text], [not discussed in the text] |
| Reference Management | [Reference section], [Footnotes], [End notes] |
| Connection to other documents | [current], [obsolete] x [replaces document ___], [replaced by document ___] |
Table 4 includes additional factors that can be used to engineer corpus complexity by creating opportunities for the system to provide successful nonanswers such as, “I don’t know” rather than confabulate an incorrect answer. This includes inclusion of distractor documents that have frivolous, obsolete, incorrect, and/or contradictory information, documents that contain style guidance and talking points, similar documents from a different organizational context. Table 4 also saves space for a list of topics that are not included in the corpus at all. Prompts requesting information that is guaranteed to not be in the corpus allows evaluators to explore the system’s guardrails, designed amnesia, and confabulation (hallucination) rate.
Table 4: Suggestions for Corpus-based Factors and Levels
| Factors | Levels (tailor as needed) |
| Topic is included in corpus | [yes], [no] |
| Document contains incorrect information | [yes], [no] [obsolete], [contradicted], [error], [opinion], [misinformation], [poison] |
| Document is distractor | [yes], [no]
If yes, [relevant in different context], [requires reasoning to discern], [not relevant] |
The System: The system consists of the software that wraps the LLM with additional functionality as well as the LLM itself and these provide an additional category of factors. Small changes to the system can result in noticeable changes to system performance. Thus, while a system should be held as fixed during an individual test excursion, the experiments should be repeated to track performance changes as the system updates or to compare multiple configurations of a system. Consider developing a sufficient sample size and test design to enable meaningful comparisons of the system to itself over time.
The Answers: A potentially surprising overarching category of factors are the answers that the LLM-enabled system provides. The answers, or responses, will vary by the number of steps (hops) required to assemble a response, how many different sources must be synthesized, and how well the system handles ambiguity and document readability. The system may produce answers that are infeasible, incorrect, or inconsistent. The system may be designed to provide a non-answer to certain inputs, particularly in the case where there is no valid answer to be found in the corpus or where the prompt violates a safety guardrail. Ultimately, we desire satisfactory responses from an LLM-enabled system and the definition of satisfactory is highly dependent on the task, the context, and the definitions of mission success. Suggestions for response-centric factors are in Table 5.
Table 5 Suggested Response-centric Factors
| Factors | Levels / Category Tags (tailor as needed) |
| Is successful | Scoring should vary but be tailored. ‘Successful’ can mean a correct answer, a satisfactory answer (scored by jury), or an appropriate non-answer (‘I don’t know’ (IDK); ‘Decline to answer’ (DTA)). |
| Is consistent (consistent output given repeated, identical input) | [Acceptable on a scale of 1-n] x [Consistent on a scale of 1-n] x [selected prompts] |
| Is a negative | [yes], [no]
If yes, [the answer is “no”], [the answer contradicts the user], [the answer corrects the user], [the answer handles negation logic] |
| Is ambiguous | [due to synonyms], [requires disambiguation], [due to temporal reasoning], [due to logical reasoning, bridging logic] x Scale of 1-5 or use a semantic distance measure. |
| Is derived from symbolism | [text formatting], [text design], [icons, logos], [imagery], [data displays], [color], etc. |
| Is in the style of | [Journalistic], [Technical], [Editorial], [Academic], [something non-professional like poetry], etc. |
| Requires a decision about which document to use (ranking) | [levels of authoritativeness], [levels of recency/obsolescence], [contradictory docs] |
| Is obtained from a single document | [single page], [spans multiple pages], [spans multiple pages with interrupting text (headers, footers, footnotes)], [multiple places in a document], [requires logic due to footnotes], etc. |
| Must be synthesized from multiple documents (“multi-hop”) | [2], [2 or more], [2 or more with contradicting information or versions], etc.
(particularly helpful when synthesizing with talking points) |
A common desire by stakeholders is that the LLM-enabled system perform in a consistent way, even so far as requesting “near deterministic behavior” in which the system provides the same output when given the same input. Laban et al. (2023) provide a simple methodology that can be included in a test design to characterize the consistency in the system’s responses. Laban et al. (2023) further probe for model sycophancy (Ranaldi and Pucci 2024; Sharma, et al. 2023) by adding text such as “Are you sure?” and “Please check your facts and try again” to induce the model into changing the response, potentially flip-flopping from a successful response to an unsuccessful one.
One option for testing consistency is to look for trends as a function of the cognitive complexity of the end-user task. Lower-level cognitive tasks (such as define a vocabulary word or assemble a list) may be consistently correct or incorrect, but higher-level tasks (such as summarize) may be more variable. The variability may not be problematic provided the model does not flip-flop between acceptable and unacceptable answers (Figure 5). There may be many successful ways to summarize a complex topic. Figure 5 plots the number of unique responses (y-axis) to a prompt that is repeated a number of times (x-axis) and scored as a binary ‘successful’ or ‘unsuccessful.’

Figure 4: Plots of acceptable and unacceptable levels of response variability.
3. Considerations for Using Benchmarks in T&E
A standard approach to testing LLMs is the use of benchmarks, such as the General Language Understanding Evaluation (GLUE) benchmark (Wang, et. Al. 2019). Benchmarks are prepared sets of test items and metrics that measure aspects of an LLM’s performance. They are useful for comparing LLMs (and at times, LLM-enabled systems) to each other in a reproducible way. Their use in academia and industry is analogous to DoD developmental testing. They may have utility in operational-type experiments on a case-by-case basis. Benchmarks are not required for a successful experiment supporting DoD T&E.
Benchmarks should be examined for utility and appropriateness prior to inclusion in an experimental design. The Stanford BetterBench Assessment (Reuel-Lamparth, et al. 2024) provides a framework for assessing benchmark quality. The assessment includes 46 best practices across a benchmark’s lifecycle. Suggested assessment criteria include whether domain experts and literature were consulted in developing the benchmark, the inclusion of metrics such as floor and ceilings, human performance level, and random performance level, the availability of the code, calculations statistical significance or uncertainty quantification, documentation, ability to be gamed, and the like.
In addition, benchmarks are subject to benchmark contamination. Contamination happens when the benchmark test items are included in the LLM training data (DeepLearning.ai 2024, Zhang et al. 2024), which results in overstating the system performance because the system got to peek at both the question and the answer. While methods are emerging to detect the degree of contamination (Li, et. al, 2024), a custom set of test items created by the evaluators will mitigate this challenge.
4. Considerations for Developing Useful Sample Sizes
A key component of adequate scientific test and analysis is to obtain sufficient samples from testing to enable statistical power (to detect differences if there are differences to detect) and confidence (acceptable levels of uncertainty) in the resulting analysis (DOT&E 2013). Adequate power and confidence in the test findings support decision-makers by informing them of the strengths and limitations of the test. Many considerations go into sampling choices, including the test goal and way the results are scored. Binary scoring is a special case that impacts experimental design choices for LLM-enabled systems.
Binary scoring is a common scoring methodology used across academia and industry for grading LLM performance: a prompt and response is successful or not successful; pass or fail. This approach enables simple summary metrics such as accuracy, precision, F-1. However, as the simplified approach to scoring converges on the concept of Bernoulli Trials, the statistical assumptions that underly such trials become important (Siegrist 2022). Two criteria must be met:
- The trials must be independent. Earlier trials cannot impact subsequent trials. Check to ensure the LLM’s context window is not influencing subsequent prompts.
- The trials must be interchangeable. Develop a useful sample size of equivalent trials by controlling factors and conditions to enable statistical power and confidence for analysis.
Many LLM performance metrics do not pause to consider the impact of an uncontrolled (non-i.i.d) Bernoulli Trial test design (Olivetti, et al. 2015). Uncontrolled trials can negatively impact the findings of the test by inducing misleading confidence intervals (Varoquaux and Collio 2023), error bounds (Cook 2021), and impede root-cause analysis. Hussain et al. (2018) outline measures, such as stratification, that can be applied to non-i.i.d. designs to help correct for the lack of control.
Another common practice is to use other LLMs as a judge of the responses (Samuylova 2025). Using an LLM to score results accelerates the process of scoring the results. LLM-as-Judge may be suitable for low-level cognitive tasks (such as recall or list) for which there is exactly one correct answer, however, for more complex cognitive tasks, such as summarization, reasoning, and disambiguation, where more judgement and nuance is required, human subject matter experts (SMEs) may be a better choice to score responses. Two recent studies provide a template for using a jury of SMEs to score responses on summarization tasks (BBC 2025) and (Jaźwińska and Chandrasekar 2025). Both studies include their survey questions and scoring methodology which can be implemented in developmental and operational testing with little modification.
Finally, caution is further warranted when using the LLM-as-Judge method because the judge LLM, itself, is not error-free. Thus, the resulting score is a condition of both the judge and the system’s probabilities of being correct and must be couched as a conditional probability (Steyer and Nagel 2017). An LLM-as-Judge may also introduce a bias which further distorts the scores (Wang et al., 2024). Uncertainty quantification under these circumstances is still an area of active research. Using a jury of SMEs can mitigate these challenges.
Conclusions
With this initial framework of factors and conditions to consider in a test of an LLM-enabled system, evaluators can take initial steps to bridge the gap between common LLM testing practices and the DoD expectations for scientific rigor in testing. Although the factors and conditions listed herein are just a starting point, they enable a wide variety of study designs ranging from observational studies to designed experiments with concomitant statistical power and confidence. The LLM technologies will continue to evolve but the need to understand their contribution to mission success when used by Warfighters remains constant. The methodologies described here require a substantial up-front investment to design appropriate tests for the LLM-enabled systems, given requirements and available resources. However, once created, the test design can be reused to monitor the performance of LLM-enabled systems over time. This frees DoD T&E professionals from the constraints of laboratory-style test instruments and developmental test methods and enables scientific operational testing of LLM-enabled systems.
References
Aquino, G. D. A. E., Raimundo Gomes and Israel Gondres Torné. 2025 “From RAG to Multi-Agent Systems: A Survey of Modern Approaches in LLM Development” Preprints. https://doi.org/10.20944/preprints202502.0406.v1
Arslan, Muhammad, Hussam Ghanem, Saba Munawar, and Christophe Cruz. “A Survey on RAG with LLMs.” Procedia Computer Science 246 (2024): 3781-3790. https://doi.org/10.1016/j.procs.2024.09.178.
Berglund, Lukas, et al. “The Reversal Curse: LLMs Trained on ‘A Is B’ Fail to Learn ‘B Is A.’” arXiv, 4 Apr. 2024, arXiv:2309.12288v3 [cs.CL].
British Broadcasting Company. “Representation of BBC News Content in Ai Assistants.” About the BBC, 11 Feb. 2025, www.bbc.co.uk/aboutthebbc/documents/bbc-research-into-ai-assistants.pdf.
Brown, Keith, editor. Encyclopedia of Language & Linguistics. Elsevier Science, 2006
Calvo, Paco, and Antoni Gomila, editors. Handbook of Cognitive Science: An Embodied Approach. Oxford: Elsevier, 2009. doi:10.1016/B978-0-08-046616-3.X0001-1.
Coffey, Emily B J, and Sibylle C Herholz. “Task decomposition: a framework for comparing diverse training models in human brain plasticity studies.” Frontiers in human neuroscience vol. 7 640. 8 Oct. 2013, doi:10.3389/fnhum.2013.00640
Cook, John. “Approximating Sum of Independent but Not iid Bernoulli RVS.” Applied Mathematics Consulting Blog, 23 June 2021, www.johndcook.com/blog/2021/06/23/sum-bernoulli-not-iid/.
Dean, Angela, Daniel Voss, and Danel Draguljić. Design and analysis of Experiments. Cham, Switzerland: Springer, 2017.
DeepLearning.AI. “Benchmark Tests Are Meaningless the Problem with Training Data Contamination in Machine Learning.” The Problem with Benchmark Contamination in AI, The Problem with Benchmark Contamination in AI, 4 Nov. 2024, www.deeplearning.ai/the-batch/the-problem-with-benchmark-contamination-in-ai/.
Gage, Nicole M., and Bernard J. Baars. Fundamentals of Cognitive Neuroscience: A Beginner’s Guide. 2nd ed., Academic Press, 2019. doi.org/10.1016/C2014-0-03767-7.
Guinet, Gauthier, et al. “Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation.” arXiv, 22 May 2024, arXiv:2405.13622v1 [cs.CL].
Guha, Neel, et al. “Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models.” Advances in Neural Information Processing Systems 36 (2023): 44123-44279.
Hussain, Zahid, et al. “A Stratified Randomized Response Model for Sensitive Characteristics Using Non-Identical Trials.” Communications in Statistics – Theory and Methods, vol. 49, no. 1, 2018, pp. 99-15. Taylor & Francis Online, https://doi.org/10.1080/03610926.2018.1530791
Jaźwińska, Klaudia, and Aisvarya Chandrasekar. “Ai Search Has a Citation Problem.” Columbia Journalism Review, 6 Mar. 2025, www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php.
Kelly, Diane, et al. “Development and Evaluation of Search Tasks for IIR Experiments Using a Cognitive Complexity Framework.” Proceedings of the 2015 International Conference on The Theory of Information Retrieval, Association for Computing Machinery, 2015, pp. 101–110, https://doi.org/10.1145/2808194.2809465.
Knisely, Benjamin M., et al. “A Cognitive Decomposition to Empirically Study Human Performance in Control Room Environments.” International Journal of Human-Computer Studies vol. 141, 2020, p. 102438, https://doi.org/10.1016/j.ijhcs.2020.102438.
Krathwohl, D. R. “A Revision of Bloom’s Taxonomy: An Overview.” Theory Into Practice, vol. 41, no. 4, 2002, pp. 212–218, https://doi.org/10.1207/s15430421tip4104_2.
Laban, Philippe, et al. “Are you sure? challenging llms leads to performance drops in the flipflop experiment.” arXiv preprint arXiv:2311.08596 (2023).
Li, Yucheng, Frank Guerin, and Chenghua Lin. 2024. “LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction.” Proceedings of the AAAI Conference on Artificial Intelligence 38: 18600-18607. https://doi.org/10.1609/aaai.v38i17.29822.
Magesh, Varun, et al. “Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools.” arXiv, 30 May 2024, arXiv:2405.20362v1 [cs.CL].
Nezhurina, Marianna, et al. “Alice in Wonderland: Simple Tasks Reveal Severe Generalization and Basic Reasoning Deficits in State-Of-the-Art Large Language Models.” NeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning.
Nguyen, Hung T., and Gerald S. Rogers. “Confidence Intervals for a Bernoulli θ.” Fundamentals of Mathematical Statistics: Probability for Statistics, Springer New York, 1989, pp. 150-165. DOI: 10.1007/978-1-4612-1013-9_19.
Office of the Director, Operational Test and Evaluation. “Guidance on the Use of Design of Experiments (DOE) in Operational Test and Evaluation.” Memorandum, October 19, 2010.
Office of the Director, Operational Test and Evaluation. “Best Practices for Assessing the Statistical Adequacy of Experimental Designs used in Operational Test and Evaluation.” Memorandum, July, 2023.
Office of the Director, Operational Test and Evaluation. Test and Evaluation Master Plans and Test and Evaluation Strategies. DOD Manual 5000.100, December 9, 2024.
Office of the Director of Operational Test and Evaluation. Operational Test and Evaluation and Live Fire Test and Evaluation of Artificial Intelligence-Enabled and Autonomous Systems. DOD Manual 5000.101, December 9, 2024.
Olivetti, E., S. Greiner, and P. Avesani. “Statistical Independence for the Evaluation of Classifier-Based Diagnosis.” Brain Informatics, vol. 2, no. 1, Mar. 2015, pp. 13-19. Springer Link, doi:10.1007/s40708-014-0007-6.
Perez, Ethan, et al. “Discovering Language Model Behaviors with Model-Written Evaluations.” Findings of the Association for Computational Linguistics: ACL 2023, edited by Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, Association for Computational Linguistics, 2023, pp. 13387-13434. Toronto, Canada. ACL Anthology, https://aclanthology.org/2023.findings-acl.847/. doi:10.18653/v1/2023.findings-acl.847.
Ranaldi, Leonardo, and Giulia Pucci. “When large language models contradict humans? large language models’ sycophantic behaviour.” arXiv preprint arXiv:2311.09410 (2023).
Rosenbaum, Paul R. Design of observational studies. New York: Springer, 2013.
Samuylova, Elena. “LLM-as-a-Judge: A Complete Guide to Using LLMS for Evaluations.” Evidently AI – Open-Source ML Monitoring and Observability, July 23, 2025. https://www.evidentlyai.com/llm-guide/llm-as-a-judge.
Siegrist, Kyle. “11.1: Introduction to Bernoulli Trials.” Statistics LibreTexts, Libretexts, 24 Apr. 2022, stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/11%3A_Bernoulli_Trials/11.01%3A_Introduction_to_Bernoulli_Trials.
Sharma, Mrinank, et al. “Towards understanding sycophancy in language models.” arXiv preprint arXiv:2310.13548 (2023).
Steyer, Rolf, and Werner Nagel. Probability and conditional expectation: Fundamentals for the empirical sciences. Chichester, West Sussex: John Wiley & Sons, Inc, 2017.
Vaswani, Ashish, et al. “Attention is All You Need.” Advances in Neural Information Processing Systems, edited by I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, vol. 30, Curran Associates, Inc., 2017, https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
Wang, Alex, et al. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.” arXiv, 22 Feb. 2019, arXiv:1804.07461v3 [cs.CL].
Wang, Peiyi, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, and Zhifang Sui. 2024. “Large Language Models are not Fair Evaluators.” In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 9440–9450. Bangkok, Thailand: Association for Computational Linguistics.
Wu, Da, Jingye Yang, and Kai Wang. “Exploring the reversal curse and other deductive logical reasoning in BERT and GPT-based large language models.” Patterns 5.9 (2024).
Yu, Zhiyuan, et al. “Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models.” arXiv, 26 Mar. 2024, arXiv:2403.17336v1 [cs.CR].
Varoquaux, Gaël, and Olivier Colliot. “Evaluating Machine Learning Models and Their Diagnostic Value.” Machine Learning for Brain Disorders, edited by Olivier Colliot, Humana, 2023. NCBI Bookshelf, doi:10.1007/978-1-0716-3195-9_20.
Zhang, Hugh, et al. “A Careful Examination of Large Language Model Performance on Grade School Arithmetic.” Advances in Neural Information Processing Systems, edited by A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, vol. 37, NeurIPS, 2024.
Zhao, Wayne Xin, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min et al. “A survey of large language models.” arXiv preprint arXiv:2303.18223 1, no. 2 (2023).
Author Biographies
Karen O’Brien is a retired 20-year Army civilian and former Chief Evaluator (for reliability and data science) at the Army Evaluation Center where she evaluated the effectiveness, suitability, and survivability of emerging technologies during a time of rapid change, rapid fielding, and non-traditional testing. Now a senior principal data scientist and AI/ML Focus Area lead at Modern Technology Solutions, Inc, she supports diverse government agencies in advancing their AI/ML initiatives.
Dewey Classification: L 681 12

