DECEMBER 2025 I Volume 46, Issue 4

Information Technology (IT) System Reliability and Availability Testing

Information Technology (IT) System Reliability and Availability Testing

Tom Roltsch

Tom Roltsch

MANTECH International Corporation Herndon, VA
tom.roltsch@mantech.com

DOI: 10.61278/itea.46.4.1006

Abstract

This paper outlines the methodologies, metrics, and probability distributions used to model interarrival times of failures for IT systems. The often-used Classical MTBF Test is compared to four Bayesian methods, and the results are summarized. Included in the analysis is how the discrimination ratio from the objective and threshold requirements for MTBF comes into consideration when testing MTBF. Each test method is analyzed for its advantages and disadvantages, with a focus on balancing test efficiency and risk. The approaches discussed include Classical MTBF, MTBF50, MTBF50 Sequential Probability Ratio Test (SPRT), and Bayesian methods using Gamma and Beta distributions. These methods are shown to reduce test durations by 40% to 60% compared to traditional approaches while maintaining low risk to both the producer and the consumer.

This paper also discusses the implications of these methodologies for education and training, and identifies future research directions and test methods, such as fuzz testing, to address challenges posed by increasingly complex IT systems and evolving cybersecurity threats. By incorporating automated testing with operational testing under realistic load conditions, both IT system reliability and operational availability can be accurately estimated. The findings provide actionable insights for test organizations, enabling them to optimize testing processes, reduce acquisition risks, and achieve high levels of reliability and availability for IT systems.

Keywords: Software Reliability test; Software MTBF test; IT System Reliability test; Beta-Binomial Test; MTBF50 Test; Bayesian Test; IT System Acceptance Test; IT System Availability

Symbols and Definitions

Threshold MTBF Requirement: The minimum value of Mean Time Between Failure (MTBF) that the consumer will accept.

Objective MTBF Requirement: The value of MTBF that the consumer wants.

Discrimination Ratio: The ratio of the objective MTBF requirement to the threshold MTBF requirement.

Rate of Occurrence of Failures (ROCOF): Used for repairable systems. It is analogous, but not the same, as the failure rate for non-repairable systems.

Operational Availability (Ao): The mean availability of a system during its useful life.

MTBF50: A median estimate for MTBF based on evidence or opinion. Evidence, such as previous test data, is less subjective.

Consumer’s Risk: The probability that a system not meeting the threshold requirement will pass the test. Often set at 20% as a tradeoff between test cost and consumer’s risk. Also, it can be expressed as an 80% confidence level.

Producer’s Risk: The probability that a system meeting the objective requirement will fail the test. Often set at approximately 20% to match the consumer’s risk and produce a test that ensures a system meeting the objective requirement for MTBF will pass the test. Also, it can be expressed as an 80% power level.

Introduction

Decades ago, many software engineers and reliability engineers as well wondered if the concept of reliability applied to software. After all, software does not appear to degrade over time or wear out. However, software can have latent defects even if a rigorous software quality model and developmental testing regimen are followed. Latent software defects can cause failure under certain conditions. This was never more apparent than on June 4, 1996. On that day, after the European Space Agency spent 10 years and $7 billion to develop the Ariane 5 rocket, it exploded 34 seconds into its maiden flight along with its payload of four satellites. The root cause of the explosion – software (Shereef 2019).

Malfunctioning software can cause any system to fail, although probably not as dramatically as a rocket exploding. Nowadays, it is accepted that software has reliability. In fact, the Institute of Electrical and Electronics Engineers (IEEE) defines software reliability as, “(A) The probability that software will not cause the failure of a system for a specified time under specified conditions. (B) The ability of a program to perform the required function under stated conditions for a stated period of time. NOTE—For definition (A), the probability is a function of the inputs to and use of the system, as well as a function of the existence of defects in the software. The inputs to the system determine whether existing defects, if any, are encountered” (IEEE 2016, 17). Software defects, also referred to as faults or bugs, can lead to software failures.

IT Systems are prone to emergent behaviors whereby the mix or concatenation of apparently mature and otherwise low-risk constituent parts can create problems during integration, usability, security, or stress testing. (Joiner 2018, 321).

Software can fail in many ways. Common Weakness Enumeration (CWE) is a list of software defects that CWE refers to as “weaknesses.” CWE prioritizes defects by how dangerous they are based on known vulnerabilities associated with those defects. In 2024, there were 31,770 records of software defects. Table 1 lists the top ten most dangerous software defects because these are known to be vulnerabilities that attackers exploit to compromise IT systems (MITRE Corporation 2024). Reliable software not only produces high operational availability systems but also aids in cybersecurity (MITRE Corporation 2020).

Table 1. 10 Most Dangerous Software Defects From CWE

Defect Rank Description
1 Improper Neutralization of Input During Web Page Generation
2 Out-of-bounds Write
3 Improper Neutralization of Special Elements used in an SQL Command
4 Cross-Site Request Forgery (CSRF)
5 Improper Limitation of a Pathname to a Restricted Directory (‘Path Traversal’)
6 Out-of-bounds Read
7 Improper Neutralization of Special Elements used in an OS Command
8 Use After Free
9 Missing Authorization
10 Unrestricted Upload of File with Dangerous Type

The increasing complexity and interconnectedness of modern IT systems, coupled with the critical need for reliable and available software, necessitate robust reliability testing methodologies. Despite the importance of such testing, there remains a lack of comprehensive guidance on best practices for evaluating software reliability and availability.

This paper addresses this gap by describing and comparing five inferential test methods designed to verify the reliability and availability of new, repairable IT systems. The paper begins by introducing the typical IT system and developmental software methods and defining reliability, maintainability, and failure. This is followed by an overview of software defects from the literature referred to as “bugs.” A discussion of the RAM metrics for IT systems is followed by sections on automated testing and operational testing. Finally, the five test methods are introduced, and a comparison of the five test methods is presented, followed by a discussion and conclusion.

The operational and cost benefits of these methods are obvious, demonstrating that they can reduce test durations by 40% to 60% compared to traditional approaches. Additionally, it emphasizes the importance of automating software reliability testing to enhance efficiency and accuracy prior to operational testing and using new techniques such as fuzz testing. The paper also discusses the implications of these approaches for policy, education, and training, and identifies future trends and research directions in IT system reliability testing. By providing actionable insights and practical methodologies, this paper aims to assist test organizations in optimizing their processes and achieving high levels of operational availability and reliability for IT systems.

Background

The typical IT system consists of a central core software application, along with ancillary software applications, hosted on high-reliability commercial data centers, that provides service to many users simultaneously through network connectivity. The core software application may also interface with internal or external databases and other IT systems. Figure 1 shows a diagram of a typical IT system (DHS 2025, 48).

Figure 1. Typical IT System

IT systems can be modeled as a combination of a hardware component and a software component. The hardware component comprises one or more data centers. The software component comprises the primary software application and various ancillary software applications. The hardware component of an IT system often has reliability, in terms of MTBF, of 10,000 hours or more. This makes a reliability test, and therefore an estimate for operational availability impractical for the hardware portion of an IT system. The dramatic increases in IT system hardware reliability over the past decade, resulting from technological advances and redundancy, have made traditional MTBF testing for IT system hardware unfeasible.

Fortunately, most new IT systems are hosted by commercial web services platforms like Amazon Web Services, Amazon GovCloud, or Microsoft Azure and use a service level agreement (SLA) that states the operational availability that they provide. Commercial web services platforms have years of operational data to demonstrate that they can provide the operational availability specified in the SLA; therefore, the hardware component of the system covered under the SLA does not need to be tested. The focus of the test and evaluation for a new IT system is the software developed for the consumer by the producer and its integration (DHS 2025, 49). For example, if the SLA provides for platform as a service (PAAS) then the producer may only be responsible for the software application and the deployment of the software onto the cloud platform.

“APIs, like any software, are likely to have defects; that is, anything that deviates from the expected functioning of the system. Vulnerabilities are a subset of defects, where the system can be abused (intentionally or unintentionally) to create significant risk for the operation of the system, people, and/or organization. What constitutes risk depends on the software component, data, and/or code in question; the level of risk the organization is willing to tolerate; and other factors. APIs are particularly risky because they serve as system access points. Consequently, an API defect or vulnerability can enable bad actors to enter the system or cause it to behave unexpectedly. Preventing these threats requires continuous testing of the software product to provide evidence that the system works as expected” (Gomez, A., & Vesey, A., 2024, 27).

Fuzzing is an important new technique in software and security testing that involves continuously generating a large number of test cases against target programs to discover unexpected behaviors such as bugs, crashes, and vulnerabilities. Fuzz testing often incorporates AI. Recently, fuzzing has advanced considerably owing to the emergence of new methods and corresponding tools. However, it still suffers from low coverage, ineffective detection of specific vulnerabilities, and difficulty in deploying complex applications (Zhao et al. 2024).

“Fuzz testing stimulates a system or emulation of one with semi-valid inputs in order to determine vulnerabilities. Fuzzing is an important new T&E approach in industry and Government to find software vulnerabilities” (Joiner 2024).

For the reliability testing of the software component of the IT system, the IEEE definition of reliability can be used. “The probability that software will not cause the failure of a system for a specified time under specified conditions. The probability is a function of the inputs to and use of the system, as well as a function of the existence of defects in the software. The inputs to the system determine whether existing defects, if any, are encountered” (IEEE 2016, 17). Thus, the mean time between failure is really the mean time between discovery of a latent defect that causes a failure. The discovery of latent defects is dependent on the number of inputs and the stress on the system. Therefore, the system must be operated near capacity for a predetermined length of time to verify the reliability and availability requirements.

There are two ways to model the software “repair” process. One is to treat each version of software as a non-repairable component of the IT System. Since the software version is non-repairable, it has a mean time to failure (MTTF) and fails only once. When it fails, it is replaced with an updated version (that is not the same). Alternatively, the software repair process could be modeled the same way as the hardware repair process. The production of an updated software version with the defect eliminated can be considered a “repair.” If the latter method is used, then one can assign the metrics, per IEEE 1633-2016, IEEE Recommended Practice on Software Reliability, Mean Time to Software Restore (MTSWR) for the software maintainability metric and Mean Time Between Critical Failures (MTBCF) for the software reliability metric.

Failure categories and chargeability are defined in a Failure Definition and Scoring Criteria (FDSC). An FDSC is necessary prior to testing because objective evaluation of the IT system depends on being able to properly identify what constitutes a failure, identify the severity of the failure, and assign chargeability of that failure to the software component, hardware component, or to external systems as appropriate (DHS 2024, 1). Software failure modes and effects analysis, reliability block diagrams and fault tree analysis will greatly assist in the development of an FDSC.

Software Defects (Bugs)

Historically, defects in software were referred to using the colloquial term “bugs.” Thus, academic journals still use the historic names given to certain types of defects. The reliability testing for the software component of an IT System focuses on two main types of software defects – Bohrbugs and Mandelbugs, either of which could cause a system failure.

The Bohrbug is named after the atomic model developed by Danish physicist Niels Bohr (Rouse 2014). A Bohrbug is a fault that is easy to isolate, and whose manifestation is consistent under a well-defined set of conditions. Once a Bohrbug has caused a failure, it is a relatively simple task to track it down in the software code and to remove it because the same input will always cause the same failure. Failures from Bohrbugs are repeatable if one knows the conditions that resulted in the failure, and they can be fixed by changing the code (Grottke 2016, 70).

Mandelbugs are named for Professor Benoit Mandelbrot, the “Father of Fractal Geometry,” who spent his career working at IBM (Rouse 2016). The behavior of a Mandelbug may appear non-deterministic because the same set of input data will sometimes cause a failure, but sometimes not cause a failure. Therefore, failures caused by Mandelbugs are difficult to reproduce (Grottke 2016, 71). While Bohrbugs can be effectively counteracted through testing and debugging, it is a daunting task to find and remove Mandelbugs because of their elusive nature. Once a Mandelbug has been understood, it can be fixed in the code (Grottke 2016, 71). A special type of Mandelbug is called a Heisenbug after Heisenberg’s Uncertainty Principle. These bugs change their behavior when studied, making them particularly elusive (Grottke and Trivedi 2005, 428).

Another special type of Mandelbug is the Aging bug. Aging bugs, such as memory leaks, can cause failure as the system accumulates operating time. Probably the most infamous aging bug was that of the Patriot missile system on February 25, 1991. Due to an aging bug, the Patriot system failed to intercept a Scud missile launched at Saudi Arabia during Desert Storm that killed 28 US Army soldiers. The bug was due to a programming error that,

led to an inaccurate tracking calculation that became worse the longer the system operated. At the time of the incident, the Patriot battery had been operating continuously for over one hundred hours. By then, the inaccuracy was serious enough to cause the system to look in the wrong place for the incoming Scud” (United States General Accounting Office 1992, 1).

The types of aging bugs are summarized in Table 2. (Trivedi et al. 2011, 226).

Table 2. Defects Due to Aging (Aging Bugs)

Aging Rate of Occurrence of Failure (ROCOF) Increases or Performance Degrades Over Time
Memory Leak Memory allocations and objects are not released even if not required.
Cursor Leak Open database cursors are not closed.
TCP Aging Degraded performance after a certain number of Transmission Control Protocol (TCP) connections have been opened.
Numeric Overflow A numeric quantity, like a sequence number overflowing.
Fragmentation Performance degradation due to the creation of holes in database files with frequent inserts and data purging.
Memory Trampler Shared data structures corrupted by a participant process.

While aging bugs represent only a small fraction of software bugs, they are difficult to detect during testing because they do not manifest themselves until the system has been running for a long time. During testing, even with continuous testing and a dedicated software reliability growth program, it is mainly Bohrbugs that will be discovered.

The analysis of software anomalies in NASA missions conducted by Grottke et al. revealed that, although Bohrbugs represented most defects, Mandelbugs accounted for a substantial share, in the 20% to 40% range. Moreover, these authors showed that the proportions of Bohrbugs and Mandelbugs for different missions seem to [be] stable around almost the same value. They also found that aging-related bugs represent a non-negligible share of defects, even in long-running mission-critical software (4.4%)” (Grottke et al. 2010, 455).

Although software is unlikely to be “bug-free,” not every software defect will cause a failure. Some defects may never be discovered, and some may cause a failure that is not critical.

Reliability, Availability, and Maintainability (RAM) Metrics for IT Systems

With the software application considered as one component of an IT system that includes at least one hardware component and at least one software component, metrics can be designated at the component and system levels. The hardware component is treated as a known quantity for the purposes of RAM testing. Five inferential tests for software reliability are described below. Inferential tests are not measurements but rather indicate, with some probability or confidence, that the system, as delivered to the consumer, meets the threshold requirement. Most RAM requirements are written at the system level as mean time between failure (MTBF), mean time to repair (MTTR), and operational availability (Ao) with threshold and objective requirements agreed upon by both the consumer and producer prior to the acquisition. Ultimately, the owners and users of any IT system want high Ao. To achieve high Ao, a system must have high reliability, represented by long interarrival times between failures, and high maintainability, represented by the system being restored quickly after a failure. It is safe to assume that the Ao of the service level agreement is the Ao for the hardware component. The hardware component reliability and maintainability could be calculated from the service provider’s records, but it is unnecessary.

For systems that operate less than 24 hours per day, the software operational availability calculated with the duty cycle included in the calculation is shown in Equation [1] (DHS 2025, 46). The software mean time between critical failure (MTBCF) is divided by the duty cycle, which is a number between 0 and 1. A duty cycle equal to 1 represents a system that operates continuously, while a duty cycle of 0.25 represents a system that operates an average of 6 hours per day. The minimum value of Ao will occur when the duty cycle is one, and Ao is increased if the duty cycle is less than 1 as shown in Equation [1]:

A similar calculation can be performed for the hardware component. For example, a commercial web services platform may provide 8,000 hours MTBF and 10 hours MTTR as demonstrated by operating 24 hours per day for many years. At a duty cycle of 0.25, the Ao for the commercial web services platform would be about 99.97%. However, the Ao value with the duty cycle represents the operational availability only for the time that the system is expected to be “on.” The time that the system is “off” is excluded from the calculation for Ao. Therefore, the Ao calculated with the duty cycle does not represent the proportion of time that the system is fully operational; it represents the proportion of the “agreed service time” that the system is fully operational. In practice, software reliability is often so high that very few failures are witnessed during operational testing. Therefore, the metric MTSWR is usually estimated. Methods for estimating MTSWR can be found in IEEE Std 1633-2016, IEEE Recommended Practice on Software Reliability.

Software Development, Agile, DevOps, and Continuous Testing

Agile development is often the approach for software development and delivery for IT programs and projects (DHS 2020, 1). A continuous testing environment exists until the software is deemed ready for an operational test. Each release is tested using automated scripts. The automated scripts can analyze inputs and outputs much faster than could be done via manual testing in real-time; thus, a 25-hour automated test could represent the same number of transactions that would be expected in five thousand hours. An estimate for MTBCF can be made by dividing the time equivalent of the continuous test by the number of critical failures observed during the test.

Table 3 shows a system that has five releases and is tested after each release using automated scripts that simulate five thousand hours of operation. After a release, the rate of occurrence of failure (ROCOF) is high because of the defects from the most recent release, plus any defects found in previous releases. After the final release is debugged and tested, the simulated MTBCF value can be used to develop an estimate that can be used as an input for subsequent operational testing. The system is ready for operational testing after the last release is debugged and the backlog is cleared. Using only the debugged value for MTBCF after the final release to verify the reliability requirement is not recommended by the author because the discovery of latent defects is dependent on the number of inputs and stress on the system. IT systems sometimes slow down considerably when operating under a full load. A case in point is the Healthcare.Gov system which was not adequately tested before being fielded. Healthcare.Gov’s debut on October,1 2013, ended abruptly when the system crashed with just a few thousand users. The Government Accountability Office determined that Healthcare.gov was launched “without verification that it met performance requirements” (GAO 2013, 25).

Table 3. Testing Following Each Release

Critical Failures Test 1 Test 2 Test 3 Test 4 Test 5
Release 1 11 1 1 0 0
Release 2 4 1 1 0
Release 3 5 2 0
Release 4 7 1
Release 5 4
Cumulative Test Time (h) 5,000 10,000 15,000 20,000 25,000
Cumulative MTBCF 454 625 652 606 657

Reliability testing for hardware requires that the configuration be unchanged during the test period. When testing software, configuration changes must be allowed. Each time a software application fails and is “repaired,” it is a new configuration that, hopefully, is more reliable than the previous configuration.

However, software updates, patches, and “repairs” can introduce new defects that could eventually cause a failure. Thus, over time, there may be no increase in reliability as shown in Figure 2 (Eusgeld et al. 2008, 111). The same trend was noted by Carnegie Mellon University’s Peter Feiler, best known as the technical lead and author of the SAE AS-2C Architecture Analysis & Design Language (AADL) standard (Feiler 2017, 9).

Figure 2. Long-Term Software MTBF based on Eusgeld, et al.

Operational Testing

Clearing the backlog (debugging) is an entrance requirement to operational testing. Programs are encouraged to use best practices for software development, such as those found in IEEE Std 1633-2016, IEEE Recommended Practice on Software Reliability. Programs can use a software reliability growth tracking graph, such as the one in Figure 3, to track reliability during the software development process. As shown, defects in the backlog are considered “open.” Defects that have been “fixed” but not tested are considered “resolved.” After a resolved defect has passed regression testing, it is considered “closed.”

Figure 3. IT System Software Reliability Tracking

Inferential tests are designed so that if the producer delivers a system to the consumer that meets the objective requirement or better, the system will likely pass the test. If the producer delivers a system to the consumer that does not meet the threshold requirement, the system will likely fail the test. These concepts can be quantified using the discrimination ratio, consumer’s risk, and producer’s risk. For all the test types, a higher discrimination ratio corresponds to a shorter test with equivalent consumer’s and producer’s risk. To achieve 20% consumer’s risk and 20% producer’s risk, the test time for requirements with a discrimination ratio of three is at least 40% less than the test time for requirements with a discrimination ratio of two. IT systems typically have requirements with discrimination ratios between two and four. A discrimination ratio of three or greater is advantageous because it is more efficient to test. A discrimination ratio of three is used in all the test examples below.

The type of test and confidence required depends on the acquisition risk and the criticality of the system. An IT system could be considered low risk for acquisition but serves a critical role. If the consumer is willing to accept a higher risk, then inferential tests can be designed at a 70% confidence level to further decrease the test time. For high acquisition risk or critical IT systems, testers use at least an 80% confidence level as a tradeoff between cost and risk. Inferential tests are based on probability and are Go/No-Go tests. They do not measure reliability or MTBF but rather indicate at a prescribed confidence or probability that the reliability or MTBF meets or exceeds the threshold requirement. Good test design also ensures that a system meeting the objective requirement has a high probability of passing the test. That is, if the producer delivers to the consumer what the consumer wants, then the system should pass the test. Figure 4 shows a standard test strategy for verifying the reliability and operational availability of the software component, while Figure 5 shows an accelerated test strategy used for low-risk acquisition.

Figure 4 Standard T&E Process

Figure 5 Low-Risk Acquisition T&E Process

Comparison of Five Test Methods

Depending on the system and risk level, one of the following five test methods can be used to test a new, repairable IT system. Operational tests demand that the IT system is operating at or near its maximum capacity for users. Automated testing should inform operational testing and evaluation and is essential to IT system development. However, automated testing is not sufficient to determine a system’s reliability and operational availability with acceptable accuracy. To get defensible, accurate estimates for the reliability and operational availability of IT systems, the data must incorporate the operational use of the system by actual users (Guertin 2023,1).

Performance and stress testing of the integrated and usable IT system is necessary to ensure manageable demands during the full range of usage cycles. For the Department of Homeland Security, with over 26 million users and approximately 600,000 authentications daily, it is nearly impossible to reproduce that level of load in virtual environments. Some system errors can only be replicated under load. It is extremely difficult for testers of IT systems, due to the scale, to produce accurate estimates for reliability and operational availability using automated testing or a simulated production environment, even with all of the components integrated, without testing the system while operating under load. (Joiner, et al. 2018, 327).

1. Classical MTBF Test (Roltsch 2024)

The Classical MTBF test is widely used due to its simplicity: “operate a system for a period, and the sample estimate for MTBF is the operating time divided by the number of failures. An exact confidence interval is applied to the sample estimate using the chi-squared distribution as a pivot to determine the test result. The confidence interval is calculated using the chi-squared distribution, with two times the number of failures plus two failures for the degrees of freedom. Adding two failures to the degrees of freedom in the calculation of the chi-squared probability allows a solution when there are zero failures and makes the test more conservative” (Roltsch 2024). Although this is the least efficient test listed here, IT programs sometimes use it for two reasons: a) While Bayesian reliability test methods can save test time and cost, the program wants a high degree of objectivity and a conservative test result, b) For data analysis from a system that has already been operational for a period long enough to verify the threshold requirement. At an 80% confidence level, this test is at least 40% longer than the other tests listed here and requires the assumption that interarrival times of failures are exponentially distributed.

2. MTBF50 Test (NIST 2012)

The debugged value from continuous testing after the final release can be used to form an MTBCF50 estimate that functions as a Bayesian prior for a subsequent Classical MTBF test. The MTBCF50 estimate, along with a “weak” prior, is effective in terms of saving test time (NIST 2012, 8.2.5). The MTBF50 Test drawback is the non-objectivity of the MTBCF50 estimate. The test requires the assumption that interarrival times of failures are exponentially distributed and requires an estimate for MTBCF50 based on the final debugged value after development.

3. MTBF50 SPRT (Roltsch 2021)

The MTBCF50 estimate can also be used as a Bayesian prior for Wald’s traditional Sequential Probability Ratio Test (SPRT). The time to accept is reduced, making the MTBF50 SPRT even more efficient than the traditional SPRT (Roltsch 2021, 24). The test requires the assumption that interarrival times of failures are exponentially distributed and requires an estimate for MTBCF50 based on the final debugged value after development. The MTBF50 SPRT is the most efficient test but has the drawback of non-objectivity in the MTBCF50 estimate.

4. Combination of Classical MTBF tests using the Gamma Distribution as a Conjugate Prior (NIST 2012)

An MTBCF from an operational assessment can be combined with an MTBCF from an operational test using the gamma distribution. This test is very efficient but requires the assumption that interarrival times of failures are exponentially distributed.

5. Combination of Binomial tests using the Beta Distribution as a Conjugate Prior (ReliaSoft 2024)

A binomial operational assessment can be combined with a binomial operational test using the Beta Distribution. This test is also referred to as the beta-binomial test or Bayesian non-parametric test. It is very efficient and has two advantages: a) No assumption is made about the distribution of the interarrival time of failures, b) The test is based on trials, not on time, so the test is the same length for any duty cycle.

Test Efficiency

Tables 4 and 5 compare the total test time of the five tests. Table 4 shows the total test time by test type for new, continuously operating, repairable IT systems with a discrimination ratio of three, an objective MTBCF requirement of 1500 hours, and 20% consumer’s risk and no more than 26% producer’s risk. Table 5 shows the total test time by test type for new, repairable IT systems with a discrimination ratio of three, an objective MTBCF requirement of 1500 hours, and 30% consumer’s risk and no more than 33% producer’s risk. In Tables 4 and 5, the Combine Binomial test gives a result in terms of reliability, so an equivalent MTBCF is estimated using the exponential distribution. The time for a trial is considered to be 24 hours. Using the threshold MTBCF requirement of five hundred hours gives a threshold reliability requirement for 24 hours of 95.3%.

The total test time includes both tests for combinations of two tests. The MTBCF50 estimate for the MTBF50 test and the MTBF50 SPRT is assumed to be equal to the threshold requirement of 500 hours. If the MTBCF50 estimate was less than 500 hours, then the test would be longer, and if the MTBCF50 estimate was greater than 500 hours, then the test would be shorter.

Table 4. Test length for 20% consumer’s risk and no more than 26% producer’s risk for the five types of tests for a system with software MTBCF objective requirement of 1500 hours and a discrimination ratio of three. Duty cycle 100%.

Total Test Time Test Type Number of Failures Allowed Input from Previous Test
1151 h MTBF50 1 MTBCF50 estimate of 500 hours from continuous testing
526 h to 1926 h MTBF50 SPRT 0 to 3 MTBCF50 estimate of 500 hours from continuous testing
1497 h Classical MTBF 1 None
805 h Combine MTBF 1 Operating time and number of failures from OA
35 trials

840 h if 24 h trials

Combine Binomial 1 Number of successes and failures from OA

Table 5. Test length for 30% consumer’s risk and no more than 33% producer’s risk for the five types of tests for a system with software MTBCF objective requirement of 1500 hours and a discrimination ratio of three. Duty cycle 100%.

Total Test Time Test Type Number of Failures Allowed Input from Previous Test
255 h MTBF50 0 MTBCF50 estimate of 500 hours from continuous testing
217 h to 1299 h MTBF50 SPRT 0 to 2 MTBCF50 estimate of 500 hours from continuous testing
602 h Classical MTBF 0 None
602 h Combine MTBF 1 Operating time and number of failures from OA
26 trials

624 h if 24 h trials

Combine Binomial 1 Number of successes and failures from OA

The Combine MTBF and Combine Binomial tests cannot be designed for zero failures; however, one or zero failures is allowed to pass the test. The Classical MTBF test, even with a discrimination ratio of three, has a high producer’s risk with zero failures; thus, generally it must be designed with at least one failure allowed at an 80% confidence level. However, at a 70% confidence level, the Classical MTBF test can be designed with zero failures and acceptable risks. The MTBF50 SPRT is a continuous test that can be stopped if the system meets the test criteria. A reliable system will pass this test quickly, and an unreliable system will fail this test quickly, but a system between the threshold and objective requirements may need the maximum test time.

Conclusions

Statistical Test and Analysis (STAT) training for IT testers is needed because technology and mathematics keep pushing forward and enabling more efficient ways to validate system performance. IT systems are ideally suited for automated testing because they have no moving parts. A resident run-time program can continuously test the system. IT system testing will continue to evolve as the synthesis of mathematics and technology continues to improve. However, the question, “does it work in the real world?” will need operational testing to get an answer, at least for the foreseeable future.

This paper provides five inferential methodologies for testing the reliability and operational availability of new, repairable IT systems. Five inferential test methods, including Classical MTBF, MTBF50, MTBF50 SPRT, Gamma Combination of Classical MTBF tests, and Beta Combination of Binomial tests, have been described and compared.

The idea was introduced that using a discrimination ratio of three for the threshold and objective MTBF requirements, rather than the often-used discrimination ratio of two, is more efficient to test and is recommended by the author.

Tables 4 and 5 clearly show the savings in test time over the Classical MTBF test. Combining these inferential test techniques along with a discrimination ratio of three appears to be a sweet spot for inference. Using modern inferential approaches can optimize test efficiency while maintaining confidence in system performance and keeping consumer’s and producer’s risks low.

The analysis highlights that these methods can reduce test durations by 40% to 60% compared to the traditional approach.

The paper underscores the importance of emerging trends in automated testing, such as fuzz testing, as a critical component of software development. Reliability and cybersecurity are intertwined but separate metrics. While a software defect (reliability) can lead to a cybersecurity failure, a cybersecurity failure, not related to a software bug, can cause the system to fail (reliability).

Automation enhances efficiency and is great for finding Bohrbugs, but it is not sufficient on its own; operational testing remains essential to assess the system’s reliability and operational availability under real-world conditions.

Future research should focus on addressing limitations in current test methodologies, exploring emerging trends such as fuzz testing, and developing innovative solutions to tackle challenges posed by complex IT systems and evolving cybersecurity threats.

By adopting the strategies outlined in this paper, test organizations can reduce acquisition risk, improve cost efficiency, and ensure the delivery of reliable and highly available IT systems that meet the demands of modern users and operational environments.

References

DHS. 2025. Department of Homeland Security, Reliability, Availability, and Maintainability (RAM) Test and Evaluation Supplemental Guidance, Version 4.0.

DHS. 2024. Department of Homeland Security Test and Evaluation Supplemental Guidance Failure Definition Scoring Criteria Version 3.0, April 2024.

DHS. 2020. Department of Homeland Security. Agile Method for Software Development and Delivery for Information Technology. DHS Directives System, Instruction Number: 102-01-004 Revision Number: 02.

Eusgeld, I., F. Fraikin, M. Rohr, at al. 2008. Software Reliability. In: Eusgeld, I., Freiling, F.C., Reussner, R. (eds) Dependability Metrics. Lecture Notes in Computer Science, vol 4909.

Feiler, Peter. 2017. Improvements in Safety Analysis for Safety-critical Software Systems. Software Solutions Symposium, Carnegie Mellon University.

Ferreira, F. H., Nakagawa, E. Y., & dos Santos, R. P. 2021. Reliability in Software-intensive Systems: Challenges, Solutions, and Future Perspectives. Paper presented at the 47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA).

Gomez, A., & Vesey, A. 2024. On the Design, Development, and Testing of Modern APIs. Retrieved from Online: https://insights.sei.cmu.edu/library/on-the-design-development-and-testing-of-modern-apis/ (accessed October 15, 2025).

Grottke, Michael. 2016. Recovery From Software Failures Caused by Mandelbugs. IEEE Transactions on Reliability. 65 (1): 70-87. doi: 10.1109/TR.2015.2452933 (accessed July 9, 2025).

Grottke M. and K. Trivedi. 2005. Software Faults, Software Aging and Software Rejuvenation. The Journal of Reliability Engineering Association of Japan. 27 (7): 425-438.

Grottke, M., A. P. Nikora and K. S. Trivedi. 2010. An empirical investigation of fault types in space mission system software. IEEE/IFIP International Conference on Dependable Systems & Networks. 447-456.

Guertin, N., & Goodman, A. 2023. Operational Test and Evaluation (OT&E) for Rapid Software Development. The ITEA Journal 44 (1).

IBM. 2025. Heritage. Benoit Mandelbrot. https://www.ibm.com/history/benoit-mandelbrot

(accessed July 9, 2025).

IEEE. 2016. Institute of Electrical and Electronics Engineers Standard 1633-2016, IEEE Recommended Practice on Software Reliability.

Joiner, K. 2024. Review of Fuzz Testing to Find System Vulnerabilities. The ITEA Journal 45 (4).

Joiner, K. F., Ghildyal, A., Devine, N., Laing, A., Coull, A., & Sitnikova, E. 2018. Four testing types core to informed ICT governance for cyber-resilient systems. International Journal of Advances in Security, 11.

McQuade, M., Medin, M., & Murray, R. 2018. Defense Innovation Board Do’s and Don’ts for Software (draft). https://media.defense.gov/2018/Oct/09/2002049593/-1/-1/0/DIB_DOS_DONTS_SOFTWARE_2018.10.05.PDF (accessed October 15, 2025).

Mitre Corporation. 2024. 2024 CWE Top 25 Key Insights.

https://cwe.mitre.org/top25/archive/2024/2024_key_insights.html

(accessed July 9, 2025).

Mitre Corporation. 2020. CISQ Quality Measures – Reliability

CWE – CWE-1306: CWE CATEGORY: CISQ Quality Measures – Reliability (4.17) (accessed July 9, 2025).

NIST. 2012. NIST/SEMATECH e-Handbook of Statistical Methods. https://www.itl.nist.gov/div898/handbook/ (accessed July 10, 2025).

ReliaSoft. 2024. Reliability Test Design. ReliaSoft Life Data Analysis Reference. https://help.reliasoft.com/reference/life_data_analysis/lda/reliability_test_design.html#Reliabil (accessed July 11, 2025).

Roltsch, Tom. 2024. The Robust Classical MTBF Test. The ITEA Journal 45 (2).

Roltsch, Tom. 2021. Bayesian MTBF50 Sequential Probability Ratio Test. The ITEA Journal 42 (1).

Rouse, M. 2014. Bohr Bug. https://www.techopedia.com/definition/18011/bohr-bug (accessed July 9, 2025).

Rouse, Margaret. 2016. Mandelbug. https://www.techopedia.com/definition/31977/mandelbug (accessed July 9, 2025).

Trivedi, Kishor, Rajesh Mansharamani, Dong Seong Kim, et al. 2011. Recovery from Failures Due to Mandelbugs in IT Systems. Proceedings of the 2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing. 224–233.

United States General Accounting Office. 1992. Software Problem Led to System Failure at Dhahran, Saudi Arabia.

United States General Accounting Office. 2013. Healthcare.gov: Ineffective Planning and Oversight Practices Underscore the Need for Improved Contract Management.

Weiss, J., & Patt, D. (2022). Software Defines Tactics: Structuring Military Software Acquisitions for Adaptability and Advantage in a Competitive Era. https://www.hudson.org/national-security-defense/software-defines-tactics-structuring-military-software-acquisitions (accessed October 15, 2025).

Zhao, H. Qu, J. Xu, X. Li, W. Lv, and G. G. Wang, A systematic review of fuzzing. Soft Computing, vol. 28, no. 6, pp. 5493-5522, 2024.

Author Biographies

Tom Roltsch is a senior reliability engineer for ManTech International Corporation. He serves as a reliability subject matter expert for the Office of Test and Evaluation at the Department of Homeland Security Science and Technology Directorate. He has worked in test and evaluation since 2000 and specialized in RAM test and evaluation for 11 years. Tom holds a Bachelor of Science degree in physics from Virginia Military Institute, a Master of Interdisciplinary Studies in Energy and Sustainability from George Mason University, and a Master of Science in Government Contracts from George Washington University Business School.

ITEA_Logo2021
ISSN: 1054-0229, ISSN-L: 1054-0229
Dewey Classification: L 681 12

  • Join us on LinkedIn to stay updated with the latest industry insights, valuable content, and professional networking!