SEPTEMBER 2025 I Volume 46, Issue 3
SEPTEMBER 2025
Volume 46 I Issue 3
IN THIS JOURNAL:
- Issue at a Glance
- Chairman’s Message
Technical Articles
- Kernel Model Validation: How To Do It, And Why You Should Care
- Confidence-Based Skip-Lot Sampling
- Eucalyptus – An Analysis Suite for Fault Trees with Uncertainty Quantification
- Digital Twins in Reliability Engineering: Innovations, Challenges and Opportunities
- Competence Measure Enhanced Ensemble Learning Voting Schemes
- Advancing the Test Science of LLM-enabled Systems: A Survey of Factors and Conditions that Matter Most
- Beyond Accuracy: Evaluating Bayesian Neural Networks in a Real-world Application
- Balancing Structure and Flexibility: Evaluating Agile, Waterfall, and Hybrid Methodologies in Aerospace and Defense Projects
Workforce of the Future
- Building Confidence, Interest, and Opportunity: A Social Cognitive Career Theory-Based Analysis of the Young Women in Engineering Outreach Program
News
- Association News
- Chapter News
- Corporate Member News
![]()
Competence Measure Enhanced Ensemble Learning Voting Schemes

Francesca R. McFadden
Department of Mathematics and Statistics, University of Maryland, Baltimore County; Baltimore, MD, USA
![]()
![]()
Abstract
Ensemble methods are comprised of multiple individual base models each producing a prediction. Voting schemes are used to combine, fuse, or select among the predictions of the base models. A well-formed ensemble should be formed from models with diverse assumptions, e.g., differing underlying training data, feature space selection, and therefore decision boundaries. Diversity across the models is an advantage, but is not being fully exploited with existing simple voting schemes. The purpose of the described approach is to enhance current voting scheme approaches by considering base model competence. Consideration of the individual base models in the voting for the specified input will be based on achieving a threshold model competence score. This approach appends confidence-based schemes with ensuring that inputs are consistent with the prediction space of the base models. An application using random forest classifiers demonstrates that integrating competence score estimation into ensemble learning leads to better performance compared to the highest confidence selection strategy.
Keywords: Ensemble learning, voting scheme, model competence, explainability, trust score
Introduction
| Classifier Model | Class A Confidence | Class B Confidence | Acceptance Threshold |
| Classifier 1 | 0.6 | 0.4 | 0.5 |
| Classifier 2 | 0.3 | 0.7 | 0.8 |
Table 1: Example of two classifier predictions on an input data point.
To describe candidate voting schemes, we use the example shown in Table 1 where two classifiers have two possible class label predictions, Class A and Class B. Suppose the same data point is input into each classifier. Class label predictions are typically supplied along with associated confidence probabilities. The corresponding confidence in each of the class label predictions is shown, e.g., Classifier 1 predicts the input is Class A with confidence of 0.6. In some cases, analysis may have been performed to restrict use of a model when it is not sufficiently confident in a prediction. For example, the model’s outcomes would not be used if the confidence does not meet or exceed a given acceptance threshold.
| Sample Voting Scheme Description | Outcome |
| The class with the highest confidence prediction is selected | Class B |
| The class with the highest average confidence is selected | Class B |
| Confidence must meet or exceed an acceptance threshold | Class A |
| Prior analysis was done to select the results of one model | Class A |
Table 2: Examples of voting schemes.
Table 2 presents four sample voting schemes. Each row contains two columns – a description of a voting scheme and the outcome of the voting. Even for an example with two models and two possible class labels, there are differences in the ensemble prediction depending on the voting scheme. The four schemes are explained from top to bottom.
- In one common scheme, the ensemble prediction is the class corresponding to the highest confidence across any of the base classifiers. In the Table 1 example 0.7 is the highest confidence so the ensemble would yield Class B.
- Another common fusion option is to choose the class with the highest average confidence as the ensemble prediction. In the Table 1 example, the Class A average is 0.45 and the Class B average is 0.55, indicating that Class B is the ensemble prediction. Weighted average may be applied such that one classifier’s vote counts more than others in the ensemble as well.
- Since confidence values may not be normalized across models, acceptance thresholds may be applied to filter out their votes. If an acceptance threshold is applied in the Table 1 example, the Class B prediction from Classifier 2 would not be considered. Hence, only the Classifier 1 prediction would remain and the ensemble prediction would be Class A.
- In disagreements, prior analysis may have been performed to select the outcome of a model. For example, it may be that Classifier 2 is known to have high false positives for Class B. Therefore, in disagreements where Classifier 1 predicts Class A and Classifier 2 predicts Class B, the Classifier 1 Class A decision is used as the ensemble prediction.
We refer to a voting scheme as simple when it is simple to understand which models were used in the predictions, how each model voted, and when applicable what characteristics caused one of the models to be preferred as a decider. The voting schemes in Table 2 are considered simple voting schemes. An advantage for each of these simple voting schemes is that an end user may have transparency in how the base model predictions were combined or selected from to yield the ensemble prediction.
Motivation
Diversity among models in an ensemble will result in a more robust solution, better prediction performance, and lower errors (Polikar, 2006). Accordingly, a well-formed ensemble should include models with various assumptions, e.g., differing training data, feature space selection, learning algorithms, training parameters, and decision boundaries (Polikar, 2006). Base models may have different prediction granularity where one model may offer higher specificity, e.g., distinguishing between breeds of dog, than others in the ensemble. In such cases, broader or more general class labels may be mapped to appropriately during the ensemble combination process. Complementary features, class representation, and training data distribution across base models are to an advantage, but are not learned or exploited with the simple schemes in the Introduction (Kundu, 2021).
On the other hand, more advanced network-based techniques do aim to exploit these complementary traits, but often at the cost of explainability. The web-based Essential Guide to Ensemble Learning provides an overview of several ensemble learning techniques with illustrations (Kundu, 2024). More advanced approaches apply a training framework or network wrapping a set of base models. Brief descriptions of common network-based fusion techniques are provided to demonstrate the increasingly advanced nature of the techniques:
- In a bagging parallel ensemble, T ≥ 2 subsamples are drawn from the training data, with replacement. Each subsample is used to train a base model. The model predictions will be aggregated through majority voting.
- In a boosting sequential ensemble, three models are formed. The first model is trained on a random sample of the available data. The second model’s training data is resampled such that half is from the set of points the first model correctly predicted and the other half is from the set the first model incorrectly predicted. The third model is trained on points where the other two disagreed. The model predictions will be aggregated through majority voting.
- In a stacked ensemble, T ≥ 2 base models are trained on a random sample of the available data. Their predictions are used to train a meta-model.
- Mixture of Experts ensembles apply weights to combinations of the predictions from T ≥ 2 base models, each trained on one of T subsamples of the training data. A gating network, often a neural network, learns the weights.
It is expected that the base models are diverse and correct “most of the time” (Polikar, 2006). As the ensembles become more advanced, their potential to exploit or learn where the models are complementary increases. The bagging ensemble should produce less variance in prediction, meaning it is less sensitive relative to the training data set. In practice, T may be on the order of one hundred or more for bagging ensembles each built with independently sampled training sets. Meta models have an opportunity to learn when false positives or false negatives may come from one of the base models, to improve the ensemble prediction.
While the more advanced techniques may be able to learn complementary traits of the classifiers, they lose transparency in how base model decisions are weighed to produce a decision and result in loss of explainability to end users. This paper proposes and demonstrates an approach for enhancing ensemble learning performance through integration of model competence scores in a simple voting scheme, aiming to exploit complementary traits of classifiers while preserving explainability to end users. To explain the approach, an overview of model competence score estimation is provided.
Model Competence Estimation
Model competence estimators enable detection of when inputs to a machine-learned model are outside of its prediction space. Implicitly, model competence estimators indicate how well a model is expected to perform on inputs potentially outside of its training set.
An influential approach to point-wise classification model competence estimation came from the 2019 Neural Information Processing Systems (NeurIPS) conference. The approach was titled “Accurate Layerwise Interpretable Competence Estimation”, and appropriately the algorithm is referred to as ALICE (Rajendran & LeVine, 2019). The ALICE approach produces a competence score indicating whether the model will be competent to predict the true class f(x) of an input x. The ALICE concept extends confidence to incorporate additional mechanisms that may result in uncertainty in prediction. Specifically, the ALICE competence score is a product of probability factors accounting for distributional, data, and model uncertainties. The variable D is the event that the input x is in-distribution. The in-distribution probability p(D | x) accounts for distributional uncertainty. An error function
is applied to evaluate the distance from f(x) to each one-hot label cj for each class j = 1, 2, …, k the model may produce. A user set error threshold,
, is applied to the error function to indicate model uncertainty. For each class j = 1, 2, …, k, the p(cj | x, D) is the in class probability for an input x and accounts for the data uncertainty. The resulting ALICE competence score is,
Equation 1: ALICE Competence Score
The competence scores are compared to a user set threshold τ ∈(0, 1)and the model is deemed competent for values above it. Both the error threshold and the overall competence score threshold often require expert judgement to set.
The resulting scores may be employed in a workflow to indicate when the user should trust a model based on its prediction space. Moreover, it may indicate whether the model may ever be able to predict a class label mapping to the true species of an input. The ALICE concept applies to classifier models. While other competence estimation or similarly inspired techniques may be considered, we employ the ALICE method to estimate model competence to demonstrate the following approach. Namely, the score will be used to indicate to the user when and where to trust a base model for a given input data.
Approach
The purpose of the described concept is to enhance ensemble learning voting scheme approaches through consideration of base model competence scores, ensuring that the input data are appropriate to the prediction space of the individual base models. While the approach may extend to other types of models, the discussion and results focus on classifier models with a common, finite number of class labels in their prediction space.
For a given input x to an ensemble, the predictions of each base model are considered. When all models agree, meaning there is consensus, the unanimous label is the ensemble prediction. When there is non-consensus, the described approach will consider individual base model competence for the input x to filter out which base model predictions will be considered in the voting to obtain the ensemble prediction. Consideration of the base models in the voting for the specified input will be based on achieving a threshold model competence score, with each base model i prescribed its own threshold τi for i = 1, 2, …, T. It should also be noted that the error thresholds δi may also be tailored to each base model. This approach appends confidence-based schemes with ensuring that inputs are consistent with the training data and overall prediction space of the individual models. If non-consensus remains after the competence score filtering step, traditional simple voting schemes as described from Table 2 may be applied among remaining models.
As an example, suppose both models shown in Table 1 were assigned a corresponding competence score threshold. The two models do not agree since Classifier 1 predicted Class A and Classifier 2 predicted Class B, meaning there was non-consensus. Therefore, we consider the competence scores of each model. We consider three alternatives:
- Suppose not all base models are deemed competent. In that case, the predictions from the base models that are not competent are not considered in voting. As an example, suppose Classifier 1 had a competence score above threshold while Classifier 2 did not. Therefore, we filter out the vote of Classifier 2 and only the Classifier 1 vote of Class A remains. Then, the ensemble prediction is Class A.
- Suppose all models are deemed competent for the input. Then non-consensus remains after the competence score filtering step. Now, one of the techniques in Table 2 could be applied. In this case, the highest confidence prediction among all models is then selected resulting in ensemble prediction Class B.
- Suppose no base model is deemed competent. Similarly, non-consensus will remain after the filtering step and one of the techniques in Table 2 could be applied. In this case, highest confidence prediction results in Class B.
Simple threshold filtering and voting techniques maintain transparency in which model predictions are used and when filtering occurs to end users. With the addition of vote filtering based on base model competence, complementary traits of base models, such as training set, should be leveraged while preserving explainability to end users.
Application Data Set
To demonstrate the approach, the Behavioral Risk Factor Surveillance System (BRFSS) 2015 survey (CDC, 2015) was used. The Centers for Disease Control and Prevention (CDC) surveyed Americans from all fifty states and three US territories on health and lifestyle risk factors. Historically, the survey has been conducted annually.
The cleaned version of the data set from Kaggle (Teboul, 2022), including 253,680 interviews, was employed in the workflow. The data includes twenty-one features including a mixture of feature types with quantitative and qualitative responses. The feature types in the data include binary, e.g., smoker or not, integer, e.g., body mass index (BMI), and categorical scale, e.g., a general health score. The general health score is reported on a scale where 1 represents ‘excellent,’ 2 represents ‘very good,’ 3 represents ‘good,’ 4 represents ‘fair,’ and 5 represents ‘poor.’
Approximately 84% (213,703 of 253,680) of surveys reported no diabetes. A binary indication of no diabetes and/or gestational (during pregnancy) diabetes (0) and prediabetes and/or diabetes (1) was used as a data label. Pearson’s correlation coefficients were calculated between each feature and the binary diabetes indicator; the ten features with the highest absolute correlations are presented in Table 3. The top five features correlated to the diabetes indicator are a general health score, indication of high blood pressure, body mass index (BMI), indication of high cholesterol, and indication of difficulty walking.
| BRFSS 2015 Feature | Correlation |
| General Health Score (1-excellent to 5-poor) | 0.30 |
| High Blood Pressure (BP) (0-low, 1-high) | 0.27 |
| Body Mass Index (BMI) | 0.22 |
| Difficulty Walking (0-no, 1-yes) | 0.22 |
| High Cholesterol (0-low, 1-high) | 0.21 |
| Age in Years | 0.19 |
| Heart Disease (HD) or Attack (0-no, 1-yes) | 0.18 |
| Income Scale (1- < $10K to 8- ≥ $75K) | -0.17 |
| Physical Activity within 30 days (0-no, 1-yes) | -0.12 |
| Prior Stroke (0-no, 1-yes) | 0.10 |
Table 3: Pearson correlation coefficients to the binary diabetes indicator.
Application Ensemble
Two random forest classifiers with one-hundred trees were produced using different features and training data subsets. The first classifier, Classifier 1, was trained with a random sixty percent sample of the BRFSS as a training set and incorporated the top ten features correlated to the binary diabetes indicator. The second classifier, Classifier 2, was trained with a different random sixty percent sample of the BRFSS as a training set and incorporated the top five features correlated to binary diabetes indicator. After eliminating the two training sets, thirty percent of the data was preserved as a test set.
The confusion matrices showing the comparison of model predictions versus true class for the two classifiers are shown in Figure 1. The matrices to the right of each confusion matrix are the row-normalized percentages. The percentages are the true positive rate (bottom right), true negative rate (top left), false positive rate (top right), and false negative rate (bottom left). Similar presentation of the confusion matrices will be shown for the ensemble results in a later section. The percentages along the diagonal are desired to be high (closer to 100%) while the percentages off diagonal are desired to be low (closer to 0%). Classifier 2 has a higher true positive, higher true negative, lower false positive, and lower false negative predictions. Therefore, Classifier 2 is a better performing model than Classifier 1. Ensemble learning does not guarantee improved performance from an ensemble compared to a high performing base model.
Figure 1: Confusion matrices for (left) Classifier 1 and (right) Classifier 2.
Regions of Consensus
When all classifiers in an ensemble concur on the predicted class of a given input, there is consensus among them. In the current example application, consensus was obtained for 94.3% (72070 of 76411) of the test set cases. As expected, when there is consensus, the performance is improved. The identical confusion matrices for both classifiers at the test points where there was consensus are shown in Figure 2. For points where the models agree, as anticipated there is performance improvement, e.g., higher true positive and true negative, lower false positive and false negative.
Figure 2: Identical confusion matrices on test points where there is consensus between (left) Classifier 1 and (right) Classifier 2.
When the classifiers in an ensemble do not agree on the predicted class of a given input, there is non-consensus among them. Each classifier has two outcomes, diabetes and non-diabetes. So, in cases of non-consensus, one is correct and one is incorrect. This means opportunity exists to identify where Classifier 1 and Classifier 2 are individually accurate at the non-consensus test points. We aim to show that the competence-based approach performs better than other voting techniques, e.g., highest confidence selection.
Application Competence Scores
Setting competence score thresholds requires examination of the distribution of scores of the training set. When the score distributions are multi-modal or rapid decay in counts is observed as shown in the distribution, the process of identifying appropriate thresholds is more complicated. In this case, a combination of the competence score distributions and underlying model confidence is used to establish the score thresholds. An approach to automate how the thresholds are set is part of future work.
For the application ensemble, a subset of the training data, consisting of 5000 points, was used to analyze the distributions of competence scores to determine thresholds which may be applied to the base models in the ensemble. Figure 3 shows the distribution of the competence scores in the subset of training data for both classifiers.
Figure 3: Distribution of competence scores in the training subset for (left) Classifier 1 and (right) Classifier 2.
The Classifier 1 competence scores are roughly symmetric around a peak of 0.5. The threshold of 0.5 was chosen for Classifier 1. On the contrary, the Classifier 2 scores present two modes with a rapid drop in counts at a score of approximately 0.2. The confidence values were examined for Classifier 2 to verify a drop in high confidence values (> 0.9) when the competence score is below 0.2 was the root cause of the Classifier 2 distribution. The competence threshold of 0.2 was chosen for Classifier 2.
As previously described in the approach, predictions on inputs with scores below the chosen thresholds will not be included when non-consensus occurs.
Results
The proposed approach to incorporate competence scores is compared to a common highest confidence approach. Figure 4 shows the confusion matrices for both approaches. The approach of incorporating the competence score performed better than the maximum confidence approach for true positive rate, true negative rate, false negative rate, and false positive rate. Therefore, the competence-based approach shows better performance.
Figure 4: Confusion matrices for (left) maximum confidence voting and (right) competence-based filtering approach.
Conclusion
The application ensemble was used to demonstrate an approach for incorporating competence score estimation into ensemble learning methods. The results show a performance improvement compared to a traditional highest confidence approach. Additionally, this example showed the approach may be applied to data with features of all types, e.g., binary, categorical.
While there was promise in performance, the approach will be applied to additional data sets, classifier algorithms, and ensembles to identify opportunity for further enhancement of the approach. There are ensembles where the differences between the base models will be pronounced and one would expect a more drastic improvement from employment of the competence-based approach.
We were able to identify and log which classifier was used or selected for each non-consensus test point, leading to more transparency in selection for human machine teaming applications. Therefore, this approach is more explainable to end users than network learning ensemble techniques. From this approach, recommender system visualizations may be formed to make ensemble learning with many classifiers more easily understood by end users.
This approach may also help in determining opportunities for making an ensemble more robust as it may be used prior to end user employment of predictions. It could also be used to diversify the ensemble through identification of gaps in the aggregate training and prediction space. As an example, the approach could be incorporated in ensemble development to identify where no base classifiers show competence.
The example shown incorporated two base models in the ensemble. There is interest in determining how the results may scale with a larger number of base models in the ensemble. The approach demonstrated still applies as more base models are introduced. However, as previously noted, the thresholds applied in determining model competence scores necessitate analyst insight. Therefore, to scale the approach it is desirable to identify approaches to automate how the thresholds are set. Future work involves data-driven approaches to set the thresholds based on the competence score distribution.
An additional pursuit may be in extending how this approach may extend beyond classifiers to other types of models. The ALICE score (Rajendran & LeVine, 2019) approach was considered in the demonstration and is appropriate for classification models as shown. The approach described may be generalized to trust score estimation approaches which have been developed for regression models, such as the Regression Trust Score (RETRO) method (de Bie, 2021).
Acknowledgements
This work was completed as an independent study at the University of Maryland Baltimore County under the advisement of Dr. Matthias K. Gobbert.
Matlab was the environment used to process data, create classifier models, and generate plots and confusion matrices for the results shown in this presentation.
Charts corresponding to this article were presented at the DATAWorks conference in Alexandria, VA on April 24, 2025.
References
1.Centers for Disease Control and Prevention (CDC). “Behavioral Risk Factor Surveillance System Survey Questionnaire.” U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, Atlanta, GA 2015.
2.de Bie, K., Lucic, A., & Haned, H. “To Trust or Not to Trust a Regressor: Estimating and Explaining Trustworthiness of Regression Predictions”, arXiv.2104.06982, 2021.
3.Kundu, R., Singh, P.K., Mirjalili, S., & Ram Sarkar, R. “COVID-19 detection from lung CT-Scans using a fuzzy integral-based CNN ensemble,” in Computers in Biology and Medicine, Vol. 138, 2021.
4.Kundu, R. “The Essential Guide to Ensemble Learning,” V7 Labs, 11 Jan 2024, Retrieved from: https://www.v7labs.com/blog/ensemble-learning-guide.
5.McFadden, F. “Applications of model competence estimation” [Conference Presentation], Society of Industrial and Applied Mathematics (SIAM) Mathematics of Data Science (MDS) Conference, Atlanta, GA, USA, 24 October 2024.
6.Polikar, E. “Ensemble based systems in decision making,” in IEEE Circuits and Systems Magazine, Vol. 6, No. 3, 2006.
7.Rajendran, V. & LeVine, W. “Accurate Layerwise Interpretable Competence Estimation,” In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, R. Garnett, editors, Vol. 32, Curran Associates, Inc., 2019.
8.Teboul, A. “Diabetes Health Indicators Dataset,” Kaggle, 2022. Retrieved from: https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset/data
Author Biographies
Francesca R. McFadden Francesca has worked on modeling, simulation, and analysis to evaluate system architectures at the Johns Hopkins University – Applied Physics Laboratory since 2010. She earned a bachelor’s degree in computational mathematics with an additional major in statistics from Carnegie Mellon University. Francesca earned a Master’s degree in Applied Mathematics from North Carolina State University. She is a doctoral candidate at the University of Maryland Baltimore County studying under the advisement of Dr. Matthias K. Gobbert.
Dewey Classification: L 681 12

