SEPTEMBER 2025 I Volume 46, Issue 3
SEPTEMBER 2025
Volume 46 I Issue 3
IN THIS JOURNAL:
- Issue at a Glance
- Chairman’s Message
Technical Articles
- Kernel Model Validation: How To Do It, And Why You Should Care
- Confidence-Based Skip-Lot Sampling
- Eucalyptus – An Analysis Suite for Fault Trees with Uncertainty Quantification
- Digital Twins in Reliability Engineering: Innovations, Challenges and Opportunities
- Competence Measure Enhanced Ensemble Learning Voting Schemes
- Advancing the Test Science of LLM-enabled Systems: A Survey of Factors and Conditions that Matter Most
- Beyond Accuracy: Evaluating Bayesian Neural Networks in a Real-world Application
- Balancing Structure and Flexibility: Evaluating Agile, Waterfall, and Hybrid Methodologies in Aerospace and Defense Projects
Workforce of the Future
- Building Confidence, Interest, and Opportunity: A Social Cognitive Career Theory-Based Analysis of the Young Women in Engineering Outreach Program
News
- Association News
- Chapter News
- Corporate Member News
![]()
Beyond Accuracy: Evaluating Bayesian Neural Networks in a Real-world Application

Mark Hinds
Los Alamos National Laboratory,
Los Alamos, NM, 87545

Scott Koermer
Los Alamos National Laboratory,
Los Alamos, NM, 87545
Michael Geyer
Los Alamos National Laboratory,
Los Alamos, NM, 87545
Abstract
Robust uncertainty quantification is essential for deploying machine learning models in high-stakes scientific applications, where predictions must remain reliable even under data scarcity and distribution shift. This work investigates uncertainty-aware neural network models for regression tasks in such settings, with a focus on a specific application in planetary science: the prediction of chemical composition from spectroscopic data collected by the ChemCam instrument onboard NASA’s Mars rover Curiosity. Due to the limited size of labeled training data and differences between Earth-based training samples and Martian measurements, conventional models may exhibit overconfidence or fail to generalize. We evaluate three Bayesian inference methods (Laplace approximation, deep ensembles, and Variational Bayes) with respect to their predictive performance and calibration of both epistemic and aleatoric uncertainty. Our results show that ensemble methods provide more informative uncertainty estimates, particularly under distribution shift. The methodology and insights presented here are broadly applicable to scientific domains requiring trustworthy machine learning under uncertainty.
Keywords: Uncertainty quantification, Bayesian neural networks
1.0 Introduction
In planetary science missions (including NASA’s Mars rover program), instruments such as ChemCam provide crucial spectroscopic data for ascertaining the chemical composition of the Martian surface (Figure 1). Accurate prediction of chemical composition is key to mission success, but there are several challenges facing existing statistical and machine learning methodologies applied to this task. First, supervised learning models are necessarily trained on laboratory data with known chemical composition, which is expensive to obtain and therefore relatively small (on the order of 500 unique geological targets). Complex machine learning models trained on small data sets may suffer from high epistemic (model) uncertainty, as the dataset is not rich enough to fully resolve the model parameters. Second, the models will be applied on data collected on Mars, which is likely to lie outside the distribution of training data. Under such distribution shift scenarios, traditional neural network models may make poor predictions and yet also exhibit overconfidence. In scientific application areas, we argue that robust generalization and meaningful uncertainty quantification are critical. Specifically, reliable uncertainty quantification can indicate when a model is uncertain (due to lack of data or noisy data), guide future data acquisition, and enable efficient prioritization of new targets.
In this work, we focus on developing and evaluating uncertainty-aware neural network models for prediction of chemical composition from ChemCam data. While a number of authors have applied a variety of regression methods to ChemCam data, existing approaches lack principled uncertainty quantification and may exhibit overconfidence, particularly on out-of-distribution inputs. We evaluate three Bayesian convolutional neural network inference approaches (Laplace approximation, deep ensembles, and Variational Bayes) with respect to the quality of their predictive uncertainty and their performance on out-of-distribution data. Specifically, we consider both epistemic (model) uncertainty and aleatoric (data noise) uncertainty and evaluate how epistemic uncertainty changes under distribution shift. In the real-world, high-stakes domain of ChemCam spectral analysis, we find evidence that deep ensembles provide more useful measures of epistemic uncertainty than Laplace approximations or Variational Bayes. While our specific results are centered around this specific data set, the methodology and evaluation criteria we present are broadly applicable to any application of neural network models where accurate characterization of uncertainty is necessary.
In Section 2, we present the mathematical framework and methodology, outline our evaluation metrics, and introduce the ChemCam data. Section 3 gives results: evaluation of the three inference approaches with respect to the metrics, inspection of aleatoric and epistemic uncertainty relative to the overall uncertainty within each method, and characterization of the behavior of each method on out-of-distribution data. Finally, Section 4 summarizes our conclusions and outlines future work.

Figure 1. Left: The Curiosity Mars rover with instrumentation identified, including the ChemCam Laser-induced Breakdown Spectroscopy (LIBS) instrument at the top of the mast. Image credit: NASA/JPL-Caltech. Right: Illustration of LIBS; the laser creates a plasma on the surface of the target which emits light captured in a spectrograph, where different colors (wavelengths) correspond to different elements. Image credit: NASA/JPL-Caltech/LANL.
2.0 Methods
In this section, we describe the mathematical background (Section 2.1), details of the methods implemented (Section 2.2), evaluation metrics (Section 2.3), and the application area and data (Section 2.4).
2.1 Bayesian modeling and uncertainty quantification
Suppose we aim to learn a predictive model that maps X→Y where the data used to train the model is comprised of N pairs
. We consider a general setting in which a predictive model that depends on parameters θ and noise variance σ2 is described by a likelihood function
. That is, given a setting of θ and σ2 and an input x, the likelihood function describes the probability density function over y. We specifically consider regression problems with neural networks, where a natural choice of likelihood function is Gaussian:
. Here, the neural network fθ gives the mean with a learnable noise variance σ2. In this framework, maximum likelihood estimation seeks to minimize the loss function L corresponding to the negative log likelihood

If we assume σ2 is known or fixed, this becomes the standard mean squared error loss function typically used for training neural networks for regression tasks. In this work, we estimate σ2 which represents aleatoric uncertainty, or data uncertainty, that reflects underlying noise or variation in the response y.
While aleatoric uncertainty is an important part of overall predictive uncertainty, we also want to estimate epistemic uncertainty, or model uncertainty; that is, uncertainty about the parameters θ and σ2. In this work, we take a Bayesian approach. Given a likelihood function, we can specify a prior distribution over parameters p(θ,σ2). Bayes’ theorem then gives the posterior distribution over parameters as

If the posterior distribution were known, then given our data set comprised of x and y, we would have the probability density function over θ and σ2 that reflects epistemic uncertainty. In most practical problems, and particularly with large parameter sets θ corresponding to neural network models, it is intractable to compute the posterior distribution in closed form, and sampling methods like Markov chain Monte Carlo tend to scale poorly (Goan and Fookes 2020); we will describe tractable approximate inference methods in Section 2.2.
Assuming we know the posterior distribution, our predictions are distributed according to the posterior predictive distribution
![]()
That is, for a new input x*, we can compute the probability of the output y* conditional on x* by integrating over parameters sampled from the posterior. In practice, we will approximate this integral with samples from the posterior.
Epistemic uncertainty (uncertainty about θ) is the primary goal in most existing Bayesian neural network applications. In addition, jointly estimating σ2 and θ can be difficult as the mean function fθ (parameterized by a neural network) is very flexible and it may be difficult to distinguish signal from noise. As a result, we take a two-step approach to handle the data noise variance σ2. First, we perform maximum a posteriori (MAP) optimization in the full Bayesian model to determine parameters θ and σ2 that minimize the negative log posterior

We use independent standard Gaussian priors for θ; for σ2, we reparameterize to optimize unconstrained over the log variance and use an independent Gaussian prior with mean -0.5 and variance 0.1. Notably, Gaussian priors correspond to familiar regularization terms used in neural network training; the standard Gaussian prior on θ has
which is an l2 weight decay term. After MAP optimization, we fix σ2 (which corresponds to assuming the posterior over σ2 is a point mass at the MAP estimate). Given this fixed σ2, we then compare different Bayesian inference methods for that yield approximate posteriors p(θ| x,y). We compute epistemic uncertainty by sampling values θ to compute fθ(x) (neural network outputs for each value of θ) and total uncertainty by, for each sample of θ, also sampling from the noise distribution (using σ2). This procedure is akin to computing a sample-based posterior predictive but with a delta function for the posterior over σ2.
2.2 Inference for Bayesian neural networks
Since the true posterior is intractable, we explore three approximate inference methods in this work: deep ensembles, Laplace approximations, and Variational Bayes. Other alternatives include Monte Carlo dropout (which can be seen as a restricted version of Variational Bayes under specific assumptions (Gal and Ghahramani 2016)) and stochastic gradient Markov chain Monte Carlo (which can e difficult to tune and scale (Papamarkou et al. 2022)). Each of the methods we explore uses the same base neural network architecture, which is comprised of an input size of 5,605 wavelengths (1D vector) followed by three 1D convolutional layers with output channels 32, 128, 1, kernel sizes 11, 5, 1, strides 3, 3, 3, and nonlinear activation ReLU, respectively. These layers are followed by two linear layers (input features 207, output features 20 followed by ReLU and input features 20 output features 20) and a final output linear layer with output features 9 to predict the oxide weight percents. During the original MAP fit, we also estimate a 9-vector corresponding to σ2 for each oxide and fix these values before computing Laplace or Variational Bayes approximations. We use a maximum of 200 training epochs with early stopping on validation loss with patience 50 and use the Adam optimizer with a learning rate of 3e-4.
Deep ensembles
Ensembles characterize uncertainty in by repeatedly re-fitting a neural network (using maximum likelihood or MAP estimation) but starting with different weight initializations (Lakshminarayanan, Pritzel, and Blundell 2017). This procedure results in a collection of models that exhibit predictive variation on held-out datasets and can be interpreted as a Bayesian approximation (Wilson and Izmailov 2020). In this work, we fit an ensemble using MAP optimization with 30 different initial weight values. However, we found that a subset of these models reached very poor local minima, where the neural network predicted a constant value and the estimated noise variance σ2 was very large; therefore, we filtered out models with this behavior, resulting in an ensemble of size 11.
Laplace approximation
The Laplace approximation begins with a pre-trained MAP solution and expands an approximate Gaussian posterior around this point using the curvature of the loss function. Specifically, we denote the unnormalized posterior distribution as g(θ)=p(y | x,θ)p(θ) and denote
=argmaxθg(θ). We then consider a truncated Taylor approximation around
, where the first derivative term is zero because
is a local maxima,

Exponentiating, we obtain

which corresponds to an unnormalized Gaussian distribution centered at the MAP solution with covariance determined by the Hessian. We use the Python laplace package (Daxberger, Kristiadi, et al. 2021), and for numerical stability and computation, we compute the last-layer Laplace approximation (fixing all weights except those in the last layer) with a diagonal approximation to the covariance using prior precision 1.0.
Variational Bayes
Variational Bayes proposes a family of approximate posterior distributions, then uses optimization to find the member of the family closest to the true posterior. In this work, we use a mean-field approximate posterior
where each weight θi has a tunable mean μi and precision
i. The loss function is

The first term is the KL divergence between the approximate posterior and the prior and is available in closed form since both are Gaussian; the second term is the expected log likelihood with respect to q and is approximated via sampling (Blundell et al. 2015). Many works have also found that weighting the first term with a weight β can be advantageous for optimization (Wenzel et al. 2020). For the results presented here, we used β=1e-4 and a prior noise standard deviation of 0.01 to obtain stable optimization results. We used 30 training epochs with an Adam optimizer with learning rate 1e-3.
2.3 Evaluation metrics
In this section, we discuss evaluation metrics for the probabilistic predictions from the Bayesian neural network approaches. We first evaluate predictive accuracy through the root mean squared error (RMSE) between the predictions and the true values on a held-out test set; for Bayesian approaches, we first take the average across a set of sampled predictions, then compute the RMSE. To evaluate uncertainty, we first consider constructing α-level prediction intervals, then use coverage, interval width, and interval score (which combines coverage and width) on these intervals. The α-level prediction interval is constructed as follows. First, we sample from the posterior predictive distribution with θ sampled from the approximate posterior distribution and σ2 fixed to the MAP value. From the collection of samples, we compute a mean
and standard deviation
, then compute intervals as
, where
is computed using the inverse CDF of the standard Gaussian distribution.
Coverage
Given an α-level prediction interval, we assess coverage on the held-out test set by computing the proportion of the set for which the true value falls inside the interval. In theory, an α-level prediction interval should achieve at least α-level coverage (e.g., 95% prediction intervals should capture true values inside the intervals 95% of the time).
Interval width
While coverage is important, one could always achieve (or exceed) the nominal coverage by increasing the width of the prediction intervals. Intervals that are narrower while still achieving the nominal coverage are more precise, so we also measure the width of the intervals as a metric of performance.
Interval score
A proper scoring rule that combines the notions of coverage and interval width is the interval score (Gneiting and Raftery 2007), defined for an interval with lower bound l and upper bound u as

The interval score is on the scale of the data, where the first term u-l is the width of the uncertainty interval. The second and third term are penalty terms; Iy<l is the indicator function which has value 0 if y>l (if the true value is higher than the lower bound of the interval) and value 1 otherwise. Therefore, if the value y falls outside the interval, we add a penalty to the width that is equal to how far outside the interval it is, scaled by
(which penalizes more for intervals that should have higher coverage). Lower interval score values are preferable, and higher values occur either because the width u-l is large, or because coverage is not achieved (when y < l or y > u). Therefore, interval score offers a metric that can be used to compare models while taking into account both the width and (mis)coverage of intervals.
Epistemic fraction
To investigate the contribution of epistemic uncertainty to total uncertainty, we compute the fraction of variance coming from epistemic uncertainty as follows. First, we compute the total uncertainty by taking the variance across the posterior predictive samples that form the prediction interval. Second, we compute the epistemic uncertainty by re-computing posterior predictive intervals but with no noise variance; that is, we sample θ then simply take the output of the neural network as the prediction (with no additional sampling of noise variance based on σ2). The ratio of epistemic to total uncertainty is the epistemic fraction.
2.4 Application: geochemical analysis using spectral data
The NASA Mars rover missions have worked to deploy remote instrumentation to characterize the chemical composition of the Martian surface. The ChemCam instrument onboard the Curiosity rover was designed to recover the chemical composition of rock and soil samples by measuring the spectral properties of plasma atomic emissions induced by a laser pulse (via Laser-Induced Breakdown Spectrometry, or LIBS). LIBS is a type of atomic emission spectroscopy in which the laser makes a small plasma on the target surface, causing the atoms to emit light that is collected by high-resolution spectrometers. The primary driver in variation across spectra is the chemical composition of the target, so previous works sought to predict the composition from the spectrum using methods like linear regression, dimension reduction, and deep neural networks (Wiens et al. 2013; Forni et al. 2013; Clegg et al. 2017; Anderson et al. 2017; Castorena et al. 2021; Li et al. 2020; Yu and Yao 2023). Convolutional neural networks have been applied successfully in both LIBS and other types of spectroscopy (Acquarelli et al. 2017; Ghosh et al. 2019; Shen and Viscarra Rossel 2021); however, deterministic machine learning techniques may not accurately reflect uncertainty in predictions, particularly when we train from small data sets and expect models to be applied to potentially out-of-distribution data (such as data collected by the rover instrument on Mars). When applied to data collected on Mars, there is a risk of extrapolation due to unanticipated compositions, suggesting that estimates of uncertainty are crucial for interpretation of results.
We use publicly available data hosted in the Planetary Data System (PDS) (McMahon 1996). We split the geological standards data by geological target name into a training set with 375 targets, a validation set with 47 targets, and a test set with 48 targets. For each target, data was collected with the laser pointed at 4 or 5 distinct locations; while variation across locations should be small for the geologic standards (because they are ground and pressed), some variation in composition could occur. At each location, the laser was fired 50 times; we discard the first 5 shots to remove surface effects, resulting in approximately 225 spectra available per target. We preprocess the spectral data by masking the wavelengths to three wavelength ranges (UV: 246nm to 338nm, VIO: 382nm to 473nm, VNIR: 492nm to 849nm) and normalizing by the maximum value within each spectral range. We then normalize by the maximum across the entire range, remove negative values, and apply a logarithmic transform after adding 0.5 to avoid values near zero. The certified oxide weight percents corresponding to each target (obtained by other analytical techniques) are given in terms of percent values that should add to near 100% across the nine oxides we consider (SiO2, TiO2, Al2O3, FeOT, MnO, MgO, CaO, Na2O, K2O) which should make up the bulk of each sample. We preprocess the oxide weight percents by dividing by 100 (to convert to proportions), then scaling by the standard deviation across the training set. Figure 2 shows spectral data from five example targets within a specific spectral range; the black line shows the mean across all observations for that target, with variation shown in blue. While different targets often have features in specific areas, the feature intensity and shape varies across samples. Figure 3 shows the distribution of the labels (oxide weight percents) across the training set. Because the samples are geological in nature, the distributions of the oxides are non-uniform, with many right-skewed (few large values). In addition, the range of the oxides differs significantly across oxides; in particular, SiO2 tends to have the largest concentration, while other oxides like TiO2 and K2O tend to be present only in trace amounts.
To test our models on out-of-distribution data, we obtained 253 averaged spectra collected from the actual ChemCam instrument on Mars and preprocessed them in the same way we preprocessed the laboratory data. The actual composition for the Martian targets is generally not known, but previous work has found distribution shifts between the laboratory and Mars data that suggest we may see changes in model uncertainty on the Mars data (Panda et al. 2023).

Figure 2. Examples of training set spectral data from five geological targets for a subset of the total measured wavelength range. Each target is measured multiple times; the mean spectrum is in black with plus or minus two standard deviations in blue.

Figure 3. Training set distributions of the oxide weight percents for all nine oxides, with the diagonal showing a kernel density plot for a particular oxide, and the off-diagonal showing two-dimensional kernel density estimates for pairs of oxides. Of note are the varied ranges for each oxide and the skewed distributions present in many cases (where the majority of the data has low values but with a few higher values).
3.0 Results
In this section, we describe the performance of ensemble, Laplace approximation, and Variational Bayes methods in our application. Section 3.1 discusses performance and calibration metrics for prediction intervals from each method, while Section 3.2 delves deeper into the contribution of epistemic and aleatoric uncertainty within each method. Finally, in Section 3.3, we explore how each method behaves when applied to out-of-distribution data.
3.1 Performance and calibration metrics
Table 1 shows the test set RMSE for each oxide for the average predictions from the ensemble, Laplace approximation, and Variational Bayes (VB) methods, along with the average RMSE across all oxides. Ensemble generally achieves the smallest RMSE, though for a few oxides, Laplace or Variational Bayes slightly outperforms it. Table 2 shows the coverage rate for 95% prediction intervals including both epistemic and aleatoric uncertainty; values should be above 0.95. All three methods attain proper coverage for most oxides, though notably, Laplace and VB fall well short of 0.95 coverage for K2O. Table 3 gives the average interval width (per oxide) for each method. Generally, smaller intervals are better as long as they achieve the desired coverage. Because some intervals did not achieve the desired coverage, we bold the smallest interval that does achieve the desired coverage. Laplace approximation generally achieves the smallest intervals, though in a few cases, ensemble gives the smallest intervals that actually achieve nominal coverage. To combine the concepts of width and coverage into one metric, we show the interval score in Table 4. Here, ensemble achieves the lowest interval score across all oxides, but results are mixed for individual oxides. For K2O, ensemble achieves notably lower interval score than the other methods (because they severely undercover). These results are specific to 95% prediction intervals. Figure 4 is calibration plot showing the observed coverage is given as a function of the desired coverage for all three methods, averaged across oxides; that is, if we were to construct prediction intervals with different desired coverage, we measure for each what the observed coverage is. It appears that all three methods tend to overcover, particularly ensemble, meaning that the intervals are wider than necessary.
Table 1. Test set root mean squared error (RMSE) across all oxides (left) and each individual oxide with the best results (lowest values) bolded. Ensemble achieves the best overall RMSE and the best RMSE within each oxide for 6 of 9 oxides. Laplace approximation achieves the best RMSE for SiO2, Al2O3, and CaO. Variational Bayes incurs the highest overall RMSE.
| Method | All | SiO2 | TiO2 | Al2O3 | FeOT | MnO | MgO | CaO | Na2O | K2O |
|---|---|---|---|---|---|---|---|---|---|---|
| Ensemble | 1.82 | 6.11 | 0.55 | 2.89 | 2.30 | 0.48 | 0.76 | 1.75 | 0.56 | 0.98 |
| Laplace | 1.92 | 5.95 | 0.59 | 2.71 | 2.58 | 0.79 | 0.92 | 1.73 | 0.84 | 1.16 |
| Variational Bayes | 2.17 | 6.78 | 0.51 | 3.51 | 3.08 | 0.50 | 0.98 | 2.13 | 0.67 | 1.41 |
Table 2. Test set coverage across all oxides (left) and each individual oxide. By construction, coverage should be greater than 0.95. All three methods achieve average coverage greater than 0.95 across all oxides but in some cases may undercover (values less than 0.95) for specific oxides. Notably, Laplace and Variational Bayes achieve coverage less than 0.9 for K2O.
| Method | All | SiO2 | TiO2 | Al2O3 | FeOT | MnO | MgO | CaO | Na2O | K2O |
|---|---|---|---|---|---|---|---|---|---|---|
| Ensemble | 0.99 | 0.96 | 1.00 | 0.96 | 1.00 | 1.00 | 1.00 | 0.99 | 1.00 | 0.98 |
| Laplace | 0.96 | 0.94 | 0.98 | 0.94 | 0.95 | 0.99 | 0.99 | 0.97 | 0.97 | 0.87 |
| Variational Bayes | 0.95 | 0.92 | 1.00 | 0.91 | 0.96 | 1.00 | 1.00 | 0.96 | 0.98 | 0.80 |
Table 3. Test set interval width all oxides (left) and each individual oxide; lower values are generally preferable as long as coverage is greater than 0.95. The smallest width achieving coverage is bolded. Ensemble tends to have larger widths than the other two methods, with Laplace approximation achieving the shortest overall width.
| Method | All | SiO2 | TiO2 | Al2O3 | FeOT | MnO | MgO | CaO | Na2O | K2O |
|---|---|---|---|---|---|---|---|---|---|---|
| Ensemble | 5.02 | 12.81 | 1.89 | 6.20 | 7.35 | 3.73 | 3.80 | 5.09 | 2.13 | 2.14 |
| Laplace | 4.44 | 11.65 | 1.77 | 5.54 | 5.92 | 3.51 | 3.30 | 4.60 | 1.86 | 1.85 |
| Variational Bayes | 4.65 | 12.36 | 1.80 | 5.85 | 6.23 | 3.58 | 3.45 | 4.73 | 1.94 | 1.93 |
Table 4. Test set interval score across all oxides (left) and each individual oxide; lower values are better. The score is essentially equal to the interval width but with additional penalty for predictions that fall outside the prediction intervals. The smallest interval score is bolded. Ensemble achieves the lowest average interval score across oxides, but within individual oxides results are mixed. We note that for K2O in particular, ensemble achieves much lower interval score than the other methods (because the other methods severely undercover the data for K2O).
| Method | All | SiO2 | TiO2 | Al2O3 | FeOT | MnO | MgO | CaO | Na2O | K2O |
|---|---|---|---|---|---|---|---|---|---|---|
| Ensemble | 5.56 | 14.54 | 1.89 | 8.00 | 7.44 | 3.74 | 3.84 | 5.60 | 2.19 | 2.82 |
| Laplace | 6.18 | 13.98 | 2.53 | 7.79 | 8.06 | 3.90 | 3.59 | 5.60 | 2.99 | 7.22 |
| Variational Bayes | 6.70 | 15.56 | 1.87 | 9.49 | 7.83 | 3.61 | 3.52 | 6.17 | 2.22 | 10.02 |

Figure 4. For a range of predictive uncertainty intervals with different desired coverage (horizontal axis), we measure the observed coverage in the test set (vertical axis). The diagonal line would indicate well-calibrated intervals. All three methods are miscalibrated in that they are longer than necessary, with ensemble being the most inflated.
3.2 Epistemic and total uncertainty
To understand how much of the total predictive uncertainty comes from epistemic uncertainty, we calculate for each test set item the fraction of the total variance that comes from epistemic uncertainty. Figure 5 shows the distribution of epistemic uncertainty fraction across the test set for each method. Clearly, the ensemble method has the largest overall contribution from epistemic uncertainty. Laplace approximation, on the other hand, has very small epistemic uncertainty, so that nearly all the predictive variance comes from aleatoric uncertainty.

Figure 5. The fraction of epistemic variance compared to total variance across the test set is shown for the three methods for each oxide. Ensemble tends to have the largest contribution from epistemic uncertainty while Laplace has nearly negligible epistemic uncertainty.
To better visualize the contribution of epistemic uncertainty to the total uncertainty, in Figure 6, we show the predicted values versus reference (ground truth) values for a subset of the test set for two of the oxides (with the average prediction and average variance computed across the repeated samples for each target). Good predictions would fall near the diagonal line with uncertainty intervals that cross the diagonal line. As discussed before, Laplace approximation has very little epistemic uncertainty, and there are a few predictions for which the interval does not cover the true value. VB has some epistemic uncertainty, but it appears to be relatively constant across targets, and there are a few predictions for which the interval does not cover the true value. Ensemble has larger epistemic uncertainty for some targets, indicating more change in epistemic uncertainty with the underlying data, and the intervals appear to cover the true predictions for all examples shown.

Figure 6. Reference (ground truth) oxide weight percent values versus predicted values for a number of test set targets. Prediction intervals (blue, reflecting total uncertainty) are overlaid with the portion coming from epistemic uncertainty (red). Deep ensemble appears to achieve the best coverage (prediction interval contains the true value) and also appears to have the most variable epistemic uncertainty width across targets.
Ideally, increasing epistemic uncertainty would indicate regions of the input space where the model is least certain and where collecting more data could be beneficial. These regions of input space could also be where the model doesn’t predict as well (because it hasn’t seen enough data in this region). Figure 7 shows the prediction error (root mean squared error; RMSE) versus epistemic uncertainty (standard deviation due to epistemic uncertainty) for VB and ensemble across the entire test set. It appears that ensemble has a more positive relationship between RMSE and epistemic uncertainty, such that predictions that incur higher error tend to also have higher epistemic uncertainty. On the other hand, while VB shows some relationship, the epistemic uncertainty values are smaller and tend to flatten out as RMSE increases, indicating the epistemic uncertainty may be less useful for diagnosing potentially poor predictions on new data.

Figure 7. Prediction error (RMSE) versus epistemic uncertainty on the test set for VB and Ensemble. In contrast to VB, ensemble epistemic uncertainty tends to increase with RMSE, indicating that the model “knows what it doesn’t know” to a better extent than VB.
3.3 Application to out-of-distribution data
Finally, we evaluate how the uncertainty changes when we apply the models to out-of-distribution data. Specifically, we would expect epistemic uncertainty to increase on out-of-distribution data. We applied our models to a held-out set of data collected on Mars by a twin instrument to the laboratory instrument used to collect training data; while the ground truth labels are not known, we assess how the average epistemic uncertainty fraction (per oxide) compared between the Mars data and the test set from Earth. Figure 8 shows the results for Ensemble and Variational Bayes (as the epistemic fraction for Laplace was still negligible for Mars data). In general, the fraction of uncertainty from epistemic (model-based) tends to increase more for ensemble than VB, though the effect is varied across oxides.

Figure 8. The average fraction of epistemic uncertainty for the Earth test set compared to a Mars test set for each oxide; ensemble shows the largest increase in the epistemic uncertainty across oxides.
4.0 Conclusions
Reliable epistemic uncertainty is crucial not only for scientific applications with relatively small data, as that we study here, but also for very large neural network models such as large language models (Papamarkou et al. 2024). The Bayesian inference framework offers an attractive, principled approach to incorporate multiple sources of uncertainty (including epistemic and aleatoric) within one framework, but existing approaches for Bayesian inference in large models (such as neural networks) face challenges that impede their successful application in practice. In this work, we compared three approaches to approximate Bayesian inference for neural networks on a real-world dataset: ensemble, Laplace approximation, and variational Bayesian inference. While all three methods were somewhat miscalibrated (producing larger prediction intervals that needed), we find that ensembles generally result in good predictive performance metrics, including error of the mean prediction, coverage and width of the prediction intervals, and the interval score. Furthermore, we present evidence that the epistemic uncertainty in the ensemble method is more adaptive to the data and increases when the model is applied to out-of-distribution data. In contrast, we found that Laplace approximation had very little epistemic uncertainty, while the epistemic uncertainty from Variational Bayes was relatively constant across examples and less sensitive to out-of-distribution data.
All three methods can be viewed as approximate Bayesian inference methods, so why might the results differ, particularly with respect to epistemic uncertainty and how it behaves on out-of-distribution data? First, both Laplace and Variational Bayes have several key tuning parameters that may affect their performance. In particular, in this work, we found that Variational Bayes was sensitive to the choice of prior variance and the loss function weighting term. In addition, for the model and data we considered, the Laplace approximation suffered from an ill-conditioned Hessian that resulted in unstable covariance estimates; in our experiments, we could not overcome this obstacle (via tuning of prior precision or selection of different subsets of weights). Previous work has identified that deep networks with ReLU activation functions may particularly suffer from this issue (as some weights will have near zero gradients (Daxberger, Nalisnick, et al. 2021)); in future work, we plan to further explore this issue in greater depth. Second, we did not treat σ2 in a fully Bayesian fashion for Variational Bayes and Laplace; instead, we chose to fix σ2 to an initial MAP estimate. While we do not expect this choice to have a large impact, this warrants investigation in future work. Finally, Variational Bayes and Laplace both rely on Gaussian approximations that are inherently unimodal (either centered around the MAP estimate in Laplace, or optimized in Variational Bayes). In light of the consistently superior predictive accuracy and more faithful uncertainty estimates produced by ensemble methods, we contend that the true predictive posterior cannot be adequately captured by a single, unimodal Gaussian surrogate. Rather, the underlying posterior landscape is likely to exhibit one or both of the following features:
- Strong multi-modality. Multiple, well-separated “islands” of high probability mass can arise when different parameter configurations explain the data equally well. A single Gaussian collapses these modes into one averaged estimate, masking alternative plausible explanations.
- Heavy tails. Probability mass that decays more slowly than a Gaussian in the tails indicates enduring uncertainty far from the mode. Gaussian approximations underestimate this uncertainty, leading to overconfident predictions on out-of-distribution inputs.
Our results suggest that richer, more flexible families of approximate posteriors, such as mixtures of Gaussians, normalizing flows, or nonparametric processes, are necessary to faithfully represent the full predictive posterior. Future work should therefore investigate and benchmark such expressive approximations to ensure both calibrated uncertainty quantification and robust predictive performance.
Acknowledgment
Research presented in this article was supported by the Laboratory Directed Research and Development program of Los Alamos National Laboratory under project numbers 20230469ECR and 20240065DR.
References
Acquarelli, Jacopo, Twan van Laarhoven, Jan Gerretzen, Thanh N Tran, Lutgarde MC Buydens, and Elena Marchiori. 2017. “Convolutional Neural Networks for Vibrational Spectroscopic Data Analysis.” Analytica Chimica Acta 954: 22–31.
Anderson, Ryan B, Samuel M Clegg, Jens Frydenvang, Roger C Wiens, Scott McLennan, Richard V Morris, Bethany Ehlmann, and M Darby Dyar. 2017. “Improved Accuracy in Quantitative Laser-Induced Breakdown Spectroscopy Using Sub-Models.” Spectrochimica Acta Part B: Atomic Spectroscopy 129: 49–57.
Blundell, Charles, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. 2015. “Weight Uncertainty in Neural Network.” In International Conference on Machine Learning, 1613–22. PMLR.
Castorena, Juan, Diane Oyen, Ann Ollila, Carey Legget, and Nina Lanza. 2021. “Deep Spectral CNN for Laser Induced Breakdown Spectroscopy.” Spectrochimica Acta Part B: Atomic Spectroscopy 178: 106125.
Clegg, Samuel M, Roger C Wiens, Ryan Anderson, Olivier Forni, Jens Frydenvang, Jeremie Lasue, Agnes Cousin, et al. 2017. “Recalibration of the Mars Science Laboratory ChemCam Instrument with an Expanded Geochemical Database.” Spectrochimica Acta Part B: Atomic Spectroscopy 129: 64–85.
Daxberger, Erik, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, Matthias Bauer, and Philipp Hennig. 2021. “Laplace Redux–Effortless Bayesian Deep Learning.” In NeurIPS.
Daxberger, Erik, Eric Nalisnick, James U Allingham, Javier Antorán, and José Miguel Hernández-Lobato. 2021. “Bayesian Deep Learning via Subnetwork Inference.” In International Conference on Machine Learning, 2510–21. PMLR.
Forni, Olivier, Sylvestre Maurice, Olivier Gasnault, Roger C Wiens, Agnès Cousin, Samuel M Clegg, Jean-Baptiste Sirven, and Jérémie Lasue. 2013. “Independent Component Analysis Classification of Laser Induced Breakdown Spectroscopy Spectra.” Spectrochimica Acta Part B: Atomic Spectroscopy 86: 31–41.
Gal, Yarin, and Zoubin Ghahramani. 2016. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” In International Conference on Machine Learning, 1050–59. PMLR.
Ghosh, Kunal, Annika Stuke, Milica Todorović, Peter Bjørn Jørgensen, Mikkel N Schmidt, Aki Vehtari, and Patrick Rinke. 2019. “Deep Learning Spectroscopy: Neural Networks for Molecular Excitation Spectra.” Advanced Science 6 (9): 1801367.
Gneiting, Tilmann, and Adrian E Raftery. 2007. “Strictly Proper Scoring Rules, Prediction, and Estimation.” Journal of the American Statistical Association 102 (477): 359–78.
Goan, Ethan, and Clinton Fookes. 2020. “Bayesian Neural Networks: An Introduction and Survey.” Case Studies in Applied Bayesian Data Science: CIRM Jean-Morlet Chair, Fall 2018, 45–87.
Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. 2017. “Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles.” Advances in Neural Information Processing Systems 30.
Li, Lu-Ning, Xiang-Feng Liu, Wei-Ming Xu, Jian-Yu Wang, and Rong Shu. 2020. “A Laser-Induced Breakdown Spectroscopy Multi-Component Quantitative Analytical Method Based on a Deep Convolutional Neural Network.” Spectrochimica Acta Part B: Atomic Spectroscopy 169: 105850.
McMahon, Susan K. 1996. “Overview of the Planetary Data System.” Planetary and Space Science 44 (1): 3–12.
Panda, Nishant, Natalie Klein, Dominic Yang, Patrick Gasda, and Diane Oyen. 2023. “Semi-Supervised Learning of Pushforwards for Domain Translation & Adaptation.” arXiv Preprint arXiv:2304.08673.
Papamarkou, Theodore, Jacob Hinkle, M Todd Young, and David Womble. 2022. “Challenges in Markov Chain Monte Carlo for Bayesian Neural Networks.” Statistical Science 37 (3): 425–42.
Papamarkou, Theodore, Maria Skoularidou, Konstantina Palla, Laurence Aitchison, Julyan Arbel, David Dunson, Maurizio Filippone, et al. 2024. “Position: Bayesian Deep Learning Is Needed in the Age of Large-Scale AI.” In International Conference on Machine Learning, 39556–86. PMLR.
Shen, Zefang, and RA Viscarra Rossel. 2021. “Automated Spectroscopic Modelling with Optimised Convolutional Neural Networks.” Scientific Reports 11 (1): 208.
Wenzel, Florian, Kevin Roth, Bastiaan S Veeling, Jakub Świątkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. 2020. “How Good Is the Bayes Posterior in Deep Neural Networks Really?” In Proceedings of the 37th International Conference on Machine Learning, 10248–59.
Wiens, RC, S Maurice, J Lasue, O Forni, RB Anderson, S Clegg, S Bender, et al. 2013. “Pre-Flight Calibration and Initial Data Processing for the ChemCam Laser-Induced Breakdown Spectroscopy Instrument on the Mars Science Laboratory rover.” Spectrochimica Acta Part B: Atomic Spectroscopy 82: 1–27.
Wilson, Andrew G, and Pavel Izmailov. 2020. “Bayesian Deep Learning and a Probabilistic Perspective of Generalization.” Advances in Neural Information Processing Systems 33: 4697–4708.
Yu, Yan, and Meibao Yao. 2023. “When Convolutional Neural Networks Meet Laser-Induced Breakdown Spectroscopy: End-to-End Quantitative Analysis Modeling of ChemCam Spectral Data for Major Elements Based on Ensemble Convolutional Neural Networks.” Remote Sensing 15 (13): 3422.
Author Biographies
Dr. Natalie Klein Dr. Natalie Klein joined the Statistical Sciences group at Los Alamos National Laboratory in Fall 2019. Her research centers on the development and application of statistical and machine learning approaches in a variety of application areas, including remote/standoff sensing and high-dimensional physics simulations. A key theme in her work combines statistical ideas with machine learning, including a recent focus on uncertainty quantification for neural networks. Dr. Klein holds a joint Ph.D. in Statistics and Machine Learning from Carnegie Mellon University.
Mark Hinds is a Graduate Research Assistant in the Statistical Sciences group at Los Alamos National Laboratory and a Master’s student in Computer Science at Georgia Institute of Technology. His research focuses on developing probabilistic deep learning methods for uncertainty quantification and explainable AI with applications in planetary science and spectral analysis.
Dr. Scott Koermer is a postdoctoral researcher in the Statistical Sciences Group at Los Alamos National Laboratory. Scott’s recent work includes investigating the sensitivity of uncertainty quantification accuracy of Bayesian Neural Networks to tuning parameter choices for variational posterior approximations, as well as providing a Bayesian update to an explosion monitoring model so that partially missing observations could be used for training. Scott graduated from Virginia Tech in 2022 with a PhD in Mining and Minerals engineering where he applied and developed Bayesian methods for the analysis, modeling, and optimization of mineral processing circuits.
Dr. Michael Geyer is a Los Alamos National Laboratory Information Science and Technology Institute Postdoctoral Fellow and researcher in the field of Artificial Intelligence, specializing in Deep Learning, and Robustness of AI Systems. He obtained his Ph.D. in Computer Science from The University of Texas at San Antonio, where his research focused on developing novel techniques for AI explainability and reliability along with adversarial defense techniques. Michael works in the computational physics group at the Los Alamos National Laboratory where his work focuses on robust deep learning models for physical systems.
Dewey Classification: L 681 12


