To examine and compare the performances of Bayesian Optimization using the two ML models (denoted LVGP-BO and Lolo-BO for conciseness), we tested them on both synthetic mathematical functions and materials property optimization problems. Our comparative study is conducted using a modular BO framework (illustrated in Fig. 1; details of implementation are in “Methods”), in which both LVGP and Lolo can serve as the ML model.

We use this BO framework to search for the optimum value of any function \(y(\varvec{v})\), where the input variables \(\varvec{v}=[\varvec{x}, \varvec{t}]\) consist of numerical variables \(\varvec{x}\) and/or categorical variables \(\varvec{t}\), and the response *y* is a scalar. The BO performances are compared in two aspects, accuracy and efficiency. Accuracy relates to the ability to find the optimal objective function value. We record the complete optimization history for every test case, so that accuracy can be compared by looking at the optimal objective values observed at any time in the optimization process. Efficiency, on the other hand, is characterized by the rate of improving the objective function. In application scenarios such as materials design, the design evaluation (experimentation or physics-based simulation) is often very time-consuming, in comparison, the time for fitting ML models and calculating acquisition functions is negligible. Thus, when comparing efficiency, we focus on the time in terms of iteration number instead of actual computational time. The BO method capable of converging to the global optimum in fewer iterations is favored under these metrics. In the following part, we introduce the experimental settings and present the results for each test problem.

### Demonstration: mathematical test functions

We first present test results of minimizing mixed-variable mathematical functions selected from an online library^{37}. For functions that are originally defined on a continuous domain, we convert some of their arguments to be categorical for testing purposes. As these are white-box problems, we can investigate the BO methods’ performance under different problem characteristics, as well as the factors influencing the performance.

#### Low-dimensional simple functions

In the first test case, we use the Branin function, which has two input variables and relatively smooth behavior. We modify its definition as follows:

$$\begin{aligned} f(x,t)=\left( t-\frac{5.1}{4\pi ^2} x^{2}+\frac{5}{\pi } x-6\right) ^{2}+10\left( 1-\frac{1}{8\pi }\right) \cos (x)+10, \end{aligned}$$

(5)

where \(x\in [-5,10]\) is a numerical variable, and *t* is categorical, with categories corresponding to values \(\{0, 5, 10, 15\}\). To provide an intuitive sense of its behavior, we visualize the function in Fig. 2a. Lolo-BO and LVGP-BO are used respectively to minimize the modified Branin function, starting with 10 initial samples. We repeat this 30 times with different random initial designs for each replicate, and the optimization histories across replicates are shown in Fig. 2e. To compare the overall performance and robustness of LVGP-BO and Lolo-BO, we show both the median objective value \(\tilde{y}\) and the scaled median absolute deviation \(\mathrm {MAD}=\mathrm {median}\left( \left| y-\tilde{y}\right| \right) /0.6745\) at every iteration.

Another low-dimensional, simple test function is the McCormick function (visualized in Fig. 2b), in the following modified form:

$$\begin{aligned} f(x,t) = \sin (x+t) + (x-t)^2 – 1.5x + 2.5t +1, \end{aligned}$$

(6)

where \(x\in [-1.5,4]\), and *t*’s categories correspond to integer values \(\{-3, -2, \dots , 4\}\). The initial sample size and number of replicates are the same as described above; optimization histories are shown in Fig. 2f. From the optimization history plots, we observe that for both test functions, LVGP-BO converges to the global minimum in fewer iterations, thus showing better efficiency.

#### Low-dimensional complex functions

We then test LVGP-BO and Lolo-BO in optimizing low-dimensional complex functions (definitions are provided in Methods), in this case, rugged functions with several local and/or global minimums. The Six-Hump Camel function (Fig. 2c):

$$\begin{aligned} f(x,t) = \left( 4-2.1x^2+\frac{x^4}{3}\right) x^2 + xt + (-4+4t^2)t^2, \end{aligned}$$

(7)

with \(x\in [-2,2]\) and \(t\in \{\pm 1, \pm 0.7126, 0\}\), is optimized in 30 runs, each starting from initial samples of size 10. As Fig. 2g shows, both LVGP-BO and Lolo-BO converge to the global minimum, while LVGP leads to faster convergence.

We also test optimizing the Rastrigin function, which has more local minimums (as shown in Fig. 2d):

$$\begin{aligned} f(\varvec{v}) = 10d + \sum _{i=1}^d [v_i^2 – 10\cos (2\pi v_i)], \end{aligned}$$

(8)

where *d* is the adjustable dimensionality. We set \(d=3\), with two numerical variables \(x_{1,2}=v_{1,2} \in [-5.12,5.12]\), and one categorical variable \(t = v_3 \in \{-5, -4, \dots , 5\}\). As Fig. 2h shows, in optimizing this highly multimodal function, LVGP-BO shows more performance superiority: it approaches the global minimum at around 60 iterations and eventually converges to the global minimum, while Lolo-BO does not.

#### High-dimensional functions

Moving beyond low dimensionality, we compare the two BO methods on a series of high-dimensional functions. We are optimizing the Perm function (whose behavior in 2D is shown in Fig. 3a):

$$\begin{aligned} f(\varvec{v}) = \sum _{i=1}^{d}\left( \sum _{j=1}^{d}\left( j^{i}+0.5\right) \left( \left( \frac{v_{j}}{j}\right) ^{i}-1\right) \right) ^{2}, \end{aligned}$$

(9)

the Rosenbrock function (whose behavior in 2D is shown in Fig. 3d):

$$\begin{aligned} f(\varvec{v})=\sum _{i=1}^{d}\left[ 100\left( v_{i+1}-v_{i}^{2}\right) ^{2}+\left( v_{i}-1\right) ^{2}\right] , \end{aligned}$$

(10)

and a simple quadratic function \(f(\varvec{v}) = \sum _{i=1}^d v_i^2\), where *d* denotes the dimensionality.

For the Perm function, we use both low- and high-dimensional settings: (1) six-dimensional (6D), with \(t = v_6 \in \{-4,1,6\}\) and \(x_{1,\dots ,5} = v_{1,\dots ,5} \in [-6,6]\). In this case, the degrees of freedom \(D_f=7\). (2) ten-dimensional (10D), with \(t_1=v_6\in \{-4,1,6\}\), \(t_2=v_7\in \{-8,-3,2,7\}\), \(t_3=v_8\in \{-6,1,8\}\), \(t_4=v_9\in \{\pm 3, \pm 9\}\), \(t_5=v_{10}\in \{0, \pm 5, \pm 10\}\), and \(x_{1,\dots ,5} = v_{1,\dots ,5} \in [-10,10]\). \(D_f=19\) for this function. 10 replicates are run for each test, starting from initial samples of sizes 20 for 6D and 50 for 10D. Observations are that, in the 6D test case, LVGP-BO and Lolo-BO show close efficiencies (Fig. 3b). Whereas in the 10D case, both BO methods have difficulties optimizing the function and get stuck for more than 20 iterations (Fig. 3c); Lolo-BO displays better convergence rate and final minimum objective value.

The Perm function is complex because of its non-convexity, and more significantly, its erratic behavior at the domain boundary: the function value is growing nearly exponentially near the boundary. We test BO of the 10D Rosenbrock and quadratic functions to investigate the influence of high dimensionality, without the erratic complexity. The Rosenbrock function is also non-convex, but is well-behaved at the domain boundary. We define its input variables as following: numerical variables \(x_{1,\ldots ,5}=v_{1,\ldots ,5} \in [-5, 10]\), categorical variables \(t_1=v_6\in \{-4,1,6\}\), \(t_2=v_7\in \{-3, 1, 5, 9\}\), \(t_3=v_8\in \{-5, 1, 7\}\), \(t_4=v_9\in \{-2, 1, 4, 7\}\), \(t_5=v_{10}\in \{\pm 3, \pm 1, 5\}\), which make \(D_f = 19\). The quadratic function is convex and well-behaved. It takes five categorical variables \(t_{1,\dots ,5} = v_{6,\dots ,10} \in \{0, \pm 1, \pm 2\}\) and five numerical variables \(x_{1,\dots ,5} = v_{1,\dots ,5} \in [-2,2]\), with \(D_f = 25\). As Fig. 3e,f shows, both BO methods make progress in descending the function value towards the optimum, while LVGP-BO has a considerably faster convergence rate. Through these, we find that when the dimensionality of the problem is high, convexity influences the comparison between LVGP-BO and Lolo-BO similarly to the low-dimensional situation. For the three 10D functions, Lolo-BO displays consistent behavior of making slow progress; whereas when the function is ill-behaved near the domain boundary, the efficiency of LVGP-BO decreases. In the Supplementary Information (SI), we present additional test cases, which support the findings as well.

#### What determines BO performance?

We seek explanations for LVGP-BO and Lolo-BO’s performance differences from two aspects: fitting accuracy and uncertainty estimation quality. For BO to successfully locate the global optimum, the ML model does not need to fit the response function accurately everywhere, but the accuracy near the optimum matters. This accuracy can be improved with new sample acquisition guided by uncertainty. In regions that are far from optimal, the quantity of data and the resulting prediction accuracy only need to be sufficient to confidently rule the region out as a promising design region, which is why uncertainty quantification is important.

We select the Branin function as a representative, generate 10 initial samples following the same procedure as described in “Methods”, run 20 iterations of BO to acquire 20 more samples, and fit LVGP/Lolo models to the samples of sizes 10 and 30. In Fig. 4 we show the behaviors of LVGP and Lolo in fitting the mixed-variable Branin function.

With a small training set, LVGP can attain better prediction of the function compared to Lolo; moreover, its uncertainty quantification assigns low uncertainty in the vicinity of known observations and high uncertainty in the regions where data are sparse. These enable well-directed sampling in the less explored regions, hence promoting the model to “learn” the target function efficiently in regions where it matters most, i.e., in the vicinity of the optimum. In SI, we show the sampling sequences of Lolo-BO and LVGP-BO optimizing the Branin function to illustrate the difference between LVGP’s and Lolo’s uncertainty quantification and their effects on sample selection.

We conduct a similar fitting test for the Rastrigin function in 2D and show the results in Fig. 5. For this more complex function, both methods fail to attain a good fit with 20 initial samples. Despite this, LVGP gives a better estimation of uncertainty in that it assigns higher uncertainty at the regions with sparser data points, which effectively guides the model towards a better fit. With the training sample size increased to 60, however, LVGP can fit the fluctuating function more accurately than Lolo (compare panels c and d), especially in the regions close to optimum, i.e., where the function values are low.

We next extend this fitting and UQ comparison to high-dimensional cases. Training samples are generated from the 10D quadratic function and the Perm function (Eq. 9), respectively, then LVGP and Lolo models are tested to accurately fit the samples. At high dimensions, the previous visualization is no longer feasible. Instead, we adopt the relative root-mean-square error (RRMSE)

$$\begin{aligned} {\rm RRMSE} = \frac{{\rm RMSE}}{\sigma _y} = \sqrt{\frac{\sum _i (\hat{y}_i-y_i)^2}{\sum _i (y_i-\bar{y})^2}} \end{aligned}$$

(11)

as a metric of fitting quality. Note that RRMSE is related to another widely used metric, the coefficient of determination \(R^2\), through \(R^2 = 1 – \mathrm{RRMSE}^2\). \(\mathrm {RRMSE>1}\) can happen when the fitted model is worse than using the response mean as a constant predictor. For each function, we evaluate the model fitting quality by calculating RRMSE on 1000 test samples generated independently from the training samples. Figure 6a,b shows the RRMSEs across different training sample sets to indicate how well the two models fit the mathematical functions. We also show in Fig. 6c,d the deviations of the models’ predictions from the true responses, i.e., the prediction errors.

As the figures show, in fitting the relatively simpler quadratic function, Lolo attains a higher quality compared to LVGP at small sample size (50); as the sample size increases to 80, the fitting quality of LVGP improves significantly, whereas the fitting quality of Lolo does not change much. However, even at a small sample size, LVGP’s prediction error for samples with low function values (near optimum) is lower than Lolo’s. With well-directed uncertainty quantification, LVGP-BO can add samples that improve the fitting, thus leading to efficient convergence.

In fitting the 10D Perm function, both models fail to attain a good RRMSE; the Lolo model fits slightly better than LVGP, and this comparison is not changed as the training sample size increases. In this case, the dimensionality is too large for the known samples to cover, hence, it is difficult for both ML models to capture the complexity of the Perm function. Neither model shows dominant fitting accuracy near optimum over another model. Lolo’s slightly better global fitting accuracy enables it to display higher efficiency in BO of the Perm function.

### Materials design applications

To assess the performances of two ML models in facilitating materials design, we apply Lolo-BO and LVGP-BO to optimize materials’ properties using several existing experimental/computational materials datasets. We adopt a simple yet generally applicable design representation, using chemical compositions as design variables to optimize the properties. We start with a small fraction of samples randomly selected from the dataset; the evaluation of a sample is imitated by querying its corresponding property from the dataset. Model fitting and acquisition function follow the same procedure as previous sections.

#### Moduli of \(\mathrm {M_2AX}\) compounds

The \(\mathrm {M_2AX}\) materials family^{38} has a hexagonal crystal structure, in which M, A, and X represent different sites, M and X atoms form a 2D network with the X atoms at the center of octahedra, while the A atoms connect the layers formed by M and X. \(\mathrm {M_2AX}\) compounds display high stiffness and lubricity, as well as high resistance to oxidation and creeping at high temperatures. These properties make them promising candidates as structural materials in extreme-condition applications such as aerospace engineering^{39,40}. For both the capability as a structural material and the manufacturability, elastic properties are of particular importance. However, elastic properties have nontrivial dependence on composition. The computational determination of these properties includes the calculation of stresses or energies under several strains using density functional theory (DFT)^{40}, which is resource-intensive. Here we demonstrate how the design optimization of \(\mathrm {M_2AX}\) compounds’ elastic properties directly in the composition space may benefit from mixed-variable BO while assessing the performances of Lolo-BO and LVGP-BO.

From Balachandran et al.^{41}, we retrieve a dataset that reports Young’s, bulk, and shear moduli (*E*, *B*, and *G*, respectively) of 223 \(\mathrm {M_2AX}\) compounds within the chemical space \({\rm M} \in \{{\rm Sc, Ti, V, Cr, Zr, Nb, Mo, Hf, Ta, W}\}\), \({\rm A}\in \{{\rm Al, Si, P, S, Ga, Ge, As, Cd, In,Sn, Tl, Pb}\}\), \({\rm X} \in \{{\rm C, N}\}\). The input variable \(\varvec{v}\) is thus three-dimensional, with all inputs being categorical. Since *E* and *G* are highly correlated (shown in SI), we choose *E* and *B* as target responses and optimize them separately. Both optimizations start with 30 initial samples and run for 50 iterations, adding one sample per iteration.

In Fig. 7, we show the value distributions and optimization histories for *E* and *B*. Both Lolo-BO and LVGP-BO are capable of discovering the material that has optimal modulus within 20 iterations, significantly reducing the required resources as compared to computationally evaluating the whole design space. Of the two BO methods, LVGP-BO exhibits marginally higher rates of convergence in both tasks. As observed from a–b, the input–response relations of both *E* and *B* are relatively well-behaved, without showing abrupt changes or clusters of values, which favors LVGP-BO’s efficiency. Hence, the results here are consistent with the findings in the mathematical test cases.

#### Bandgap and stability of lacunar spinels

In another materials design application case, we consider materials having the formula \(\mathrm {AM^aM^b_3X_8}\) and the lacunar spinel crystal structure^{42}. Element candidates for the sites are \(\mathrm {A\in \{Al, Ga, In\}}\), \(\mathrm {M^a\in \{V, Nb, Ta, Cr, Mo, W\}}\), \(\mathrm {M^b\in \{V,Nb,Ta,Mo,W\}}\), \(\mathrm {X\in \{S,Se,Te\}}\). This is a family of materials that potentially exhibit metal–insulator transitions (MITs)^{43}, i.e., electrical resistivity changing significantly upon external stimuli, such as temperature change across a critical temperature. The MIT property can be leveraged for encoding and decoding information with lower energy consumption compared to current devices^{44}. Hence, the \(\mathrm {AM^aM^b_3X_8}\) materials family shows promise for next-generation microelectronic devices, including neuron-mimicking devices which can accelerate ML^{45}. The origin of the transition is structural distortion triggered by external stimuli, which leads to a redistribution of electrons in the band structure^{46}. Though the physical mechanism behind MITs is complex, two relevant properties may serve as proxies for the performances of candidate materials. One is the bandgap of the insulating ground state \(E_g\), as a larger \(E_g\) generally corresponds to a higher resistivity in the insulating state, and therefore a higher resistivity change ratio upon the phase transition to a metallic phase under the applied field. Another is the decomposition enthalpy of the material \(\Delta H_d\) which is associated with a material’s stability. Stable compounds are more likely to be synthesizable and operable in novel devices. Therefore, we use these two properties corresponding to their functionality and stability as the target in MIT materials design.

In a dataset collected by Wang et al.^{3}, a total of 270 combinations of candidate elements are enumerated, for every compound \(E_g\) and \(\Delta H_d\) calculated from DFT are listed. Similar to the previous test case, we use the four-dimensional (categorical) composition as the inputs and optimize two responses \(E_g\) and \(\Delta H_d\) separately. Figure 7e–g shows the results: starting from 10 initial samples, LVGP-BO and Lolo-BO both discover the compound with optimal \(\Delta H_d\) efficiently, but are relatively slow in optimizing \(E_g\); LVGP-BO shows better efficiency on \(\Delta H_d\) while Lolo-BO shows better efficiency on \(E_g\). When we increase the initial sample size to 30, LVGP-BO and Lolo-BO exhibit similar efficiency on \(E_g\).

We show the different characteristics of the two responses of the dataset by a scatter plot in Fig. 7h. Among the 270 \(E_g\) values in the dataset, 56 are zero and others are positive values. These values form a clustered distribution at 0 and make the target function \(E_g = f(\varvec{v})\) ill-behaved. Combined with the high-dimensionality, this function becomes challenging for LVGP-BO and Lolo-BO to optimize, as we demonstrated in the high-dimensional numerical examples. In contrast, \(\Delta H_d\) values form a relatively well-behaved target function, hence, LVGP-BO performs better on this task, also in agreement with previous findings.

#### Investigating machine learning performance

Though the response functions linking materials compositions to properties are black-box functions, for which the exact behaviors are unknown, we investigate the fitting accuracy of two ML models to get a hint of their performances in BO. The regression plots in Fig. 8 show the test response predictions versus the true response values for the ML models fitted to training data of size 30. Interestingly, for all four target properties, LVGP shows better prediction errors for the high-performance (larger true response) materials, which are the candidates close to optimum. Aligning with the findings from mathematical examples, this explains LVGP-BO’s edge in efficiency over Lolo-BO. Another observation worth noting is that, for the lacunar spinels with zero bandgaps, LVGP’s predictions contain negative values, while Lolo’s do not (Fig. 8c). This is because the LVGP model using the RBF kernel tends to yield smooth response function predictions, which accounts for the influence of the clustered behavior of the response function on LVGP-BO’s performance.