Kinetic mechanisms predict how ion channels and other proteins function at the molecular and cellular levels. Ideally, a kinetic model should explain new data but also be consistent with existing knowledge. In this two-part study, we present a mathematical and computational formalism that can be used to enforce prior knowledge into kinetic models using constraints. Here, we focus on constraints that quantify the behavior of the model under certain conditions, and on constraints that enforce arbitrary parameter relationships. The penalty-based optimization mechanism described here can be used to enforce virtually any model property or behavior, including those that cannot be easily expressed through mathematical relationships. Examples include maximum open probability, use-dependent availability, and nonlinear parameter relationships. We use a simple kinetic mechanism to test multiple sets of constraints that implement linear parameter relationships and arbitrary model properties and behaviors, and we provide numerical examples. This work complements and extends the companion article, where we show how to enforce explicit linear parameter relationships. By incorporating more knowledge into the parameter estimation procedure, it is possible to obtain more realistic and robust models with greater predictive power.

## Introduction

To understand how ion channels and other proteins function at the molecular and cellular levels, one must decrypt their kinetic mechanism, defined as a set of interconvertible structural conformations, with transitions quantified by rate constants that depend on external variables (e.g., membrane potential, ligand concentration, etc.). Modeling molecular kinetics is not trivial, but sophisticated algorithms have been developed that can extract the rate constants for a given model from a variety of experimental data types, such as single-channel or whole-cell voltage-clamp currents (Colquhoun and Hawkes, 1982; Colquhoun and Sigworth, 1995; Qin et al., 1996, 2000; Venkataramanan and Sigworth, 2002; Colquhoun et al., 2003; Milescu et al., 2005; Csanády, 2006; Stepanyuk et al., 2011, 2014), single-molecule fluorescence (Weiss, 2000; Milescu et al., 2006a,b; Liu et al., 2010), or even current-clamp recordings (Milescu et al., 2008). Automated algorithms that can identify the model itself have also been attempted (Gurkiewicz and Korngreen, 2007; Menon et al., 2009). This abundance of data and analysis algorithms is great, but it raises an important issue: how do we make sure that a model is consistent with all these data, new and old? In other words, how do we extract a model that explains new experimental data but also satisfies existing knowledge?

In the first part of this study (see Salari et al. in this issue), we discussed the general principles of enforcing prior knowledge using model constraints. We identified two main types: parameter constraints and behavioral constraints. Parameter constraints represent explicit mathematical relationships between model parameters, which include the pre-exponential and exponential rate constant factors, allosteric and other multiplicative factors, and any external variables that describe the experimental data and the recording conditions. In part one, we presented a unified mechanism that handles both equality and inequality linear parameter constraints, using relatively simple linear algebra methods that convert the interdependent parameters of the model into a reduced set of independent (“free”) parameters that can be passed to the optimization engine. Linear relationships can implement a surprisingly broad range of constraints, particularly after some model parameters are first transformed by the logarithm function. For example, one can scale one rate to another, parameterize allosteric relationships, enforce microscopic reversibility, restrict a parameter to a range of values, etc. Nevertheless, linear parameter constraints are not a universal solution.

We present here a complementary modality for enforcing prior knowledge, which can be used to enforce any type of model behavior, as well as arbitrary parameter relationships. For example, one can fit stationary single-channel data using a maximum likelihood method but simultaneously use constraints to enforce voltage dependence or other nonstationary behavior, as obtained from other types of data or from the literature. The basic idea is to calculate for each applied behavioral constraint a penalty that represents the degree by which the model deviates from that constraint. The penalty is then added to the cost function that measures the error between the data and the prediction of the model. As the optimizer minimizes the cost function in search for an optimal solution, it will generate a model that will not only fit the data but will also satisfy all the prior knowledge.

We illustrate here the two types of model constraints (linear parameter constraints and behavioral constraints) and test their respective computational procedures with a simple ion channel modeling example. First, we simulate stochastic macroscopic data in response to a typical voltage-clamp protocol. Then, we fit these data while enforcing different combinations of model constraints. The calculations are explained step by step, and detailed numerical examples are given. All computational procedures were implemented in our freely available software (Milescu, 2015).

## Materials and methods

All the mathematical and computational algorithms described in this study, as well as the simulation, data processing, and model optimization, were implemented, tested, and performed with the freely available MLab edition of the QuB program, running under the Microsoft Windows operating system.

### Model parameters

To simulate the test data, the model shown in Fig. 2 A was tweaked by hand to generate macroscopic currents resembling voltage-dependent sodium currents (Fig. 3). The simulated data were fitted in multiple runs, with different sets of constraints applied to the model (Fig. 2 B). The model parameter values (true, initial, and estimated) are listed in Table 1.

### Stochastic simulations

Ion channel macroscopic traces were simulated stochastically under the voltage-clamp paradigm, using established procedures (Milescu et al., 2005). To approximate the properties of sodium currents, the single-channel conductance was 10 pS and the reversal potential was +60 mV. Random Gaussian noise with zero average and 5-pA standard deviation was added to each trace to approximate whole-cell recording noise. To generate the activation/inactivation time course (Fig. 3 A) and activation/availability steady-state curves (Fig. 3 B), we used a typical voltage-clamp protocol: the channels were first equilibrated at −120 mV and then subjected to a 200-ms conditioning step at potentials ranging from −120 mV to +40 mV, followed by a 50-ms test step at 0 mV. The peak current from each conditioning step was extracted and converted to conductance (assuming a linear relationship), and the obtained values were used to construct the activation curve. Similarly, the peak current from the test step was extracted and used to construct the availability curve. Together, the currents evoked during the first 5 ms of the conditioning step in the −50 mV to +40 mV range (Fig. 3 A) and the activation and availability curves (Fig. 3 B) were used for model optimization.

### Model optimization

The algorithms were tested by fitting the data shown in Fig. 3. Optimization trials were run on a dual eight-core 3.3 GHz Intel Xeon processor computer, running Windows 7 (64 bit). Each optimization run took less than 20 min to complete. The model was optimized by minimizing the cost function with a modified version of the Davidon–Fletcher–Powell optimizer (dfpmin; Fletcher and Powell, 1963; Press et al., 1992). For efficiency, the cost function was coded for parallel computation. The cost function was calculated as the sum of square errors between the data and the prediction of the model, normalized to the total number of points, plus a penalty term for those optimization trials involving a behavioral constraint, as detailed in Results. The gradients of the cost function with respect to the free parameters were calculated numerically. The prediction of the model for a given set of parameters was obtained by simulating the deterministic response of the model to the same stimulation protocols as used for simulation. Then, the resulting traces were processed to extract the time course and the activation and availability curves, following the same procedure as for the simulated test data (Milescu et al., 2010; Salari et al., 2016).

## Results

### Implementing prior knowledge with model constraints

A theoretical background was given in part one of this study (Salari et al., 2018), where we briefly introduced a few prerequisite topics (kinetic mechanisms, model topology, and parameter estimation). Then, we presented a mathematical formalism and computational procedure for enforcing linear parameter constraints. Here, in part two, we continue with a mechanism for enforcing model behavior and arbitrary parameter relationships. Then, we give a step-by-step numerical example that illustrates the implementation of all types of constraints. Because we will make multiple references to the first part, the equations introduced here are numbered in continuation of those in part one. The mathematical symbols that are not explicitly defined here have been introduced in part one.

#### Behavioral constraints and arbitrary parameter relationships

A good amount of prior knowledge about the channel can be expressed as linear relationships between model parameters, resulting in constraints that can be handled with relatively straightforward linear algebra methods. However, some channel behaviors and properties cannot be easily formulated as explicit functions of model parameters or they need nonlinear functions that are not so easily tractable. For voltage-gated channels, examples of important functional behavior include the open probability (*P*_{O}), the voltage dependence of activation or inactivation, or the use-dependent availability. These properties cannot be easily formulated as functions of rate constants, except for very simple kinetic mechanisms. Furthermore, they must be prescribed in the context of a specific experiment (e.g., a voltage-clamp step protocol).

*Expanding the cost function*

Without explicit parameter relationships, we cannot solve behavioral constraints simply by converting model parameters into free parameters, as we did for linear parameter constraints. Likewise, we cannot use that formalism to solve any parameter constraint that cannot be written as a linear relationship between the transformed model parameters, as captured by the generalized linear constraint Eqs. 32 and 33 (Salari et al., 2018). Instead, the solution we propose here for handling behavioral constraints and arbitrary parameter relationships is to include them into the cost function. Thus, the cost function *F*, which is minimized by the optimizer in search for an optimal solution, can be expanded to include multiple components, one for each set of experimental data and one for each constraint:

where $F i Y $ represents the cost of data component *i* and $F j C $ represents the cost of behavioral constraint *j*. The α quantities are relative weighting factors that multiply the cost function components. Including multiple components in the cost function is known in the optimization literature as multi-objective fitting (Druckmann et al., 2007; Bandyopadhyay and Saha, 2013; Fletcher, 2013). For example, $F i Y $ could stand for newly acquired voltage-clamp data (e.g., the time course of activation and inactivation at different membrane potentials), whereas $F j C $ could be data from the literature (e.g., steady-state activation and inactivation curves) or a hypothesized property (e.g., the open probability *P*_{O}). The cost function components that denote constraints should be formulated in such a way that they take a value of zero when the underlying constraint is satisfied, and a very large value (relative to the data cost components) when the constraint fails, as explained further.

*Formulating behavioral constraints and arbitrary parameter relationships*

Some behavioral constraints can be formulated as mathematical relationships involving simple properties of the channel. For example, we could constrain the maximum open probability reached during a depolarization step to take certain values or to fall within a range:

An example of a parameter constraint that cannot be processed with the formalism developed in part one is restricting a rate constant pre-exponential factor $k ij 0 $ to a range of values:

This range constraint cannot be handled as two linear inequality relationships, because they would be mathematically redundant, where both cannot be simultaneously satisfied. Another example is parameterizing an exponential factor $k ij 1 $ as a product of more than one variable:

where *a* and *b* could stand for the δ_{ij} and *z*_{ij} parameters in Eq. 2 (Salari et al., 2018) and *C* is a constant equal to *F*/(*R* × *T*). Likewise, this equation cannot be handled with the formalism from part one because it is nonlinear.

Algebraically, any equality or inequality relationship can be converted to "=0" or "≥0," respectively. Thus, the above constraints could be rewritten as follows:

The cost function components *F*^{C} that correspond to the above equality and inequality relationships can be formulated as follows:

where α is a weighting factor with the following properties:

and

where “constraint” refers to the left-side term of a constraint equation (Eq. 37; Salari et al., 2018). Thus, these cost function components are equal to zero when the underlying constraints are exactly satisfied but take a positive and quadratically increasing value when the constraint relationships are not satisfied.

*Nonparametric behavioral constraints*

In principle, the same logic can be applied to any other model property. However, some model behaviors cannot be reduced to a single value or cannot be easily calculated theoretically. For example, many functional aspects, such as the recovery from inactivation or the use dependence, can be empirically fitted by one or two exponentials. Unfortunately, these apparent time constants cannot be directly and easily calculated from the model, which actually predicts a larger number of exponentials, equal to the state count minus one (Milescu et al., 2005; Salari et al., 2016). Likewise, the voltage-dependent activation curve can be well approximated and fitted by a Boltzmann equation with only two parameters, but calculating the half-activation and the sensitivity values directly from the model is generally not practical.

In cases like these, it is simpler to simulate the response of the channel to the same stimulation protocol as was used to obtain the experimental (or hypothesized) data. Then, a cost function component can be calculated as the sum of square differences between the simulated and the experimental data:

where *y*_{i} and *x*_{i} are experimental and simulated data points, respectively, and *N* is the number of data points. In the above equation, one could use the raw data directly, point by point, or one could extract some properties from the raw data and use the points on that property curve. For example, when the stimulation protocol is designed to extract the time course of a macroscopic current, one would fit the raw data directly. In contrast, when the stimulation protocol is designed to extract a behavior, such as the recovery from inactivation, one would fit the property curve. Although extracting a property curve involves additional computation, it has the substantial benefit of concentrating the information on a very specific aspect of channel behavior. For example, in a curve that represents the recovery from inactivation, every data point informs directly on the apparent time constants of inactivation. Likewise, every data point in a voltage-dependent activation curve informs directly on the two parameters of the Boltzmann equation.

Whether the cost function for these nonparametric behavioral constraints is calculated from raw data or from property curves or is based on hypothetical values, one must consider the presence of random noise and other artifacts that contaminate the experimental data. Thus, even a perfect model would not generate zero cost for the constraints, which may confuse the optimization engine. A simple solution is to reformulate the problem as an inequality:

where ε is a positive constant proportional to the noise content of the data. Then, the cost function component can be written as follows:

where α is a weighting factor with the following property:

Thus, if the sum of square errors between the simulated and the experimental data is less than ε, then the underlying constraint is considered to be satisfied. In other words, the model only needs to explain the constraining data components “well enough,” as warranted by the inevitable noise and artifacts.

#### A computational framework for solving behavioral constraints and arbitrary parameter relationships

We presented above some examples of behavioral constraints and arbitrary parameter relationships. In general, the problem we must solve is to find a model that not only best explains the experimental data but also satisfies a set of equality or inequality constraints. Mathematically, the problem can be formulated as the minimization of a function subject to a set of nonlinear constraints:

where $X \xaf $ and $F( X \xaf )$ are the vector of free parameters and the cost function, respectively, as defined in part one, and $h i ( X \xaf )$ and $g j ( X \xaf )$ are two sets of *N*_{E} equality and *N*_{I} inequality constraints, respectively. In the case of maximum likelihood methods, instead of maximizing the log-likelihood, one can equivalently minimize its negative.

As discussed in the previous section, one possible solution to this constrained function minimization is to add the constraints to the cost function (Eq. 60). This approach is equivalent to the method of penalties (Fletcher, 2013), which reformulates the problem as an unconstrained optimization, by adding a penalty term to the cost function $F( X \xaf )$. Thus, the objective becomes minimizing a penalized cost function $F'(X\xaf,\alpha )$:

where α and β are penalty factors with the following properties:

Formulating the *h*_{i} and *g*_{j} expressions that correspond to any of the equality and inequality constraints above is straightforward. For example, the first two *P*_{O} constraints given above (Eq. 64) become:

The gradients of the penalized cost function might be required by the optimization engine. These could be calculated analytically, as follows:

The derivatives of *h*_{i} and *g*_{j} with respect to a free parameter $x \xaf k $ depend on the specific constraints used, and one may need to calculate them using the chain differentiation rule, as in the case of linear constraints. Ultimately, if the constraint functions are too complicated, the gradients can be approximated numerically. Whether the gradients are calculated analytically or numerically, one should keep in mind that inequality penalties are only semidifferentiable and may throw off the optimizer. If this is the case, then one possibility is to approximate the penalty into a differentiable function (Bertsekas, 1975). The variance of the estimates can also be calculated using the procedure described in part one.

The main advantage of the penalty method is that it can be used with any optimization algorithm that was originally designed for nonconstrained problems. The main issue, however, is the choice of the penalty parameter α. On the one hand, if α is too small, then the solution found by the optimizer will be pulled toward $F( X \xaf )$ and the constraints defined by $h i ( X \xaf )$ and $g j ( X \xaf )$ may not be exactly satisfied. On the other hand, if α is very large, then the solution will satisfy the constraints (in principle). However, the optimizer engine may have a difficult time finding that solution, because the penalized cost function $F'(X\xaf,\alpha )$ may change very abruptly in the *n*-dimensional parameter space. Thus, although it is conceptually and computationally very simple, using the penalized cost function is not exactly a plug-and-play solution, as in the case of linear constraints.

A possible strategy is to find the solution iteratively, starting with a relatively small *α*, and increasing it until some convergence criteria are satisfied (Himmelblau, 1972). This is the approach we are taking here, as summarized in Fig. 1. Once a model topology is chosen, the workflow starts with defining the linear parameter constraints and the behavioral constraints (if any), including any other arbitrary parameter constraints. The next step is to define the cost function and the penalized cost function, according to the specific application (e.g., macroscopic fitting, single-channel maximum likelihood, etc.). Then, we choose a set of model parameters as the starting point, **K**_{0}. Mathematically, these parameters do not need to satisfy either set of constraints (parameter or behavioral), but starting as close as possible is recommended. From the initial model parameters **K**_{0}, we then calculate the initial set of free parameters, $X \xaf 0 $. Finally, we initialize the penalty parameter α to α_{0}, equal to a small positive number. In practice, this value can be chosen so as to make the data and the penalty components of the overall cost function be approximately equal.

Once these quantities are defined and initialized, we start the optimization procedure, which involves two embedded loops, as shown in Fig. 1. An outer loop, indexed by *p*, handles the schedule for updating the penalty parameter α_{p} that is used to calculate the penalized cost function $F'(X\xaf,\alpha p)$, and an inner loop, indexed by *k*, handles model optimization for a given α_{p}. The penalty parameter α_{p} is progressively increased at each outer loop iteration, to increase the relative weight of the penalty component in $F'(X\xaf,\alpha p)$. Thus, the behavioral constraints may be only loosely satisfied at the end of the first outer loop iteration, but they will get tighter each time α_{p} is increased. The outer loop can be run for a predefined number of iterations or can be terminated when the behavioral constraints are satisfied, if at all possible.

In principle, any type of optimization engine can be used in the inner loop. As explained, the optimizer is completely model and penalty blind. Essentially, the optimizer solves an unconstrained minimization problem, operating with a set of free parameters $X \xaf $. However, as it explores the free parameter space in search for a minimum, the optimizer will require, for a given $X \xaf k $, the penalized cost function $F'(X\xaf,\alpha p)$ and possibly its gradients. For this, the transformed model parameters **R**_{k} are calculated from $X \xaf k $, and then the model parameters **K**_{k} are calculated from **R**_{k}, as outlined in Fig. 3 of the companion article (Salari et al., 2018). The model optimization in the inner loop can be run for a predefined number of iterations or can be terminated when some convergence criteria are satisfied. Typically, convergence requires that there be no substantial changes in the free parameter values and in the cost function (and its gradients be close to zero) from one iteration to the next.

### Testing the algorithm

To clarify the computational procedures described in both parts of this study, we give a step-by-step numerical example. For illustration purposes, we chose the model shown in Fig. 2 A, which is complex enough to accommodate an allosteric factor (Fig. 2 A, *a*_{1}), an external parameter representing the number of active channels in the recording (*N*_{C}), and several parameter and behavioral constraints. At the same time, the model is small enough to allow us to print the vectors and matrices used in the numerical computation. Readers who wish to implement their own code can use these examples for verification. Briefly, we tested the algorithms by fitting a stochastically simulated set of macroscopic data, generated in response to a typical voltage-clamp step protocol. We intentionally chose a relatively small dataset (the time course of activation and inactivation at different voltages and the voltage-dependent steady-state activation and inactivation curves, as shown in Fig. 3) to illustrate potential parameter identifiability issues and the effect of constraints. The data were fitted in multiple runs, with each run enforcing a different set of constraints, as outlined in Fig. 2 B. The simulation, data analysis, and fitting procedures are explained in Materials and methods.

We define the following set of model parameters **K**:

where *a*_{1} is the allosteric factor and *q*_{1} is the channel count. Thus, we have a total of 14 model parameters, with numerical values given in Table 1. The corresponding vector of transformed model parameters **R** is:

where

#### Applying linear parameter constraints

The test model has allosteric relationships that require two sets of linear parameter constraints. The first set applies to the forward transitions C_{1} to C_{2} and C_{2} to C_{3}:

As explained in part one (Salari et al., 2018), we apply the logarithm on both sides of Eq. 80 and obtain a set of two equality relationships:

The backward transitions C_{3} to C_{2} and C_{2} to C_{1} have a similar allosteric relationship, which results in another set of two equality relationships:

Altogether, we have a set of four linear mathematical relationships between the transformed model parameters in **R**:

Together, these four equations reduce the number of free parameters by four, from 14 down to 10. We must point out that the same allosteric relationships could be implemented just as well without the explicit use of an allosteric factor. Thus, we could write the following constraint equation:

Again, after taking the logarithm, we obtain a set of two equations:

The first equation in the set enforces the allosteric relationships at *V* = 0. However, the second equation is not sufficient to enforce the allosteric relationships at any arbitrary voltage. To do so, we must add either one of the following two equations:

Altogether, this is equivalent to having a set of three mathematical relationships:

with the final form:

This result can be easily verified: the same set of equations can be obtained by eliminating $\varphi 1$ between the first two equalities in Eq. 83. Without the explicit use of an allosteric factor, the model would have only 13 model parameters. However, there would be only three constraint equations in that case, which means that the number of free parameters would still be the same: 10. In conclusion, adding an allosteric factor does not necessarily increase the number of free parameters of a model. Instead, it provides a more intuitive way of formulating the relationships that may exist between rate constants.

Another assumption that we made about our test model is that the O_{3} to I_{4} transition has the same voltage sensitivity as the C_{2} to O_{3} transition. This results in one mathematical relationship:

With this relationship, the number of free parameters is down to nine. We note that this relationship follows from the actual model parameters used to simulate the data. However, even if the true model parameters were unknown, the savvy investigator would still enforce this constraint, motivated by the shape of the activation curve, which reaches a constant value toward the more positive voltages (Fig. 3 B). For this particular model, this aspect of the activation curve suggests that the rates of activation and inactivation increase by approximately the same factor with voltage. If, for example, the inactivation rate had a stronger voltage dependence than the activation rate ($k 3,4 1 >k 2,3 1 $), the activation curve would start turning down at more positive potentials.

The final assumptions we made involve inequality constraints. Thus, we constrained the rate of recovery from inactivation (I_{4} to O_{3}) to have negative voltage dependence and the deactivation rates (O_{3} to C_{2} and C_{2} to C_{1}) to have a voltage sensitivity greater than −0.15 mV^{−1}:

Because the $k 3,2 1 $ and $k 2,1 1 $ factors are already constrained to be equal, we apply the inequality constraint only to $k 2,1 1 $, to avoid redundancy. To handle these two inequality constraints, we add two slack variables, *z*_{1} and *z*_{2}, and write two equality relationships:

Thus, although we added two constraints, we also added two slack variables. As a result, the number of free parameters remains the same: nine.

We summarize here all the constraint equations:

#### Linear algebra calculations

We can now formulate the constraint matrix **M** and vector **V**, as in Eq. 37 (Salari et al., 2018):

As **M** contains only constant values, it can now be decomposed with the singular value decomposition technique into three matrices, as in Eq. 40 (Salari et al., 2018):

From **V**_{M}, we can now obtain the **A** matrix, as shown in Eq. 41 (Salari et al., 2018):

The **A**^{−1} matrix is simply obtained by transposing **A**, and we do not show it here. To obtain the **B** vector, we must first calculate the pseudoinverse of **M**, **M**^{+}, as shown in Eq. 43 (Salari et al., 2018). First, we calculate the pseudoinverse of **S**_{M}, **S**_{M}^{+}:

With **S**_{M}^{+}, we can calculate **M**^{+}:

The **M**^{+} matrix can now be used to calculate the **B** vector, as in Eq. 42 (Salari et al., 2018). However, when the model contains inequality constraints, the **V** vector will contain elements that depend on the slack variables *z*_{1} and *z*_{2}. During optimization, the slack variables are changed freely by the parameter estimation engine. However, at the beginning of the optimization, they must be initialized by solving their corresponding constraint equation. In this case, *z*_{1} is initialized as follows:

where 0.1 is the initial value of $k 4,3 1 $. Likewise, *z*_{2} is initialized as:

With the *z*_{1} and *z*_{2} values, we can now calculate the initial **V** and **B** vectors:

To start the optimization, we must initialize the free parameters $X \xaf $. When the model constraints include inequalities, as we have here, $X \xaf $ is formed by the reunion of **X** and **Z** vectors (Eq. 44; Salari et al., 2018). **Z** contains the slack variables, which are initialized as shown above, whereas **X** is initialized from the initial set of model parameters **R**_{0}, using Eq. 39 (Salari et al., 2018). Altogether, the initial free parameter values are:

Each time the cost function is requested by the optimization engine, the transformed model parameters **R** are calculated from the free parameters $X \xaf $ with Eq. 40 (Salari et al., 2018). Then, the model parameters **K** are calculated from **R**.

#### Applying arbitrary parameter constraints and behavioral constraints

In addition to linear parameter constraints, we also tested a few simple but useful constraints that cannot be implemented with the linear algebra formalism (Salari et al., 2018). First, we tested an arbitrary parameter constraint that restricts the channel count *N*_{C} to a range of values. The test data were simulated with *N*_{C} = 5,000. However, to test the algorithms under more realistic conditions, we enforced a range of values away from the true value (6,000–8,000). The same strategy was used with all the behavioral constraints introduced next. The constraint and the corresponding cost function component are the following:

where β_{1} and β_{2} are numerical factors with the following properties:

The normalization to 6,000 or 8,000 makes this penalty component numerically comparable with all the other penalty and data components.

The second is a behavioral constraint that enforces the maximum open probability reached during a brief depolarization step from −120 to 0 mV, as illustrated in Fig. 2 B (run IV). With the true parameter values, the model predicts a maximum *P*_{O} of ∼0.42, but we constrained it to 0.5. The constraint equation and the corresponding cost function component are the following:

Finally, we tested a behavioral constraint that enforces the time constant of recovery from inactivation. As discussed earlier, it would be rather difficult to calculate this quantity analytically. Instead, we use a surrogate value, extracted from a simulation in response to a two-pulse voltage-clamp protocol. As shown in Fig. 2 B (run V), we inactivate the channels with a brief voltage pulse, let them recover for 50 ms, and then apply a second pulse to test how many channels have recovered. The recovered fraction is defined as the maximum open probability reached during the second voltage pulse relative to the first pulse:

Thus, if we want to enforce a specific recovery time constant *τ*_{R}, we can calculate the corresponding $f R $ for a recovery interval of arbitrary duration *t* and use that $f R $ value in the behavioral constraint:

Our test model predicts a recovered fraction $f R $ of ∼0.43 with a recovery interval *t* = 50 ms, at −80 mV, but we constrained it to 0.8. The constraint equation and the corresponding cost function component are the following:

#### Optimizing the model

We illustrate the performance of the algorithms with six optimization runs, each implementing a different set of constraints, as described in Fig. 2 B. Together, these examples test the full range of constraints that the algorithms are designed to handle, as they are likely to occur in practical modeling applications: linear equality and inequality parameter constraints and model behavior and properties. Furthermore, we test all types of model parameters, as defined in the companion paper: rate constant parameters, multiplicative factors (*a*_{1}), and external parameters (*N*_{C}). The true parameter values, as well as the initial and the estimated values obtained in each optimization run, are given in Table 1.

In run I, we enforced only equality linear parameter constraints (Eq. 83). The cost function that was minimized by the optimizer had the following expression:

where $F 1 D $, $F 2 D $, and $F 3 D $ are the cost components corresponding to the data shown in Fig. 3: time-course traces, activation curve, and availability curve, respectively. Each of these data components is the sum of square differences between the data and the prediction of the model, normalized by the total number of data points. The time-course component was also normalized to the peak current, as follows:

where *N*_{V} is the number of traces, *N*_{t} is the number of samples in each trace, *y*_{V,t} and *I*_{V,t} are the data point and the predicted current, respectively, at voltage *V* and time *t*, and *y*_{peak} is the largest negative peak current in the entire dataset. With these normalizations, all three data cost components take comparable values. We have not done it here but, in principle, one should further normalize the data to account for potentially different levels of noise, such as between the time course traces and the activation and availability curves. One possibility would be to multiply each cost component by a factor inversely proportional to its normalized variance, to ensure that less noisy datasets will be fitted more tightly by the model. This variance can be approximated through fitting each dataset with an appropriate mathematical function (e.g., a sum of exponentials for the time course data and a Boltzmann for the activation and inactivation curves).

In run II, we used the same conditions as for run I, but we added the inequality linear parameter constraints (Eqs. 90 and 91). In runs III through VI, we applied the same linear parameter constraints as in run II, but in each of these runs, we added different constraints that were implemented via the penalty mechanism: an arbitrary parameter constraint that restricts *N*_{C} to a range of values (run III) and behavioral constraints that enforce *P*_{O} (run IV), the recovered fraction *f*_{R} (run V), or both *P*_{O} and *f*_{R} simultaneously (run VI). In runs III through VI, the optimizer minimized a penalized cost function with the following expression:

where $F C $ stands for either $F 1 C $ (run III), $F 2 C $ (run IV), $F 3 C $ (run V), or $F 2 C +F 3 C $ (run VI).

The optimization results shown in Fig. 4 demonstrate the proper functioning of the algorithm with all types of constraints. To test the convergence of the optimizer, we intentionally chose starting parameters (Table 1) that generate prediction curves that deviate substantially from the data, as shown by the blue traces in Fig. 3. In all cases, the cost function virtually settled in ∼30 iterations (Fig. 4 A, left), after which most model parameters changed little (Fig. 4 B). For run I, the final parameter values are within ∼10% of the true values (Table 1), which is to be expected under these conditions (Milescu et al., 2005). For the other runs, the constraints push some of the parameters away from their true values, as intended. Although the final parameter values (Table 1) vary across the six runs, they all predict virtually identical fit curves, all represented by the red traces in Fig. 3 (A and B).

The effect of inequality linear parameter constraints can be observed by comparing runs I and II. In run I, the $k 4,3 1 $ parameter is unconstrained and meanders to values as large as +0.12 mV^{−1}, finally converging to a slightly positive value, even though the true value is slightly negative (−0.05 mV^{−1}). The convergence to a positive value for $k 4,3 1 $ is not a failure of the search engine but simply a result of the stochastic nature of the data. In run II, $k 4,3 1 $ is constrained to a negative range and, as expected, converges to a final value of zero. The convergence to a value that lies on the edge of the constrained range would suggest that this solution is suboptimal, compared with the solution found in run I. Indeed, the cost function value is nominally larger: 0.000533 for run II versus 0.000392 for run I, although the difference is imperceptible. The $k 2,1 1 $ parameter is also constrained with an inequality in run II. However, $k 2,1 1 $ hovers comfortably above its limit in run I, and, as expected, the constraint applied in run II has no effect.

In runs III through VI, the cost function is replaced by a penalized cost function, which adds penalty components (Eq. 73). In all of these cases, the penalty function quickly drops by four or five orders of magnitude during the optimization (Fig. 4 A, right). In run III, where the penalty mechanism enforces a range of values for *N*_{C}, the penalty function occasionally drops to zero (Fig. 4 A, right, orange trace), whenever *N*_{C} is within the allowed range and the constraint is exactly satisfied (Fig. 5 A). Although the initial value of *N*_{C} (3,000) was outside the acceptable range, the optimizer quickly brought *N*_{C} within the range, in just a few iterations. We find it interesting that the convergence value of *N*_{C} does not lie on the edge of its allowed range (6,000), as close as possible to the convergence value found in run II (5,500). This suggests the existence of multiple solutions that predict identical fits.

In runs IV through VI, the penalty mechanism was used to enforce equality relationships for *P*_{O} and *f*_{R}. Like with *N*_{C} in run III, the initial values of *P*_{O} and *f*_{R} were quite different from their enforced values. However, a few iterations were sufficient to bring *P*_{O} or *f*_{R} close to their enforced values, as illustrated in Fig. 5 (B and C, green and magenta traces). In contrast to run III, the penalty function approaches a small value but does not reach zero (Fig. 4 A, right, green, blue, and magenta traces). Accordingly, the enforced quantities hover in a small neighborhood centered on their enforced values (Fig. 5, B and C). The size of this neighborhood depends on the numerical value of the penalty parameter α_{p}: the larger the α_{p}, the smaller the neighborhood. In principle, enforcing the penalty might require several cycles, where each cycle increases the value of α_{p}, as illustrated in Fig. 1, and tightens the constraint. However, for these relatively simple optimization examples, we initialized the penalty factor as α_{0} = 1, which enforced the constraints tightly enough in a single penalty cycle.

As expected, adding these constraints that push *P*_{O} and *f*_{R} away from their true values also results in slightly suboptimal fits in runs IV through VI, compared with runs I through III. Furthermore, these constraints expose correlations between properties of the model (*P*_{O} and *f*_{R}) and certain model parameters. Thus, *P*_{O} is inversely correlated with *N*_{C}. Without any constraint, *P*_{O} and *N*_{C} are estimated as ∼0.42 and 5,100, respectively. In contrast, when *P*_{O} is constrained (runs IV and VI), the *N*_{C} estimate is lowered to 4,000 (Fig. 5 A). Vice versa, when *N*_{C} is constrained to a larger value, the estimated parameters predict a lower *P*_{O} (Fig. 5 B). Likewise, *f*_{R} is correlated with the rate of recovery from inactivation (the I_{4} to O_{3} transition). Thus, enforcing *f*_{R} to a larger value (0.8) than the true value (0.43) results in a smaller estimate for $k 4,3 0 $ and in a more negative estimate for $k 4,3 1 $ (Fig. 4 B, runs V and VI). Considering these potential correlations between different parameters or model properties, one should be careful not to apply contradictory constraints.

## Discussion

We have presented here a set of mathematical and computational tools that can be used to estimate kinetic mechanisms that explain new data but also satisfy user-defined prior knowledge. In part one of this study (Salari et al., 2018), we derived a procedure for enforcing explicit linear equality and inequality parameter relationships. Here, in part two, we introduced a procedure for enforcing arbitrary model properties and behaviors, as well as arbitrary parameter relationships. Together, these methods are capable of handling virtually all types of model constraints that are likely to arise in practical situations. To demonstrate our approach, we provided a step-by-step numerical example. Interested readers can use these examples to implement the constraint algorithms in their own software and to verify correctness. We also implemented these algorithms in the freely available QuB software, as maintained by our laboratory (Milescu, 2015).

### Compatibility with existing optimization frameworks

The procedures described here can be easily adapted into a typical optimization package. As illustrated by the workflow diagram in Fig. 1, only a few modifications would be required: adding a function that converts between free parameters and model parameters ($X \xaf k \u2192K k $) and vice versa, modifying the cost function to calculate and add the penalty, and implementing a schedule to progressively increase the penalty parameter. The first two modifications are trivial because every optimizer will have a callback function where the user writes custom code to calculate the cost function for a given set of free parameters. The third modification is potentially more involved, but a simple solution would be to increase the penalty parameter by hand and restart the optimizer with the parameter values obtained in the previous iteration.

### Constrained fitting versus multiobjective fitting

There is a certain similarity between constrained fitting and simply including those data that underlie the constraints into a more comprehensive dataset to be fitted. The second approach is generally described as multiobjective fitting (Druckmann et al., 2007; Bandyopadhyay and Saha, 2013). Although it is not a substitute for the reduction method that is used to enforce linear parameter constraints, it could be a substitute for the penalty method. As the name implies, in this case the optimizer would need to find a solution that satisfies multiple objectives (i.e., datasets). This is conceptually equivalent to constrained fitting, but there is also one important difference: in multiobjective fitting, the optimal solution found by the search engine may actually explain poorly each and all of the individual datasets, as long as it is the best overall compromise. Moreover, to find this compromise solution, one must choose a set of weighting factors that encode how much each dataset is worth to the model, which is not trivial.

In contrast, the constraining mechanism described in this study will give the highest priority to the constraints and satisfy them exactly (the linear parameter constraints, via the reduction method) or at least very closely (all other constraints, via the penalty method). Only after the constraints are satisfied will the model adapt to explain the data (in as much as it is possible). Nevertheless, as we explained in the paper, a certain margin of error can be built into the constraints to accommodate noise and potential artifacts, but the constraints will stay tightly within this margin. Then, one advantage to the constraint approach is that one can more easily detect when a model is incompatible with the data. Furthermore, one could also detect inconsistent knowledge, as signaled by incompatible constraints.

### Model behavior: To enforce or not?

The need for enforcing explicit parameter relationships is obvious, if only to consider microscopic reversibility or the ratio of sequential activation rates. However, it may be less clear to the reader why model behavior and properties need to be enforced. Why not derive them directly from the data? After all, once model parameters are estimated, they can be used to predict any model property or behavior. The problem resides in the potential lack of model and parameter identifiability. In an ideal case, the model would be uniquely identifiable, which means that no other topology exists that can explain the data equally well (Kienker, 1989; Bruno et al., 2005). Furthermore, the data would be noise and artifact free and the model parameters would be fully identifiable, which means that the model admits a unique solution and the optimizer is able to find it from the data. If this were the case, then it would make little sense to enforce a model behavior or property except to test the sensitivity of the parameters with respect to that behavior.

In reality, however, the true model may never be known, and the working model may be just one out of many equivalent topologies. Furthermore, the parameters may not be fully identifiable, either because the model admits multiple solutions (theoretical parameter identifiability) or because the data are corrupted by noise and artifacts that flatten the cost function surface (practical parameter identifiability; Milescu et al., 2005; Raue et al., 2009; Siekmann et al., 2012; Hines et al., 2014; Middendorf and Aldrich, 2017). Thus, estimating the kinetic mechanism from limited data may result in a parameter set that is just one out of many possible solutions and potentially one with poor predictive power.

This is actually the case with our numerical example: in all runs, the estimates obtained by the optimizer are close to the true values (Table 1), except when otherwise constrained (e.g., *N*_{C} in run III). However, the estimates differ across runs, even though the fits are virtually identical between runs and follow closely the data (Fig. 3, red lines). How can different sets of parameters produce the same solution? The explanation for this apparent contradiction is that the parameters are not uniquely identifiable given the reduced dataset. Clearly, adding more constraints or enforcing other model behaviors would improve parameter identifiability and would select only those parameter solutions that are compatible with that behavior. Furthermore, it would also improve model identifiability, making it easier to discover the correct model. Short of the true model, we would at least obtain more robust models, as nature always does.

## Acknowledgments

We thank the members of the Milescu laboratories for their constructive comments and suggestions.

This work was supported by the American Heart Association (grants 13SDG16990083 to L.S. Milescu and 13SDG14570024 to M. Milescu) and the Graduate Assistance in Areas of National Need Initiative/Department of Education (training grant fellowship to M.A. Navarro).

The authors declare no competing financial interests.

Author contributions: L.S. Milescu developed the mathematical and computational algorithms and implemented them in software. All authors contributed to designing and testing the algorithms and software and writing the manuscript.

Richard W. Aldrich served as editor.

## References

## Author notes

M.A. Navarro and A. Salari contributed equally to this paper.