An article with three goals, namely, to (1) provide the set of ideas and information needed to understand, at a basic level, the application of convolutional neural networks (CNNs) to analyze images in biology; (2) trace a path to adopting and adapting, at code level, the applications of machine learning (ML) that are freely available and potentially applicable in biology research; (3) by using as examples the networks described in the recent article by Ríos et al. (2024. https://doi.org/10.1085/jgp.202413595), add logic and clarity to their description.

### Introduction

This article adds to a large body of instructional materials on the implementation of neural network models available online (some of which are examined below, under Resources). It is built specifically for biologists interested in the fundamentals of the technique of neural networks (a subset of machine learning, which is, in turn, part of the wide field of AI) as much as in its practical operation. Assuming an awareness of quantitative modeling in this constituency, the article starts by noting the essential identity of mathematical models with adjustable parameters and neural networks, to then focus on their unique and defining features. For instruction on practical aspects, the tutorial takes examples from the article by Ríos et al. (2024), recently published in this Journal, and should therefore help its reading.

### The main ideas

#### Origins

Artificial intelligence (AI) as an alternative to human intelligence was first posited by Turing (1950), while the concept and first realizations of neural networks are attributed to McCulloch and Pitts (1943), who proposed a computational network of logical units as a model for neural activity. Models evolved from their early work were in a category sometimes called symbolic AI, a category that comprises much of what the biologists who read *JGP* normally do. For example, the early work on programming for objective detection of calcium sparks uses symbolic AI (e.g., [Iglesias et al., 2021; Cheng et al., 1999]). The original networks of McCulloch, Pitts, and their followers were just simplified representations of the biological neuronal networks, models of the nervous system. By this criterion, the authors were doing neurobiological research. More modernly, after advances largely attributed to Geoffrey Hinton (Rumelhart et al., 1986), computers started to “learn,” with outcomes reminiscent of the learning functions of animal brains.

These developments produced a curious inversion of terms: currently, most AI programmers do not “do” neurobiology, but instead use, learn from and are inspired by it. This flow from neuro to AI is not limited to software; a most striking example is the development of neuromorphic hardware (e.g., Ham et al., 2021), defined by individual chips that reproduce functions of neurons and even astrocytes. The multiple interactions between neurobiology and AI are explored by Surianarayanan et al. (2023).

#### Machine learning (ML)

Machine learning consists of algorithms—linked computations that use information, produced by the user, to improve their performance (a.k.a. learning). In image analysis (sometimes referred to as computer vision), the information takes the form of “training sets”: large groups of images paired with individual labels that establish the quality that the algorithm is asked to detect. In the *JGP* article by Ríos et al. (2024), the labels provided to the “semantic segmenter” models establish, for every pixel, where in a muscle cell they are located (I band, A band, etc.) or whether they are within the image of a glycogen granule. For a simpler approach, the article also develops “categorical classifiers.” For that type of model, the labels refer to a whole image and state where in the cell it is located or how many granules it contains. It is only after the algorithm has been exposed to multiple images and gone through multiple learning iterations that it becomes capable of performing its intended function; in the examples, classifying pixels or images.

#### A first example of a neural network

Artificial neural networks (ANNs) are a class of machine learning devices with applications in a wide set of fields, from voice recognition to autonomous driving, from financial forecasting to protein folding. (Narcross [2021], provides a useful compilation of the many types of networks developed for machine learning and annotates the menagerie of acronyms that refer to them).

A defining characteristic of neural networks as used in AI is that they implement composite mathematical functions. They do so at multiple levels of functional compounding, or functions of functions. A simple example, modified from a popular online course by A. Ng (Ng, 2021) is in Fig. 1. This “Realtor’s Network” goal is to estimate a fair price for a home, based on its features. For the simplified example, four of the important features are represented within the green box at left, collectively identified as *X*: an array or “column vector” with four numeric terms (*x*_{1}, …, *x*_{4}). These terms are then combined to generate a second set of numbers. For instance, combining the area and number of bedrooms generates *f*, an estimate of the size of the family for which the home would be desirable. From the zip code and other information, the network derives *w*, for walkability of the neighborhood. Another combination of input features generates a number *q* for the quality of schools. These three variables are the output of a first layer of units, or, conventionally and aspirationally, “neurons.” In a second step, a neuron in a second layer combines these intermediate variables to generate a price, *p*, which undergoes a correction for the season—it is hard to sell in winter— to reach the final estimate, represented as $y^$ (y-hat), to distinguish it from the actual price, *y*. Each of these combinations requires factors and additive terms, the parameters of the model, also called “weights,” as they establish the influence that a given variable has on the final price or on an intermediate variable. (Stretching the neurophysiological analogy, the variable weights stand in for synaptic plasticity.) This is a three-layer network, reflected in the notation of parameters: *a*_{i}, *b*_{i}, *c*_{i}, for the successive layers, where i goes from 1 to 3.

*y*). The parameter values are initially guesses, which must be modified to minimize the error of fit. In this case, the error (“loss” in AI parlance) can be defined as the average sum of squares of the individual differences $y^\u2212y$, namely

The summation is extended to the *m* terms of the training group (the homes that have already been sold, which therefore have both their features, *x*_{i}, and sold price, *y*, known). This error is minimized by iterative corrections of parameters using the standard method of steepest descent, e.g., Yang (2019). A simple reminder of the method takes the first 10 min of our journal club presentation at https://www.youtube.com/watch?v=sYlcx3UEs34&ab_channel=EduardoRios.

*L*in parameter space. In symbols, loss (Eq. 1) is a function

*L*(

*W*) where

*W*is the array [

*w*

_{i}] of all adjustable parameter values. If there are N

_{w}parameters,

*W*can be referred to as a vector in the N

_{w}-dimensional space of parameter values. The iterative correction consists of subtracting from

*W*a vector in the same space, calculated by multiplying the gradient of the loss by a negative factor,

*r*, the “learning rate”:

The gradient $\u2207L$ is the vector of derivatives of the loss with respect to each parameter. Successive corrections, together with a careful choice of learning rate—usually based on experience plus trial-and-error—leads to a minimum of *L*(*W*), where *W* is now the optimum set of parameter values. Note that unlike the parameters introduced earlier, the learning rate *r* does not enter the calculation of loss, and is therefore not subjected to automatic correction but still participates in the optimization process. *r* is one of many “hyperparameters” that characterize the model or its implementation.

The Realtor’s model of Fig. 1 has many elements in common with current neural networks but differs from them in aspects of form and substance. First, the Realtor’s is a shallow network of just three layers of neurons (the mathematically inclined will realize that it is actually reducible to a one-layer model—an issue revisited later). Current practice favors instead “deep” ANNs, with hundreds of layers (hence “deep learning”). A more radical and eventually beneficial difference between ANNs will become clear later.

#### Back-propagation of errors

*ŷ*on parameter a

_{1}:

*ŷ*is a function of the variable

*p*, generated in the second layer, which is in turn a function of

*f*, generated in the first layer using parameters a

_{1}, a

_{2}, and features of the home. Every additional layer will add another functional nesting. Calculation of the derivative of the loss (Eq. 1) with respect to a given parameter requires that of

*ŷ*with respect to the same parameter. These are calculated using the chain rule, which prescribes successive multiplications of partial derivatives. For example, the recipe for the derivative with respect to a

_{1}is (in simplified notation)

The AI pioneers realized that there is a backward directionality in this process, indicated by the orange arrow in Fig. 1. While the concept can be traced back to earlier times, the already-mentioned article by (Rumelhart et al, 1986) is credited for popularizing the idea and the descriptive term “back propagation of errors.” We start by realizing that a neural network operates via layers of neurons that pass information from left to right. As the succession of multiplicative terms in Eq. 4 makes clear, the analysis of errors, that is, the calculation of derivatives for correction, starts from *ŷ* and progresses from right to left, layer by layer (a_{1}← *f* ← *p* ← *ŷ*). This realization allowed the systematic analysis of errors and optimization of networks with increasingly large numbers of layers. The term “deep learning” helped popularize them.

Another difference between current neural networks and the Realtors’ example is represented in Fig. 2: rather than preselecting the information to be provided to neurons in the input layer (features *x*_{i}), as done in the Realtors model, all features are provided to every neuron, and similarly, the outputs of all neurons in layer 1 are provided to every neuron in layer 2, etc. Networks with this property are called “fully connected” or “dense.”

The combined introduction of large numbers of layers and their full connectivity was accompanied by (or perhaps required) abandoning any intent to attribute meaning to the intermediate variables and associated parameters. A classical modeler will see neural networks as quantitative models akin to polynomial fits, in which coefficients have a purely descriptive (rather than any “physical” or “mechanistic”) meaning. While this erasure of mechanism may look as a disadvantage (in the example, the models relinquish any notion of material features like “walkability,” or “family size”), it actually liberated the process from the need for any physical insight, which in turn allowed applications to a vast realm of problems, in a purely empirical manner only based on precedent, i.e., prior solutions, crucial for training. (This mechanistically agnostic stance leads logically to the dense, fully connected structure, which amounts to admitting total ignorance of the features of the image that are important to the classification, and in the successive layers the abandonment of any a priori distinction among the different nodes/neurons. As written above, full connectivity also frees the approach from the need for prior knowledge, adding flexibility to it and objectivity to its results. Convolutional approaches, described later, constitute a partial backtrack from full connectivity—a retreat with some basis on theory and solid support from experience).

### The operations of ANNs

#### Outline

As summarized above, deep learning networks for image classification are mathematical models that perform calculations on images for assignment to a class (out of a limited number of possibilities; say, cat or no-cat). The calculations result in a set of numbers that are converted to class probabilities, leading to the final classification. The construction of a working model usually entails a standard process: it starts with gathering a set of images used for “building,” which are first curated (to be representative of the target set), labeled (i.e., assigned to one of the classes), and separated in three subsets: training, validation, and testing. The model is first “fitted” to the training subset (i.e., its adjustable parameters are optimized by error minimization). This is an iterative process, the progress of which is monitored by the evolution of the statistical measure of error in the testing set (“loss,” automatically minimized for parameter adjustment) and in the validation set (largely used to adjust the structure of the model and procedural aspects, represented by hyperparameters). The process results in a model structure and parameter values determined by both the training and validation sets. After this iterative process is concluded, i.e., after an acceptable measure of error is reached on the validation set, the performance of the model is measured in the test set, previously unknown to the model, as a way to appraise the reliability of the model predictions in general “production” mode.

#### Input

ANNs operate on sets of images. In digital form (Fig. 2), images are usually rectangular arrays of values, representing light intensity in a scale of convenience, in most cases, the 256 integer values between 0 and 255. Color images are mostly represented in the three-color RGB format, with one array for each of red, green, and blue (Fig. 2). While digital cameras deliver megapixel images, we will start from 64 × 64 or 4,096 pixels per image for each color. The information provided to the input layer of the processing network is, therefore, a set of 12,288 intensity values (12,288 is n^{[0]}— the number of elements or features, in what is conventionally called layer 0). As illustrated in Fig. 2, images are first “flattened” to a column vector *X*^{[0]} of elements *x*_{i}, with i ranging from 0 to 12,287 in the figure (some programming languages may instead number arrays starting from 1 and ending at 12,288). The superindex ^{[0]} identifies this vector as the zeroth layer of the network. The models in Ríos et al. (2024) take as input images of two sizes: 256 × 256 pixels for the categorical classifiers (described in their Data S1 and illustrated here with Fig. 7) and 1,024 × 1,024 pixels for the segmentation models in the main article. In these cases, the input vector *X*^{[0]} consists of more than a million elements.

#### One “neuron” defines multiple parameters

Going back to the Realtors’ example, it can be seen that the top neuron of the first layer defines two parameters: a_{1} and a_{2} because it uses only the two top values of the input information vector. In the general fully connected case, neurons will use every input value to generate their output. This implies that every neuron of the input layer will have *n*^{[0]} + 1 parameters, as the calculation includes an additive term. Fig. 3 A illustrates the operation of the first layer of three neurons in the column vector *X*^{[0]} that represents the 64 × 64-pixel image from Fig. 2. Every neuron in the layer (numbered 0–2) will introduce 12,288 parameters *w*_{ij} (weights), where the first subindex represents the neuron (i = 0–2) and the second represents the element in the input vector (j = 0–12,287). Similarly, every one of the *n*^{[2]} neurons of the second layer will introduce *n*^{[1]} + 1 parameters, where *n*^{[1]} equals the number of neurons of the first layer, etc.

#### Successive neuron layers perform matrix multiplication

*n*numbers with

*n*factors, is called the scalar or “dot” product of two vectors.

*n*

^{[1]}neurons (those in layer 1—identified by superindex

^{[1]}—where each neuron has an index i between 1 and

*n*

^{[1]}), it corresponds by definition to the multiplication of vector

*X*

^{[0]}by a matrix

*W*

^{[1]}of

*n*

^{[1]}rows, each consisting of

*n*

^{[0]}weights

*w*

_{ij}.

*W*

^{[1]}and features in

*X*

^{[0]}explicit, and the second equality defines the matrix product as an array (conventionally in column) of “dot” products. There are three indexes in the expressions: j identifies the “feature” in the input vector, it goes from 1 to

*n*

^{[0]}; i identifies the neuron, in this case between 1 and

*n*

^{[1]}; and the superindex identifies the layer, 1 in this case.

*l*) of fully connected

*n*

^{[l]}neurons are therefore representable by matrices of

*n*

^{[l-1]}columns and

*n*

^{[l]}rows; in other words, every layer operates like a matrix of as many rows as there are neurons in the layer and as many columns as neurons in the previous layer. Eq. 7 represents the case illustrated in Fig. 3 C, where layer

*l*, with three neurons, operates on vector

*X*

^{[l-1]}, the output of layer

*l*-1, with

*n*

^{[l-1]}neurons. Note that the result, named

*Z*

^{[l]}, is a column vector of as many elements as neurons in layer

*l*.

The nature of the operations of a neural network, thus representable as matrix multiplications, explains why computing with GPUs (graphics processing units or “graphic cards”) streamlines network calculations, as GPUs are natively suited for matrix operations. The operation of a neural network is not a series of matrix multiplications, however; for reasons given below, a non-linear operation on *Z*^{[l]} must follow every matrix multiplication to finally produce the layer “output,” a.k.a. “feature vector” *X*^{[l]}.

#### The last layer performs classification

*ŷ*, which is discrete, binary (cat or no-cat, 1 or 0) in the example of Fig. 2, or one of a set of locations, or granule counts, in the categorical classification applications by Ríos et al. (2024). Fig. 4 illustrates how a single neuron (constituting the last layer) can make the conversion. It simply adds all inputs (the elements of the last vector

*Z*) and uses the sum (

*z*) as an argument of a nonlinear function, the range of which is 0–1. A popular choice for the role is the sigmoid represented in Fig. 4 B.

Then, if *σ*(*z*) is >0.5, the binary variable *ŷ* is given a value of 1 (“cat”), while if it is <0.5, *ŷ* will be assigned the value 0 (“no cat”). A function with similar properties, the hyperbolic tangent represented in Fig. 4 D, is often preferred for rather technical reasons. When the goal is to decide among multiple classes (say, granule count in Ríos et al. [2024]), the range of these functions (*σ* or *tanh*) can be divided into intervals, one for each class. (This simple design of the last layer for multiple classes has been superseded. A lucid explanation, by Geoffrey Hinton, of the now favored “softmax” approach and its advantages, can be found here: https://www.youtube.com/watch?v=PHP8beSz5o4&ab_channel=ArtificialIntelligence-AllinOne).

#### Matrices are not enough

As mentioned earlier, the operations of neurons in successive layers cannot be only linear combinations (matrix multiplications). Why not? The simple reason is that matrix multiplication has the associative property, whereby the result of two successive matrix multiplications is equal to a single multiplication by the product of the matrices. As seen in Fig. 3, a layer of *n* neurons, representable by a matrix of dimensions *m* × *n*, has *m* × *n* parameters *w*_{ij}. Two layers in succession of *n* and *h* neurons, respectively, are representable by matrices of dimensions *m* × *n* and *n* × *h*, containing *m* × *n* + *n* × *h* parameters. However, their product is another matrix, of dimension *m* × *h*, often entailing a reduction in the number of parameters. The implication is that successive linear operations do not contribute additional independent parameters (which is the goal of the successive layers in a neural model). Therefore, every layer must include a nonlinear operation such as *σ*(*z*) on the result of the multiplication. Unlike the last layer, which requires bounded functions like *σ* and *tanh*, the nonlinear functions used in hidden layers do not need to have a bounded range. Fig. 4 C illustrates a commonly used function that lacks an upper bound: named *ReLU*, for rectified linear unit. In sum, every layer must follow its matrix multiplication (of output *Z*^{[l]}) by a nonlinear operation, which yields the layer output *X*^{[l]}.

#### Convolutions in neural networks

For image analysis, the neural networks that work best include multiple convolutional stages, in which the “neuron” description is largely replaced by that of a “filter” or “kernel.” The reasons for and implications of this renaming will be considered later.

In image analysis, convolutions work on 2-D arrays, combining values of contiguous and typically small sets of pixels, to generate new 2-D arrays (or filtered images). A simple example is the convolution with a kernel consisting of a 3 × 3 array with all nine elements equal to 1/9 (Fig. 5). By definition, the convolution operation produces an output image where the value at every position in the original is replaced by the average of the original value and that of its eight neighboring pixels—the result of multiplying these by the nine elements of the kernel and adding the results. This is the simplest form of a smoothing operation in two dimensions.

The figure also illustrates a 3 × 3 kernel that marks edges. However, in CNNs, none of these filters with defined functions are applied. In agreement with the general rule of ANNs, CNNs eschew operations with physical meaning, such as those in the Realtor’s example, and allow parameters to vary as dictated by the minimization of loss during the training process.

Moreover, in CNNs, convolutions are typically operators on 3-D arrays or “volumes” (multidimensional arrays are commonly called “tensors”). The bottom panel in Fig. 5 illustrates the work of a filter on an RGB image (a 3-D tensor). The simple rule is that the kernels must have the same size in the third dimension (a.k.a. number of “channels”) as that of the image or tensor to which it is applied. Thus, filtering an RGB color image, with three channels, requires a three-channel filter (27 parameters), as shown. The 3-D convolution in the example consists of placing the filter in every possible position in the plane of the image, replacing the original three values at every center with the sum of 27 products of filter parameters and underlying pixel values.

As described, filters perform weighted sums, identical to those of individual neurons represented by Eq. 5. Therefore, convolutional filters are formally “neurons”; they differ from the fully connected ones by operating on small sets of nearby pixels, therefore extracting local features. Thus, the outputs from convolutional layers are appropriately named “filtered images” or “feature maps.” In contrast, fully connected neuron layers take data from every pixel or neuron in the previous layer and produce outputs of a more abstract nature, called sometimes “feature vectors” or “latent features.”

Convolutional filters, a.k.a. “kernels,” are small arrays; compared with fully connected neurons, their parameters are very few, resulting in models with notably fewer weights. The conservation of local features plus the economy of weights explains the preference for CNNs in image classification and segmentation tasks. Convolutional layers have an additional virtue: by contrast with dense layers, which require inputs of fixed dimensions, they can be applied flexibly, to 2-D arrays of unspecified size. Full, end-to-end, input-to-output flexibility is implemented in “fully convolutional networks,” or FCNs (Shelhamer et al., 2017), which, as the name indicates, only have convolutional components.

Practical convolutional layers consist of multiple different filters, each producing a different intermediate 2-D image. These images are stacked to constitute 3-D tensors that are passed as inputs to the next layer. The dimensions of the tensors vary, depending on the number and operation of filters, which can be defined with entire freedom. Fig. 5 illustrates how a convolution makes the first and second dimensions smaller (“downsampling,” the 4 × 4 original is reduced to 2 × 2). When dealing with larger images, a more effective way of reducing these dimensions is by skipping pixels in the convolution (a procedure known as “striding”). Conversely, it is possible to avoid the change in the first and second dimensions by the simple “padding” maneuver. An ingenious procedure that expands images (i.e., increases their *x*–*y* dimensions) will be described in the section about the semantic segmentation approach to image analysis.

Fig. 6 illustrates an actual CNN. In the example, as in most working networks, the strategy is to first “encode,” also called “downsample”; namely, decrease the image size (dimensions 1 and 2) from layer to layer, while increasing the number of channels (dimension 3), which is achieved by increasing the number of filters in successive layers. The theoretical justification for adding filters is that a reduction in image size without a corresponding increase in the numbers of filters would incur in loss of information. (This rule, however, is customarily violated—not followed strictly—as both *x* and *y* dimensions are halved, corresponding to a fourfold reduction in pixel numbers, while the number of filters is only doubled!).

The example consists of nine layers, with the input image conventionally indexed as layer 0.

The working categorical models in Data S1 of Ríos et al. (2024) consist of 175 layers. Same as in Fig. 6, only a few layers, the last or “top” ones, are fully connected (“dense”). Choosing a network configuration involves deciding on hyperparameters—numbers of layers, their mode of operation, and their numbers of elements—plus other aspects described later. While the regular parameters are optimized by automated iterative minimization of error, these hyperparameters are chosen by trial and error, guided by the collective experience of the field.

#### A variety of model architectures

As stated above, the components and modes of operation of CNNs are chosen at liberty. This freedom and the intense research activity in the field often lead to innovations that achieve improvements in predictive accuracy and/or reduction in the size of the models required for the same performance (with economies of computing hardware, energy consumption, or time). Four such advances were applied in the categorical classifier models assembled by Ríos et al. (2024). Here, we recapitulate their main features. Later in the tutorial, we will examine the semantic segmenters, which install a radically different approach to classification.

The first practical model that used systematic backpropagation techniques for minimization of error was “LeNet 5,” by Yann LeCun and others (LeCun et al, 1989); it was developed and successfully used for the automatic recognition of handwritten ZIP codes in mail handled by the US Postal Service. This was followed by “AlexNet,” by Krizhevsky et al. (2012), which used a similar structure but improved over LeNet by adapting it to GPU-equipped computing. The sped-up calculations in turn allowed the use of a much larger model (60 million parameters, versus 60,000 in LeNet 5). Then came “ResNet50” (He et al., 2015), the architecture of choice in the categorical models of Ríos et al. (2024). ResNet introduced residual networks, characterized by connections that skip layers (say, from layer *l* the information can be passed directly to layer *l* + 10, which also receives the regular input from *l* + 9). This skipping technique allowed the implementation of very deep networks, including the 175-layer ResNet50, and the U-Net structure of semantic segmenters. Later models that became popular include “Inception” (Szegedy et al., 2015), which combines multiple types of filters in the same layer and includes innovative “1 × 1” or “network in network” convolutions (Lin et al., 2014, *Preprint*), and “VGG” in its “16” (Simonyan and Zisserman, 2015, *Preprint*) and “19” versions, characterized by a large increase in number of parameters and reliance on small 3 × 3 filters.

While all the models above strived for accuracy in standardized tasks, others sought economy of computing resources; thus “MobileNet” (Sandler et al., 2019, *Preprint*) reduced the memory required through simplified filtering of multiple channels (“depth-wise convolutions”) and other changes, resulting in a model that can be run on smartphones. Some of these architectures were tested by Ríos et al. (2024) for their categorical classifiers without noticeable improvement over the models based on ResNet50. U-Net (Ronneberger et al., 2015, *Preprint*), the structure of choice for the semantic segmenter models developed by Ríos et al. (2024), is a convolutional device that makes radical use of skipping connections. It was introduced for the identification of cancer tissue in histopathology images and found to work well with small training sets. More recently, many of the structures mentioned above have been modified to make them fully convolutional (FCNs [Shelhamer et al., 2017]), thereby achieving flexibility in input dimensions.

### Programming CNNs

One might expect that programming these CNN models, with parameters in the tens or hundreds of millions, would be prohibitively difficult. However, the enormous cohorts of programmers and researchers, together with the culture of open code and support by Google, Microsoft, and other corporations, have generated a wealth of resources that make the technology accessible to anybody with some coding experience. (Said openness does not apply to large corporate models, such as those that support “generative AI,” in which numbers are greater by orders of magnitude (e.g., 1.7 × 10^{14} parameters for GPT-4, a current version of ChatGPT).

The environment in which most ANNs for image processing are built and run is Python (Van Rossum and Drake, 2009) at version 3.11.4 at the time of this writing.

The programs that implement the various models perform the basic functions of numerical model fitting, plus others, necessary for handling massive data. Thus, they must input data (images), often reformat or “preprocess” those (for consistency and to comply with program features including recasting their size or pixel dimension, as well as normalizing to a certain numerical range of intensities), then initialize the parameter values, define the model structure, and implement the iterative fitting routines.

After the backpropagation techniques were systematized (LeCun et al., 1989), conventional fitting of models adopted as a standard in every iteration a forward propagation stage --the calculation of the estimate *ŷ* using the input features (*X*) and current parameter or weight values-- and a “backprop” stage, where derivatives of the loss and correction terms are calculated. These two steps, which move sequentially through the network layers in both directions, are followed in every iteration by a correction of parameters (“gradient descent” Eq. 2) and, at times, manual adjustments of hyperparameters. Because training and prediction are carried out on image sets often too large for computer memory, the sets are fed continuously through “pipelines” that contain images with minimal formatting (for economy of transfer). In these pipelines, images are fed in batches, typically of 32 or 64 members. Loss is calculated and differentiated for gradient descent as averages over these batches, which are input sequentially until the training set is used in full, to lead to a new set of parameter values. Then this set is applied to calculate *ŷ* and accuracy for the validation set, thus completing an iteration or “epoch.” Multiple iterations are performed until a satisfactory accuracy is achieved on the validation set.

#### Tensorflow

Tensorflow (Abadi et al., 2016, *Preprint*), is an open-source, free library of modules that perform or facilitate tasks for machine learning and AI. Developed in 2015 and maintained by Google, it is now in version 2. Originally developed as a Python library, it is currently not limited to that language. Basic information on its functions and installation can be found at https://www.tensorflow.org/overview.

#### Keras

Keras (https://keras.io) is also a library, dedicated to making the application of Tensorflow more user friendly. It claims to be “designed for human beings, not machines” (a “dig” on Tensorflow). The models described by Ríos et al. (2024) were built using Keras in a Tensorflow environment (a programming environment in which all Tensorflow utilities are available and invoked by simple calls).

### Two coding examples

In this nuts-and-bolts part of the tutorial, actual programs that accomplish a classification task are examined. The examples, taken from Ríos et al. (2024), are presented starting from the categorical classifiers, which are simpler, both to program and to train (as the needed labeling is also simpler). In the study cited, the categorical classifier approach found severe limitations, which led to the development of semantic segmenters that fully overcame the limitations. The reasons for the problems in the categorical approach were identified and will be described here as well, for an additional learning opportunity.

#### A categorical classifier

The approach to identifying locations in EM images, implemented in the “Locations” categorical classifier of Ríos et al. (2024), started with subdivision of the original images into sub-images, small enough to be contained fully in a single region (procedure illustrated with Fig. 7, reprinted from Ríos et al., 2024, Data S1). As detailed in the original article, the sub-images were small enough for most to be fully contained in one of the regions of interest, thus reducing the determination of location to a simple image classification task. Their small area also reduced the number of granules there to a maximum of 9 or 10, which simplified the task of the “Granules” classifier as well.

The code that implements “Locations” working in training mode is listed in Box 1. A full version, which includes sections for working on all modes, as well as facilities for input and output, is shared in GitHub (Rios, 2023a) and in Data S1 of Ríos et al. (2024). The simplifications of coding afforded by Tensorflow and Keras should be evident in that the entire program can fit into one page (including descriptive comments, in green font, marked by the initial symbol #). The example was written using Jupyter Notebook, (https://jupyter.org), an interface that allows editing of the programs written in Python (or other languages) as well as their running. Other editors can be used (including Atom, https://atom-editor.cc, recommended and applied in Baylor’s textbook [Baylor, 2021]). Jupyter notebooks allow splitting the code into convenient cells that can be executed separately. The split can be quite arbitrary; in the example, the full program is presented in six cells described below:

Code lines are executed sequentially; the partition in separately executable cells (1–6) simplifies description and “debugging.” Cell 1 defines the environment by loading or “importing” the necessary libraries. Cell 2 loads the set of labels of the training images. Cell 3 defines the pipeline that will introduce the training and validation images at execution time. Cell 4 defines the model (locations_model) as a Python function with arguments to be set later (see text) and assembles the layers of neurons that constitute it; the process is simplified by the use of a pretrained structure, ResNet50, as “base model.” Cell 5 redefines the model as “model2,” by giving values to the Python function arguments and completes the assembly by defining the calculation of loss, the measure of performance, and the correction coefficient “r” (learning rate). The final cell puts the process to work, in training mode in the example (set by the statement “history = model2.fit”); it also defines the number of iterations (“epochs”) and a buffer (“history”) for recording the procedure. The function of individual lines is described in comments (green font).

##### Cell 1

It brings to computer memory the libraries of commands or modules that accomplish the necessary tasks. Python is started in a “Tensorflow environment,” which must be preassembled in the programming computer (a task usually done by Anaconda, described later). Tensorflow must still be loaded (“imported”). The Keras libraries streamline the definition of network layers and their operations; they must be imported from the Tensorflow environment as well. “Import-dataset-from-directory” or “idd” implements the efficient Dataset presentation or piping of large amounts of data at execution time. “Dense” and “Flatten” are layer types; Keras allows them to be invoked by a simple call. Likewise “Sequential,” another feature of Keras, is a simple method for programming so-called linear layer structures (in which no skip-connections are allowed). RandomFlip and RandomRotation are resources for enriching the training set (known as Data Augmentation). “Adam” is an algorithm (an “optimizer” in AI jargon) that implements gradient descent efficiently. “Plot_model” produces a graph-style depiction of the models.

##### Cell 2

Reads from a storage device the labels (or true classifier values, a.k.a. “Ground Truth”) of the subset of images used for training and validation. Labels of training images are produced by the trainer and must be ordered strictly in the sequence of the corresponding images, as their place in the sequence is the sole means to establish correspondence.

##### Cell 3

Implements the input of images. The high-level instruction “idd,” imported in cell 1, builds a pipeline that passes images in the highly efficient “Dataset” format, as batches of 32 or 64 members, together with their labels, in a manner tuned to the capabilities of the computer system. For example, cell 3 is used in the training phase of the network, specifying a split, in this case of 0.75 and 0.25, respectively, for the fraction of images that will be used for training or validation. For actual classification (a.k.a. “prediction”) of images, a similar cell will specify instead “labels = none”—assigning the labels will be the task of the model. An implementation of a no-label dataset for prediction is shared in Ríos et al. (2024), Data S1, Section 2.2.

##### Cell 4

This is the core of the program. It defines the actual model, namely every layer of the network (with its more than 23 million parameters), and provides the calculations (forward propagation through the network, backward calculation of errors and their derivatives, and iterative correction of parameters). Cell 4 is compact because the largest part of the network is summarized in the command between dashed lines. The command incorporates the user model into the predefined ResNet50 structure (the “base model”), thereby defining 169 of the 175 layers (all but the first three, which adapt the input images to the ResNet50 requirements, and the last three, which perform the classification sought). The statement “base_model.trainable = False” avoids adjusting (fitting) the weights of the base model, thereby limiting the adjustments to the parameters of the last layers (a component of Transfer Learning, described later with an example in Box 2). In the implementation represented, the optional operation of data augmentation is commented out (canceled) with the symbol #.

The code in Box 1 implements transfer learning by incorporating in the model structure (most of) the original ResNet50 with its parameters pretrained. This box demonstrates modifications to the code in Box 1 to train additional layers. Notably, the statement base_model_trainable = False in Box 1 Cell 4, changes to True, which frees layers in the pre-trained “base model” to have their parameters retrained for the new task. Training will start at the layer in position “fine_tune_at,” chosen here to be layer #160. For fine-tuning, the learning rate (*r* in Eq. 2) is made smaller, 0.25 of the initial rate in this example.

##### Cell 5

The previous cell defines the model (locations_model) as a function, which can be used in multiple ways defined by the values of its arguments. Cell 5 defines the way in which the function will be used (the “instantiation” in Python jargon), which may include, for example, data augmentation. The cell gives the name “model2” to this specific instance of the function; it includes a compiler stage that also sets the basic learning rate, defines the measure of error or “loss” that will be used (many formulations have been found that improve on the simple RMS error of Eq. 1), chooses the optimizer (Adam in this case, cf. note on Cell 1), and establishes a measure of performance, “accuracy,” that is passed to the trainer.

##### Cell 6

Launches the training procedure, establishing also the number of iterations over the training set (“epochs”). Hidden by its obscure form, the key command here is “history = model2.fit(…)”; it defines the goal—“fit,” which means training in this case, and defines the variable “history,” which will contain the measures of performance. A run of the program for actual classification (a.k.a. prediction), which of course can only be done well after successful training, would be launched by the alternative command “history = model2.predict(…).”

##### Transfer learning

In the example code of Box 1, the crucial Cell 4 has two components that implement “Transfer Learning”— the use of existing network structures with weights already trained for object detection or image classification on large databases. In line 3 of the cell, the indication “weights = “imagenet” invokes the ResNet50 neural network with its weights (parameters) already optimized (trained) on the vast Imagenet dataset. In the same line, the indication “include_top = False” removes the last (top) layers of code in the original, the lines that perform classification, so that they can be replaced by layers instrumental to the user’s goal, location in this case.

The next line, “base_model_trainable = False” instructs to only adjust the values of the parameters in the layers added by the user. As described in Ríos et al. (2024), this limited optimization stage of training resulted in accuracies in the 70% range for the Locations model. In a subsequent stage, additional layers, with their corresponding parameters, are liberated for fitting. Box 2 shows Cell 4 modifications that allow optimizing parameters within the base ResNet50 structure, starting at layer 160. This refined stage of fitting is usually carried out at a slower pace, set by a smaller learning rate *r*. Refined training led to an accuracy better than 80% in the Locations example.

##### Overfitting

Anyone who has implemented a quantitative model to describe noisy data must have encountered overparametrization and know to keep the number of adjustable parameters safely below the number of data points. In AI, overparametrization is common due to the customary use of models with millions of parameters. It reveals itself in the divergence of the evolution of accuracies attained in successive iteration epochs. While the accuracy of predictions on the training set of images (i.e., those provided to the program with their user-defined labels) continues to increase and may reach 100% or perfection, the one on the validation set (to which the program must assign labels, which are then compared with the correct ones) remains far from 100% and often decreases. The phenomenon, called overfitting, is a consequence of too many adjustable parameters chasing too few data points, with the consequence of “fitting the noise,” that is, adjusting parameters to describe quirks in the training data that are not common to other members of the sample, with the consequent increase in error on the validation set. That no further improvement was found in the performance of the example model by adding trainable layers before #160 is a clear indication of overfitting.

A similar classifier model, “Granules,” is also available in GitHub (Rios, 2023b).

#### A semantic segmenter

In contrast to categorical classifiers, which classify images, semantic segmenters (SS; He et al., 2004, reviewed by Csurka et al. [2023], *Preprint*) classify every pixel of an image. In the implementations of Ríos et al. (2024), SS modules overcame the severe limitations found with the categorical classifiers. The structure represented in Fig. 8 was found suitable for both their Locations segmenter, which assigns every pixel to one of the six classes (A band, I band, Z disk or mitochondria, intra-SR, near SR, and unclassifiable) and to their Granules segmenter, which assigns every pixel to one of the two classes (in—belonging to a glycogen granule—or out).

The lower part of the structure, colored orange, is a popular simplified form of the original U-Net introduced by Ronneberger et al. (2015), *Preprint*. The top depicts adjustments adopted both to match the input dimensions of the images processed by Ríos et al. (2024) and the needed classifications. Conventionally, to visualize efficiently this more complex model structure, the representation of 3-D tensors as prisms, used in Fig. 6, is replaced by rectangular bars, where the vertical length represents the identical *x* and *y* dimensions (images are square) and the horizontal width represents the number of filters.

Because semantic segmenters do pixel-by-pixel classification, the dimensions of the end images (final arrays where every pixel is classified) must be identical to those of the input. To this end, the U-Net structure consists of two nearly symmetrical stages or arms: an “encoding” or “downsampling” arm, where images are progressively downsized, and the information is retained in a progressively increasing number of filters (width of bars), followed by a “decoding” arm, which implements recovery of image size and reciprocal reduction in numbers of filters.

As shown in Fig. 8, network layers (bars) are first aligned horizontally in groups of three, referred to as convolutional or “conv” blocks. The structure can be described as a chain of conv blocks. An additional, essential aspect is the use of “skip connections” (represented by horizontal arrows), akin to the residual connections of ResNets. With these, the output of the second convolutional layer of every block is passed to the input of the corresponding upsampling block, where it is simply stacked as additional channels (the “concatenate” operation). The details of every block can be garnered from the diagram; they are also represented in a literal form in Data S2, Summary of Layers, of Ríos et al. (2024).

This pattern of two convolutional layers in sequence, followed by a reduction in spatial dimension with a corresponding increase in the numbers of filters (a.k.a. channels), is followed down to the bottom of the “U.” At this point, the sequence is reversed, symmetrically, and the spatial dimension is doubled in each three-layer block. The expansion to an image of greater size is done by means of the “transpose convolution,” a numerical concoction that is neither a convolution nor a filter, illustrated in Fig. 9. As stated, at every transposed conv block, the output images (a.k.a. feature maps) from the corresponding encoding block are stacked as additional channels, the “concatenate” operation. This empirically useful operation is justified in theory as a means to preserve and pass detailed spatial information during the encoding process, necessary because encoding is viewed as the extraction of global features by application of various filters, a process that may degrade the local information.

At the end of the U-Net network is the classification layer, represented by the dashed line at the top right in Fig. 8. The spatial dimensions there are 1,024 × 1,024 (the same as the input images) and the number of channels is equal to the number of classes, two for Granules and six for Locations. So, for every pixel in the original image, the model outputs a linear array or vector of values that are interpreted as the probabilities of, say, each of the six possible locations, and the predicted class is that with the highest value.

As was the case for the categorical classifiers, programming these stages is greatly simplified in the Tensorflow Keras environment. Box 3 shows the listing of the Granule model’s core. The listings of the segmenters are available in Data S2 of Ríos et al. (2024) and posted in GitHub (Rios, 2024a, 2024b).

The first two sections define the building blocks of the two arms of the U-Net structure: conv_block for encoding and upsampling_block, for decoding. Conv_block consists of two convolutional layers and a Max Pooling, which reduces the x and y dimensions by half. Conv_block returns two outputs: next_layer, which is passed to the next encoding block, and skip_connection, passed to the decoding arm (both have the same content, “conv”). Upsampling_block includes the transposed convolution, a concatenation with the stack of images transferred from the encoding arm, and two convolution layers. The third section, with the same role as Cell 4 in the classifier example of Box 1, defines the model, consisting of eight encoding and eight decoding blocks, as the function unet_model. Note that the last layer, conv10, has only two filters (set by the argument “n_classes,” which in the first line was set = 2). Thus, the layer produces two values per pixel, interpreted as probabilities that define whether the pixel is or is not in a granule.

Why did the categorical classifiers of Ríos et al. (2024) fail? Are segmenters intrinsically superior? The anonymous journal referees suggested a main reason and the authors concurred. The locations approach started with the subdivision of the original images into small subimages (Fig. 7). While this approach simplifies the tasks of both classifiers, it also presents to the models just the sub-image, thereby removing the global context that allows an observer to easily place it in the cell. By contrast, the semantic segmenters operate on whole images and do not have this disadvantage. The deficit, therefore, is attributed to loss of context in preprocessing (subdivision operation) rather than intrinsic to the categorical classification models.

### Resources

Resources are available in multiple categories:

- (1)
Full-function online servers where Python programs can be written and run without the need for implementation in a local computer. Google’s Colab, or Colaboratory, is the standard: https://colab.research.google.com/?utm_source=scs-index.

- (2)
Hyper-environments facilitate the installation and maintenance of Python and its subordinate libraries. Anaconda is a leading example. https://anaconda.com. It is popular for its ability to check and automatically maintain compatibility among all the utilities that constitute a working environment, thus enabling the continuous evolution of versions and updates that characterize the Python ecosystem. The Tensorflow environment where the example model programs were put together was built and maintained by Anaconda.

- (3)
Interpreters, which help write and run the programs after Python is installed. We mentioned Jupyter and Atom. IDLE (defined as a graphic user interface or GUI) is an additional alternative: https://docs.python.org/3/library/idle.html.

- (4)
Nested programming environments (a.k.a. platforms, or libraries—collections of modular utilities) use the strengths inherent in computing to automate and streamline the most cumbersome tasks, including the detailed calculation of derivatives needed for parameter optimization. Among these, we already mentioned Tensorflow and its subset Keras. Many programmers vouch for PyTorch. https://pytorch.org.

- (5)
Existing open code, freely available, which integrates these modules into working programs, already trained and tested to perform identification and classification tasks. These programs can be modified for different tasks with relative ease. A popular source and depository of code is GitHub, maintained by Microsoft https://github.com.

- (6)
Large cloud storage and processing resources are available for remote use, combined with large sets of images that can be used for training and testing. An example, used in Ríos et al. (2024), is ImageNet https://www.image-net.org/. Harvard Dataverse https://dataverse.harvard.edu is a convenient free repository for research data, including large image sets, with clear rules for deposit and retrieval.

- (7)
Finally, these procedures are explained, at variable levels of specialization and clarity, by multiple dedicated free websites, user communities and for-profit online enterprises.

#### For learning

As stated before, programming requires some proficiency in the basic languages of AI, either Python or Java. I found the course imparted by Prof. Andrew Ng (Ng, 2021) to be useful, not just to learn the language, but as a gateway to the neural network courses offered by the same organization. As with most of these online resources, these courses require a paid subscription. A compelling alternative is Computational Cell Biology, the recently published book by Baylor (2021). In the book, Python is introduced and then applied in every chapter at increased levels of complexity, largely for simulations of computational biophysics. A bonus is the instruction imparted by Prof. Baylor on fundamental topics of cellular excitability and signaling, with the perspective of a field leader.

For programming of Neural Networks, again I mention the CNN course by A. Ng (Ng, 2021). In assembling this tutorial, I was greatly helped by his ideas and didactic strategies (notably the Realtors’ model as a step toward dense networks). But there are many other options. The central hubs of Python, Tensorflow, and Keras provide extensive descriptions and examples of the many functions but aim for completeness rather than clarity. For exploring the theory of AI and computer vision, rather than the practicalities of programming, I recommend the cycle of talks by G. Hinton, mentioned earlier for an explanation of the “softmax” approach to classification. Also at that level are the YouTube postings by Aurélien Géron; at the link below is his enjoyable explanation of alternative measures of loss, the variable to minimize during network training—an explanation that starts from Information Theory and the ideas of Claude Shannon. https://www.youtube.com/watch?v=ErfnhcEV1O8&ab_channel=Aur%C3%A9lienG%C3%A9ron.

#### Support

There is an abundance of help, mostly free, reachable online. The Coursera organization supports a forum called Discourse, where questions are answered by students or a number of experienced volunteers. There is also the unofficial community of Coursera students, who are always willing to help. Two warnings: they number in the millions (not a typo), hence a call for assistance normally and immediately gets 100-plus answers, not always mutually consistent. More warning: Coursera maintains a “no-answer-to-test-questions” honor code—necessary for the integrity of their certification system—which is routinely ignored by the informal forum.

The most popular free collaborative forum is Stack Overflow (https://stackoverflow.com/questions), notable for keeping a hierarchy of participants based on the quality of their contributions. However, they are not exactly welcoming to beginners, who are usually helped—but often with a sneer.

In all, this jumble of for-profit corporations, free tools, motivated individuals and ersatz collaborative groups amount to an ecosystem overwhelming in its richness. Navigate at your own risk and exhilaration.

## Data availability

All original data images used for training and predicting using the example models in this tutorial are deposited in Harvard Dataverse as follows: images for training of Locations and their label masks at Rios et al. (2024b); images for training of Granules and their label masks at Rios et al. (2024c), and all images predicted at Rios and Samsó (2024). A fourth dataset (Rios et al., 2024a) contains all analysis and graphics files (Sigmaplot and Excel) referred to in this tutorial. The annotated code for the four models described is shared in the GitHub repository at: (Rios, 2023a, 2023b, 2024a, 2024b) and in Data S1–S3 of Ríos et al. (2024).

## Acknowledgments

Olaf S. Andersen served as editor.

The author is grateful to Montserrat Samsó, who generated the images studied in the example article and carefully edited this manuscript; to Paul Mielke (Deeplearning.AI) for mentoring on all things AI; to Sheila Riazi and her team at the Malignant Hyperthermia Investigation Unit (University Health Network, Toronto) for comments and stimulus; to Eshwar Tammineni (Rush University) for starting the study of glucose metabolism that motivated the AI approaches, to Carolina Figueroa and Carlo Manno (Rush University) for supporting this work in many ways; to Amira Klip (SickKids, Toronto) and Clara Franzini-Armstrong, for long-term collaboration, enlightenment and encouragement, and to Architect Lucas Rios, for inducing my first approach to AI.

This work was funded by grants from National Institute of Arthritis and Musculoskeletal and Skin Diseases (R01 AR 071381) to S. Riazi, E. Rios, and M. Fill (Rush University), National Institute of Arthritis and Musculoskeletal and Skin Diseases (R01 AR 072602) to E. Rios, S.L. Hamilton, S. Jung and F. Horrigan (Baylor College of Medicine), and National Center for Research Resources (S1055024707) to E. Rios and others.

Author contributions: E. Rios: Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing—original draft, Writing—review & editing.

## References

## Author notes

Disclosures: The author declares no competing interests exist.