3 Parisominous models through seed exploration

In the previous chapter, we developed graphical representations for model-agnostic XAI methods, primarily focusing on intrinsically interpretable models like tree-based algorithms. However, the rapid growth of artificial intelligence has firmly established deep neural networks as a dominant tool for complex tasks, from logical reasoning to creative endeavors. This shift presents a fundamental challenge, their decision processes are hidden within intricate combinations of linear and non-linear functions, making traditional XAI methods insufficient and giving rise to new fields like mechanistic interpretability aimed at untangling their internal workings.

At their core, neural networks learn by iterating over batches of data, updating large number of internal parameters via back-propagation and gradient descent (Goodfellow-et-al-2016?). Their capacity is governed by hyperparameters, such as the number of layers and neurons, which determine the model’s complexity. Because the state-of-the-art tasks neural networks excel at often involve massive, complex datasets, the prevailing trend has been to build ever-larger models, creating a culture where complexity is often the default starting point.

This default, however, is problematic. Highly complex models are not only unnecessary but detrimental, if a simpler solution exists. They are prone to overfitting where they are memorizing noise instead of learning true patterns, thereby leading to poor generalization on new data. In these scenarios, simpler, parsimonious models are superior as they are more robust, computationally efficient, and, most importantly, transparent. (breiman2001statistical?) provides an early statistical perspective on algorithmic models that focus on predictive accuracy to the detriment of interpretability.

Therefore, a critical gap exists between the “make it bigger” trend of mainstream deep learning and the foundational principle of model parsimony. This project directly addresses this gap. We investigate how simple, parsimonious neural networks can be discovered for simple datasets. Our focus will be on single-layer, fully connected neural networks, using them as a base to build a pedagogical tool to explore the impact of random seed and model size across various data-generating processes.

In discovering simpler models, we were able to uncover clear, demonstrable insights into how simpler architectures can be found. The key takeaway messages from this project would be

A simple parsimonious model can exist for a given neural architecture that has similar performance as a more complex model, but it’s not so easy to find.
Changing the random seed produces large variation in the accuracy of a simple model fit, but little effect on the accuracy of a complex model fit.
A complex model has stability but little interpretability. Persisting with random starts can provide a parsimonious, interpretable fitted model.

3.1 Background

3.1.1 Training neural networks

Figure 3.1: An example neural network structure

A neural network can be seen as a sequence of linear transformations combined with nonlinear functions called activations. The weights of these linear transformations are trained through a process of iterative adjustments to reduce the loss between the current prediction and the expected output, guided by an optimization algorithm. This minimization is achieved through repeated cycles of forward and backward propagation (Goodfellow-et-al-2016?).

The training begins with a forward pass. Suppose the input observation is $x \in \mathcal{R}^{1 \times p}$ where $p$ is the number of input features (in the given example $p = 2$). Input data is fed into the network, and each neuron calculates its output based on the weighted sum of its inputs and a non-linear activation function (for example ReLU). In the example given there are $4$ neurons, and each neuron has $p$ weights associated with it, which will be multiplied with the input vector of size $p$ and added together to create a vector of 4. Additionally there’s a bias term in a layer that is also a vector of size 4 and gets added to the output. To smoothen the process between neurons, the action of an entire layer can be summarised to $y = xW^T+b$ where $W is a matrix of 4 rows and 2 columns and $b$ is a vector of 4 values.

These outputs then propagate forward through the network until a final prediction is produced. This prediction is compared to the true target value, and the error is calculated through a loss function. Different loss functions are defined based on the task and the data structure of predictions and ground truths. For example a binary cross entropy loss function (bishop_pattern_2006?) is defined for binary classification tasks assuming that the prediction is in the range of $[0,1]$ and the ground truth is in binary.

The heart of the learning process lies in the backward pass, also known as back-propagation. This step uses the chain rule of calculus to calculate the gradient of the loss function with respect to each weight in the network. Essentially, it determines how much each weight contributes to the overall error. The gradient indicates the direction and magnitude of the change needed to reduce the loss. To update a weight matrix $W$ we need the gradient of the loss with respect to $W$ and the learning rate used to indicate how fast and stable the updates to the weights should happen. Together these elements come together to create the basic concept of gradient descent as follows, where the gradient dictates the direction and the intensity the weight should “descend” down the loss surface towards a lower loss.

\[ W_{new} = W + \alpha * \frac{\partial L}{\partial W} \]

An optimization algorithm, such as Stochastic Gradient Descent or Adam (kingma_adam:_2017?), is then employed to update the weights based on these calculated gradients. The optimizer iteratively adjusts the weights in the direction that minimizes the loss. The learning rate controls the size of these adjustments—a smaller learning rate leads to slower but more stable learning, while a larger learning rate can accelerate learning but may also cause instability.

The entire forward and backward process is repeated for numerous iterations, often referred to as epochs. During each epoch, the network sees the entire training dataset multiple times, allowing it to progressively refine its weights and improve its accuracy.

3.2 Motivation

The educational implications of seeing more complex models in the wild can be concerning. Students learn to equate model performance with complex neural architectures, thereby developing an intuitive tendency toward complexity. This bias persists into professional practice, where engineers default to complex solutions without considering simpler alternatives. The result is a field that consistently over-engineers solutions to problems that may have elegant, simple answers. Therefore it is necessary that educators are made aware of the possibility that parsimonious models can have differing fits based on the randomness in the model. In additon, using complex models where a smaller parsimonious models are sufficient introduces a requirement for additional computational resources, and can be prone to overfitting on the smaller dataset. The motivation for this work is to challenge the assumption that increasingly complex models are always necessary for strong performance. In practice, not all tasks require deep and intricate architectures, and in some cases, a well-initialised smaller model may perform just as well as a larger one.