Step-by-step modeling a neural network for classification

I wanted to focus here on the modeling procedure of a neural network for classification in Javascript. How to choose the number of intermediate layers and their associated numbers of neurons? And above all: how to interpret the results? These are the questions that asked myself and to which I would like to apply a methodology.

I will use my own neural network implementation in JavaScript (Github) as well as Python scripts to quickly generate and visualize data. We will see, after the explanation of the example below, the following points:

  1. Data generation
  2. Training
    1. Step 1: get the lowest error
    2. Step 2: generalize

Warning: as I keep learning everything as I go along, it may be possible that some remarks eventually turn out to be wrong. The article is open to beginners like me who have trouble interpreting the impact of the topology of a neural network on its learning.

Go directly to the Javascript  demo on JSFiddle, also further down the page.

The classification example

We want the neural network to learn to compare two numbers (a, b) and returns 1 if the first is strictly greater than the second:

     $$f(a, b) = \begin{cases}1 & \text{if a \textgreater{} b}\\ 0 & \text{else}\end{cases}$$

I have chosen this simple problem so that the optimization procedure will be as well, keeping in mind that the conclusion will be applicable to more complicated classification problems.

Our artifical neural network will take two inputs and one input. We still have to find which topology et which hyperparameters to choose:

  • The learning rate
  • The number of intermediate layers — or hidden layers
  • The number of neurons per layer
  • The activation function of the neurons of the hidden layers

I. Data generation

The decisive element is not yet part of the list above. The training dataset that will be used to train the neural network must be correctly chosen if you want the learning to be successful. Likewise, a validation dataset will be used in order to be sure the neural network hasn’t just adapted  itself to the values of the training set, or also known as avoiding overfitting. The neural network will not learn from this dataset, so we don’t apply backpropagation on the validation set.

I chose to randomly generate the (a, b) couples between -0.5 and 0.5 rather than generating all the possible pairs (with 0.01 precision, for example) because it is in practice very rare to be able to foresee all possible combinations of entries and to have an exhaustive dataset. Using a Python script, I then generate my (a, b) couples according to the following criteras:

  • Normalized between -0.5 and 0.5,
  • Uniform random draw centered in 0.

With Ptyhon and the NumPy library, we can use a, b = np.random.uniform(-0.5, 0.5, 2). Here are the results for different dataset sizes, with f(a, b) = 1 in red and f(a, b) = 0 in black:

Dataset representations of different sizes n (generated with Python).

The size of the training set must be large enough if we want the neural network to have enough examples to understand what it is asked to approximate. The higher the number of neurons in our network is, the more important it is to feed it from a larger training set in order to prevent each weight from adjusting itself with a value of the dataset.

In our example, we see that for n=200 the demarcation is clearly not as explicit as for n=8000 ; therefore the neural network will have more difficulty to classify with this training. It is recommended to distribute the sizes such that the training set represents 80% and the validation set 20% of the overall sample. In general, it is necessary that the values ​​are equally distributed over the two sets of data (equal variance) but the set validation does not need to be as dense as the training set.

We will take n=8000 for the training and n=2000 for the validation. The code that was made for the generation and visualization is available on Github.

II. Training

The JavaScript library will allow us to train the neural network and visualize the result directly in the browser. We choose the following hyper-parameters:

  • A low learning rate of 0.05. A value that comes up often is 0.1, but I still prefer to have a little more precision with this one;
  • We can optimize the search of the gradient with the Nesterov Accelerated-Gradient algorithm which allows a faster convergence towards a low rate of error;
  • The activation function will be PReLU with \alpha = 0.1 because ReLU seems currently to bring the best non-linearity, but suffers from “dead neurons” approaching zero (see “dying relu problem” on datascience.stackexchange.com);
  • A window of 20 epochs is enough for this simple problem and at the sight of our dataset quite consistent for the problem.

Step 1: Converge to a low error rate

We first want to underestimate our problem and get a network that does not learn because of a too simple model (underfitting), then add neurons and/or hidden layers to decrease the error rate. We can start with a single neuron on a single intermediate layer:

With an error around 0.05 and 0.04, the model already has an accuracy over 95%. With one more layer, the error falls to 0.3. With a second, it does not change much: but with two neurons on both layers so this time the error goes below 0.001!

The gray curve indicates the error for the training set, while the purple curve  indicates the one relative to the validation set. The lower the error is, the more precise the classification will be.

By interacting with the JavaScript demo we realize that the classification is very reliable, except for the special case where a = b . Adding some values ​​where a = b in our dataset should solve the problem. Note: if the validation curve did not follow the training curve, then we would have potentially had overfitting.

You may ask yourself the question: what would have changed if we had increased the number of neurons in each layer instead of increasing the number of layers itself? According to my tests, to achieve the same performance as 2x2x2x1 with a single hidden layer we need 6 neurons on it, with less certainty that the network converges to the right solution as quickly. With 10, this uncertainty is discarded but the computation takes more time and so we lose learning efficiency.

In conclusion, it appears that thanks to the non-linearity added by the intermediate layers, the neuron network can learn faster and more easily than with a large number of neurons per layer. On the other hand, too much depth will have the opposite effect on such a simple problem.

Step 2: generalize

Now that we have a robust model, we would like to know if our neural network has really learned to compare two numbers and not only if they are between –0.5 and 0.5.

In fact, it is sufficient for us to normalize each input a and b between –0.5 and 0.5 for the comparison to work with any number. But it is the idea of generalization that interests us, because the very purpose of artificial intelligence is finally to be able to make an abstraction of the training data to correctly evaluate data never seen.

Test set visualization (n=600), which does not have any values between -0.5 and 0.5

On the experiment, you can change the interval up to [-5 ; 5] to find that the network can not manage these values.

We will therefore generate a third dataset: the test dataset that will draw (a, b) couples into [-5 ; 5], excluding any value of  ]-0.5 ; 0.5[. Again, we will not apply backpropagation on this dataset.

Its size is not important and we do not want to slow down learning, 600 (a,b) couples are enough. The JSFiddle below in this article will allow you to visualize how the neural network reacts to the test set (red curve).

With a 2x2x2x1 topology, the curve is simply not showed because of a too large error (greater than 1). By multiplying by 2 the number of neurons per hidden layer, the curve is present but wobbles too much: we still can not say that our network has generalized. Even with 6 or 12 neurons per layer, the result remains essentially the same so increasing the number of neurons does not seem to be the solution.

The green curve represents the error on the test set. The lower the error is, the more the learning is generalized.

However, with an extra layer we can combine a second time the abstraction provided by a hidden layer, with the possibilities offered by more neurons.

The topology 2x4x4x4x1 seems to be the one that allows the most generalization.

Here is the result, which you can change yourself with JSFiddle :

You will notice that the convergence to a minimal error is not automatic. The training time of 20 epochs is relatively low and we could leave more time for the optimizer (Nesterov Accelerated Gradient) with 100 or 200 epochs, which I invite you to test.

As a conclusion, we see in this example that adding more neurons allows more combinations and so more possibilities, while adding intermediate layers helps to make sense of these possibilities.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.