- Data generation
- Step 1: get the lowest error
- Step 2: generalize
Warning: as I keep learning everything as I go along, it may be possible that some remarks eventually turn out to be wrong. The article is open to beginners like me who have trouble interpreting the impact of the topology of a neural network on its learning.
The classification example
We want the neural network to learn to compare two numbers and returns
1 if the first is strictly greater than the second:
I have chosen this simple problem so that the optimization procedure will be as well, keeping in mind that the conclusion will be applicable to more complicated classification problems.
Our artifical neural network will take two inputs and one input. We still have to find which topology et which hyperparameters to choose:
- The learning rate
- The number of intermediate layers — or hidden layers
- The number of neurons per layer
- The activation function of the neurons of the hidden layers
I. Data generation
The decisive element is not yet part of the list above. The training dataset that will be used to train the neural network must be correctly chosen if you want the learning to be successful. Likewise, a validation dataset will be used in order to be sure the neural network hasn’t just adapted itself to the values of the training set, or also known as avoiding overfitting. The neural network will not learn from this dataset, so we don’t apply backpropagation on the validation set.
I chose to randomly generate the couples between
0.5 rather than generating all the possible pairs (with
0.01 precision, for example) because it is in practice very rare to be able to foresee all possible combinations of entries and to have an exhaustive dataset. Using a Python script, I then generate my couples according to the following criteras:
- Normalized between
- Uniform random draw centered in
With Ptyhon and the NumPy library, we can use
a, b = np.random.uniform(-0.5, 0.5, 2). Here are the results for different dataset sizes, with in red and in black:
The size of the training set must be large enough if we want the neural network to have enough examples to understand what it is asked to approximate. The higher the number of neurons in our network is, the more important it is to feed it from a larger training set in order to prevent each weight from adjusting itself with a value of the dataset.
In our example, we see that for the demarcation is clearly not as explicit as for ; therefore the neural network will have more difficulty to classify with this training. It is recommended to distribute the sizes such that the training set represents 80% and the validation set 20% of the overall sample. In general, it is necessary that the values are equally distributed over the two sets of data (equal variance) but the set validation does not need to be as dense as the training set.
We will take for the training and for the validation. The code that was made for the generation and visualization is available on Github.
- A low learning rate of
0.05. A value that comes up often is
0.1, but I still prefer to have a little more precision with this one;
- We can optimize the search of the gradient with the Nesterov Accelerated-Gradient algorithm which allows a faster convergence towards a low rate of error;
- The activation function will be PReLU with because ReLU seems currently to bring the best non-linearity, but suffers from “dead neurons” approaching zero (see “dying relu problem” on datascience.stackexchange.com);
- A window of 20 epochs is enough for this simple problem and at the sight of our dataset quite consistent for the problem.
Step 1: Converge to a low error rate
We first want to underestimate our problem and get a network that does not learn because of a too simple model (underfitting), then add neurons and/or hidden layers to decrease the error rate. We can start with a single neuron on a single intermediate layer:
With an error around
0.04, the model already has an accuracy over 95%. With one more layer, the error falls to
0.3. With a second, it does not change much: but with two neurons on both layers so this time the error goes below
You may ask yourself the question: what would have changed if we had increased the number of neurons in each layer instead of increasing the number of layers itself? According to my tests, to achieve the same performance as
2x2x2x1 with a single hidden layer we need 6 neurons on it, with less certainty that the network converges to the right solution as quickly. With 10, this uncertainty is discarded but the computation takes more time and so we lose learning efficiency.
In conclusion, it appears that thanks to the non-linearity added by the intermediate layers, the neuron network can learn faster and more easily than with a large number of neurons per layer. On the other hand, too much depth will have the opposite effect on such a simple problem.
Step 2: generalize
Now that we have a robust model, we would like to know if our neural network has really learned to compare two numbers and not only if they are between –
In fact, it is sufficient for us to normalize each input
b between –
0.5 for the comparison to work with any number. But it is the idea of generalization that interests us, because the very purpose of artificial intelligence is finally to be able to make an abstraction of the training data to correctly evaluate data never seen.
On the experiment, you can change the interval up to to find that the network can not manage these values.
We will therefore generate a third dataset: the test dataset that will draw couples into , excluding any value of . Again, we will not apply backpropagation on this dataset.
Its size is not important and we do not want to slow down learning, 600 couples are enough. The JSFiddle below in this article will allow you to visualize how the neural network reacts to the test set (red curve).
2x2x2x1 topology, the curve is simply not showed because of a too large error (greater than 1). By multiplying by 2 the number of neurons per hidden layer, the curve is present but wobbles too much: we still can not say that our network has generalized. Even with 6 or 12 neurons per layer, the result remains essentially the same so increasing the number of neurons does not seem to be the solution.
However, with an extra layer we can combine a second time the abstraction provided by a hidden layer, with the possibilities offered by more neurons.
Here is the result, which you can change yourself with JSFiddle :
You will notice that the convergence to a minimal error is not automatic. The training time of 20 epochs is relatively low and we could leave more time for the optimizer (Nesterov Accelerated Gradient) with 100 or 200 epochs, which I invite you to test.
As a conclusion, we see in this example that adding more neurons allows more combinations and so more possibilities, while adding intermediate layers helps to make sense of these possibilities.