I have been pondering all the things we have to manage to make models learn well - not only architecture, but also what activation functions we should use and what we do about so called regularization.

So, I thought I would (briefly) share a few things I’ve discovered, a few personal observations, and a few links to follow for anyone who wants to know more.

Since I am definitely not an expert in this (!) I will be more than happy to be corrected if I’ve made a mistake - and I hope others can add to this.

Here’s a cartoon set of dense layers showing the content of one typical neuron

And an illustration of what the three parts are doing - so I can discuss them.

The story is: inputs from the layer below are acted on by the **Weights and Bias**, the **Norm** element then adjusts the mean and standard deviation (statistical spread) of the resulting values, and whichever **Activation function** is chosen (there are many, I just picked two typical examples of key kinds), the result of the normalisation is put through the activation function before being passed to the next layer.

First off: why do we have activation functions? Answer to avoid the vanishing/exploding gradient problem. The former effectively prevents learning at the bottom of deep models, the latter prevents anything happening at all once triggered. Actually, it seems to *me* that functions like *tanh* and *sigmoid* that make all really large and small values very similar, might have addressed the *exploding gradient*/.stability problem - but actually created the vanishing gradient problem that some other technique then needs to address.

Which activation function you use will depend on the kind of ML you are doing: **if you work with e.g. images** your input data has a fixed range (e.g. 0-255 RGB values, or 0-1 “normalised”) then **tanh or sigmoid are reasonable** choices; but if you are dealing with, say regressions, where there is no obvious scale up front - the tanh function might not work because large values in you regression will be in the flatter regions of the curve and the gradients (hence ability to learn from them) will be small. Similarly, if you have negative input values, you probably don’t want the sigmoid function either

because it eliminates the negative! (NB you can see more clearly here the three different lines: the solid line is the activation function itself, and the dotted line is the 1st derivative - how sensitive the activation is to changes in input value; the dashed line is the 2nd derivative, for info only.)

FYI Jason Brownlee has a more comprehensive post on activation functions here. His activation recommendations (not just what is plausible/not unreasonable) for various model types are:

The Weights and Bias and Norm need to be discussed together because as you can see, Norm removes the bias from the step before. I have no idea whether any of the ML frameworks are optimised for efficiency so that they don’t do unnecessary bias calculations when Norm is on. NB here I am talking about *Batch Normalisation* (a regularisation method) - there is also Layer Normalisation and you can read about the differences here

This normalisation works to keep everything in the “sweet spot” of the activation function around x = 0 where the response of the activation is (ignoring sign for a moment) most closely linear with respect to the input of the activation function (from Norm, if on, or directly from W&B)

So I don’t go on too long (there is a lot more that could be said about regularisation in general), the only other thing to mention briefly here is the use of Dropout. In PerceptiLabs you can specify the “keep probability” for each neuron, i.e. the probability that neuron will actually be used (per batch? I don’t know. Update: see below - @robertl confirmed it is per batch). The idea is that by literally ignoring or “dropping” some neurons from the calculations you prevent your model from overfitting, i.e. ~memorising the input.

What it does is force the model to “average” weights over multiple neurons that are also shared with other things to be learned, i.e. it creates cross-coupling with other weights, making it that much harder for the model to learn lots of independent pieces so it has to ~generalise (just tagging @robertl here because I have mentioned something related to him before).

Was this helpful? Let me know in the comments - I can do more (or less!) if you want.

**PS** For the technically/mathematically inclined/curious, the original paper on batch normalization is here in the arXiv

**PPS** I’m also very keen on Weight Decay (not yet available in PL) - again see another Jason Brownlee Machine Mastery post. My question for anyone who knows: should we use Weight Decay *and* Dropout? Is there synergy (i.e. both together are better than either alone)?