Info: Activation Functions & Regularisation - what and why

I have been pondering all the things we have to manage to make models learn well - not only architecture, but also what activation functions we should use and what we do about so called regularization.

So, I thought I would (briefly) share a few things I’ve discovered, a few personal observations, and a few links to follow for anyone who wants to know more.

Since I am definitely not an expert in this (!) I will be more than happy to be corrected if I’ve made a mistake - and I hope others can add to this.

Here’s a cartoon set of dense layers showing the content of one typical neuron

image

And an illustration of what the three parts are doing - so I can discuss them.

image

The story is: inputs from the layer below are acted on by the Weights and Bias, the Norm element then adjusts the mean and standard deviation (statistical spread) of the resulting values, and whichever Activation function is chosen (there are many, I just picked two typical examples of key kinds), the result of the normalisation is put through the activation function before being passed to the next layer.

First off: why do we have activation functions? Answer to avoid the vanishing/exploding gradient problem. The former effectively prevents learning at the bottom of deep models, the latter prevents anything happening at all once triggered. Actually, it seems to me that functions like tanh and sigmoid that make all really large and small values very similar, might have addressed the exploding gradient/.stability problem - but actually created the vanishing gradient problem that some other technique then needs to address.

Which activation function you use will depend on the kind of ML you are doing: if you work with e.g. images your input data has a fixed range (e.g. 0-255 RGB values, or 0-1 “normalised”) then tanh or sigmoid are reasonable choices; but if you are dealing with, say regressions, where there is no obvious scale up front - the tanh function might not work because large values in you regression will be in the flatter regions of the curve and the gradients (hence ability to learn from them) will be small. Similarly, if you have negative input values, you probably don’t want the sigmoid function either

image

because it eliminates the negative! (NB you can see more clearly here the three different lines: the solid line is the activation function itself, and the dotted line is the 1st derivative - how sensitive the activation is to changes in input value; the dashed line is the 2nd derivative, for info only.)

FYI Jason Brownlee has a more comprehensive post on activation functions here. His activation recommendations (not just what is plausible/not unreasonable) for various model types are:

The Weights and Bias and Norm need to be discussed together because as you can see, Norm removes the bias from the step before. I have no idea whether any of the ML frameworks are optimised for efficiency so that they don’t do unnecessary bias calculations when Norm is on. NB here I am talking about Batch Normalisation (a regularisation method) - there is also Layer Normalisation and you can read about the differences here

This normalisation works to keep everything in the “sweet spot” of the activation function around x = 0 where the response of the activation is (ignoring sign for a moment) most closely linear with respect to the input of the activation function (from Norm, if on, or directly from W&B)

So I don’t go on too long (there is a lot more that could be said about regularisation in general), the only other thing to mention briefly here is the use of Dropout. In PerceptiLabs you can specify the “keep probability” for each neuron, i.e. the probability that neuron will actually be used (per batch? I don’t know. Update: see below - @robertl confirmed it is per batch). The idea is that by literally ignoring or “dropping” some neurons from the calculations you prevent your model from overfitting, i.e. ~memorising the input.

What it does is force the model to “average” weights over multiple neurons that are also shared with other things to be learned, i.e. it creates cross-coupling with other weights, making it that much harder for the model to learn lots of independent pieces so it has to ~generalise (just tagging @robertl here because I have mentioned something related to him before).

Was this helpful? Let me know in the comments - I can do more (or less!) if you want.

PS For the technically/mathematically inclined/curious, the original paper on batch normalization is here in the arXiv

PPS I’m also very keen on Weight Decay (not yet available in PL) - again see another Jason Brownlee Machine Mastery post. My question for anyone who knows: should we use Weight Decay and Dropout? Is there synergy (i.e. both together are better than either alone)?

3 Likes

Great post! :smiley:

As for the ML frameworks doing unnecessary bias calculations, according to this pytorch guide, to gain some speed one can safely disable bias for a convolutional layer as long as its directly followed by a batchnorm layer. Does this hold true for Tensorflow, too? If so, one should be able to set the use_bias parameter to False for same effect? :slight_smile:

Thanks @birdstream :slight_smile:

Thanks to your pytorch input I did a spot of googling: yes, tensorflow also has a use_bias argument for Dense, see here. No idea how much of a difference to compute efficiency it would make, but if batch norm is enabled it would make sense to set use_bias = False - every little helps.

@robertl Are you already doing this? (NB use-bias has probably been around for ages but it does seem to be in TF 2.5)

Opening the PL code editor for a conv layer reveals that use_bias indeed is set to True also when batchnorm is used. I feel, I need to do some test/benchmarking here :grin:

1 Like

Bias and batch norm mean (“beta” - these people have no imagination!) are both trainable parameters… not only is it inefficient to have the same thing twice, it is just conceivable to me that the two could also interact in unhelpful ways (e.g. if they don’t get updated at the same time - deliberately or accidentally - it could set up oscillations and create instability)

Let us know what you find out. I bet for small models it might be hard to measure a difference if doing GPU training but it might be more significant if doing CPU only training (though it wouldn’t surprise me if there weren’t a single CPU instruction for multiply and add)

Awesome post @JulianSMoore! :smiley:

Just to chip in on the two points so far:
For the dropout, it should be per batch :slight_smile: If it was per iteration things would get very messy and per epoch would overexpose specific neurons.
For the bias, as @birdstream pointed out we don’t automatically disable it at this point, which is a great find! Added it as a task for us to take a look at

2 Likes

I was curious so I tried to set use_bias=False in the conv layer but PL wouldn’t have it, because it expects the bias to be present and will throw an error. I had some succes in simply passing tf.zeros() to the variables. I could train the model, but the training view got a little messed up like it wouldn’t show the map and such. I suspect I didn’t supply the right shape for the tensors maybe? :thinking:

You were probably one of those “meddling kids” the Scooby Doo villains kept complaining about, Joakim :wink:

I like the detective work :smiley: I guess use-bias = False fails because the backprop doesn’t know about it, and still tries to train the bias, but, no, hang on, it’s a TF graph so why would PL be involved in that part?

You’ve just shone the torch on a door marked “Danger! Do not enter!” and I want in! :scream_cat:

I leave no stones unturned :joy:

It seems to me, that because bias is missing there is only one value to unpack, but the frontend wants two so it fails on that part. It’s passed as something like kernel, bias = self.conv.weights. I tried passing tf.zeros() as bias and it does train but it messes up the training view a bit :sweat_smile:

I find myself lost behind that door already :joy:

1 Like