I just had a model outside PL explode, which was unexpected because exactly the same model had been run before and with exactly the same data, but it shows that model behaviour can be initialisation dependent.
So, since I’m now looking at the use of clipnorm/clipvalue etc. (for those unfamiliar with these, they are kwargs on the abstract Optimizer class, and therefore common to all the keras optimisers) I wondered how PL handles gradient explosion.
Does it apply gradient clipping in any way or is random gradient explosion just something the user should be aware of? Or are there other mitigations applied?