@robertlundberg provided a copy of the YouTube mnist GAN demo with additional normalisation.
After updating to perceptilabs-gpu 0.11.7 (in my Anaconda env with TF1.15 and CUDA brought in to the env by the Conda install of TF; WIn 10 64 bit, Home, 20H2) I ran the model and noted the following
- Memory usage increases until all RAM and all swap is consumed
- GPU (1080 Ti) is being used at ~<10% level according to the CUDA graph of task Manager; clock rate goes up and temp rises but it’s not being pushed very hard
- There’s then a message in the browser window that I am using a lot of memory!
- -> How to limit it?
- After reaching about 50% training, I paused, and put the machine to sleep…
- On machine resume, there’s no “resume” for training
- Similarly, I have noticed that I have no idea how to reset the system, but that is what happened here I think - I couldn’t resume, it restarted
- Then it crashed Chrome after reaching about 20%
- I was watching the generated digits and, TBH almost nothing is happening to the patterns - almost as though the random isn’t random at all.
Now it seems less like there’s a memory leak than before, I even saw memory usage go down for a while, but memory use seems at best unconstrained - 98% of 64GB RAM and a huge swapfile - most of which is entirely due to python. I’m quite impressed that Win 10 didn’t complain or crash itself.
I still notice that the generated digits vary only imperceptibly from iteration to iteration over quite a long period and looking at the gradients in various places they seem to have vanished; which could be why nothing much seems to change. (NB graph colours make min/max/av very hard to distinguish)
I would really like to have a step by step setup for TF perceptilabs and CUDA that leads to a controlled and convergent model… can anyone help contribute to that?
Right now I have seen a really interesting concept, but have been unable to get even the simplest model to work (run to the end without crashing or produce sensible intermediate results) , which is very frustrating.
It is also very slow; I’ve run small TF demos before and on less capable CPU only setups training has been much faster than I am currently experiencing. I’d expect about 5 mins for the mnist GAN, but an hour seems more likely (when it runs to completion).
Are slowness, memory usage, GPU under-usage, and crashing all related perhaps?
Clearly the team and other users are enjoying success with this - what’s gone wrong with my setup and approach?