Issue - model cannot run & crashes (self and/or Chrome)

@robertlundberg provided a copy of the YouTube mnist GAN demo with additional normalisation.

After updating to perceptilabs-gpu 0.11.7 (in my Anaconda env with TF1.15 and CUDA brought in to the env by the Conda install of TF; WIn 10 64 bit, Home, 20H2) I ran the model and noted the following

  • Memory usage increases until all RAM and all swap is consumed
  • GPU (1080 Ti) is being used at ~<10% level according to the CUDA graph of task Manager; clock rate goes up and temp rises but it’s not being pushed very hard
  • There’s then a message in the browser window that I am using a lot of memory!
  • -> How to limit it?
  • After reaching about 50% training, I paused, and put the machine to sleep…
  • On machine resume, there’s no “resume” for training
  • Similarly, I have noticed that I have no idea how to reset the system, but that is what happened here I think - I couldn’t resume, it restarted
  • Then it crashed Chrome after reaching about 20%
  • I was watching the generated digits and, TBH almost nothing is happening to the patterns - almost as though the random isn’t random at all.

Now it seems less like there’s a memory leak than before, I even saw memory usage go down for a while, but memory use seems at best unconstrained - 98% of 64GB RAM and a huge swapfile - most of which is entirely due to python. I’m quite impressed that Win 10 didn’t complain or crash itself.

I still notice that the generated digits vary only imperceptibly from iteration to iteration over quite a long period and looking at the gradients in various places they seem to have vanished; which could be why nothing much seems to change. (NB graph colours make min/max/av very hard to distinguish)

I would really like to have a step by step setup for TF perceptilabs and CUDA that leads to a controlled and convergent model… can anyone help contribute to that?

Right now I have seen a really interesting concept, but have been unable to get even the simplest model to work (run to the end without crashing or produce sensible intermediate results) , which is very frustrating.

It is also very slow; I’ve run small TF demos before and on less capable CPU only setups training has been much faster than I am currently experiencing. I’d expect about 5 mins for the mnist GAN, but an hour seems more likely (when it runs to completion).

Are slowness, memory usage, GPU under-usage, and crashing all related perhaps?

Clearly the team and other users are enjoying success with this - what’s gone wrong with my setup and approach?