Once upon a time re-running a model would cause memory to saturate.
Now it seems that after 1 run PL wishes instead to spare the GPU from further exertions and so only uses CPU…
Whereas on run 1 CPU was (on this graph) ~20%, on run 2 it shot up immediately to almost 100% - here it is at 95%, elsewhere I was seeing, 98%, 99%. I had to stop this run (obviously).
Which was sad because I had just edited the code on the training node to use “/checkpoint” rather than “//checkpoint” and was going to walk away for 40 mins. (I hope the model saved was viable…)
UPDATE model was NOT viable after restarting server and reopening. This is a problem - I can’t tell what’s wrong… almost have to rebuild the entire thing every time I want to run something.
UPDATE 2 Even Ctrl-c the server and restarting would not persuade PL to use the GPU again. I checked with Jupyter notebook code that I could train on the GPU after 2 or 3 more failed attempts in PL and it ran just fine. (The only things I haven’ tried yet are exiting the anaconda prompt window and/or rebooting the machine…)
It is also effectively impossible to save the textile model and restore it: there is always a problem with the local data components… I need to disconnect them from other components, disconnect them from their data sources, reconnect them to the data sources, relink the components. Every. Time.
Ctrl+F5 does not help. Running in Incognito window does not help. Clearing the cache does not help.
I am in danger of becoming ungenerous in my reporting… I have reached my frustration limit and am walking away for the rest of the day. Hopefully, tomorrow I shall be full of the joys of spring again.
I would like to thing that I am stupidly overlooking something I should be doing… I wonder what it might be.
Any ideas? (on that or the reported issues )