Textile ResNet model - cuDNN error?

Update 3 - Apparently resolved!

Thanks to Robert’s pip list and line by line comparison with mine: it seems my installation had tensorflow-estimator 2.4.1 and tensorboard 2.4.0 (somehow) - I’ve removed and reinstalled and now that I have 1.15.1 and 1.15.0 like you, the Textile model now runs.

It is very interesting to note that a muddle environment could cause the cuDNN and then missing python.dtypes errors.

Wasted most of the day on this and it was entirely my problem.

Update 2

It is more complicated than it seems… anyone who actually knows what’s going on and how to resolve this is welcome to join in.

I was using this code that was recommended as helping…

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
tf.keras.backend.set_session(tf.Session(config=config))

And everything was good. I commented it out. Everything was still good. I tried something else without it and it failed… I put the code in and it did not fix it… and the previously working code continued to work but 1-2 orders of magnitude slower… as though it wasn’t using the GPU at all.

I’m very confused.

In the Python Anaconda env with CUDA the Textile model does not suffer from the convolution issue, but does complain

ModuleNotFoundError(“No module named ‘tensorflow.python.types’”)

I think I have noted that before. I’m very confused. Did I say that?

Update1

There is a problem with the OS CUDA installation. Code that failed with the OS CUDA runs in an environment in which CUDA has been installed. I expect the perceptilabs models will also be OK in that environment (but have not yet tested that)

I will try to fix the OS installation (once I know what is wrong).

NB the nvidia driver was just updated to 641.40 and the PC rebooted; tf convolution test run 1st with OS CUDA (failed), then re-ran in a different environment (env CUDA) - ran.

Initial Report
When I try to run the model it says (see comments afterwards!)

Userland error in layer 1599466492197 [DeepLearningConv]. Line: 32
2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node DeepLearningConv_Convolution_1/Conv2D (defined at c:\users\julian\anaconda3\envs\perceptilabs-gpu\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
	 [[gradients/DeepLearningConv_Convolution_1/Conv2D_grad/Conv2DBackpropFilter/_415]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node DeepLearningConv_Convolution_1/Conv2D (defined at c:\users\julian\anaconda3\envs\perceptilabs-gpu\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

File "C:\Users\Julian\AppData\Local\Temp/training_script.py", line 2195, in main
  server.run(auto_start=auto_start)
File "C:\Users\Julian\AppData\Local\Temp/training_script.py", line 1781, in run, origin 1599467087463, line 394 [TrainNormal]
  self.init_layer(graph, mode)
File "C:\Users\Julian\AppData\Local\Temp/training_script.py", line 1547, in init_layer, origin 1599467087463, line 160 [TrainNormal]
  layer_output_tensors = build_graph(input_tensor, label_tensor)
File "C:\Users\Julian\AppData\Local\Temp/training_script.py", line 1541, in build_graph, origin 1599467087463, line 154 [TrainNormal]
  is_training=is_training
File "C:\Users\Julian\AppData\Local\Temp/training_script.py", line 297, in __call__, origin 1599466492197, line 32 [DeepLearningConv]
  y = tf.add(tf.nn.conv2d(x, W, strides=[1, self._stride, self._stride, 1], padding=self._padding), b)
File "c:\users\julian\anaconda3\envs\perceptilabs-gpu\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 2010, in conv2d
  name=name)
File "c:\users\julian\anaconda3\envs\perceptilabs-gpu\lib\site-packages\tensorflow_core\python\ops\gen_nn_ops.py", line 1071, in conv2d
  data_format=data_format, dilations=dilations, name=name)
File "c:\users\julian\anaconda3\envs\perceptilabs-gpu\lib\site-packages\tensorflow_core\python\framework\op_def_library.py", line 794, in _apply_op_helper
  op_def=op_def)
File "c:\users\julian\anaconda3\envs\perceptilabs-gpu\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func
  return func(*args, **kwargs)
File "c:\users\julian\anaconda3\envs\perceptilabs-gpu\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3357, in create_op
  attrs, op_def, compute_device)
File "c:\users\julian\anaconda3\envs\perceptilabs-gpu\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3426, in _create_op_internal
  op_def=op_def)
File "c:\users\julian\anaconda3\envs\perceptilabs-gpu\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1748, in __init__

I put a Convolution in the mnist model and got the same error.

Interestingly this error has occurred to others in other circumstances, see github and github and superuser

Now, I am not using CUDA in the environment, this environment uses the OS installed CUDA (CUDA 10.0, cuDNN 7.4.2), which is working as far as I can tell (see below)… but not for convolutions.

Questions

  • any idea how to resolve this specific issue
  • any idea how to test/veryify cuDNN initialisation generally

Why do I think CUDA, cuDNN are probably OK? (tests previously reported in other threads but brief note here)

  • IGNORE! Mathematica detects and uses GPU -> It also uses a local CUDA installation.
  • Separate Python tests of matrix multiplication (10^8 elements each) with specified detected GPU is fine.

Here’s some TF code I used…

loops = 10
n = 10000
mat1 = np.random.random((n,n))
mat2 = np.random.random((n,n))

with tf.device(tf.test.gpu_device_name()):
    mmul = tf.math.multiply(mat1, mat2)
    startT = time.perf_counter()
    with tf.Session() as sess:
        for i in range(loops):
            result = mmul.eval()
    endT = time.perf_counter()

print(f"{loops * (n ** 2):,} multiplications done in {endT - startT:0.4f} seconds")
print(result)

(I simply cannot get that code to format right here… unless I put some text between the last bullet and the preformatted block. Just my luck)

I could not use the in-app github bug report (spinning wait circles preventing??) so here is the report I tried to file there

Report belonging to forum post Textile ResNet model - cuDNN error?

Userland error in layer 1599466492197 [DeepLearningConv]. Line: 30
ModuleNotFoundError("No module named 'tensorflow.python.types'")

File "C:\Users\Julian\AppData\Local\Temp/training_script.py", line 1781, in run, origin 1599467087463, line 394 [TrainNormal]
  self.init_layer(graph, mode)
File "C:\Users\Julian\AppData\Local\Temp/training_script.py", line 1547, in init_layer, origin 1599467087463, line 160 [TrainNormal]
  layer_output_tensors = build_graph(input_tensor, label_tensor)
File "C:\Users\Julian\AppData\Local\Temp/training_script.py", line 1541, in build_graph, origin 1599467087463, line 154 [TrainNormal]
  is_training=is_training
File "C:\Users\Julian\AppData\Local\Temp/training_script.py", line 295, in __call__, origin 1599466492197, line 30 [DeepLearningConv]
  W = tf.compat.v1.get_variable('W', shape = shape, initializer=  tf.contrib.layers.xavier_initializer())
File "c:\users\julian\anaconda3\envs\perceptilabs_tf1-15_gpu\lib\site-packages\tensorflow_core\python\util\lazy_loader.py", line 62, in __getattr__
  module = self._load()
File "c:\users\julian\anaconda3\envs\perceptilabs_tf1-15_gpu\lib\site-packages\tensorflow_core\python\util\lazy_loader.py", line 45, in _load
  module = importlib.import_module(self.__name__)
File "c:\users\julian\anaconda3\envs\perceptilabs_tf1-15_gpu\lib\importlib\__init__.py", line 127, in import_module
  return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  
File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  
File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  
File "c:\users\julian\anaconda3\envs\perceptilabs_tf1-15_gpu\lib\site-packages\tensorflow_core\contrib\__init__.py", line 39, in <module>
  from tensorflow.contrib import compiler
File "c:\users\julian\anaconda3\envs\perceptilabs_tf1-15_gpu\lib\site-packages\tensorflow_core\contrib\compiler\__init__.py", line 21, in <module>
  from tensorflow.contrib.compiler import jit
File "c:\users\julian\anaconda3\envs\perceptilabs_tf1-15_gpu\lib\site-packages\tensorflow_core\contrib\compiler\__init__.py", line 22, in <module>
  from tensorflow.contrib.compiler import xla
File "c:\users\julian\anaconda3\envs\perceptilabs_tf1-15_gpu\lib\site-packages\tensorflow_core\contrib\compiler\xla.py", line 22, in <module>
  from tensorflow.python.estimator import model_fn as model_fn_lib
File "c:\users\julian\anaconda3\envs\perceptilabs_tf1-15_gpu\lib\site-packages\tensorflow_core\python\estimator\model_fn.py", line 26, in <module>
  from tensorflow_estimator.python.estimator import model_fn
File "c:\users\julian\anaconda3\envs\perceptilabs_tf1-15_gpu\lib\site-packages\tensorflow_estimator\python\estimator\model_fn.py", line 29, in <module>
  from tensorflow.python.types import core

Hi @JulianSMoore,
Glad to hear that you solved it, although sounds like it was a painful trip.

The spinning circles should not prevent github bug reports as it’s operations on two different services (and they are threaded).
How did it look for you when you tried to report to GitHub?
Did the Post button simply not work or was there something else failing?

Post button didn’t work, that was all; I just naively assumed it was due to the unsaved state as I have previously posted successfully (at times when I was not aware of the spinning circles indicating incomplete action)

@robertl PS do I need to tag you in replies, or do you see that I have replied anyway? It’s not clear to me whether the reply action relates to the thread or the individual post.

I’ll double check on the spinning circles.

I’ll see your replies without you tagging me as well :slight_smile: