Training crash in the beginning due to thread errors

Some unexpected crash occurs in the beginning of training with error:

Traceback (most recent call last):
File “c:\python38\lib\site-packages\flask\”, line 1516, in full_dispatch_request
rv = self.dispatch_request()
File “c:\python38\lib\site-packages\flask\”, line 1502, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
File “c:\python38\lib\site-packages\flask\”, line 84, in view
return current_app.ensure_sync(self.dispatch_request)(*args, **kwargs)
File “perceptilabs\endpoints\session\”, line 76, in perceptilabs.endpoints.session.base.SessionProxy.dispatch_request
File “perceptilabs\endpoints\session\”, line 154, in perceptilabs.endpoints.session.threaded_executor.ThreadedExecutor.send_request
File “perceptilabs\endpoints\session\”, line 175, in perceptilabs.endpoints.session.threaded_executor.ThreadedExecutor.get_task_info
File “perceptilabs\endpoints\session\”, line 81, in perceptilabs.endpoints.session.threaded_executor.TaskCache.get
File “perceptilabs\endpoints\session\”, line 91, in perceptilabs.endpoints.session.threaded_executor.TaskCache.get
File “perceptilabs\endpoints\session\”, line 125, in perceptilabs.endpoints.session.threaded_executor.ThreadedExecutor.start_task.run_task
File “perceptilabs\endpoints\session\”, line 114, in perceptilabs.endpoints.session.utils.run_kernel
File “c:\python38\lib\asyncio\”, line 608, in run_until_complete
return future.result()
File “perceptilabs\endpoints\session\”, line 98, in run
RuntimeError: Task failed!

Basically, I do everything as I usually did, I already used PeceptiLabs with my projects without too many issues, but this time it just crashes.

The epoch/batch and other settings weren’t drastically modified, the code wasn’t customized either, it might have something with images, but those are just classic JPEG files.

The output is set to 16 though.

Using latest versions of PerceptiLabs, CUDA, cuDNN, etc

Hi @max,
Welcome to the forum!

Sorry that you ran into an issue, let’s see if we can figure it out :slight_smile:
Do I understand that this is happening on a dataset which used to work?
Also, does this every happen on the tutorial datasets or just custom ones?
And finally, if you run the tool as perceptilabs -v=3 (this enables more logging in the terminal), do you see any additional information in the terminal when you encounter this error?

All the best,

The dataset is probably ok, I mean, I already worked with these images using keras and sklearn in raw python, but it’s first time I try it with PerceptiLabs.

The tutorial datasets and my other datasets work perfectly fine, I encounter this issue for the first time. I’m new to PerceptiLabs though.

Enabling verbose debug logs show the exact same error without any other useful info, unfortunately.

Also I tried different version of python (3.8 and 3.7) but the result was still the same. I tried it only on Windows.

This also happens when I reduce my dataset from 2000 images to 200.

I would say that maybe some type of middleware can cause this.

I just read the logs and just after the training loop and a bit before the error in question there is this:

c:\python38\lib\site-packages\sentry_sdk\integrations\ UserWarning: Creating resources inside a function passed to is not supported. Create each resource outside the function, and capture it inside the function to use it.
return old_run_func(self, *a, **kw)
2021-10-06 16:14:23,877 - INFO - - Request to endpoint ‘session_proxy’ took 0.0002s
INFO:perceptilabs.applogger:Request to endpoint ‘session_proxy’ took 0.0002s
2021-10-06 16:14:24,552 - ERROR - - Unexpected exception in CoreThread (issue origin: threading:932)
Traceback (most recent call last):
File “perceptilabs\”, line 37, in
File “perceptilabs\core_new\compatibility\”, line 41, in
File “perceptilabs\core_new\compatibility\”, line 58, in perceptilabs.core_new.compatibility.base.CompatibilityCore._run_trainer_threaded
File “perceptilabs\core_new\compatibility\”, line 64, in perceptilabs.core_new.compatibility.base.CompatibilityCore._run_trainer
File “perceptilabs\trainer\”, line 184, in run_stepwise
File “perceptilabs\trainer\”, line 275, in _loop_over_dataset
File “c:\python38\lib\site-packages\tensorflow\python\data\ops\”, line 761, in next
return self._next_internal()
File “c:\python38\lib\site-packages\tensorflow\python\data\ops\”, line 744, in _next_internal
ret = gen_dataset_ops.iterator_get_next(
File “c:\python38\lib\site-packages\tensorflow\python\ops\”, line 2727, in iterator_get_next
_ops.raise_from_not_ok_status(e, name)
File “c:\python38\lib\site-packages\tensorflow\python\framework\”, line 6897, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File “”, line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot add tensor to the batch: number of elements does not match. Shapes are: [tensor]: [133,200,1], [batch]: [133,200,3] [Op:IteratorGetNext]

Looks like dimension problem.

Looks like you found it, great! :slight_smile:
This will happen in case you happen to have a mix of grayscale and RGB images in your dataset. A quick fix is to convert all your images to one or the other using some code like this:

folder = 'C:/Users/Robert/Documents/PerceptiLabs/Default/Water_Bodies_Dataset/Masks'
new_folder = 'C:/Users/Robert/Documents/PerceptiLabs/Default/Water_Bodies_Dataset/Grayscale'


files = sorted(os.listdir(folder))

for _file in files:
    image = cv2.imread(os.path.join(folder, _file), GRAYSCALE_FLAG)
    cv2.imwrite(os.path.join(new_folder, _file), image)

Hope that helps!

yeah it did the trick, so you was right, thanks
however, I used RGB without greyscale, because I would like to save the colors
ideally, if PerceptiLabs detected such issue and corrected it or at least proposed a solution automatically, that would be great ^^

Hi @max

Good suggestion there about dataset consistency checks. I’m curious about where the issue actually occurred - often there’s a normalisation element in the TF graph, though I’m not sure what model you have, and one might have hoped that such issues would be caught here.

On the other hand, I could understand that examining all images before training could be inefficient - it would be better just to catch all such errors and give an explicit message.

Question for @robertl - is there a “fault finding” tree/process, or could one be created, starting from keyworded symptoms and giving specific check actions?

PL is not the source of the issue, but the easier it makes it to build models the more sanity checks/messages might be useful to users. If one is coding one expects to have to deal with many such little issues… what PL has done is to make many of them go away before we start :smiley: - but could it do more? I don’t think “diagnostics” need slow things down… just a bit more code to run on exceptions to provide a bit of interpretation?

Hi @max, great suggestion!
We have something similar to that suggestion here:, where it would report the issue and which images had the issue. Still an early spec though.

@JulianSMoore, agreed that more diagnostics and better error reporting is a bit part of the workflow and should be included in PL :slight_smile:
Would you explain a bit more your thought with the “fault finding” tree/process?

Hi @robertl - “fault finding process” details? Sure :slight_smile:

List symptoms in the order they could occur, e.g. browser can’t find server, browser finds server but login doesn’t appear, login-appears but login doesn’t work, logged in but model hub is empty (when models should be present)…

It will become impossible to maintain that sequentiality, but then the issues are becoming more subtle and possibly less amenable to checklist-based fault-finding.

  1. Browser cannot connect to PL
    List possible causes and how to find out whether they apply, e.g.
    Possible causes: a) server startup failure, b) invalid port number, …

a) Browser Cannot Connection - diagnosis etc.

  • Return to the console, look for text containing “error” (then for some common errors, describe a fix?)
  • If error and fix not identified, ctrl-c the existing server and restart with “-v=3” debug option… look for “Warning”…

b) Invalid Port - diagnosis etc.

  • Make sure the PL servers are stopped (ctrl-c)
  • Windows: run netstat from the command line [more details…] if any of the following ports are found, there will be a conflict with PL… try to end the processes keeping those ports open and restart PL.
  • Linux…
  • MacOS…

Non-specific diagnostic/corrective actions
1/ Try opening the PL URL in a private/incognito browser window
2/ After taking a copy of the database at this location [Win/Linux/Mac], delete the database and try restarting PL
3/ Try clearing the browser cache [warning: list of things NOT to delete because they won’t help and will create more work for the user… e.g. don’t delete logins

NB Does the browser cache have a timeout? Is there any way (could there be?) within PL to “recreate” the database - after making a backup of course! (and ? restore from such a backup). Is there/could there be a way to clear the browser cache?

Q? How will cloud operations differ? Would the database, browser cache still be used locally - keeping only PL code etc. in the cloud?

Other Symptoms previously observed & worth noting with advice/ affected versions:

  • Component preview not shown on component (but exists in settings panel) = “Inconvenient, but model still runnable” etc."
  • Training becomes progressively slower

more on request - I hope you were just asking for an outline and not all the details :wink:

1 Like