Why might the training randomly freeze?

Hello, I’ve managed the previous one but now I’ve got the freezed

Epoch 35 is like 30minutes stays the same. And if I stop the Python next run I have got an error like

Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node replica_1/model_1/conv_0/convolution}}]]
[[{{node training/Adam/gradients/replica_0/model_1/bnorm_96/FusedBatchNorm_grad/FusedBatchNormGrad}}]]

And I have to reinstall tensorflow and clean temp files. How to avoid those freezes?

I’m on Win10, Python 3.7, tensorflow-gpu 1.13.1 and CUDA 10.1 with corresponding cudaNN and Visual Studio 2019

  • Firstly, I will advice you update to the latest ImageAI version v2.1.5 which includes progress bar for your training.
    pip3 install imageai --upgrade
  • Secondly, what batch_size are you using?
  • What type of GPU are you using?

yep I have the last imageai (2.1.5)
batch_size is 4, but it doesnt matter, the same error I’m getting even if its 2
If I correctly understand your question I have RTX2080Ti

I have recorded a little gif

@Akinori Kindly visit the discussion linked below on solving this.

Unfortunately it doesnt work, I install my cudatoolkit as well as tensorflow-gpu with conda environment and the the successful setup the page offers does not fit (I’ve tried v9.0 and 10.0, cudnn are both the same 7.6) By the way, I manage to make it work for CPU tensorflow, but it’s not working scenario for me it takes infinity to make it happen on very low number of batches and pictures.

OH! I found out the way to run the process without this error using this code:

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession
config = ConfigProto()
config.gpu_options.allow_growth = True

But unfortunately the process still freezes, it stays like this for 5 min already:

those are the snippets right before the freeze

and one second after