cuda error out of memory что делать - Ок! Компьютер

When I started to train some neural network, it met the CUDA_ERROR_OUT_OF_MEMORY but the training could go on without error. Because I wanted to use gpu memory as it really needs, so I set the gpu_options.allow_growth = True .The logs are as follows:

And after using nvidia-smi command, it gets:

After I commented the gpu_options.allow_growth = True , I trained the net again and everything was normal. There was no the problem of CUDA_ERROR_OUT_OF_MEMORY . Finally, ran the nvidia-smi command, it gets:

I have two questions about it. Why did the CUDA_OUT_OF_MEMORY come out and the procedure went on normally? why did the memory usage become smaller after commenting allow_growth = True .

Содержание

6 Answers 6
Stee1Arm
pilat200
Comments
mkabatek commented Dec 28, 2017
This comment has been minimized.
Tottom commented Dec 30, 2017 •
This comment has been minimized.
mkabatek commented Dec 30, 2017
This comment has been minimized.
Tottom commented Dec 30, 2017
This comment has been minimized.
mkabatek commented Dec 30, 2017
This comment has been minimized.
Tottom commented Dec 30, 2017
This comment has been minimized.
Tottom commented Dec 30, 2017
This comment has been minimized.
Tottom commented Dec 31, 2017
This comment has been minimized.
remotetech commented Jan 10, 2018
This comment has been minimized.
Tottom commented Jan 10, 2018
This comment has been minimized.
remotetech commented Jan 10, 2018 •
This comment has been minimized.
remotetech commented Jan 11, 2018
This comment has been minimized.
Tottom commented Jan 11, 2018
This comment has been minimized.
Tottom commented Jan 11, 2018
This comment has been minimized.
raduvultur commented Jan 12, 2018
This comment has been minimized.
remotetech commented Jan 16, 2018
This comment has been minimized.
HoverDrive commented Feb 16, 2018
This comment has been minimized.
samywee commented Feb 17, 2018

6 Answers 6

In case it’s still relevant for someone, I encountered this issue when trying to run Keras/Tensorflow for the second time, after a first run was aborted. It seems the GPU memory is still allocated, and therefore cannot be allocated again. It was solved by manually ending all python processes that use the GPU, or alternatively, closing the existing terminal and running again in a new terminal window.

By default, tensorflow try to allocate a fraction per_process_gpu_memory_fraction of the GPU memory to his process to avoid costly memory management. (See the GPUOptions comments).
This can fail and raise the CUDA_OUT_OF_MEMORY warnings. I do not know what is the fallback in this case (either using CPU ops or a allow_growth=True ).
This can happen if an other process uses the GPU at the moment (If you launch two process running tensorflow for instance). The default behavior takes

95% of the memory (see this answer).

When you use allow_growth = True , the GPU memory is not preallocated and will be able to grow as you need it. This will lead to smaller memory usage (as the default option is to use the whole memory) but decreases the perfomances if not use properly as it requires a more complex handeling of the memory (which is not the most efficient part of CPU/GPU interactions).

Stee1Arm

Новичок

Добрый вечер.
Подскажите пожалуйста, в чем может быть причина:
Майню на Nicehash, при переключение на алгоритм NeoScrypt вылетает ошибка «out of memory».
Майнер перезагружается и с ново выдает ту же ошибку, так по кругу пока не поменяет алгоритм.
Когда выставляю NeoScrypt только на 2 картах, все работает хорошо, больше 2-х вылетает с ошибкой.

Ферма:
8x Palit GTX 1080 Ti JetStream 11GB
1x Corsair 1000W 80+ Gold RM1000i
3x Corsair 850W 80+ Gold RM850
Asus Prime Z270-A + G4400 3.3Ghz 3MB
8GB DDR4 2400Mhz
Kingston SSD 120GB
Файл подкачки ставил 16000 и 24000 не помогает.

Бывалый

pilat200

Свой человек

Comments

Copy link Quote reply

mkabatek commented Dec 28, 2017

Nicehash v2.0.1.5 Beta, Windows 10

Running on GeForce GTX 1070, CUDA 9.1.85 , Nvidia 388.71

wrkr0-6 | CUDA error ‘out of memory’ in func ‘cuda_neoscrypt::init’ line 1258

wrkr1-7 | CUDA error DRIVER: ‘2’ in func ‘cudahelp::device_thread_init’ line 168

This comment has been minimized.

Copy link Quote reply

Tottom commented Dec 30, 2017 •

Hi,
Same issue for me on GeForce GTX 6x 1070 and 2x 1070Ti, CUDA 9.1.85, Nvidia 388.71
CUDA error ‘out of memory’ in func ‘cuda_neoscrypt::init’ line 1258

This comment has been minimized.

Copy link Quote reply

mkabatek commented Dec 30, 2017

I think it has to do with the amount of RAM in the machine. I only have 4Gb and I read somewhere else that someone added 8GB ram to their system and it fixed the issue.

Should be fixable in software though.

This comment has been minimized.

Copy link Quote reply

Tottom commented Dec 30, 2017

I do have 8GB memory on the system already. and 32GB swap. maybe i should do less on swap?

This comment has been minimized.

Copy link Quote reply

mkabatek commented Dec 30, 2017

They ADDED 8Gb of memory, meaning they have 16Gb now. Here is the original thread.

This comment has been minimized.

Copy link Quote reply

Tottom commented Dec 30, 2017

ic. Another poster says he can run 3x gtx 1070 on 8GB memory, anymore than that then he gets out of memory. so if i run 12x gtx 1070 i need 32GB RAM?

I will get 2x 8GB dimms in the new year only 🙁 and confirm the results on my current 8 card rig.
thanks!

This comment has been minimized.

Copy link Quote reply

Tottom commented Dec 30, 2017

I have changed my vmem to 96GB. Neoscrypt ran for awhile without failing. still monitoring.

This comment has been minimized.

Copy link Quote reply

Tottom commented Dec 31, 2017

So i tried 16GB vMem and got the errors more frequently. And on 64GB vMem and had the errors maybe about 20 hours later.

I have now upped my vMem to 98GB and i have not had any further errors in the last 24 hours. I have 6x 1070 and 2x 1070ti cards. OC 150mhz on CPU and 500mhz on memory and 65% TDP.

I have only 8GB physical memory.

This comment has been minimized.

Copy link Quote reply

remotetech commented Jan 10, 2018

Nicehash v2.0.1.5 Beta, Windows 10
Algorithm neoscrypt
cuda error out of memory in func cuda_neoscrypt::init line 1258
cuda error driver: 2 in func cudahelp::device_thread_init line 168

Same issue here. just upgraded to Nicehash v2.0.1.6 will monitor to see if fixed running precise bench again now. I’m running 1070 ti’s with 4 G Ram
I’m sure my Virtual Memory is auto. are you guys having any luck increasing it?

This comment has been minimized.

Copy link Quote reply

Tottom commented Jan 10, 2018

Are you able to change your Virtual Memory at all? how many 1070ti’s 4gb are you running?

This comment has been minimized.

Copy link Quote reply

remotetech commented Jan 10, 2018 •

Hi Tottom, I’m running:

2 ZOTAC 1070 Ti 8GB
1 Gigabyte 1060 6GB

for a total of 3 GPU cards. my Motherboard is using 4G RAM stick and I have a 60G SSD with Windows 10.

Yes I can set the Virtual Memory manually but there is not much memory left to play with.

For now I have just disabled the Neoscrypt Algorithm on all 3 GPU’s as I still have 13 other active Algorithms that are working just fine.

Please let me know exactly what has worked to fix this Neoscrypt error for you guys? so I can make the adjustments and turn back on the Neoscrypt Algorithm at some point.

This comment has been minimized.

Copy link Quote reply

remotetech commented Jan 11, 2018

Tottom, so your change to Vmem to 98GB is still working with neoscrypt with no errors?

Please advise as I will need to upgrade my ssd 🙂

This comment has been minimized.

Copy link Quote reply

Tottom commented Jan 11, 2018

remotetech, i was running 6x 1070 and 2x 1070ti at the time when the fix for vMem helped stabilize neoscrypt. apparently a 1070 8gb card is recommended to have 10 to 15gb of vMem per card. Other guys did not seem to experience this issue and when adding more physical memory on the motherboard, 16gb instead of 8gb their issue was also then resolved. if you have 40gb vMem it might work out for you if you have enough space? Another thought, start on the vMem of 16gb then see how long it takes to give the memory error or restart excavator. then add another 8gb and see till you have a stable miner. before spending too much on another drive? I have a 180gb and can only set 128gb at this time and it also seems to just not be enough for 8x 1070 and 2x 1070ti. the excavator runs for a few minutes then restarts. I added a second disk and put the vMem on it this did not fix the issue. So i think the excavator does not like vMem on another disk or shared between disks? I am no expert on the application but anything is possible.

This comment has been minimized.

Copy link Quote reply

Tottom commented Jan 11, 2018

just saw this comment from a fellow miner about Neoscrypt. Maybe it is more harsh to attempt?
from: Alex Thomas

«I had the same problem when I was mining neoscrypt and solve it like that:
format and reinstall all drv nvidia and gigabyte oc soft. and nicehash last version.
oc gigabyte soft alls gpus -20% power
open nicehash soft and make normal benchmark
now all it is working perfect , i can miner neoscrypt without errors cuda

solved
CUDA error ‘out of memory’ in func ‘cuda_neoscrypt::init’ line 1258″

This comment has been minimized.

Copy link Quote reply

raduvultur commented Jan 12, 2018

4x GTX 1070 (3 ASUS, 1 MSI) on WIN10 with 4GB RAM. After I got the error above I increased the vRAM amount to 64000 MB and running now with no errors since two hours ago.

This comment has been minimized.

Copy link Quote reply

remotetech commented Jan 16, 2018

Installed new drive and went with 125GB Vmem and turned back on neoscrypt and it’s mining neoscrypt now with no errors for about the last 10 minutes. Never got this far before. Thanks for all your help!

This comment has been minimized.

Copy link Quote reply

HoverDrive commented Feb 16, 2018

I’ve noticed a pattern here: everyone who has posted seems to have a mixture of GPUs in their mining rig. And so do I. One GTX 1080, one GTX 1070 ti, three GTX 1070s, and one GTX 1060.

I found my issue to be optimization confusion. I’m using the NIceHash miner, and when I ran the original benchmark against my GPUs, the optimizer assumed that my three GTX 1070s were created equal. But I have 2 that overclocked higher from the factory, and one that is lower-end and therefore has a slower GPU clock.

The optimizer picked one of the higher-end 1070s to run it’s optimization against, and simply assumed that the other two were identical cards. I eventually figured out that the error was occurring when the slower GTX 1070 was hit as hard as the higher-end GTX 1070s. Which was happening within seconds of starting the miner. Sometimes it took minutes.

A possible solution, would have been to overclock all three GTX 1070 at the same speed and re-run the optimizer. But what I ended up doing is replace the the slower 1070 with a GTX 1050 Ti from a different miner. The optimizer re-ran and now all seems to be working fine.

This comment has been minimized.

Copy link Quote reply

samywee commented Feb 17, 2018

I have very similar problem. I somewhat agree with HoverDrive.

I too have mix of GPUs (1070, 1060, 1050ti, 1050 and some are factory OC’d).

I run v2.0.1.10 (Latest as of today) — Error I get is «wrkrx-x | CUDA error ‘out of memory’ in func ‘cuda_neoscrypt::init’ line 1405» .

I have 10 GPU rig with 8GB memory. Quick look at task manager show memory is not a problem.

Is this a bug? Should I file a bug report? When I turn off neoscrypt or remove someof my cards, this rig seems to work. Is there a gpu limit to what NiceHash can handle. Windows 10 that I use, seems to recognize all cards and use the latest drivers from NVIDIA.

Источник: computermaker.info