问题1:Linux服务器深度学习代码无法用GPU执行;

问题2:执行命令 nvidia-smi,提示:Failed to initialize NVML: Driver/library version mismatch

原因:系统自动更新驱动

分析:

1.查看nvidia相关安装包信息,确认一下版本

1
sudo dpkg --list | grep nvidia-* 

2.查看nvidia内核版本

1
cat /proc/driver/nvidia/version

3.查看安装包安装或更新情况

1
cat /var/log/dpkg.log | grep nvidia

核对发现服务驱动自动更新了,所以导致显卡驱动用不了,执行不力nvidia-smi

解决办法:禁用自动更新

1
sudo vim /etc/apt/apt.conf.d/50unattended-upgrades

注释掉以下两行:
//“${distro_id}:${distro_codename}”;
//“${distro_id}:${distro_codename}-security”;

重启系统

1
sudo reboot

再次执行:

1
nvidia-smi

正常:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
(base) dino@jarvis:~$ nvidia-smi 
Wed Sep 6 01:43:33 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 0% 57C P0 107W / 350W | 0MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

Ref: https://zhuanlan.zhihu.com/p/453955370