以 CentOS 7.7,Tesla P100 GPU 为例。
1. 基础环境准备
1
|
yum install -y pciutils
|
1
2
3
|
lspci | grep -i nvidia
00:09.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 12GB] (rev a1)
|
支持 CUDA 的 GPU 列表:https://developer.nvidia.com/cuda-gpus
1
2
3
4
|
uname -m && cat /etc/redhat-release
x86_64
CentOS Linux release 7.7.1908 (Core)
|
支持 CUDA 的 OS 列表:https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#system-requirements
1
2
3
|
yum update -y
yum install -y wget vim gcc
yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
|
需要安装不低于 19.03 的版本,参考链接 。安装 Docker 参考链接: CentOS 7 安装指定版本的 Docker 。
2. 安装 GPU 驱动 & CUDA
2.1 禁用系统默认的 nouveau 驱动
屏蔽前:
1
2
3
4
5
6
7
8
9
10
|
lsmod | grep nouveau
nouveau 1898794 0
mxm_wmi 13021 1 nouveau
wmi 21636 2 mxm_wmi,nouveau
video 24538 1 nouveau
i2c_algo_bit 13413 1 nouveau
ttm 96673 2 bochs_drm,nouveau
drm_kms_helper 186531 2 bochs_drm,nouveau
drm 456166 5 ttm,bochs_drm,drm_kms_helper,nouveau
|
禁用 nouveau :
1
2
|
bash -c "echo blacklist nouveau > /etc/modprobe.d/blacklist-nvidia-nouveau.conf"
bash -c "echo options nouveau modeset=0 >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf"
|
重建 initramfs image
1
2
|
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
dracut /boot/initramfs-$(uname -r).img $(uname -r)
|
重启系统,屏蔽后:
1
2
3
|
lsmod | grep nouveau
(结果为空)
|
2.2 安装 GPU 驱动
有两种安装方法:
添加源
1
2
|
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm
|
安装 nvidia-detect :
1
|
yum install -y nvidia-detect
|
检测是否有对应的 kmod-nvidia 版本:
安装 kmod-nvidia 驱动:
1
|
yum install -y kmod-nvidia
|
在 Nvidia 官网 驱动下载 页面,找到 lspci | grep -i nvidia
命令显示的 GPU 类型。
1
2
|
wget http://cn.download.nvidia.com/tesla/440.64.00/nvidia-driver-local-repo-rhel7-440.64.00-1.0-1.x86_64.rpm
rpm -Uvh nvidia-driver-local-repo-rhel7-440.64.00-1.0-1.x86_64.rpm
|
也可以下载 Shell 脚本安装
1
2
3
|
wget http://us.download.nvidia.com/tesla/440.33.01/NVIDIA-Linux-x86_64-440.64.00.run
chmod +x NVIDIA-Linux-x86_64-440.64.00.run
bash ./NVIDIA-Linux-x86_64-440.64.00.run
|
2.3 安装 CUDA
在 Nvidia 开发者 cuda-toolkit-archive 页面,找到最新版本的工具包。根据页面提示,选择自己的操作系统,下面是 CentOS 7.7 得到的安装命令:
1
2
3
4
5
|
wget http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda-repo-rhel7-10-2-local-10.2.89-440.33.01-1.0-1.x86_64.rpm
sudo rpm -i cuda-repo-rhel7-10-2-local-10.2.89-440.33.01-1.0-1.x86_64.rpm
sudo yum clean all
sudo yum -y install nvidia-driver-latest-dkms cuda
sudo yum -y install cuda-drivers
|
2.4 验证是否安装成功
重启机器之后,检测 Nvidia CUDA 是否安装成功。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:00:09.0 Off | 0 |
| N/A 35C P0 27W / 250W | 0MiB / 12198MiB | 6% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
|
3. 安装 nvidia-docker
nvidia-docker 提供了在 Docker 中使用 GPU 加速的支持。
1
2
3
|
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
yum install -y nvidia-container-runtime nvidia-container-toolkit nvidia-docker2
|
编辑 /etc/docker/daemon.json
文件,新增如下内容:
1
2
3
4
5
6
7
8
|
{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
|
1
|
systemctl restart docker
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:00:09.0 Off | 0 |
| N/A 36C P0 26W / 250W | 0MiB / 12198MiB | 6% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
docker run --runtime=nvidia nvidia/cuda:10.0-base nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:00:09.0 Off | 0 |
| N/A 36C P0 26W / 250W | 0MiB / 12198MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
nvidia-docker run nvidia/cuda:10.0-base nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:00:09.0 Off | 0 |
| N/A 35C P0 26W / 250W | 0MiB / 12198MiB | 6% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
|
4. 参考
- https://github.com/NVIDIA/nvidia-docker