如何在 CentOS 安装 GPU 驱动

2023年 1月 4日 56.8k 0

以 CentOS 7.7,Tesla P100 GPU 为例。

1. 基础环境准备

  • 安装 lspci 命令
1
yum install -y pciutils
  • 检查 GPU 是否支持 CUDA
1
2
3
lspci | grep -i nvidia

00:09.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 12GB] (rev a1)

支持 CUDA 的 GPU 列表:https://developer.nvidia.com/cuda-gpus

  • 检查系统是否支持 CUDA
1
2
3
4
uname -m && cat /etc/redhat-release

x86_64
CentOS Linux release 7.7.1908 (Core)

支持 CUDA 的 OS 列表:https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#system-requirements

  • 安装系统工具包
1
2
3
yum update -y
yum install -y wget vim gcc
yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
  • 安装 Docker

需要安装不低于 19.03 的版本,参考链接 。安装 Docker 参考链接: CentOS 7 安装指定版本的 Docker 。

2. 安装 GPU 驱动 & CUDA

2.1 禁用系统默认的 nouveau 驱动

屏蔽前:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
lsmod | grep nouveau

nouveau              1898794  0
mxm_wmi                13021  1 nouveau
wmi                    21636  2 mxm_wmi,nouveau
video                  24538  1 nouveau
i2c_algo_bit           13413  1 nouveau
ttm                    96673  2 bochs_drm,nouveau
drm_kms_helper        186531  2 bochs_drm,nouveau
drm                   456166  5 ttm,bochs_drm,drm_kms_helper,nouveau

禁用 nouveau :

1
2
bash -c "echo blacklist nouveau > /etc/modprobe.d/blacklist-nvidia-nouveau.conf"
bash -c "echo options nouveau modeset=0 >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf"

重建 initramfs image

1
2
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
dracut /boot/initramfs-$(uname -r).img $(uname -r)

重启系统,屏蔽后:

1
2
3
lsmod | grep nouveau

(结果为空)

2.2 安装 GPU 驱动

有两种安装方法:

  • 第一种,安装 kmod-nvidia 驱动

添加源

1
2
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm

安装 nvidia-detect :

1
yum install -y nvidia-detect

检测是否有对应的 kmod-nvidia 版本:

1
nvidia-detect -v

安装 kmod-nvidia 驱动:

1
yum install -y kmod-nvidia
  • 第二种,下载官网驱动安装

在 Nvidia 官网 驱动下载 页面,找到 lspci | grep -i nvidia 命令显示的 GPU 类型。

1
2
wget http://cn.download.nvidia.com/tesla/440.64.00/nvidia-driver-local-repo-rhel7-440.64.00-1.0-1.x86_64.rpm
rpm -Uvh nvidia-driver-local-repo-rhel7-440.64.00-1.0-1.x86_64.rpm

也可以下载 Shell 脚本安装

1
2
3
wget http://us.download.nvidia.com/tesla/440.33.01/NVIDIA-Linux-x86_64-440.64.00.run
chmod +x NVIDIA-Linux-x86_64-440.64.00.run
bash ./NVIDIA-Linux-x86_64-440.64.00.run

2.3 安装 CUDA

在 Nvidia 开发者 cuda-toolkit-archive 页面,找到最新版本的工具包。根据页面提示,选择自己的操作系统,下面是 CentOS 7.7 得到的安装命令:

1
2
3
4
5
wget http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda-repo-rhel7-10-2-local-10.2.89-440.33.01-1.0-1.x86_64.rpm
sudo rpm -i cuda-repo-rhel7-10-2-local-10.2.89-440.33.01-1.0-1.x86_64.rpm
sudo yum clean all
sudo yum -y install nvidia-driver-latest-dkms cuda
sudo yum -y install cuda-drivers

2.4 验证是否安装成功

重启机器之后,检测 Nvidia CUDA 是否安装成功。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:09.0 Off |                    0 |
| N/A   35C    P0    27W / 250W |      0MiB / 12198MiB |      6%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

3. 安装 nvidia-docker

nvidia-docker 提供了在 Docker 中使用 GPU 加速的支持。

  • 安装 nvidia-docker
1
2
3
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
yum install -y nvidia-container-runtime nvidia-container-toolkit nvidia-docker2
  • 添加新的 runtime

编辑 /etc/docker/daemon.json 文件,新增如下内容:

1
2
3
4
5
6
7
8
{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}
  • 重启 Docker 生效
1
systemctl restart docker
  • 验证是否安装成功
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
docker run --gpus all nvidia/cuda:10.0-base nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:09.0 Off |                    0 |
| N/A   36C    P0    26W / 250W |      0MiB / 12198MiB |      6%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
docker run --runtime=nvidia nvidia/cuda:10.0-base nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:09.0 Off |                    0 |
| N/A   36C    P0    26W / 250W |      0MiB / 12198MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
nvidia-docker run nvidia/cuda:10.0-base nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:09.0 Off |                    0 |
| N/A   35C    P0    26W / 250W |      0MiB / 12198MiB |      6%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

4. 参考

  • https://github.com/NVIDIA/nvidia-docker

相关文章

KubeSphere 部署向量数据库 Milvus 实战指南
探索 Kubernetes 持久化存储之 Longhorn 初窥门径
征服 Docker 镜像访问限制!KubeSphere v3.4.1 成功部署全攻略
那些年在 Terraform 上吃到的糖和踩过的坑
无需 Kubernetes 测试 Kubernetes 网络实现
Kubernetes v1.31 中的移除和主要变更

发布评论