本文主要介绍基于AWS inferentia芯片高性价比的部署Stable Diffusion。相比Nvidia GPU同等性能机型,AWS inferentia能将推理的成本可降低多达 70%。如果你希望测试GPU机型也可以看在AWS上快速部署Stable Diffusion webui[AWS EC2 GPU机型]
环境准备
名称 | 版本 |
---|---|
Ubuntu Server | Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20230817 |
Python3 | 3.10.x |
Stable Diffusion | pretrained_sd2_512_inference |
以往大家认为AI绘图只能是Nvidia GPU卡,但是实际上AWS inferentia也可以做而且对有经验的AI开发人员而言,效果更好性价比更高 。此外我选的EC2设置了128G的SSD磁盘并有公网IP(方便远程SSH访问并直接对外提供Stable Diffusion服务 )
安装
可以参考AWS Neuron、PyTorch Neuron (torch-neuronx) Setup和pretrained_sd2_512_inference
创建inf2实例
启动一个有公网IP的inferentia机型,这里我选择的是inf2.8xlarge
机型并搭配Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20230817
系统,这个系统已经预装好了相关的驱动。
机型选择inf2.8xlarge
,并开启公网访问,磁盘配置为128GB
安装Stable Diffusion
更新neuronx到最新版本
# Update OS packages
sudo apt-get update -y
# Update OS headers
sudo apt-get install linux-headers-$(uname -r) -y
# Install git
sudo apt-get install git -y
# update Neuron Driver
sudo apt-get update aws-neuronx-dkms=2.* -y
# Update Neuron Runtime
sudo apt-get install aws-neuronx-collectives=2.* -y
sudo apt-get install aws-neuronx-runtime-lib=2.* -y
# Update Neuron Tools
sudo apt-get install aws-neuronx-tools=2.* -y
# Add PATH
export PATH=/opt/aws/neuron/bin:$PATH
安装依赖组件,并安装jupyter notebook
# Activate Python venv
source /opt/aws_neuron_venv_pytorch/bin/activate
# Install Jupyter notebook kernel
pip install ipykernel
python3.8 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)"
pip install jupyter notebook
pip install environment_kernels
# Set pip repository pointing to the Neuron repository
python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
# Update Neuron Compiler and Framework
python -m pip install --upgrade neuronx-cc==2.* torch-neuronx torchvision
开放jupyter notebook的远程访问
# 创建配置
jupyter-lab --generate-config
# 设置密码
jupyter server password
# 启动jupyter-lab并允许远程访问
jupyter-lab --ip 0.0.0.0 --port 8888 --no-browser
注意:设置AWS安全组放行8888端口并只允许你的IP访问
用浏览器访问jupyter notebook
点击进入python3对话框
安装依赖组件
!pip install diffusers==0.14.0 transformers==4.30.2 accelerate==0.16.0 safetensors==0.3.1 matplotlib
导入依赖
import os
os.environ["NEURON_FUSE_SOFTMAX"] = "1"
import torch
import torch.nn as nn
import torch_neuronx
import numpy as np
from matplotlib import pyplot as plt
from matplotlib import image as mpimg
import time
import copy
from IPython.display import clear_output
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.models.unet_2d_condition import UNet2DConditionOutput
from diffusers.models.cross_attention import CrossAttention
# Define datatype
DTYPE = torch.bfloat16
clear_output(wait=False)
定义python3工具类和函数
class UNetWrap(nn.Module):
def __init__(self, unet):
super().__init__()
self.unet = unet
def forward(self, sample, timestep, encoder_hidden_states, cross_attention_kwargs=None):
out_tuple = self.unet(sample, timestep, encoder_hidden_states, return_dict=False)
return out_tuple
class NeuronUNet(nn.Module):
def __init__(self, unetwrap):
super().__init__()
self.unetwrap = unetwrap
self.config = unetwrap.unet.config
self.in_channels = unetwrap.unet.in_channels
self.device = unetwrap.unet.device
def forward(self, sample, timestep, encoder_hidden_states, cross_attention_kwargs=None):
sample = self.unetwrap(sample, timestep.to(dtype=DTYPE).expand((sample.shape[0],)), encoder_hidden_states)[0]
return UNet2DConditionOutput(sample=sample)
class NeuronTextEncoder(nn.Module):
def __init__(self, text_encoder):
super().__init__()
self.neuron_text_encoder = text_encoder
self.config = text_encoder.config
self.dtype = text_encoder.dtype
self.device = text_encoder.device
def forward(self, emb, attention_mask = None):
return [self.neuron_text_encoder(emb)['last_hidden_state']]
# Optimized attention
def get_attention_scores(self, query, key, attn_mask):
dtype = query.dtype
if self.upcast_attention:
query = query.float()
key = key.float()
# Check for square matmuls
if(query.size() == key.size()):
attention_scores = custom_badbmm(
key,
query.transpose(-1, -2)
)
if self.upcast_softmax:
attention_scores = attention_scores.float()
attention_probs = attention_scores.softmax(dim=1).permute(0,2,1)
attention_probs = attention_probs.to(dtype)
else:
attention_scores = custom_badbmm(
query,
key.transpose(-1, -2)
)
if self.upcast_softmax:
attention_scores = attention_scores.float()
attention_probs = attention_scores.softmax(dim=-1)
attention_probs = attention_probs.to(dtype)
return attention_probs
def custom_badbmm(a, b):
bmm = torch.bmm(a, b)
scaled = bmm * 0.125
return scaled
def decode_latents(self, latents):
latents = latents.to(torch.float)
latents = 1 / self.vae.config.scaling_factor * latents
image = self.vae.decode(latents).sample
image = (image / 2 + 0.5).clamp(0, 1)
image = image.cpu().permute(0, 2, 3, 1).float().numpy()
return image
使用neuron编译模型,这样才能在inf芯片上运行,首次编译耗时数分钟,我们可以稍等...
# For saving compiler artifacts
COMPILER_WORKDIR_ROOT = 'sd2_compile_dir_512'
# Model ID for SD version pipeline
model_id = "stabilityai/stable-diffusion-2-1-base"
# --- Compile UNet and save ---
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=DTYPE)
# Replace original cross-attention module with custom cross-attention module for better performance
CrossAttention.get_attention_scores = get_attention_scores
# Apply double wrapper to deal with custom return type
pipe.unet = NeuronUNet(UNetWrap(pipe.unet))
# Only keep the model being compiled in RAM to minimze memory pressure
unet = copy.deepcopy(pipe.unet.unetwrap)
del pipe
# Compile unet
sample_1b = torch.randn([1, 4, 64, 64], dtype=DTYPE)
timestep_1b = torch.tensor(999, dtype=DTYPE).expand((1,))
encoder_hidden_states_1b = torch.randn([1, 77, 1024], dtype=DTYPE)
example_inputs = sample_1b, timestep_1b, encoder_hidden_states_1b
unet_neuron = torch_neuronx.trace(
unet,
example_inputs,
compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'unet'),
compiler_args=["--model-type=unet-inference", "--enable-fast-loading-neuron-binaries"]
)
# Enable asynchronous and lazy loading to speed up model load
torch_neuronx.async_load(unet_neuron)
torch_neuronx.lazy_load(unet_neuron)
# save compiled unet
unet_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'unet/model.pt')
torch.jit.save(unet_neuron, unet_filename)
# delete unused objects
del unet
del unet_neuron
# --- Compile CLIP text encoder and save ---
# Only keep the model being compiled in RAM to minimze memory pressure
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=DTYPE)
text_encoder = copy.deepcopy(pipe.text_encoder)
del pipe
# Apply the wrapper to deal with custom return type
text_encoder = NeuronTextEncoder(text_encoder)
# Compile text encoder
# This is used for indexing a lookup table in torch.nn.Embedding,
# so using random numbers may give errors (out of range).
emb = torch.tensor([[49406, 18376, 525, 7496, 49407, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0]])
text_encoder_neuron = torch_neuronx.trace(
text_encoder.neuron_text_encoder,
emb,
compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder'),
compiler_args=["--enable-fast-loading-neuron-binaries"]
)
# Enable asynchronous loading to speed up model load
torch_neuronx.async_load(text_encoder_neuron)
# Save the compiled text encoder
text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder/model.pt')
torch.jit.save(text_encoder_neuron, text_encoder_filename)
# delete unused objects
del text_encoder
del text_encoder_neuron
# --- Compile VAE decoder and save ---
# Only keep the model being compiled in RAM to minimze memory pressure
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)
decoder = copy.deepcopy(pipe.vae.decoder)
del pipe
# Compile vae decoder
decoder_in = torch.randn([1, 4, 64, 64], dtype=torch.float32)
decoder_neuron = torch_neuronx.trace(
decoder,
decoder_in,
compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder'),
compiler_args=["--enable-fast-loading-neuron-binaries"]
)
# Enable asynchronous loading to speed up model load
torch_neuronx.async_load(decoder_neuron)
# Save the compiled vae decoder
decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')
torch.jit.save(decoder_neuron, decoder_filename)
# delete unused objects
del decoder
del decoder_neuron
# --- Compile VAE post_quant_conv and save ---
# Only keep the model being compiled in RAM to minimze memory pressure
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)
post_quant_conv = copy.deepcopy(pipe.vae.post_quant_conv)
del pipe
# # Compile vae post_quant_conv
post_quant_conv_in = torch.randn([1, 4, 64, 64], dtype=torch.float32)
post_quant_conv_neuron = torch_neuronx.trace(
post_quant_conv,
post_quant_conv_in,
compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv'),
)
# Enable asynchronous loading to speed up model load
torch_neuronx.async_load(post_quant_conv_neuron)
# # Save the compiled vae post_quant_conv
post_quant_conv_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv/model.pt')
torch.jit.save(post_quant_conv_neuron, post_quant_conv_filename)
# delete unused objects
del post_quant_conv
del post_quant_conv_neuron
编译成功后会有Compiler status PASS
字样的提示
加载编译后的模型并生图
# --- Load all compiled models ---
COMPILER_WORKDIR_ROOT = 'sd2_compile_dir_512'
model_id = "stabilityai/stable-diffusion-2-1-base"
text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder/model.pt')
decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')
unet_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'unet/model.pt')
post_quant_conv_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv/model.pt')
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=DTYPE)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
# Replaces StableDiffusionPipeline's decode_latents method with our custom decode_latents method defined above.
StableDiffusionPipeline.decode_latents = decode_latents
# Load the compiled UNet onto two neuron cores.
pipe.unet = NeuronUNet(UNetWrap(pipe.unet))
device_ids = [0,1]
pipe.unet.unetwrap = torch_neuronx.DataParallel(torch.jit.load(unet_filename), device_ids, set_dynamic_batching=False)
# Load other compiled models onto a single neuron core.
pipe.text_encoder = NeuronTextEncoder(pipe.text_encoder)
pipe.text_encoder.neuron_text_encoder = torch.jit.load(text_encoder_filename)
pipe.vae.decoder = torch.jit.load(decoder_filename)
pipe.vae.post_quant_conv = torch.jit.load(post_quant_conv_filename)
# Run pipeline
prompt = ["a photo of an astronaut riding a horse on mars",
"sonic on the moon",
"elvis playing guitar while eating a hotdog",
"saved by the bell",
"engineers eating lunch at the opera",
"panda eating bamboo on a plane",
"A digital illustration of a steampunk flying machine in the sky with cogs and mechanisms, 4k, detailed, trending in artstation, fantasy vivid colors",
"kids playing soccer at the FIFA World Cup"
]
# First do a warmup run so all the asynchronous loads can finish
image_warmup = pipe(prompt[0]).images[0]
plt.title("Image")
plt.xlabel("X pixel scaling")
plt.ylabel("Y pixels scaling")
total_time = 0
for x in prompt:
start_time = time.time()
image = pipe(x).images[0]
total_time = total_time + (time.time()-start_time)
image.save("image.png")
image = mpimg.imread("image.png")
#clear_output(wait=True)
plt.imshow(image)
plt.show()
print("Average time: ", np.round((total_time/len(prompt)), 2), "seconds")
我们样例中输入了一组提示词,我们看到按照提示词画了一些图片
提示词a photo of an astronaut riding a horse on mars
生成了宇航员
当我们调整提示词为Portrait of renaud sechan, pen and ink, intricate line drawings, by craig mullins, ruan jia, kentaro miura, greg rutkowski, loundraw
生图效果如下
关于AWS自研芯片
AWS 目前自研了云原生场景下的多种高性能芯片,并大量的使用在AWS上取得了非常好的反响,主要是如下4类
名称 | 网站 | 功能 |
---|---|---|
Nitro | AWS Nitro System | 虚拟化、安全、网络、磁盘管理等,主要是提升EC2的整体IO性能、虚拟化效率并提升安全性 |
Graviton | AWS Graviton 处理器 | 基于ARM64指令集的通用场景下的服务器CPU,大量用在服务器领域,目前已经占据最大的ARM服务器市场份额(单价比X86_64低20%,性能更好性价比提示34%),几乎全部的AWS客户都有用到Graviton托管服务或者基于Graviton裸机自建服务 |
Inferentia | AWS Inferentia | 专用于深度学习推理场景下的推理芯片,也是本次使用的芯片。 |
Trainium | AWS Trainium | 专用于深度学习模型训练场景下的训练芯片 |
在机器学习领域,Trainium和Inferentia是完美的搭档,分别处理复杂的模型训练和任务推理场景,它们专为大规模机器学习场景设计因此效率更高,价格更低并且能耗很好。一般主流的模型都可以用neuron sdk进行适配后高效的完成AI/ML任务。
本次inf2.8xlarge
搭载了最新(2023)的第二代inf芯片,提供高达2.3 petaflops的DL性能和高达384 GB的总加速器内存以及9.8 TB/s 的带宽。AWS Neuron SDK
与 PyTorch
和TensorFlow
等流行的机器学习框架原生集成。因此,用户可以继续使用现有框架和应用程序代码在Inf2上进行部署。开发人员可以在 AWS Deep Learning AMI、AWS Deep Learning
容器或 Amazon Elastic Container Service (Amazon ECS)
、Amazon Elastic Kubernetes Service (Amazon EKS)
和 Amazon SageMaker
等托管服务中使用Inf2实例。
Amazon EC2 Inf2实例的核心是 AWS Inferentia2
设备,每个设备包含两个 NeuronCores-v2
。每个 NeuronCore-v2
都是一个独立的异构计算单元,具有四个主要引擎:张量(Tensor
)、向量(Vector
)、标量(Scalar
) 和 GPSIMD
引擎。张量引擎针对矩阵运算进行了优化。标量引擎针对 ReLU
(整流线性单元)函数等元素运算进行了优化。向量引擎针对非元素向量操作进行了优化,包括批量归一化或池化。下图显示了 AWS Inferentia2设备架构的内部工作原理。
AWS Inferentia2 支持多种数据类型,包括 FP32
、TF32
、BF16
、FP16
和 UINT8
,因此用户可以根据工作负载选择最合适的数据类型。它还支持新的可配置 FP8 (cFP8) 数据类型,这与大型模型特别相关,因为它减少了模型的内存占用和 I/O 要求。
AWS Inferentia2 嵌入了支持动态执行的通用数字信号处理器(DSP),因此无需在主机上展开或执行控制流运算符。AWS Inferentia2 还支持动态输入形状,这对于输入张量大小未知的模型(例如处理文本的模型)来说非常关键。
AWS Inferentia2 支持用 C++ 编写的自定义运算符。Neuron Custom C++ Operators 使用户能够编写在 NeuronCores 上本机运行的 C++ 自定义运算符。使用标准 PyTorch 自定义运算符编程接口将 CPU 自定义运算符迁移到 Neuron 并实现新的实验运算符,所有这些都不需要对 NeuronCore 硬件有深入了解。
Inf2 实例是 Amazon EC2 上的第一个推理优化实例,可通过芯片之间的直接超高速连接(NeuronLink v2)支持分布式推理。NeuronLink v2 使用集体通信 (Collective Communications) 运算符(例如 all-reduce)在所有芯片上运行高性能推理管道。
Neuron SDK
AWS Neuron 是一种 SDK,可优化在 AWS Inferentia 和 Trainium 上执行的复杂神经网络模型的性能。AWS Neuron 包括深度学习编译器、运行时和工具,这些工具与 TensorFlow 和 PyTorch 等流行框架原生集成,它预装在 AWS Deep Learning AMI 和 Deep Learning Containers 中,供客户快速开始运行高性能且经济高效的推理。
Neuron 编译器接受多种格式(TensorFlow、PyTorch、XLA HLO)的机器学习模型,并优化它们以在 Neuron 设备上的运行。Neuron 编译器在机器学习框架内调用,其中模型由 Neuron Framework 插件发送到编译器。生成的编译器工件称为 NEFF 文件(Neuron 可执行文件格式),该文件又由 Neuron 运行时加载到 Neuron 设备。
Neuron 运行时由内核驱动程序和 C/C++ 库组成,后者提供 API 来访问 Inferentia 和 Trainium Neuron 设备。TensorFlow和PyTorch 的 Neuron ML 框架插件使用 Neuron 运行时在 NeuronCores 上加载和运行模型。Neuron 运行时将编译的深度学习模型(也称为 Neuron 可执行文件格式 (NEFF))加载到 Neuron 设备,并针对高吞吐量和低延迟进行了优化。
使用 Inf2 实例运行流行的应用程序,例如文本摘要、代码生成、视频和图像生成、语音识别、个性化等
参考:
- Maximize Stable Diffusion performance and lower inference costs with AWS Inferentia2
- 使用 Amazon EC2 Inf2 实例运行大语言模型 GPT-J-6B
- 在 Amazon SageMaker 上使用 AWS Inferentia2 实现 AI 作画