vLLM LLM 推理和服务库

运维资讯 2023-11-03 醒在深海的猫手机阅读

Scan me!

vLLM 是一个快速且易于使用的 LLM 推理和服务库。

vLLM 的速度很快：

最先进的服务吞吐量
使用 PagedAttention 高效管理 attention key 和 value memory
连续批处理传入请求
优化的 CUDA 内核

vLLM 灵活且易于使用：

与流行的 Hugging Face 模型无缝集成
高吞吐量服务与各种解码算法，包括并行采样、波束搜索等
对分布式推理的张量并行支持
流输出
兼容 OpenAI 的 API 服务器

vLLM 无缝支持许多 Hugging Face 模型，包括以下架构：

Aquila & Aquila2 (BAAI/AquilaChat2-7B, BAAI/AquilaChat2-34B, BAAI/Aquila-7B, BAAI/AquilaChat-7B, etc.)
Baichuan (baichuan-inc/Baichuan-7B, baichuan-inc/Baichuan-13B-Chat, etc.)
BLOOM (bigscience/bloom, bigscience/bloomz, etc.)
Falcon (tiiuae/falcon-7b, tiiuae/falcon-40b, tiiuae/falcon-rw-7b, etc.)
GPT-2 (gpt2, gpt2-xl, etc.)
GPT BigCode (bigcode/starcoder, bigcode/gpt_bigcode-santacoder, etc.)
GPT-J (EleutherAI/gpt-j-6b, nomic-ai/gpt4all-j, etc.)
GPT-NeoX (EleutherAI/gpt-neox-20b, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b, etc.)
InternLM (internlm/internlm-7b, internlm/internlm-chat-7b, etc.)
LLaMA & LLaMA-2 (meta-llama/Llama-2-70b-hf, lmsys/vicuna-13b-v1.3, young-geng/koala, openlm-research/open_llama_13b, etc.)
Mistral (mistralai/Mistral-7B-v0.1, mistralai/Mistral-7B-Instruct-v0.1, etc.)
MPT (mosaicml/mpt-7b, mosaicml/mpt-30b, etc.)
OPT (facebook/opt-66b, facebook/opt-iml-max-30b, etc.)
Qwen (Qwen/Qwen-7B, Qwen/Qwen-7B-Chat, etc.)

相关推荐

运维资讯 2023-10-01 醒在深海的猫

运维资讯 2024-03-17 醒在深海的猫

运维资讯 2023-08-09 醒在深海的猫

运维资讯 2023-12-09 醒在深海的猫

运维资讯 2024-03-29 醒在深海的猫

回到顶部