AI

vLLM

What is vLLM?

vLLM, short for Virtual Large Language Model, is an open-source library designed to enhance the efficiency and scalability of large language model (LLM) inference and serving. Developed by the Sky Computing Lab at UC Berkeley, vLLM has become a community-driven project with contributions from both academia and industry.

What You Can Build With vLLM?

Typical use cases include:

  • Production model serving for applications like chatbots, virtual assistants, code generators, and customer support bots.
  • Real-time interactive AI applications where response speed and scalability are critical.
  • APIs for LLM-based features integrated into mobile apps, enterprise systems, or SaaS products.
  • Batch and offline inference workloads (for analytics, summarization, indexing, etc.).
  • Research environments where hardware efficiency helps experiment with large models without huge costs.

Installation

$ pip install -U vllm --extra-index-url https://download.pytorch.org/whl/cu130

Use user installed PyTorch

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ python use_existing_torch.py
$ pip install -r requirements/build.txt
$ pip install --no-build-isolation -e .

Installation on AGX Thor

  1. Install CUDA Toolkit 13.0 if necessary
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-0
sudo apt-get install python3-dev
  1. Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
  1. Create environment
uv venv ~/venv/vllm --python 3.12
source ~/venv/vllm/bin/activate
  1. Install Pytorch
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
  1. Install flashinfer and triton
uv pip install xgrammar triton flashinfer-python --prerelease=allow
  1. Clone vllm source from GitHub
git clone --recursive https://github.com/vllm-project/vllm.git
cd vllm
python3 use_existing_torch.py
uv pip install -r requirements/build.txt
uv pip install --no-build-isolation -e .
  1. Export variables
export TORCH_CUDA_ARCH_LIST=11.0a # Thor
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
  1. Clean memory
sudo sysctl -w vm.drop_caches=3
  1. Run GPT-OSS 120B model
mkdir -p tiktoken_encodings
wget -O tiktoken_encodings/o200k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken"
wget -O tiktoken_encodings/cl100k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"
export TIKTOKEN_ENCODINGS_BASE=${PWD}/tiktoken_encodings

Or mxfp8 activation for MoE. faster, but higher risk for accuracy.

export VLLM_USE_FLASHINFER_MXFP4_MOE=1

Run model:

uv run vllm serve "openai/gpt-oss-120b" --async-scheduling --port 8000 --host 0.0.0.0 --trust_remote_code --swap-space 16 --max-model-len 32000 --tensor-parallel-size 1 --max-num-seqs 1024 --gpu-memory-utilization 0.7
Previous
Vector Embeddings