vLLM - TONYLABS TECH CO., LTD.

What is vLLM?

vLLM, short for Virtual Large Language Model, is an open-source library designed to enhance the efficiency and scalability of large language model (LLM) inference and serving. Developed by the Sky Computing Lab at UC Berkeley, vLLM has become a community-driven project with contributions from both academia and industry.

What You Can Build With vLLM?

Typical use cases include:

Production model serving for applications like chatbots, virtual assistants, code generators, and customer support bots.
Real-time interactive AI applications where response speed and scalability are critical.
APIs for LLM-based features integrated into mobile apps, enterprise systems, or SaaS products.
Batch and offline inference workloads (for analytics, summarization, indexing, etc.).
Research environments where hardware efficiency helps experiment with large models without huge costs.

Installation

$ pip install -U vllm --extra-index-url https://download.pytorch.org/whl/cu130

Use user installed PyTorch

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ python use_existing_torch.py
$ pip install -r requirements/build.txt
$ pip install --no-build-isolation -e .

Installation on AGX Thor

Install CUDA Toolkit 13.0 if necessary

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-0
sudo apt-get install python3-dev

Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh

Create environment

uv venv ~/venv/vllm --python 3.12
source ~/venv/vllm/bin/activate

Install Pytorch

uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

Install flashinfer and triton

uv pip install xgrammar triton flashinfer-python --prerelease=allow

Clone vllm source from GitHub

git clone --recursive https://github.com/vllm-project/vllm.git
cd vllm
python3 use_existing_torch.py
uv pip install -r requirements/build.txt
uv pip install --no-build-isolation -e .

Export variables

export TORCH_CUDA_ARCH_LIST=11.0a # Thor
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Clean memory

sudo sysctl -w vm.drop_caches=3

Run GPT-OSS 120B model

mkdir -p tiktoken_encodings
wget -O tiktoken_encodings/o200k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken"
wget -O tiktoken_encodings/cl100k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"
export TIKTOKEN_ENCODINGS_BASE=${PWD}/tiktoken_encodings

Or mxfp8 activation for MoE. faster, but higher risk for accuracy.

export VLLM_USE_FLASHINFER_MXFP4_MOE=1

Run model:

uv run vllm serve "openai/gpt-oss-120b" --async-scheduling --port 8000 --host 0.0.0.0 --trust_remote_code --swap-space 16 --max-model-len 32000 --tensor-parallel-size 1 --max-num-seqs 1024 --gpu-memory-utilization 0.7