vLLM
VLLM
To begin, start a vLLM server. Be sure to specify --structured-outputs-config.backend guidance if your vLLM version is >0.12.0.
vllm serve RedHatAI/gemma-3-12b-it-quantized.w4a16 --host 0.0.0.0 \
--port 8000 \
--enable-prefix-caching \
--max-model-len 8000 \
--structured-outputs-config.backend guidance \
--gpu_memory_utilization 0.8 \
--enable-prompt-tokens-details
Bases: ModelBase
Class for vLLM endpoints.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name_or_path
|
Name of the model |
required | |
base_url
|
str | None
|
Base URL for http requests. Defaults to "http://localhost:8000/v1/" |
None
|
Examples:
from blendsql.models import VLLM
model = VLLM("RedHatAI/gemma-3-12b-it-quantized.w4a16", base_url="http://localhost:8000/v1/")
Source code in blendsql/models/vllm.py
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | |