News
force_first_config
4+ hour ago (23+ words) v LLM docs Skip Triton autotuning under VLLM_TRITON_FORCE_FIRST_CONFIG. Install the Autotuner. run replacement. Return whether the first-valid-config patch is currently installed....
nemotron_v3
9+ hour, 7+ min ago (75+ words) v LLM docs The Nemotron 3 Super model uses the same tool call and reasoning format as Qwen3 ( / + XML). This config reuses: func: qwen3_config with a distinct name. When enable_thinking=False or force_nonempty_content=True and content is empty, reasoning and content are swapped. Nemotron V3 parser: same…...
diffusion - v LLM
4+ day, 11+ hour ago (108+ words) diffusion v LLM docs Configuration for discrete diffusion (d LLM) models. Configuration for discrete diffusion language models (d LLMs). d LLMs generate tokens via iterative denoising over a fixed-length canvas rather than left-to-right autoregressive decoding. They reuse the speculative-decoding data path…...
harmony - v LLM
4+ day, 18+ hour ago (24+ words) harmony v LLM docs Parse Harmony output from token IDs. Tool calls are always extracted regardless of enable_auto_tools. Callers must decide whether to surface them....
per_token_group_fp8_quant
5+ day, 1+ hour ago (66+ words) v LLM docs Pick the best pre-tuned config for the given input shape. - Find the closest hidden_size among available configs (exact match preferred). - Find the closest group_size among available configs (exact match preferred). - Among the num_tokens values tuned for that hidden_size and group_size, pick the…...
Optimization and Tuning
7+ mon, 1+ week ago (1562+ words) This guide covers optimization strategies and performance tuning for v LLM V1. Running out of memory? Consult this guide on how to conserve memory. v LLM provides 4 optimization levels (-O0, -O1, -O2, -O3) that allow users to trade off startup time for performance: For more…...
KV Offloading Usage Guide
6+ day, 21+ hour ago (263+ words) The Offloading Connector currently supports CUDA, ROCm, and XPU only. Two specs are available, selected by the spec_name key in kv_connector_extra_config: Only the CPU primary tier has direct GPU access. Secondary tiers cannot read from or write to GPU memory; all GPU'secondary…...
fused_qk_norm_rope - v LLM
1+ week, 2+ hour ago (93+ words) Fused QK-RMSNorm + (partial) Ro PE + gate copy Triton kernel. Currently used by the Qwen3. 5 attention path (attn_output_gate with Neo X-style partial Ro PE). The unfused reference sequence is split -> Gemma RMSNorm -> Ro PE -> gate chunk; this collapses it into a single Triton…...
deepep_v2 - v LLM
1+ week, 6+ hour ago (102+ words) deepep_v2 v LLM docs Prepare/Finalize using Deep EP v2 Elastic Buffer (unified API). Supports two modes controlled by the use_cudagraph constructor arg: Decode mode (use_cudagraph=True): - do_expand=False, do_cpu_sync=False - Tokens returned in original order with recv_topk_idx (global IDs) - Worst-case tensor allocation; padding rows zeroed via…...
routed_experts
1+ week, 6+ hour ago (340+ words) Container for routed expert weights and execution logic. Compute the hidden dimension index from the shard (intermediate) dimension and tensor rank. For 2 D weight tensors the two data dims are (0, 1). For 3 D tensors with an expert dimension at dim 0, they…...