News

v LLM docs
docs. vllm. ai > en > latest > api > vllm > triton_utils > force_first_config

force_first_config

4+ hour ago  (23+ words) v LLM docs Skip Triton autotuning under VLLM_TRITON_FORCE_FIRST_CONFIG. Install the Autotuner. run replacement. Return whether the first-valid-config patch is currently installed....

Symbols: asx:zip,btc-usd,eth-usd,arpa-e,d05.S0,u11.S0
v LLM docs
docs. vllm. ai > en > latest > api > vllm > parser > nemotron_v3

nemotron_v3

9+ hour, 7+ min ago  (75+ words) v LLM docs The Nemotron 3 Super model uses the same tool call and reasoning format as Qwen3 ( / + XML). This config reuses: func: qwen3_config with a distinct name. When enable_thinking=False or force_nonempty_content=True and content is empty, reasoning and content are swapped. Nemotron V3 parser: same…...

Symbols: nasdaq:nmra
v LLM docs
docs. vllm. ai > en > latest > api > vllm > config > diffusion

diffusion - v LLM

4+ day, 11+ hour ago  (108+ words) diffusion v LLM docs Configuration for discrete diffusion (d LLM) models. Configuration for discrete diffusion language models (d LLMs). d LLMs generate tokens via iterative denoising over a fixed-length canvas rather than left-to-right autoregressive decoding. They reuse the speculative-decoding data path…...

Symbols: btc-usd
v LLM docs
docs. vllm. ai > en > latest > api > vllm > parser > harmony

harmony - v LLM

4+ day, 18+ hour ago  (24+ words) harmony v LLM docs Parse Harmony output from token IDs. Tool calls are always extracted regardless of enable_auto_tools. Callers must decide whether to surface them....

Symbols: nasdaq:hrmy
v LLM docs
docs. vllm. ai > en > latest > api > vllm > kernels > helion > ops > per_token_group_fp8_quant

per_token_group_fp8_quant

5+ day, 1+ hour ago  (66+ words) v LLM docs Pick the best pre-tuned config for the given input shape. - Find the closest hidden_size among available configs (exact match preferred). - Find the closest group_size among available configs (exact match preferred). - Among the num_tokens values tuned for that hidden_size and group_size, pick the…...

Symbols: otc:utkn,fip.16
v LLM docs
docs. vllm. ai > en > latest > configuration > optimization

Optimization and Tuning

7+ mon, 1+ week ago  (1562+ words) This guide covers optimization strategies and performance tuning for v LLM V1. Running out of memory? Consult this guide on how to conserve memory. v LLM provides 4 optimization levels (-O0, -O1, -O2, -O3) that allow users to trade off startup time for performance: For more…...

Symbols: setaf-af
v LLM docs
docs. vllm. ai > en > latest > features > kv_offloading_usage

KV Offloading Usage Guide

6+ day, 21+ hour ago  (263+ words) The Offloading Connector currently supports CUDA, ROCm, and XPU only. Two specs are available, selected by the spec_name key in kv_connector_extra_config: Only the CPU primary tier has direct GPU access. Secondary tiers cannot read from or write to GPU memory; all GPU'secondary…...

Symbols: nyse:vrt
Google News
docs. vllm. ai > en > latest > api > vllm > model_executor > layers > fused_qk_norm_rope

fused_qk_norm_rope - v LLM

1+ week, 2+ hour ago  (93+ words) Fused QK-RMSNorm + (partial) Ro PE + gate copy Triton kernel. Currently used by the Qwen3. 5 attention path (attn_output_gate with Neo X-style partial Ro PE). The unfused reference sequence is split -> Gemma RMSNorm -> Ro PE -> gate chunk; this collapses it into a single Triton…...

Symbols: btc-usd
v LLM docs
docs. vllm. ai > en > latest > api > vllm > model_executor > layers > fused_moe > prepare_finalize > deepep_v2

deepep_v2 - v LLM

1+ week, 6+ hour ago  (102+ words) deepep_v2 v LLM docs Prepare/Finalize using Deep EP v2 Elastic Buffer (unified API). Supports two modes controlled by the use_cudagraph constructor arg: Decode mode (use_cudagraph=True): - do_expand=False, do_cpu_sync=False - Tokens returned in original order with recv_topk_idx (global IDs) - Worst-case tensor allocation; padding rows zeroed via…...

Symbols: nyse:ibm,0700.hk,80700.hk,btc-usd
v LLM docs
docs. vllm. ai > en > latest > api > vllm > model_executor > layers > fused_moe > routed_experts

routed_experts

1+ week, 6+ hour ago  (340+ words) Container for routed expert weights and execution logic. Compute the hidden dimension index from the shard (intermediate) dimension and tensor rank. For 2 D weight tensors the two data dims are (0, 1). For 3 D tensors with an expert dimension at dim 0, they…...

Symbols: lloy.l,shel.l,btc-usd,0qwk.l,pacs.l,pacs.aq