News
Mastering Agentic Techniques: AI Agent Customization
1+ hour, 15+ min ago (1518+ words) Autonomous AI agents are taking on all types of work for businesses: routing logistics fleets, triaging support tickets, generating code, and orchestrating multistep workflows. How do you take a general-purpose model and make it excel at your specific task? Customization…...
Add a Specialized Deep Research Skill to Agent Harnesses
23+ hour, 40+ min ago (852+ words) Teams building these agents must ground them in enterprise data, connecting data sources, routing queries, managing authentication, tuning prompts, evaluating outputs, and preserving source attribution. NVIDIA AI-Q packages this work into an open-source deep research blueprint that can be exposed…...
Mastering Agentic Techniques: AI Agent Evaluation
1+ day, 4+ hour ago (542+ words) This post explains the key differences between model and agent evaluation and walks through five practical tips for evaluating AI agents as production systems. This evaluation approach focuses on trajectories, tools, and outcomes'not just model scores. While model and agent…...
How the NVIDIA Vera Rubin Platform is Solving Agentic AI's Scale-Up Problem
6+ day, 4+ hour ago (151+ words) Agentic inference has fundamentally changed the runtime dynamics of inference workloads by introducing non-deterministic trajectories—actions, observations…...
Transform Video Into Instantly Searchable, Actionable Intelligence with AI Agents and Skills
1+ week, 1+ day ago (1291+ words) In this post you will learn how to use the new VSS skills with coding agents to automate VSS deployment and integration into custom applications, followed by a deep dive into the technology behind VSS 3. Continue reading to learn how…...
How to Eliminate Pipeline Friction in AI Model Serving
1+ week, 2+ day ago (1238+ words) This post provides actionable best practices for eliminating the most common sources of friction in AI model serving pipelines. The results are concrete: APIs respond faster under real traffic. Each GPU carries more requests. Scaling up for peak hours is…...
Improving Bash Generation in Small Language Models with Grammar-Constrained Decoding
1+ week, 5+ day ago (984+ words) Constrained decoding is a technique that modifies the sampling process in autoregressive language model generation. At each generation step, the model produces logits as normal, but before a token is selected, a grammar is applied to change the distribution (often…...
Streaming Tokens and Tools: Multi-Turn Agentic Harness Support in NVIDIA Dynamo
1+ week, 5+ day ago (1302+ words) An agentic exchange must preserve a structured interaction: assistant turns interleave reasoning with one or more tool calls, and subsequent user turns return the corresponding tool results to the model context. Reasoning replay is model- and turn-dependent: some reasoning should…...
Achieving Peak System and Workload Efficiency on NVIDIA GB200 NVL72 with Slurm Block Scheduling
1+ week, 6+ day ago (1375+ words) NVIDIA GB200 NVL72 introduces a fundamentally new way to build GPU clusters by extending NVIDIA NVLink coherence across an entire rack. This design enables exascale performance, but it also changes the assumptions that many scheduling systems were built on. As a result,…...
Model Quantization: Post-Training Quantization Using NVIDIA Model Optimizer
1+ week, 6+ day ago (746+ words) Model quantization is an effective method to reduce VRAM usage and improve inference performance on consumer devices such as NVIDIA Ge Force RTX GPUs. By lowering computational and memory requirements while preserving model quality, quantization helps AI models run more…...