<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.2">Jekyll</generator><link href="https://gatech-sysml.github.io/preview/pr-43/feed.xml" rel="self" type="application/atom+xml" /><link href="https://gatech-sysml.github.io/preview/pr-43/" rel="alternate" type="text/html" /><updated>2026-05-12T21:41:28+00:00</updated><id>https://gatech-sysml.github.io/preview/pr-43/feed.xml</id><title type="html">Systems for AI Lab</title><subtitle>The System for AI Lab (SAIL) at Georgia Tech, led by Prof. Alexey Tumanov, specializes in advancing systems support and resource management for machine learning (ML) to democratize large-scale AI systems. Our research encompasses the entire AI infrastructure stack, from foundational system design to the development of efficient ML training and inference algorithms. By focusing on managing the complete ML lifecycle, SAIL aims to enhance accessibility and efficiency in AI technologies.</subtitle><entry><title type="html">Agentic Workloads for Inference Evaluation</title><link href="https://gatech-sysml.github.io/preview/pr-43/2026/03/17/agentic-workloads.html" rel="alternate" type="text/html" title="Agentic Workloads for Inference Evaluation" /><published>2026-03-17T00:00:00+00:00</published><updated>2026-05-12T21:37:14+00:00</updated><id>https://gatech-sysml.github.io/preview/pr-43/2026/03/17/agentic-workloads</id><content type="html" xml:base="https://gatech-sysml.github.io/preview/pr-43/2026/03/17/agentic-workloads.html"><![CDATA[<p><strong>TL;DR.</strong> If your benchmark is a short chat loop, you may be measuring the
wrong workload regime. Agentic workloads turn one task into long-lived, branching,
bursty sessions with heavy prefix reuse, which reshapes cache behavior,
scheduler fairness, memory pressure, and even the fleet size you think you
need. This post shows how to model those workloads and benchmark them
reproducibly.</p>

<p>Experiment setup, results, OpenClaw telemetry, sessions and more are
available in the
<a href="https://github.com/chus-chus/blogpost_agentic_workloads">GitHub repo</a>.</p>

<h2 id="introduction">Introduction</h2>

<h3 id="the-evaluation-problem">The evaluation problem</h3>

<p>Inference systems have a large configuration space. New optimizations ship
very fast, and each one interacts with others in ways that are hard to predict. To
measure the quality of a particular configuration, you benchmark it. To compare inference systems
against each other, you benchmark them under the same conditions.</p>

<p>Several popular inference benchmarking projects do this: nightly benchmarks across systems,
reporting which one is fastest or most efficient. But what exactly do we mean when
declaring a system better or worse? <a href="#artificial-analysis">Artificial Analysis</a> and
<a href="#inferencemax">InferenceX</a> are useful examples of this benchmarking style. An inference system might be
excellent at bursty, short-context workloads, while subpar at long-context ones.
Another might shine with quantized models on specific hardware. The
definition of “good” depends entirely on the workload. For us to test an inference system
before deployment, we need to understand how it behaves under representative workloads.</p>

<p>However, inference systems are complex software, so reasoning about how a workload interacts with all of the relevant components or optimizations<sup id="note-ref-1"><a href="#note-1">1</a></sup> is not straightforward. So things are simplified. Common benchmark policies are to run independent requests or, at most, linear conversations: I send a message, the model responds, I append the response to the history, and send another. Maybe we can also have a few conversations running in parallel.</p>

<p>And even though simple workloads are useful for exactly that reason, and because they let us isolate variables, they might not test the full range of interactions between inference-system components. We could say that simple workloads are unit tests, while agentic workloads are integration tests.</p>

<h3 id="the-workload-gap">The workload gap</h3>

<p>This means that there is a growing disconnect between what we benchmark and thus use as reference
and what actually runs in production. Many prominent
LLM applications today are agentic systems rather than simple chatbots of the 2023 era. OpenClaw
and Claude Code run sessions with parallel tool calls, growing context and subagent
delegation (<a href="#openclaw-subagents">OpenClaw subagents</a>, <a href="#claude-code-subagents">Claude Code subagents</a>).
The workload they place on an inference system does not look like a short linear conversation or independent random requests.</p>

<p>IID requests and agentic sessions are different regimes for an inference system.
Agentic workloads stress prefix caching across long sessions, memory
management under bursty traffic, and scheduling fairness when
sessions have wildly different context sizes. None of this shows up in simple
benchmarks. Evaluating on the wrong workload might lead to wrong conclusions, and wrong
conclusions might cost real money.</p>

<h3 id="why-not-just-run-a-real-agent">Why not just run a real agent?</h3>

<p>So, when we want to fully evaluate the performance of an inference system, we do not just want to test it against the simple, traditional workloads, but also against agentic ones. Naively, one might think: just run OpenClaw against the inference system, give it some tasks, measure the timings. This does not work for rigorous evaluation, though:</p>

<ul>
  <li><strong>Reproducibility.</strong> LLM outputs are non-deterministic. The same task produces different tool-call sequences on different runs. The workload itself changes between experiments, making A/B comparisons impossible.</li>
  <li><strong>Control.</strong> It is hard to isolate variables. It is also hard to test scenarios that deviate from the simplest case, like what happens when fan-out increases from 2 to 8, or when think time between requests grows. With a real agent, you cannot control that. With a benchmark framework you can just change a parameter.</li>
  <li><strong>Instrumentation.</strong> A benchmark framework measures time to first token, time between tokens, cache hit rates, etc. at the right granularity, without instrumenting someone else’s code.</li>
</ul>

<p>What we want is to replicate the <em>shape</em> of agentic workloads: the structure,
the timing, and the distributions, without running an agent. That, in turn,
requires a benchmarking framework that generates or reads these workloads and allows you to measure what is needed.</p>

<h3 id="what-this-post-does">What this post does</h3>

<p>To build such a benchmark, we first need a description of agentic workloads that
is general enough to apply across implementations, but still precise enough to predict
what an inference system sees. We do not really care whether “the agent writes
code” or “searches the web”, but rather about the trace model: a graph of
inference requests, their input and output lengths, how much prefix each
request shares with the previous one, and how much time passes between
dependent requests.</p>

<p>I use OpenClaw as a reference, an open source agentic system with broad adoption (and
hype)<sup id="note-ref-2"><a href="#note-2">2</a></sup>.
The principles I extract apply to many agentic systems like Claude
Code, because they roughly share the same high-level execution patterns of tool use,
result appending, and subagent delegation (see <a href="#claude-code-subagents">Claude Code subagents</a>).</p>

<p>Each principle corresponds to one part of this statistical description:
request-graph topology (length and branching), prefix reuse, input and output-length heterogeneity,
and inter-request timing. For each, I connect the trace statistic to the inference system
in two ways. First-order consequences are direct changes in work: more
prefills, more decode tokens, larger fresh token tails, or more concurrent
sessions. Second-order consequences are what those first-order changes might do to
cache retention, scheduling, fairness, batching, and memory pressure. I aim to present:</p>

<ol>
  <li>How to roughly describe agentic traces statistically: as session graphs plus distributions over token counts, waits, shared prefixes, and branching.</li>
  <li>How to replicate them in a benchmark: by measuring those distributions from real traces and generating matching synthetic sessions.</li>
  <li>Why it matters: inference systems behave differently under agentic load, and evaluating on the wrong workload might not give you the full picture.</li>
</ol>

<h2 id="prerequisites">Prerequisites</h2>

<p>Before we get into OpenClaw, let us set up our shared vocabulary and execution
model. If you work with inference systems, this will be familiar; if you do not,
this will be a quick overview.</p>

<h3 id="inference-basics">Inference basics</h3>

<p>An inference system handles user requests. When a request arrives, the system
computes and saves the keys and values of the input tokens: the prefill.
Then, it generates output tokens one at a time: the decode. In practice, the
system deals with many concurrent requests, carefully managing scheduling, CPU/GPU overlap, memory management, etc.
All optimizations on top of this, like advanced KV-cache policies, chunking, prefill-decode disaggregation, speculative decoding, and more, are generally strategies to make the prefill, decode, or both faster or more efficient under concurrency (<a href="#orca">Orca</a>, <a href="#distserve">DistServe</a>, <a href="#sarathi-serve">Sarathi-Serve</a>, <a href="#pagedattention">PagedAttention</a>).</p>

<h3 id="measuring-inference-performance">Measuring inference performance</h3>

<p>When we evaluate an inference system, we care about how fast it does prefills
and decodes across a workload.<sup id="note-ref-3"><a href="#note-3">3</a></sup> The core metrics are:</p>

<ul>
  <li><strong>TTFT</strong>: time to first token. How long until the first output token arrives after submitting a request. Measures prefill speed.</li>
  <li><strong>TBT</strong>: time between tokens. The interval between consecutive output tokens. Measures decode speed.</li>
  <li><strong>TPOT</strong>: time per output token. Mean TBT across a request.</li>
  <li><strong>E2E latency</strong>: total time from request submission to last output token.</li>
  <li><strong>Throughput (token or request)</strong>: it may mean system-level tokens/s or requests/s. Some also quote a per-request token rate. In this post I name the specific quantity each time.</li>
</ul>

<p>Everything else, like cost per token or energy per token, derives from these plus
hardware and pricing data. We can measure TTFT and TBT because modern inference
APIs stream tokens back, so we can observe each one as it arrives. Some systems
batch output tokens into chunks for efficiency, so we actually prefer TTFC (time to
first chunk) and TBC (time between chunks) instead for agnostic evaluation<sup id="note-ref-4"><a href="#note-4">4</a></sup>. Same idea but slightly
coarser (see, for example, <a href="#openai-streaming">OpenAI streaming responses</a>).</p>

<h3 id="what-is-a-session">What is a session?</h3>

<p>Throughout this post, I use the term <em>session</em> rather than <em>conversation</em>. A
conversation is always linear: user, assistant, user, assistant. A session is
the generalization. That is, a graph of requests with dependencies.</p>

<p>In this framework, an independent request is a session with one node. A linear conversation is a
session where nodes form a chain, one after the other. An agentic session can be a chain too, but
it can also be a DAG: when an agent spawns subagents, each runs its own chain of inference requests
in parallel, as in the simplified DAG shown later.</p>

<h3 id="the-agentic-loop">The agentic loop</h3>

<p>Imagine we have an OpenClaw instance running. We are communicating with it via
our preferred interface, having a back-and-forth conversation with it, and we
ask it to perform a particular task. When the user message arrives, OpenClaw
starts a loop comprised of several stages. We can roughly categorize each step
as: Compose, Infer, Check, Execute and Append. First, it composes the new
message from history plus new context and sends it to the LLM for inference. It
checks if the LLM response was a final answer or if it decided to call tools
first. For example, it might need to read a file, run a command, search the
web. If so, the system executes the tool calls concurrently and appends all
results to history. This loop usually repeats until the model decides it has
completed the user’s request.</p>

<p><em>A typical agentic loop.</em></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">prompt</span><span class="p">(</span><span class="n">user_input</span><span class="p">,</span> <span class="n">session</span><span class="p">):</span>
  <span class="n">session</span><span class="p">.</span><span class="n">history</span><span class="p">.</span><span class="n">append</span><span class="p">({</span><span class="s">"role"</span><span class="p">:</span> <span class="s">"user"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="n">user_input</span><span class="p">})</span>

  <span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
    <span class="c1"># One LLM inference request
</span>    <span class="n">response</span> <span class="o">=</span> <span class="n">call_llm</span><span class="p">(</span>
      <span class="n">system</span><span class="o">=</span><span class="n">session</span><span class="p">.</span><span class="n">system_prompt</span><span class="p">,</span>
      <span class="n">tools</span><span class="o">=</span><span class="n">session</span><span class="p">.</span><span class="n">tool_definitions</span><span class="p">,</span>
      <span class="n">messages</span><span class="o">=</span><span class="n">session</span><span class="p">.</span><span class="n">history</span><span class="p">,</span>
    <span class="p">)</span>
    <span class="n">session</span><span class="p">.</span><span class="n">history</span><span class="p">.</span><span class="n">append</span><span class="p">({</span><span class="s">"role"</span><span class="p">:</span> <span class="s">"assistant"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="n">response</span><span class="p">.</span><span class="n">content</span><span class="p">})</span>

    <span class="k">if</span> <span class="n">response</span><span class="p">.</span><span class="n">stop_reason</span> <span class="o">==</span> <span class="s">"end_turn"</span><span class="p">:</span>
      <span class="k">return</span> <span class="n">response</span>

    <span class="c1"># Make tool calls concurrently, then append results
</span>    <span class="n">results</span> <span class="o">=</span> <span class="k">await</span> <span class="n">gather</span><span class="p">(</span>
      <span class="n">execute_tool</span><span class="p">(</span><span class="n">tc</span><span class="p">.</span><span class="n">name</span><span class="p">,</span> <span class="n">tc</span><span class="p">.</span><span class="n">arguments</span><span class="p">)</span> <span class="k">for</span> <span class="n">tc</span> <span class="ow">in</span> <span class="n">response</span><span class="p">.</span><span class="n">tool_calls</span>
    <span class="p">)</span>
    <span class="k">for</span> <span class="n">tool_call</span><span class="p">,</span> <span class="n">result</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">response</span><span class="p">.</span><span class="n">tool_calls</span><span class="p">,</span> <span class="n">results</span><span class="p">):</span>
      <span class="n">session</span><span class="p">.</span><span class="n">history</span><span class="p">.</span><span class="n">append</span><span class="p">({</span>
          <span class="s">"role"</span><span class="p">:</span> <span class="s">"tool_result"</span><span class="p">,</span>
          <span class="s">"tool_use_id"</span><span class="p">:</span> <span class="n">tool_call</span><span class="p">.</span><span class="nb">id</span><span class="p">,</span>
          <span class="s">"content"</span><span class="p">:</span> <span class="n">result</span><span class="p">.</span><span class="n">content</span><span class="p">,</span>
      <span class="p">})</span>
    <span class="c1"># Loop back. Next iteration includes all tool results
</span></code></pre></div></div>

<p>So, each iteration of this loop is one inference request. An important
observation is that most tool calls (file reads, shell commands) do not involve
any LLM calls: they execute locally and return text. Some tools do call models
(image generation, LLM-backed web search), but those typically hit external
APIs, not the inference system serving the main agent.<sup id="note-ref-5"><a href="#note-5">5</a></sup> This means a single session without subagents produces
a linear chain of requests to the inference system under evaluation.</p>

<p>On top of this core loop, OpenClaw also has an outer loop that handles
infrastructure events like context overflow. This outer loop does not
appreciably change the steady state of the workload.</p>

<p>These are the core mechanics of the agentic loop. In the next section, I treat
the properties induced by this loop as trace statistics.</p>

<h2 id="an-agentic-workload">An agentic workload</h2>

<p>Now that we know why agentic evaluations are important, and are familiar with
the basics of inference evaluation, what a session is, and how the agentic loop
works, we can start characterizing an agentic workload.</p>

<p>For the purposes of inference evaluation, an agentic workload is a set of
session graphs. Each graph represents inference requests as nodes and
dependencies as edges. Each node
carries quantities such as the number of input and output tokens, and
each edge carries a delay and a history inheritance relationship. A
benchmark does not need to replay exact tool semantics; it needs to reproduce
the distributions of these quantities. The principles below are the dominant
terms in that description. I extract them from real OpenClaw<sup id="note-ref-6"><a href="#note-6">6</a></sup> telemetry based
on real sessions.</p>

<h3 id="request-expansion">Request expansion</h3>

<p>The agentic loop in the pseudocode above already gives us the first principle: one user task expands into
a sequence of dependent inference requests. Most tool calls do not add requests
directly to the evaluated inference system, but they do trigger another LLM
call once their results are appended to history. A think+act+observe
cycle can therefore turn one human request into many inference requests.</p>

<p>Statistically, the quantity we care about is not “user turns” but the
distribution of inference requests per user task, or equivalently the depth of
these dependency chains. In the trace underlying Case study 1, 3 user interventions
expand into 130 inference requests, which is on the current low-medium end.
Very soon, I would expect agentic loops to be able to generate thousands to
tens of thousands of inference requests per user task.</p>

<p>The first-order consequence is that end-to-end latency compounds across many
sequential prefills and decodes, not one. The second-order consequence is that
the scheduler sees long-lived dependent chains rather than IID requests, which
changes batching opportunities and how long session state remains live.</p>

<h3 id="stateful-prefix-reuse">Stateful prefix reuse</h3>

<p>In the pseudocode above, we can see how every inference request includes the full
conversation history. The model needs to see everything that happened before to
produce a coherent next step: all user messages, assistant responses, tool
results and other content. Implementation-wise, this produces monotonic context
growth. Statistically, the more fundamental quantity is prefix overlap between
consecutive requests.</p>

<p>This means that each request in the chain is strictly larger than the previous
one. Turn N’s input contains everything from turns 1 through N-1, plus
whatever new context was added. This means that every request is assembled as
follows (approximate numbers for Pi agents):</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[system_prompt]      order of 10^4 tokens
[tool_definitions]   order of 10^3 to 10^4 tokens
[message_history]    grows with each turn
[new_input]          latest user message, tool results and other content
</code></pre></div></div>

<p>We can write an approximation of the input token count <code class="language-plaintext highlighter-rouge">n</code> for turn N as:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>n_N = S + D + sum_{i=1}^{N} (U_i + A_i + R_i + X_i)
</code></pre></div></div>

<p>where <code class="language-plaintext highlighter-rouge">S</code> is the system prompt, <code class="language-plaintext highlighter-rouge">D</code> the tool definitions, <code class="language-plaintext highlighter-rouge">U_i</code> the user input
at turn <code class="language-plaintext highlighter-rouge">i</code>, <code class="language-plaintext highlighter-rouge">A_i</code> the assistant response, <code class="language-plaintext highlighter-rouge">R_i</code> the tool results (zero if no
tools were called that turn), and <code class="language-plaintext highlighter-rouge">X_i</code> other content such as injected or
synthetic history turns<sup id="note-ref-7"><a href="#note-7">7</a></sup>. The key observation is not only that input grows,
but that request <code class="language-plaintext highlighter-rouge">N</code> and request <code class="language-plaintext highlighter-rouge">N+1</code> share almost all of their tokens as <code class="language-plaintext highlighter-rouge">N</code> increases.</p>

<figure class="figure">
  <a class="figure-image" aria-label="Illustrative context growth over turns. For a single user intervention at turn 1, context grows until +250k. Model output, tool results and other events are accumulated in history, accounting for the majority of the context.">
    <img src="/preview/pr-43/images/posts/agentic-workloads/context_growth_over_turns_opaque.png" style="
        width: 760px;
        max-height: unset;
      " alt="Illustrative context growth over turns. For a single user intervention at turn 1, context grows until +250k. Model output, tool results and other events are accumulated in history, accounting for the majority of the context." loading="lazy" onerror="this.src = '/preview/pr-43/images/fallback.svg'; this.onerror = null;" />
  </a>
  
    <figcaption class="figure-caption">
      Illustrative context growth over turns. For a single user intervention at turn 1, context grows until +250k. Model output, tool results and other events are accumulated in history, accounting for the majority of the context.

    </figcaption>
  
</figure>

<h4 id="why-this-matters-for-the-inference-system">Why this matters for the inference system</h4>

<p>This is arguably the dominant property of agentic workloads. The prefix
overlap<sup id="note-ref-8"><a href="#note-8">8</a></sup> between consecutive requests is very
large, and it can reach 90-99% of the input. In other words, request <code class="language-plaintext highlighter-rouge">N</code> and
request <code class="language-plaintext highlighter-rouge">N+1</code> share almost all of their tokens. As the context horizon of LLMs
grows, this overlap will tend toward 100%.</p>

<p>On top of this, there is the <strong>constant scaffold</strong> formed by the system prompt and tool
definitions. This is an extremely high-value target for caching, as in most cases
it is repeated across all requests and sessions.</p>

<p>The first-order consequence is on prefill work. An inference system that
exploits this via prefix caching<sup id="note-ref-9"><a href="#note-9">9</a></sup> only needs to prefill the new tokens at each turn
(see <a href="#vllm-prefix-caching">vLLM prefix caching</a>).
One that does not exploit it recomputes the entire growing history from
scratch.</p>

<p>The second-order consequence is on cache policy. Once the workload is dominated by prefix reuse,
cache placement, eviction, and offloading strategies start to dominate TTFC differences between
systems that otherwise use prefix caching and the same model. If
you evaluate an inference system on independent requests, where there is no
prefix to cache, you never observe this regime.</p>

<h4 id="case-study-1-multi-turn-sessions">Case study 1: multi-turn sessions</h4>

<p>To make the workload regime concrete, we start with a simple first
case study: one real multi-turn coding trace and a synthetic workload derived
from it. The goal here is to establish the basic prefix-reuse regime
that the later case studies build on.</p>

<ol>
  <li>First, I ask OpenClaw 26.3.2 with GPT-5.1-Codex-Mini to implement a web app for interactive exploration of LLMs via interpretability methods. We cap the total inference time to about 15 minutes.<sup id="note-ref-10"><a href="#note-10">10</a></sup></li>
  <li>Then, I measure statistical properties of the resulting trace: token counts, timings, prefix reuse, etc.</li>
  <li>Next, I generate a synthetic workload from the trace that mimics the agentic pattern just described.</li>
  <li>Finally, I compare that workload with and without prefix caching.</li>
</ol>

<p>To run and measure all benchmarks, I use <a href="https://github.com/project-vajra/veeksha">Veeksha</a>
v0.2.2, the open source benchmarking framework for LLM inference systems we developed. It supports sessions as graphs of requests with dependencies, configurable
timings, prefix caching simulations, replaying real-world workloads,
microbenchmarks, and more.</p>

<p><strong>Trace analysis</strong></p>

<p>When the agent is stopped, we obtain an OpenClaw trace that looks like this:</p>

<ul>
  <li>1 linear chain of inference requests</li>
  <li>130 requests in total generated from 3 user interactions, an expansion factor of roughly 43x</li>
  <li>Median fresh input and output length of 490 and 214 tokens, respectively.<sup id="note-ref-11"><a href="#note-11">11</a></sup> Every pair of requests is roughly the model deciding to call a tool and then observing the result. Interestingly, we do not see the model deciding to call a batch of tools at once.</li>
  <li>A median waiting time of 32 ms between requests (after the previous request finished; mean of 6 s, biased by 2 slow interventions)</li>
  <li>A used context length of 117k tokens</li>
  <li>A total of 8.2 million token cache reads</li>
</ul>

<p><strong>The synthetic workload</strong></p>

<p>We now have the first empirical parameters of the trace: chain depth,
per-request token counts, wait times, and prefix reuse. Let us now measure the
actual inference performance numbers with similar sessions. Here is the
approximate configuration for the multi-turn workload. We set the
parameters to approximate the trace characteristics above based on the medians.
Take a moment to read it, as it will help you understand the workload and the
rest of the experiments.</p>

<p><em>Synthetic workload configuration for the multi-turn sessions with prefix caching.</em></p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Q: how are sessions generated?</span>
<span class="na">session_generator</span><span class="pi">:</span>
  <span class="na">type</span><span class="pi">:</span> <span class="s">synthetic</span> <span class="c1"># synthetically, with a linear (chain) shape</span>
  <span class="na">session_graph</span><span class="pi">:</span>
    <span class="na">type</span><span class="pi">:</span> <span class="s">linear</span>
    <span class="na">num_request_generator</span><span class="pi">:</span> <span class="c1"># each session has between 100 and 150 turns</span>
      <span class="na">type</span><span class="pi">:</span> <span class="s">uniform</span>
      <span class="na">min</span><span class="pi">:</span> <span class="m">110</span>
      <span class="na">max</span><span class="pi">:</span> <span class="m">150</span>
    <span class="na">request_wait_generator</span><span class="pi">:</span>
      <span class="na">type</span><span class="pi">:</span> <span class="s">poisson</span>
      <span class="na">arrival_rate</span><span class="pi">:</span> <span class="m">31.25</span> <span class="c1"># turns wait a mean of 32ms before being dispatched</span>
  <span class="na">channels</span><span class="pi">:</span> <span class="c1"># each turn introduces between 400 and 600 new tokens from the user...</span>
    <span class="pi">-</span> <span class="na">type</span><span class="pi">:</span> <span class="s">text</span>
      <span class="na">body_length_generator</span><span class="pi">:</span>
        <span class="na">type</span><span class="pi">:</span> <span class="s">uniform</span>
        <span class="na">min</span><span class="pi">:</span> <span class="m">400</span>
        <span class="na">max</span><span class="pi">:</span> <span class="m">600</span>
  <span class="na">output_spec</span><span class="pi">:</span> <span class="c1"># ... and 150 to 250 new tokens from the assistant</span>
    <span class="na">text</span><span class="pi">:</span>
      <span class="na">output_length_generator</span><span class="pi">:</span>
        <span class="na">type</span><span class="pi">:</span> <span class="s">uniform</span>
        <span class="na">min</span><span class="pi">:</span> <span class="m">150</span>
        <span class="na">max</span><span class="pi">:</span> <span class="m">250</span>

<span class="c1"># Q: how are sessions dispatched to the inference system?</span>
<span class="na">traffic_scheduler</span><span class="pi">:</span>
  <span class="na">type</span><span class="pi">:</span> <span class="s">concurrent</span> <span class="c1"># with a concurrency-based scheduler...</span>
  <span class="na">target_concurrent_sessions</span><span class="pi">:</span> <span class="m">1</span> <span class="c1"># ...that allows one session at a time</span>

<span class="c1"># We dispatch 5 sessions in total</span>
<span class="na">runtime</span><span class="pi">:</span>
  <span class="na">max_sessions</span><span class="pi">:</span> <span class="m">5</span>

<span class="na">seed</span><span class="pi">:</span> <span class="m">77</span>
</code></pre></div></div>

<p>I run the above workload independently against two Qwen3.5-35B-A3B
replicas (thinking disabled), each one running on vLLM 0.17.1 and an H100 GPU. Replica A uses the default
prefix cache configuration, while replica B has it disabled.</p>

<p><strong>What is happening?</strong></p>

<ul>
  <li>With the default prefix cache, TTFC stays in the sub-second regime even as the session approaches 100k prompt tokens.</li>
  <li>With prefix caching disabled, the full prompt has to be recomputed on every turn, so TTFC rises into the multi-second regime as context grows.</li>
</ul>

<figure class="figure">
  <a class="figure-image" aria-label="TTFC versus total prompt tokens for the same synthetic requests (log scale on the y-axis). With the default prefix cache, median TTFC grows from roughly 60 ms to 0.33 s across the session. With prefix caching disabled, median TTFC rises from roughly 0.1 s to about 10 s by the time the prompt reaches 100k tokens.">
    <img src="/preview/pr-43/images/posts/agentic-workloads/case_study_1_ttfc_scaling_opaque.png" style="
        width: 640px;
        max-height: unset;
      " alt="TTFC versus total prompt tokens for the same synthetic requests (log scale on the y-axis). With the default prefix cache, median TTFC grows from roughly 60 ms to 0.33 s across the session. With prefix caching disabled, median TTFC rises from roughly 0.1 s to about 10 s by the time the prompt reaches 100k tokens." loading="lazy" onerror="this.src = '/preview/pr-43/images/fallback.svg'; this.onerror = null;" />
  </a>
  
    <figcaption class="figure-caption">
      TTFC versus total prompt tokens for the same synthetic requests (log scale on the y-axis). With the default prefix cache, median TTFC grows from roughly 60 ms to 0.33 s across the session. With prefix caching disabled, median TTFC rises from roughly 0.1 s to about 10 s by the time the prompt reaches 100k tokens.

    </figcaption>
  
</figure>

<p><strong>Takeaway</strong></p>

<p>This first case study mainly establishes the regime. Agentic workloads are
long-lived, stateful traces with very high prefix reuse, so cache handling
quickly becomes a dominant factor in latency. Once that basic effect is in
place, the more interesting questions are the ones taken up in the next
sections: how token heterogeneity, bursty timing, and branching change the
behavior of systems that already exploit prefix reuse.</p>

<h3 id="token-count-heterogeneity">Token-count heterogeneity</h3>

<p>Prefix reuse does not mean the amount of new work per step is constant. At any
point in an agentic loop, the model might append a small memory lookup, a medium
shell output, a huge file read, or a large batch of tool results. There are
other context injection events too, like subagents sending summaries and
artifacts to the parent agent. Similarly, the distribution of output tokens in
an agentic workload is dictated by a variety of events. Many inference requests
return small messages, where the model selects tools or acknowledges results.
They usually stem from intermediate control events in the agentic loop above. Others, like turn ends, where models modify
artifacts or respond to the user, or context overflows, where the model needs to
compact the full history, generate larger answers.</p>

<p>Statistically, the quantities that matter are the incremental input size between
consecutive requests, that is, the number of fresh, non-cached tokens added on
top of the shared prefix, and the number of output tokens generated per request.</p>

<p>In real agentic traces both distributions are broad and usually heavy tailed,
as shown in the two figures below.
Most steps add a modest amount of tokens and generate short outputs, but a small
number create very large bursts.</p>

<figure class="figure">
  <a class="figure-image" aria-label="Empirical vs fitted distributions of new input tokens for the trace in Case study 1. Cropped to the 95th percentile (the max value is around 20000 tokens). Most steps add just a few new input tokens, but a small number add very large bursts. I tested lognormal, Weibull, gamma, exponential, Pareto, normal and inverse Gaussian distributions, and found that the latter fits best.">
    <img src="/preview/pr-43/images/posts/agentic-workloads/new_tokens_fit_p95_linear_opaque.png" style="
        width: 589px;
        max-height: unset;
      " alt="Empirical vs fitted distributions of new input tokens for the trace in Case study 1. Cropped to the 95th percentile (the max value is around 20000 tokens). Most steps add just a few new input tokens, but a small number add very large bursts. I tested lognormal, Weibull, gamma, exponential, Pareto, normal and inverse Gaussian distributions, and found that the latter fits best." loading="lazy" onerror="this.src = '/preview/pr-43/images/fallback.svg'; this.onerror = null;" />
  </a>
  
    <figcaption class="figure-caption">
      Empirical vs fitted distributions of new input tokens for the trace in Case study 1. Cropped to the 95th percentile (the max value is around 20000 tokens). Most steps add just a few new input tokens, but a small number add very large bursts. I tested lognormal, Weibull, gamma, exponential, Pareto, normal and inverse Gaussian distributions, and found that the latter fits best.

    </figcaption>
  
</figure>

<p>Performing the same fitting experiment on output tokens also yields the inverse Gaussian as
the best fit for our empirical data in the multi-turn workload in Case study 1.</p>

<figure class="figure">
  <a class="figure-image" aria-label="Empirical vs fitted distributions of generated output tokens for the trace in Case study 1. Cropped to the 95th percentile (the max value is around 14000 tokens). The same family of distributions was tested here as for the input tokens, with the same best fit.">
    <img src="/preview/pr-43/images/posts/agentic-workloads/output_tokens_fit_p95_linear_opaque.png" style="
        width: 589px;
        max-height: unset;
      " alt="Empirical vs fitted distributions of generated output tokens for the trace in Case study 1. Cropped to the 95th percentile (the max value is around 14000 tokens). The same family of distributions was tested here as for the input tokens, with the same best fit." loading="lazy" onerror="this.src = '/preview/pr-43/images/fallback.svg'; this.onerror = null;" />
  </a>
  
    <figcaption class="figure-caption">
      Empirical vs fitted distributions of generated output tokens for the trace in Case study 1. Cropped to the 95th percentile (the max value is around 14000 tokens). The same family of distributions was tested here as for the input tokens, with the same best fit.

    </figcaption>
  
</figure>

<p>The first-order consequence is that both prefill and decode work are
heterogeneous across the trace rather than roughly constant per turn. Two
requests with similar total context length can have very different prefill costs
depending on how large the fresh tail is, and some generations are much longer
than others. The second-order consequences are different for the two phases.
Large fresh token bursts create prefill interference: they occupy prefill
capacity for longer, which can perturb batching and worsen tail latency for
other sessions sharing the system. Long generations remain active for longer,
which extends the lifetime of the KV and changes batch characteristics.
Depending on the inference engine, this can affect metrics such as throughput, completion
latency, TBT, or fairness under mixed workloads.</p>

<p>For benchmarking, the direct consequence is that we should not model either side
with smooth average increments per turn, or sample from uniform distributions.
For example, in Veeksha’s spec shown above, this means we change
<code class="language-plaintext highlighter-rouge">text.body_length_generator</code> and <code class="language-plaintext highlighter-rouge">text.output_length_generator</code> from <code class="language-plaintext highlighter-rouge">uniform</code> to:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">body_length_generator</span><span class="pi">:</span>
  <span class="na">type</span><span class="pi">:</span> <span class="s">inverse_gaussian</span>
  <span class="na">mean</span><span class="pi">:</span> <span class="s">m</span>
  <span class="na">shape</span><span class="pi">:</span> <span class="s">s</span> <span class="c1"># controls dispersion; lower -&gt; heavier tailed</span>
</code></pre></div></div>

<p>Here:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">m</code> is about 815 for input tokens and about 615 for output tokens</li>
  <li><code class="language-plaintext highlighter-rouge">s</code> is about 200 for input tokens and about 145 for output tokens</li>
</ul>

<p><strong>Prefix invalidation</strong></p>

<p>When the context length reaches its limit in modern agentic systems, a compaction event is created. The event creates a request asking the model to summarize, and then a new session is created with the summary as fresh context. This effectively invalidates the prefix of the original session.</p>

<p>We can model this event using the previous notes on prefill and decode heterogeneity, because it creates moderately sized decodes (for summarization) and prefills (for the new session with fresh summary), which fit within that heavy-tail description.</p>

<h3 id="bursty-timing">Bursty timing</h3>

<p>Agentic systems are being given more and more ways to interact with the world: file reads and writes, program execution, API calls to other systems, computer use, and soon more autonomous real-world interaction and task navigation. We can consider these as tool calls in the agentic loop above. The idle time between subsequent requests in a session is dictated mainly by three factors:</p>

<ol>
  <li>If it is a user turn, how long the user takes to respond</li>
  <li>If it is a tool turn, the nature of the tool. A test suite might take minutes, a file read might take tens of milliseconds.</li>
  <li>Dispatch inefficiencies of the agentic system.</li>
</ol>

<p>In the context of session graphs, I define the property <code class="language-plaintext highlighter-rouge">wait_after_ready</code> of a node (request) as the time
between completion of the last parent request and the dispatch time of the node. If we look at its
distribution in our sample trace, we see that almost 80% of the waits are less than 100 ms, while the upper
tail is heavy, with 6% being larger than 10 seconds. This effect is similar to that of the input and output token distributions.</p>

<p>Again, this heavy-tail effect has implications beyond workload shape. During idle periods of a session, its KV state stays unused. This in turn increases the chance that, due to memory pressure and cache policies, at least part of the cache will no longer be resident by the time the next request of the session is dispatched. The session will then have to pay a recomputation cost. This is not necessarily bad, as it might be the correct global decision; the point is that it creates a cache-allocation tradeoff, thus affecting other local properties.</p>

<p><strong>Fidelity on synthetic workloads</strong></p>

<p>While the empirical distribution of wait times roughly matches that of the tokens, a best-fit analysis tells us that it is not well described by a single, smooth distribution; a spike+tail description fits best, as shown in the annex figure. I measured another trace, arguably more complete and representative, and got similar results. So, does this mean that a benchmarking framework should support sampling wait times according to complex spike+tail generators? I argue that with synthetic workloads, we care more about preserving clarity and the broad operational regime instead. An inverse Gaussian or lognormal distribution would do it.</p>

<p>If we care about absolute fidelity, a better option is to <strong>replay traces</strong>, preserving every detail of the original workload instead of approximating it. This option is especially useful for those who already have a lot of production traffic.<sup id="note-ref-12"><a href="#note-12">12</a></sup></p>

<h3 id="session-branching">Session branching</h3>

<p>Until now, we have been describing characteristics of linear sessions. A big component of agentic workloads, though, is how agents can spawn subagents. As agentic capabilities improve, we will likely see deeper and deeper hierarchical delegation of work, with each subagent focused on some particular task.</p>

<p>In practice, there are many ways to implement subagents and their reporting strategies.<sup id="note-ref-13"><a href="#note-13">13</a></sup> In the case of the OpenClaw harness, the subagent flow is:</p>

<ol>
  <li>Agent decides to spawn a subagent. It does so by calling the <code class="language-plaintext highlighter-rouge">sessions_spawn</code> tool, which asynchronously spawns a subagent.</li>
  <li>Subagent starts with its own system prompt plus the task description from the parent as context. It does not inherit the full context.</li>
  <li>When finished, the subagent announces the results back to the parent. The parent receives a message with the subagent’s output.</li>
  <li>Nesting depth (sub-subagents and more) and number of allowed spawns per agent are configurable, as well as max concurrency (see <a href="#openclaw-session-tools">OpenClaw session tools</a> and <a href="#openclaw-subagents">OpenClaw subagents</a>).</li>
</ol>

<p>In the context of the DAG that is an agentic session, this means that any node can have multiple children or parents. It is not a linear chain anymore. Fan-out and fan-in degrees can be bigger than 1, which creates dependencies between requests in a way that we did not have before. It also introduces context inheritance dynamics between nodes (not all nodes inherit the full history).</p>

<figure class="figure">
  <a class="figure-image" aria-label="A simplified DAG session. One request has a fan-out degree of 3 (subsession spawn), and one has a fan-in degree of 3 (subsessions reporting back). This is the shape used in Case studies 2 and 3.">
    <img src="/preview/pr-43/images/posts/agentic-workloads/case-study-2-dag_opaque.png" style="
        width: 523px;
        max-height: unset;
      " alt="A simplified DAG session. One request has a fan-out degree of 3 (subsession spawn), and one has a fan-in degree of 3 (subsessions reporting back). This is the shape used in Case studies 2 and 3." loading="lazy" onerror="this.src = '/preview/pr-43/images/fallback.svg'; this.onerror = null;" />
  </a>
  
    <figcaption class="figure-caption">
      A simplified DAG session. One request has a fan-out degree of 3 (subsession spawn), and one has a fan-in degree of 3 (subsessions reporting back). This is the shape used in Case studies 2 and 3.

    </figcaption>
  
</figure>

<p>To illustrate this, I ask OpenClaw to produce a high-quality knowledge graph and analysis of all major AI frameworks – a naturally modular task, as each subagent can be dedicated to researching some part of a particular framework. After about five minutes, OpenAI rate limits are reached, at which point the session is:</p>

<ul>
  <li>575 requests total</li>
  <li>35 sessions (spawns)</li>
  <li>about 3.2M total input tokens</li>
  <li>about 150k total output tokens</li>
  <li>44 requests deep on the longest path</li>
  <li>25 requests wide at the maximum width</li>
  <li>A max fan-in and fan-out degree of 2</li>
</ul>

<p>Take a look at the resulting session graph in the annex.</p>

<p>When building an agentic benchmark, we need to consider details such as branching factor, depth and length of child sessions, and history inheritance ratios. Higher branching factors usually mean much higher request concurrency, with both obvious and subtler implications. They directly affect the total pressure of the workload on the inference system’s memory, compute, and scheduling state.</p>

<h4 id="case-study-2">Case study 2</h4>

<p>The observed OpenClaw trace is rich in structure, with nested spawning but limited measured fan-out before rate limiting.
For controlled experiments I replace it with a simplified 3-way DAG shown above that isolates the effect of branching.</p>

<p>I now compare two workloads with the same total fresh tokens in the traces, but that differ in shape.
Workload A (“linear workload”) is a sequence of short linear sessions, while workload
B (“DAG workload”) is a sequence of the DAG sessions shown above.
Over a full pass of the traces, they both have the same total number of new input and output tokens, so from the
application’s perspective they do the same amount of work. In the timed replay, however, I keep the session
arrival rate fixed and allow the traces to wrap. This does not imply identical inference work over
time: the DAG workload may, for example, decode at longer effective context lengths, induce different cache
patterns, and wrap around faster. The point is to show that, even when traces
look similar under the same user token budget, session topology and replay setup
can change reported performance, even at fixed session dispatch rates, if not taken into account properly.</p>

<p>I run both workloads independently against the same system from a fresh start, at a shared session arrival rate
of 0.18 sessions/s. The linear workload uses 30 sessions of 5 requests each, while the DAG workload uses 10
sessions of 15 requests each with a 3-way fan-out and fan-in.<sup id="note-ref-14"><a href="#note-14">14</a></sup> <code class="language-plaintext highlighter-rouge">wait_after_ready</code> is always 0.
I create the traces synthetically and replay them for 300 seconds with Veeksha’s <code class="language-plaintext highlighter-rouge">timed_synthetic_session</code> trace
session generator (see annex). Because the run uses wrap mode, the 10-session DAG trace wraps around faster than
the 30-session linear trace at the same session arrival rate. The model and system are the same as in Case study 1.</p>

<figure class="figure">
  <a class="figure-image" aria-label="Same total fresh tokens in the traces, different effective work under timed replay. Left: ECDF of total prompt tokens. Middle: TTFC ECDF. Right: end-to-end latency ECDF. Even though both traces have the same total new input and output tokens per full pass, the DAG workload shifts mass into the longer-context regime, moving the latency curves.">
    <img src="/preview/pr-43/images/posts/agentic-workloads/case_study_2_shape_ecdfs_opaque.png" style="
        width: 760px;
        max-height: unset;
      " alt="Same total fresh tokens in the traces, different effective work under timed replay. Left: ECDF of total prompt tokens. Middle: TTFC ECDF. Right: end-to-end latency ECDF. Even though both traces have the same total new input and output tokens per full pass, the DAG workload shifts mass into the longer-context regime, moving the latency curves." loading="lazy" onerror="this.src = '/preview/pr-43/images/fallback.svg'; this.onerror = null;" />
  </a>
  
    <figcaption class="figure-caption">
      Same total fresh tokens in the traces, different effective work under timed replay. Left: ECDF of total prompt tokens. Middle: TTFC ECDF. Right: end-to-end latency ECDF. Even though both traces have the same total new input and output tokens per full pass, the DAG workload shifts mass into the longer-context regime, moving the latency curves.

    </figcaption>
  
</figure>

<p>With the same total fresh tokens in the traces, the DAG workload has 116.7% higher TTFC p99 and 77.2% higher E2E
p95 than the linear workload. Its mean prompt length is also 20.2%
longer. See the annex below for the full
comparison.</p>

<p>The previous figure also shows why looking
only at prompt length p95 is misleading here: the p95 total prompt length is
effectively unchanged, but the median rises from 2100 to 2899 tokens, so
much more of workload B spends time in the long-context regime. B also induces
burstier concurrency: while session arrival is 0.18 sessions per second,
the scheduler sees far more simultaneous active decodes because DAG sessions contain
more requests with some degree of dispatch parallelism. And even though B has a 39% higher prefix
cache hit rate than A, it is not enough to compensate for the last two characteristics; this
increases the observed TTFC.</p>

<figure class="figure">
  <a class="figure-image" aria-label="Duration-weighted decode overlap. For each x-axis value k, the y-axis shows the share of total decode time spent with at least k simultaneous decode requests. The linear run never exceeds 6 simultaneous active decodes. The DAG run reaches 41, and spends more than 60% of decode-active time at 10 or more simultaneous decode requests.">
    <img src="/preview/pr-43/images/posts/agentic-workloads/case_study_2_decode_overlap_opaque.png" style="
        width: 700px;
        max-height: unset;
      " alt="Duration-weighted decode overlap. For each x-axis value k, the y-axis shows the share of total decode time spent with at least k simultaneous decode requests. The linear run never exceeds 6 simultaneous active decodes. The DAG run reaches 41, and spends more than 60% of decode-active time at 10 or more simultaneous decode requests." loading="lazy" onerror="this.src = '/preview/pr-43/images/fallback.svg'; this.onerror = null;" />
  </a>
  
    <figcaption class="figure-caption">
      Duration-weighted decode overlap. For each x-axis value k, the y-axis shows the share of total decode time spent with at least k simultaneous decode requests. The linear run never exceeds 6 simultaneous active decodes. The DAG run reaches 41, and spends more than 60% of decode-active time at 10 or more simultaneous decode requests.

    </figcaption>
  
</figure>

<p>Session topology and trace wrapping can strongly influence benchmarking results if they are not studied properly beforehand.
Inference systems, however, are not usually provisioned for a fixed total token budget.</p>

<h4 id="case-study-3">Case study 3</h4>

<p>Case study 2 briefly shows that when evaluating capacity, one needs to be vigilant about proper workload alignment.
Case study 3 asks: after tuning each workload to its own SLO frontier, how much healthy work can each shape sustain on the same hardware?
I tune two inference deployments on the previous DAG and linear workloads, using the same system and model, provisioning each one
based on maximizing healthy normalized request rate <code class="language-plaintext highlighter-rouge">rho</code>. A healthy run is defined by <code class="language-plaintext highlighter-rouge">TTFC p95 &lt;= 0.75 s</code>,
<code class="language-plaintext highlighter-rouge">TBC p95 &lt;= 75 ms</code> and <code class="language-plaintext highlighter-rouge">error rate &lt; 2%</code>.
<code class="language-plaintext highlighter-rouge">rho_l</code> refers to the normalized request rate given the linear workload as reference, and <code class="language-plaintext highlighter-rouge">rho_d</code> for the DAG workload.
<code class="language-plaintext highlighter-rouge">rho_l*</code> and <code class="language-plaintext highlighter-rouge">rho_d*</code> refer to the maximal value found in the respective value search.</p>

<p>In both workloads, each request always has 500 fresh input tokens and asks for 300 output tokens,
so each deployment sees a fresh input rate of <code class="language-plaintext highlighter-rouge">500 * rho</code> tokens/s and a requested output rate of <code class="language-plaintext highlighter-rouge">300 * rho</code> tokens/s for
all values of <code class="language-plaintext highlighter-rouge">rho</code> in the search grid.
Session dispatch rates are thus:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">rho_l / 5</code> sessions/s for the linear workload</li>
  <li><code class="language-plaintext highlighter-rouge">rho_d / 15</code> sessions/s for the DAG workload</li>
</ul>

<p>I compare both deployments and show how it is possible to underprovision or overprovision when reference workloads do not match production traffic.</p>

<p><strong>Results</strong></p>

<p>If the real traffic is DAG-shaped but the fleet was sized on a linear reference, the linear-tuned fleet is roughly 21% larger than needed: 17.5% slack capacity. The experiments yield <code class="language-plaintext highlighter-rouge">rho_l* = 5.50</code> and <code class="language-plaintext highlighter-rouge">rho_d* = 6.67</code>: with the exact same configuration and hardware, DAG sustains about <code class="language-plaintext highlighter-rouge">1.21x</code> more useful work under the same SLO regime. Both workloads hit TBC p95 as the binding SLO constraint at almost identical absolute values (74.9 ms vs 74.6 ms), so the frontier is decode-bound in both cases.</p>

<p>The reason the DAG workload can absorb more load is prefix reuse. At its own frontier, DAG achieves a prefix-cache hit rate of 0.652 versus 0.338 for linear at its frontier, and a prompt reuse ratio of 0.744 versus 0.597. DAG’s TTFC is consistently higher than linear’s, but the high prefix reuse means most of that length is cached, freeing decode capacity. The binding constraint is TBC, not TTFC, so the prefill savings translate into room for a higher request rate.</p>

<p>To confirm this, I replay the DAG workload at the linear frontier rate <code class="language-plaintext highlighter-rouge">rho_l* = 5.50</code>. The DAG workload is comfortable: all SLOs are met with headroom. TBC and E2E p95 are lower at 57.6 ms (vs 74.9 ms for linear) and 8.40 s (vs 11.61 s for linear), respectively. TPOT throughput rises from 35.6 to 47.0 tok/s. The system is underloaded.</p>

<figure class="figure">
  <a class="figure-image" aria-label="Frontier behavior at the optimal normalized request rates `rho_l* = 5.50` and `rho_d* = 6.67`. Linear TTFC (left) is lower, consistent with its shorter context lengths. DAG has a better mid-tail distribution and reaches a similar p95 at a higher load, consistent with higher cache reuse (right).">
    <img src="/preview/pr-43/images/posts/agentic-workloads/case_study_3_frontier_metric_facets_opaque.png" style="
        width: 760px;
        max-height: unset;
      " alt="Frontier behavior at the optimal normalized request rates `rho_l* = 5.50` and `rho_d* = 6.67`. Linear TTFC (left) is lower, consistent with its shorter context lengths. DAG has a better mid-tail distribution and reaches a similar p95 at a higher load, consistent with higher cache reuse (right)." loading="lazy" onerror="this.src = '/preview/pr-43/images/fallback.svg'; this.onerror = null;" />
  </a>
  
    <figcaption class="figure-caption">
      Frontier behavior at the optimal normalized request rates <code class="language-plaintext highlighter-rouge">rho_l* = 5.50</code> and <code class="language-plaintext highlighter-rouge">rho_d* = 6.67</code>. Linear TTFC (left) is lower, consistent with its shorter context lengths. DAG has a better mid-tail distribution and reaches a similar p95 at a higher load, consistent with higher cache reuse (right).

    </figcaption>
  
</figure>

<p>The annex tables below collect the full numbers.</p>

<h2 id="conclusion">Conclusion</h2>

<h3 id="simple-and-agentic-workloads-you-need-both">Simple and agentic workloads (you need both)</h3>

<p>Simple workloads are useful for measuring raw prefill or
decode performance and for isolating confounding variables. Agentic workloads
reveal deeper effects in inference systems, like cache
retention under bursty traffic, scheduling fairness under mixed concurrency,
memory pressure from long-lived sessions, and the combined effects of request
expansion, branching and prefix invalidation.</p>

<h3 id="putting-it-all-together">Putting it all together</h3>

<p>A single agentic session is usually a DAG of parallel inference chains
with partial history inheritance. One user task
expands into many inference requests, consecutive requests share most of their prefix, fresh
input and output sizes vary widely, waits are bursty, and occasional compaction
events invalidate the prefix.</p>

<p>The key properties are:</p>

<ul>
  <li>Request expansion: the think-act-observe loop turns one user task into a long chain of dependent requests.</li>
  <li>Stateful prefix reuse: full-history appends make consecutive requests share most of their prefix, though compaction or partial-history handoffs can reset or reduce that reuse.</li>
  <li>Token-count heterogeneity: tool results, summaries and final answers create broad fresh-input and output distributions, including compaction events that generate summary decodes and prefill restarts.</li>
  <li>Bursty timing: tool latency, user think time and dispatch overhead create broad <code class="language-plaintext highlighter-rouge">wait_after_ready</code> gaps.</li>
  <li>Session branching: <code class="language-plaintext highlighter-rouge">sessions_spawn</code> turns one chain into a DAG with fan-out, width and partial history inheritance, while repeated agent and subagent scaffolds create reuse opportunities across sessions.</li>
</ul>

<p>In conclusion:</p>

<ol>
  <li>Agentic traces are structured as session graphs plus distributions over token counts, waits, prefix reuse, invalidations and branching.</li>
  <li>We can benchmark them without running a real agent by measuring those distributions and generating synthetic sessions from them. Replaying traces is also useful.</li>
  <li>This matters because inference systems behave differently under agentic load, so the wrong workload can give the wrong conclusion.</li>
</ol>

<h3 id="where-to-go-from-here">Where to go from here</h3>

<p>If you want to try these ideas on your own inference system, the
Veeksha <a href="https://github.com/project-vajra/veeksha">repository</a> and
<a href="https://project-vajra.github.io/veeksha">documentation</a> are a good
starting point. Thank you for reading.</p>

<p><em>Experiment results, OpenClaw telemetry and more are available in the
<a href="https://github.com/chus-chus/blogpost_agentic_workloads">GitHub repo</a>.</em></p>

<h3 id="acknowledgements">Acknowledgements</h3>

<p>I’d like to thank Souradeep Bera and Elton Pinto for their helpful feedback on this post, as well as the entire Vajra Project team for their support and contributions to Veeksha’s development.</p>

<h2 id="notes">Notes</h2>

<ol>
  <li><a id="note-1"></a> For example, what are the engine’s policies for KV cache management or prefill and decode scheduling? <a href="#note-ref-1">↩</a></li>
  <li><a id="note-2"></a> I use it not only because it is open source; it is also representative of systems with subagent spawning, parallel execution, and context management. <a href="#note-ref-2">↩</a></li>
  <li><a id="note-3"></a> In this post I focus on text-only requests because most agentic workloads, at the time of writing, are text-only. Requests can also be multimodal, in which case the relevant metrics could change. For example, in the case of an audio response, the time to first audio matters more than just TTFT. <a href="#note-ref-3">↩</a></li>
  <li><a id="note-4"></a> As of now, inference systems report streaming information differently, and there is not a standard way of seeing the number of output tokens in output chunks. Counting the number of tokens in a chunk accurately is not always possible due to tokenization mismatches. <a href="#note-ref-4">↩</a></li>
  <li><a id="note-5"></a> For example, OpenClaw’s web search can call Gemini, Perplexity, or other providers. These are external requests that just look like timing gaps from our inference system’s perspective. <a href="#note-ref-5">↩</a></li>
  <li><a id="note-6"></a> Technically, OpenClaw does not implement the agentic loop. According to the docs, it is a “… gateway for Pi agents”. So we are actually talking about Pi agents running on OpenClaw. <a href="#note-ref-6">↩</a></li>
  <li><a id="note-7"></a> For example, subagent summaries that get injected into the parent’s context <a href="#note-ref-7">↩</a></li>
  <li><a id="note-8"></a> Defined as the ratio of the shared consecutive tokens, from the start, to the total tokens in the input. <a href="#note-ref-8">↩</a></li>
  <li><a id="note-9"></a> Prefix caching means reusing the KV cache computed for request <code class="language-plaintext highlighter-rouge">N</code> when processing request <code class="language-plaintext highlighter-rouge">N+1</code>. Since the shared prefix is identical, the system only needs to compute KV entries for the new tokens. This requires complex cache management. Hybrid transformer + Mamba, sparse attention, or other lower-cache-footprint models just decrease the slope of the memory requirement. <a href="#note-ref-9">↩</a></li>
  <li><a id="note-10"></a> Note that all numbers of trace characteristics in this post are probably going to underestimate what power users and more advanced agentic harnesses generate. <a href="#note-ref-10">↩</a></li>
  <li><a id="note-11"></a> Means 1570 and 612, inflated by a few long-context requests. <a href="#note-ref-11">↩</a></li>
  <li><a id="note-12"></a> Veeksha can do this for a variety of trace types, like agentic ones directly from Claude Code or OpenClaw, while preserving the DAG, token, and timing distributions of the workload. <a href="#note-ref-12">↩</a></li>
  <li><a id="note-13"></a> Should agents communicate across hierarchies? Only to their parents? Peers? <a href="#note-ref-13">↩</a></li>
  <li><a id="note-14"></a> The workload spec, results and Veeksha config are collected in the annex below. <a href="#note-ref-14">↩</a></li>
</ol>

<h2 id="references">References</h2>

<ul>
  <li><a id="inferencemax"></a><a href="https://inferencex.semianalysis.com">InferenceX (formerly InferenceMAX)</a></li>
  <li><a id="artificial-analysis"></a><a href="https://artificialanalysis.ai/evaluations">Artificial Analysis Evaluations</a></li>
  <li><a id="claude-code-subagents"></a><a href="https://docs.anthropic.com/en/docs/claude-code/subagents">Anthropic: Claude Code Docs - Create custom subagents</a></li>
  <li><a id="openclaw-session-tools"></a><a href="https://docs.openclaw.ai/concepts/session-tool">OpenClaw Docs - Session Tools</a></li>
  <li><a id="openclaw-subagents"></a><a href="https://docs.openclaw.ai/tools/subagents">OpenClaw Docs - Subagents</a></li>
  <li><a id="pagedattention"></a><a href="https://arxiv.org/abs/2309.06180">Kwon et al. (2023): Efficient Memory Management for Large Language Model Serving with PagedAttention</a></li>
  <li><a id="orca"></a><a href="https://www.usenix.org/system/files/osdi22-yu.pdf">Yu et al. (2022): Orca: A Distributed Serving System for Transformer-Based Generative Models</a></li>
  <li><a id="distserve"></a><a href="https://openreview.net/forum?id=sNifYctwnP">Zhong et al. (2024): DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving</a></li>
  <li><a id="sarathi-serve"></a><a href="https://arxiv.org/abs/2403.02310">Agrawal et al. (2024): Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve</a></li>
  <li><a id="openai-streaming"></a><a href="https://platform.openai.com/docs/guides/streaming-responses">OpenAI: Streaming API responses</a></li>
  <li><a id="vllm-prefix-caching"></a><a href="https://docs.vllm.ai/en/stable/design/prefix_caching.html">vLLM Project: Automatic Prefix Caching</a></li>
</ul>

<h2 id="annex">Annex</h2>

<figure class="figure">
  <a class="figure-image" aria-label="Real session structure extracted from the trace in the session-branching section. Open in a new tab for high-resolution exploration. Only 2 subagents return results because I hit rate limits quickly.">
    <img src="/preview/pr-43/images/posts/agentic-workloads/branching_dag_openclaw.png" style="
        width: 650px;
        max-height: 1150px;
      " alt="Real session structure extracted from the trace in the session-branching section. Open in a new tab for high-resolution exploration. Only 2 subagents return results because I hit rate limits quickly." loading="lazy" onerror="this.src = '/preview/pr-43/images/fallback.svg'; this.onerror = null;" />
  </a>
  
    <figcaption class="figure-caption">
      Real session structure extracted from the trace in the session-branching section. Open in a new tab for high-resolution exploration. Only 2 subagents return results because I hit rate limits quickly.

    </figcaption>
  
</figure>

<figure class="figure">
  <a class="figure-image" aria-label="Empirical CDF of `wait_after_ready` for the OpenClaw trace in Case study 1, with log-log axes. The dashed line marks the 100 ms threshold used for the Pareto tail fit. We see the initial spike and the slower heavy tail.">
    <img src="/preview/pr-43/images/posts/agentic-workloads/wait_after_ready_pareto_tail_ccdf_opaque.png" style="
        width: 680px;
        max-height: unset;
      " alt="Empirical CDF of `wait_after_ready` for the OpenClaw trace in Case study 1, with log-log axes. The dashed line marks the 100 ms threshold used for the Pareto tail fit. We see the initial spike and the slower heavy tail." loading="lazy" onerror="this.src = '/preview/pr-43/images/fallback.svg'; this.onerror = null;" />
  </a>
  
    <figcaption class="figure-caption">
      Empirical CDF of <code class="language-plaintext highlighter-rouge">wait_after_ready</code> for the OpenClaw trace in Case study 1, with log-log axes. The dashed line marks the 100 ms threshold used for the Pareto tail fit. We see the initial spike and the slower heavy tail.

    </figcaption>
  
</figure>

<h3 id="annex-case-study-2">Annex: case study 2</h3>

<p><em>Workload specification of Case study 2 runs.</em></p>

<table>
  <thead>
    <tr>
      <th>Property</th>
      <th>Linear workload</th>
      <th>DAG workload</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Shared session arrival rate</td>
      <td><code class="language-plaintext highlighter-rouge">0.18</code> sessions/s</td>
      <td><code class="language-plaintext highlighter-rouge">0.18</code> sessions/s</td>
    </tr>
    <tr>
      <td>Sessions</td>
      <td><code class="language-plaintext highlighter-rouge">30</code></td>
      <td><code class="language-plaintext highlighter-rouge">10</code></td>
    </tr>
    <tr>
      <td>Requests per session</td>
      <td><code class="language-plaintext highlighter-rouge">5</code></td>
      <td><code class="language-plaintext highlighter-rouge">15</code></td>
    </tr>
    <tr>
      <td>Total requests</td>
      <td><code class="language-plaintext highlighter-rouge">150</code></td>
      <td><code class="language-plaintext highlighter-rouge">150</code></td>
    </tr>
    <tr>
      <td>Fresh input tokens per request</td>
      <td><code class="language-plaintext highlighter-rouge">500</code></td>
      <td><code class="language-plaintext highlighter-rouge">500</code></td>
    </tr>
    <tr>
      <td>Output tokens per request</td>
      <td><code class="language-plaintext highlighter-rouge">300</code></td>
      <td><code class="language-plaintext highlighter-rouge">300</code></td>
    </tr>
    <tr>
      <td>Total new input tokens</td>
      <td><code class="language-plaintext highlighter-rouge">75000</code></td>
      <td><code class="language-plaintext highlighter-rouge">75000</code></td>
    </tr>
    <tr>
      <td>Total output tokens</td>
      <td><code class="language-plaintext highlighter-rouge">45000</code></td>
      <td><code class="language-plaintext highlighter-rouge">45000</code></td>
    </tr>
  </tbody>
</table>

<p><em>Sample Veeksha configuration for Case study 2 DAG runs.</em></p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">session_generator</span><span class="pi">:</span>
  <span class="na">type</span><span class="pi">:</span> <span class="s">trace</span>
  <span class="na">trace_file</span><span class="pi">:</span> <span class="s">traces/workload_shape/workload_b_dag.jsonl</span>
  <span class="na">wrap_mode</span><span class="pi">:</span> <span class="no">true</span>
  <span class="na">flavor</span><span class="pi">:</span>
    <span class="na">type</span><span class="pi">:</span> <span class="s">timed_synthetic_session</span>
    <span class="na">page_size</span><span class="pi">:</span> <span class="m">16</span>

<span class="na">seed</span><span class="pi">:</span> <span class="m">42</span>
<span class="na">output_dir</span><span class="pi">:</span> <span class="s">benchmark_output/workload_shape_case_study/qwen/dag</span>

<span class="na">server</span><span class="pi">:</span> <span class="kt">!include</span> <span class="s">shared/server_h100_qwen3_5_35b_a3b.yml</span>
<span class="na">traffic_scheduler</span><span class="pi">:</span> <span class="kt">!include</span> <span class="s">shared/rate_traffic.yml</span>
<span class="na">client</span><span class="pi">:</span> <span class="kt">!include</span> <span class="s">shared/client_qwen3_5_nonthinking.yml</span>
<span class="na">runtime</span><span class="pi">:</span> <span class="kt">!include</span> <span class="s">shared/runtime_qwen.yml</span>
<span class="na">evaluators</span><span class="pi">:</span> <span class="kt">!include</span> <span class="s">shared/evaluators.yml</span>
<span class="na">trace_recorder</span><span class="pi">:</span> <span class="kt">!include</span> <span class="s">shared/trace_recorder.yml</span>
</code></pre></div></div>

<p><em>Observed comparison for Case study 2.</em></p>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Linear</th>
      <th>DAG</th>
      <th>DAG vs. linear</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Mean total prompt tokens per request</td>
      <td><code class="language-plaintext highlighter-rouge">2094.3</code></td>
      <td><code class="language-plaintext highlighter-rouge">2517.5</code></td>
      <td><code class="language-plaintext highlighter-rouge">20.2%</code></td>
    </tr>
    <tr>
      <td>TTFC p99</td>
      <td><code class="language-plaintext highlighter-rouge">0.205s</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.444s</code></td>
      <td><code class="language-plaintext highlighter-rouge">116.7%</code></td>
    </tr>
    <tr>
      <td>E2E p95</td>
      <td><code class="language-plaintext highlighter-rouge">3.054s</code></td>
      <td><code class="language-plaintext highlighter-rouge">5.410s</code></td>
      <td><code class="language-plaintext highlighter-rouge">77.2%</code></td>
    </tr>
    <tr>
      <td>TBC p99</td>
      <td><code class="language-plaintext highlighter-rouge">11.1 ms</code></td>
      <td><code class="language-plaintext highlighter-rouge">52.4 ms</code></td>
      <td><code class="language-plaintext highlighter-rouge">372.1%</code></td>
    </tr>
    <tr>
      <td>vLLM prefix-cache hit rate</td>
      <td><code class="language-plaintext highlighter-rouge">0.495</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.690</code></td>
      <td><code class="language-plaintext highlighter-rouge">39.4%</code></td>
    </tr>
  </tbody>
</table>

<h3 id="annex-case-study-3">Annex: case study 3</h3>

<p><em>Case study 3 frontier and first-failure points under the shared p95-based SLO regime.</em></p>

<table>
  <thead>
    <tr>
      <th>Quantity</th>
      <th>Linear workload</th>
      <th>DAG workload</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Max healthy normalized request rate <code class="language-plaintext highlighter-rouge">rho*</code></td>
      <td><code class="language-plaintext highlighter-rouge">5.50</code></td>
      <td><code class="language-plaintext highlighter-rouge">6.67</code></td>
    </tr>
    <tr>
      <td>First failing normalized request rate</td>
      <td><code class="language-plaintext highlighter-rouge">5.51</code></td>
      <td><code class="language-plaintext highlighter-rouge">6.68</code></td>
    </tr>
    <tr>
      <td>First limiting SLOs</td>
      <td><code class="language-plaintext highlighter-rouge">TBC p95 &gt; 75ms</code></td>
      <td><code class="language-plaintext highlighter-rouge">TBC p95 &gt; 75ms</code></td>
    </tr>
  </tbody>
</table>

<p><em>Observed frontier metrics and same-budget replay metrics for Case study 3.</em></p>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Linear at <code class="language-plaintext highlighter-rouge">rho_l* = 5.50</code></th>
      <th>DAG replay at <code class="language-plaintext highlighter-rouge">rho_l* = 5.50</code></th>
      <th>DAG at <code class="language-plaintext highlighter-rouge">rho_d* = 6.67</code></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>TTFC p95</td>
      <td><code class="language-plaintext highlighter-rouge">0.386s</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.426s</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.508s</code></td>
    </tr>
    <tr>
      <td>E2E p95</td>
      <td><code class="language-plaintext highlighter-rouge">11.61s</code></td>
      <td><code class="language-plaintext highlighter-rouge">8.40s</code></td>
      <td><code class="language-plaintext highlighter-rouge">9.69s</code></td>
    </tr>
    <tr>
      <td>TBC p95</td>
      <td><code class="language-plaintext highlighter-rouge">74.9 ms</code></td>
      <td><code class="language-plaintext highlighter-rouge">57.6 ms</code></td>
      <td><code class="language-plaintext highlighter-rouge">74.6 ms</code></td>
    </tr>
    <tr>
      <td>TPOT throughput</td>
      <td><code class="language-plaintext highlighter-rouge">35.6</code> tok/s</td>
      <td><code class="language-plaintext highlighter-rouge">47.0</code> tok/s</td>
      <td><code class="language-plaintext highlighter-rouge">39.8</code> tok/s</td>
    </tr>
    <tr>
      <td>Prefix-cache hit rate</td>
      <td><code class="language-plaintext highlighter-rouge">0.338</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.686</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.652</code></td>
    </tr>
    <tr>
      <td>Prompt reuse ratio</td>
      <td><code class="language-plaintext highlighter-rouge">0.597</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.743</code></td>
      <td><code class="language-plaintext highlighter-rouge">0.744</code></td>
    </tr>
  </tbody>
</table>

<p><em>Derived deployment implications for Case study 3.</em></p>

<table>
  <thead>
    <tr>
      <th>Comparative quantity</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>DAG replay at <code class="language-plaintext highlighter-rouge">rho_l*</code> meets all SLOs</td>
      <td><code class="language-plaintext highlighter-rouge">yes</code></td>
    </tr>
    <tr>
      <td>DAG / linear useful-work ratio at the frontier</td>
      <td><code class="language-plaintext highlighter-rouge">1.21x</code></td>
    </tr>
    <tr>
      <td>Fleet fraction needed if real traffic is DAG-shaped</td>
      <td><code class="language-plaintext highlighter-rouge">0.825x</code></td>
    </tr>
    <tr>
      <td>Slack capacity in the linear-tuned fleet</td>
      <td><code class="language-plaintext highlighter-rouge">17.5%</code></td>
    </tr>
    <tr>
      <td>Linear-tuned fleet size relative to DAG need</td>
      <td><code class="language-plaintext highlighter-rouge">1.21x</code></td>
    </tr>
  </tbody>
</table>]]></content><author><name>chus-antonanzas</name></author><summary type="html"><![CDATA[Why simple chat benchmarks are not enough for inference performance evaluation, and how to model agentic workloads with branching, prefix reuse, bursty timing, token heterogeneity, and reproducible synthetic sessions.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://gatech-sysml.github.io/preview/pr-43/images/posts/agentic-workloads/header.png" /><media:content medium="image" url="https://gatech-sysml.github.io/preview/pr-43/images/posts/agentic-workloads/header.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Introducing CompOFA</title><link href="https://gatech-sysml.github.io/preview/pr-43/2021/04/28/compofa.html" rel="alternate" type="text/html" title="Introducing CompOFA" /><published>2021-04-28T00:00:00+00:00</published><updated>2026-05-12T21:37:14+00:00</updated><id>https://gatech-sysml.github.io/preview/pr-43/2021/04/28/compofa</id><content type="html" xml:base="https://gatech-sysml.github.io/preview/pr-43/2021/04/28/compofa.html"><![CDATA[<h1 id="introduction">Introduction</h1>

<p>If you’ve trained deep learning models, you know the process can take hours or days (weeks?) and thousands of dollars’ worth of computation. With increasing use of DNNs in common production, this problem only gets bigger – they need to be used on diverse deployment targets with widely varying latency constraints, based on hardware capabilities and application requirements. Designing DNN architectures that maximize accuracy under these constraints adds another degree of complexity requiring manual expertise and/or neural architecture search (NAS) – which are even slower and costlier than training. Clearly, repeating these processes for every deployment target is not scalable and therefore, solving this problem is essential for making DNNs easier to use in real deployment.</p>

<figure class="figure">
  <a class="figure-image" aria-label="figure link">
    <img src="/preview/pr-43/images/posts/compofa/model-family.png" style="
        width: 400px;
        max-height: unset;
      " alt="figure image" loading="lazy" onerror="this.src = '/preview/pr-43/images/fallback.svg'; this.onerror = null;" />
  </a>
  
</figure>

<p>In <em>CompOFA</em>, we propose a cost-effective and faster technique to build model families that support multiple deployment platforms. Using insights from model design and system deployment, we build upon the current best methods that take 40-50 GPU days of computation and make their training and searching processes <strong>faster by 2x and 200x</strong>, respectively – all while building a family of equally efficient and diverse models!</p>

<h1 id="how-its-done-today">How it’s done today</h1>

<figure class="figure">
  <a class="figure-image" aria-label="Conventional, individual training">
    <img src="/preview/pr-43/images/posts/compofa/naive-training.png" style="
        width: 400px;
        max-height: unset;
      " alt="Conventional, individual training" loading="lazy" onerror="this.src = '/preview/pr-43/images/fallback.svg'; this.onerror = null;" />
  </a>
  
    <figcaption class="figure-caption">
      Conventional, individual training

    </figcaption>
  
</figure>

<p>The prevailing norm today is to build individual neural networks. We design and train single monolithic DNNs with a fixed accuracy and latency measure (or computational complexity, energy usage, etc.). Both, designing efficient architectures and training on production-grade datasets, require computation worth several GPU hours with slow turnaround, expensive hardwares and expertise in ML and Systems. In 2019, a study estimated the carbon emissions of one well-known NAS technique to be 283 metric tons – or nearly 60 times the emissions over an average human lifetime! Thus it is simply unscalable to continue this trend of designing and training individual DNNs for deployment.</p>

<figure class="figure">
  <a class="figure-image" aria-label="Once-For-All (OFA): co-trained model families">
    <img src="/preview/pr-43/images/posts/compofa/ofa.png" style="
        width: 400px;
        max-height: unset;
      " alt="Once-For-All (OFA): co-trained model families" loading="lazy" onerror="this.src = '/preview/pr-43/images/fallback.svg'; this.onerror = null;" />
  </a>
  
    <figcaption class="figure-caption">
      Once-For-All (OFA): co-trained model families

    </figcaption>
  
</figure>

<p><em>Once-For-All (OFA)</em> proposed to address this problem via weight-shared sub-networks of a larger network. These sub-networks of varying sizes had diverse accuracy and latency measures and could be trained simultaneously (rather than one-by-one). Post this one-time training, one can independently search and extract a subnetwork with optimal accuracy for a given deployment. Hence, OFA significantly improved the scalability over the naïve method. But, at 40-50 GPU days of train time, OFA remained expensive and required special training &amp; searching procedures for its huge search space of $10^{19}$ models.</p>

<p>In <em>CompOFA</em>, we find insights that speed up OFA’s training and searching methodologies, while making it easier to use.</p>

<h1 id="compofa">CompOFA</h1>

<p>OFA built a model search space by slicing smaller subnetworks from a larger network – by choosing subsets of its layers (depth), channels (width), resolution, etc. This choice was made independently at each layer, contributing to a combinatorial explosion of $10^{10}-10^{19}$ models! These models don’t come free – training so many of them together needs a slow, phased approach. After training, the search also requires building special accuracy and latency estimators.</p>

<p><strong>But do we need such a large search space?</strong></p>

<ul>
  <li>
    <p><strong>Are all these models efficient? No!</strong> Many of these subnetworks are of dimensions that are suboptimal, lying well below the optimal accuracy-latency tradeoff.</p>
  </li>
  <li>
    <p><strong>Are all these models different enough? No!</strong> Imagine $10^{19}$ networks where the smallest and largest differ in latency by just 100ms – this fine granularity is too small to matter for real hardware.</p>
  </li>
</ul>

<p>In CompOFA, we question whether we can identify and focus our attention just on models that are close to optimal, and at a sufficient granularity?  After all, it’s not common practice to treat these model dimensions as independent or random –- we often combine dimensions like depth and width to vary together i.e. in a compound fashion. This common intuition is backed by empirical studies like <a href="https://arxiv.org/abs/1905.11946">EfficientNet</a> and <a href="https://arxiv.org/abs/2003.13678">RegNet</a>, which showed that there are optimal relations between these dimensions.</p>

<figure class="figure">
  <a class="figure-image" aria-label="CompOFA reduces combinatorial explosion of the search space by exploiting the same direction of growth of accuracy and latency.">
    <img src="/preview/pr-43/images/posts/compofa/dw-grid.png" style="
        width: 400px;
        max-height: unset;
      " alt="CompOFA reduces combinatorial explosion of the search space by exploiting the same direction of growth of accuracy and latency." loading="lazy" onerror="this.src = '/preview/pr-43/images/fallback.svg'; this.onerror = null;" />
  </a>
  
    <figcaption class="figure-caption">
      CompOFA reduces combinatorial explosion of the search space by exploiting the same direction of growth of accuracy and latency.

    </figcaption>
  
</figure>

<p>Inspired by this, CompOFA uses a simple but powerful heuristic – <strong>choose models that grow depth and width together</strong>. This makes our search space much more tractable, but still just as efficient and diverse for actual use.</p>

<p>In our paper, we show that we can train this model family in <strong>half the time</strong> and all at once, without a slow phased approach. After training, we can search for models <strong>216x</strong> faster, and without the time and effort to build special estimators.</p>

<center>

<!-- | \*\*Metric\*\*               |   \*\*OFA\*\* | \*\*CompOFA\*\* |  \*\*Savings\*\* |
|--------------------------|----------:|------------:|-------------:|
| \*\*Train Time\*\* \(GPU hrs) |     978.3 |       493.5 |       \*\*2x\*\* |
| \*\*Train Cost\*\* \(USD)     |     \$2.4k |       \$1.2k |       \*\*2x\*\* |
| \*\*CO2 emissions\*\* \(lbs)  |       277 |         128 |       \*\*2x\*\* |
| \*\*Search Time\*\*          | 4.5 hours |  75 seconds |     \*\*216x\*\* | -->

<figure class="figure">
  <a class="figure-image" aria-label="figure link">
    <img src="/preview/pr-43/images/posts/compofa/table.png" style="
        width: 400px;
        max-height: unset;
      " alt="figure image" loading="lazy" onerror="this.src = '/preview/pr-43/images/fallback.svg'; this.onerror = null;" />
  </a>
  
</figure>



</center>

<p>Despite these savings, CompOFA does not compromise on its original goal. It’s able to extract networks for multiple latency targets on distinct hardware types, and match the existing SOTA in both optimality and range of its models.</p>

<figure class="figure">
  <a class="figure-image" aria-label="CompOFA generates efficient model families for diverse hardwares -- from mobile phones to GPUs">
    <img src="/preview/pr-43/images/posts/compofa/pareto-results.png" style="
        width: 400px;
        max-height: unset;
      " alt="CompOFA generates efficient model families for diverse hardwares -- from mobile phones to GPUs" loading="lazy" onerror="this.src = '/preview/pr-43/images/fallback.svg'; this.onerror = null;" />
  </a>
  
    <figcaption class="figure-caption">
      CompOFA generates efficient model families for diverse hardwares – from mobile phones to GPUs

    </figcaption>
  
</figure>

<h1 id="learn-more">Learn more</h1>

<p>CompOFA improves the speed, cost, and usability of jointly training models for many deployment targets. By highlighting insights on model design and system deployment, we try to address an important problem for real-world usability of DNNs.</p>

<p>To know more, please check out our <a href="https://arxiv.org/abs/2104.12642">paper</a> and <a href="https://iclr.cc/media/PosterPDFs/ICLR%202021/2c3ddf4bf13852db711dd1901fb517fa.png">poster</a> at ICLR 2021! Our code and pretrained models are also available on our <a href="https://github.com/gatech-sysml/compofa">Github repository</a>.</p>

<h2 id="citation">Citation</h2>

<div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@inproceedings</span><span class="p">{</span><span class="nl">compofa-iclr21</span><span class="p">,</span>
  <span class="na">author</span>    <span class="p">=</span> <span class="s">{Manas Sahni and Shreya Varshini and Alind Khare and
               Alexey Tumanov}</span><span class="p">,</span>
  <span class="na">title</span>     <span class="p">=</span> <span class="s">{CompOFA: Compound Once-For-All Networks for Faster Multi-Platform Deployment}</span><span class="p">,</span>
  <span class="na">month</span>     <span class="p">=</span> <span class="s">{May}</span><span class="p">,</span>
  <span class="na">booktitle</span> <span class="p">=</span> <span class="s">{Proc. of the 9th International Conference on Learning Representations}</span><span class="p">,</span>
  <span class="na">series</span> <span class="p">=</span> <span class="s">{ICLR '21}</span><span class="p">,</span>
  <span class="na">year</span> <span class="p">=</span> <span class="s">{2021}</span><span class="p">,</span>
  <span class="na">url</span>       <span class="p">=</span> <span class="s">{https://openreview.net/forum?id=IgIk8RRT-Z}</span>
<span class="p">}</span>
</code></pre></div></div>]]></content><author><name>alind-khare</name></author><summary type="html"><![CDATA[Fast & Efficient Training of Once-For-All (OFA) models.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://github.com/gatech-sysml/CompOFA/raw/main/figures/overview.png" /><media:content medium="image" url="https://github.com/gatech-sysml/CompOFA/raw/main/figures/overview.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>