Skip to Content
DocumentationPrompt Caching

Prompt Caching

Prompt caching reduces latency and cost by reusing computed context from previous requests. When the beginning of your prompt matches a cached prefix, Stacknet skips re-processing those tokens.

How it works

Request 1: [System prompt + context] + [User message A] Full computation Cache prefix stored Request 2: [System prompt + context] + [User message B] ┌───────────┴───────────┐ │ Cached prefix │ New tokens only │ (skip computation) │ (computed fresh) └───────────┬───────────┘ Faster response

What gets cached

CacheableNot cacheable
System promptsDynamic user messages
Few-shot examplesStreaming deltas
Tool definitionsRandom/temperature-dependent outputs
Static context documentsCross-node state

Usage

Caching is automatic — no special configuration required. The network transparently caches and reuses prefixes when beneficial.

Optimizing for cache hits

Structure your prompts with static content first:

const response = await client.chat.completions.create({ model: 'preview', messages: [ // Static — will be cached after first request { role: 'system', content: 'You are a helpful assistant specialized in...' }, // Static context — also cached { role: 'user', content: 'Here is the reference document: [long document]' }, { role: 'assistant', content: 'I have read the document.' }, // Dynamic — only this part changes between requests { role: 'user', content: userQuestion // Changes each request } ] })

Cache indicators

The response includes cache metadata:

{ "usage": { "prompt_tokens": 1500, "completion_tokens": 200, "prompt_tokens_details": { "cached_tokens": 1200 } } }

Distributed caching

Across Stacknet, caching operates at the node level:

  • Each aISP node maintains its own semantic cache shard
  • Warm affinity routing preferentially routes repeat requests to the same node
  • Cache hits improve aISP efficiency and reduce user costs
Last updated on