Skip to content

Conversation State Management

advanced22 min read

The Chat App That Forgot Everything

A developer ships their first AI chat feature. Users love it. Then someone reports: "The AI forgets what I said two messages ago." The developer checks the code — they're sending only the latest user message to the API. No conversation history. Each request is a clean slate.

Here's the thing most people miss about LLMs: they have zero memory. Every single API call is stateless. The model doesn't "remember" your conversation. That continuity you experience in ChatGPT? It's an illusion, carefully maintained by the frontend sending the entire conversation history with every request.

This means conversation state management isn't a nice-to-have — it's the core architecture problem of any AI-powered frontend.

Mental Model

Think of an LLM like a brilliant consultant with amnesia. Every time you call them, you hand them a written transcript of everything you've discussed so far, plus your new question. They read the whole thing, write a response, and immediately forget everything. Your job as the frontend engineer is to maintain that transcript — adding to it, trimming it when it gets too long, and making sure it contains exactly the right context for each call.

The Message Data Model

Every conversation is an array of messages. Each message has a specific shape:

interface Message {
  id: string
  role: 'system' | 'user' | 'assistant' | 'tool'
  content: string
  createdAt: Date
  // Tool-related fields
  toolInvocations?: ToolInvocation[]
  // Metadata for your app
  metadata?: Record<string, unknown>
}

The role field is the most important piece. It tells the model who said what:

  • system — Instructions that shape the model's behavior. "You are a helpful coding tutor." The user never sees this.
  • user — What the human typed.
  • assistant — What the model responded.
  • tool — Results from tool calls (function calling). More on this later.

In practice, you'll often want to track additional metadata per message — token counts, latency, model version, whether the message was edited. The model doesn't see this metadata, but your UI needs it.

interface MessageWithMeta extends Message {
  tokenCount?: number
  model?: string
  latencyMs?: number
  isEdited?: boolean
}
Quiz
Why does every API request to an LLM need the full conversation history?

The Messages Array — Append-Only by Design

The conversation state is an array of messages, and it follows an append-only pattern. You never mutate existing messages in place. You never delete from the middle. You only append.

// Correct: append-only updates
function addMessage(
  messages: Message[],
  newMessage: Message
): Message[] {
  return [...messages, newMessage]
}

// The full cycle for one exchange:
let messages: Message[] = [
  { id: '1', role: 'system', content: 'You are a coding tutor.', createdAt: new Date() }
]

// User sends a message
messages = addMessage(messages, {
  id: '2',
  role: 'user',
  content: 'Explain closures in JavaScript',
  createdAt: new Date()
})

// Send entire array to API, get response
const response = await callLLM(messages)

// Append the assistant's response
messages = addMessage(messages, {
  id: '3',
  role: 'assistant',
  content: response.content,
  createdAt: new Date()
})

Why append-only? Three reasons:

  1. Immutability — React needs new array references to trigger re-renders. Mutating in place breaks change detection.
  2. History integrity — The model's responses depend on the exact sequence of prior messages. Editing earlier messages retroactively would make later responses nonsensical.
  3. Debuggability — When something goes wrong, you can replay the exact sequence of messages that led to a bad response.
Quiz
Why should the messages array follow an append-only pattern instead of allowing edits to previous messages?

Token Budget Management

Here's the problem nobody thinks about until it bites them: conversations grow, but context windows don't.

Every model has a maximum context window — the total number of tokens it can process in a single request. GPT-4o handles 128K tokens. Claude handles 200K. Sounds like a lot, until your user has a 50-message conversation about a complex codebase, each message including code snippets. Suddenly you're bumping up against the limit, and the API starts rejecting requests or — worse — silently truncating your messages.

You need a token budget strategy. There are three main approaches:

Sliding Window

Drop the oldest messages to stay within budget. Simple, but you lose context:

function slidingWindow(
  messages: Message[],
  maxTokens: number,
  countTokens: (msg: Message) => number
): Message[] {
  const system = messages.filter(m => m.role === 'system')
  const conversation = messages.filter(m => m.role !== 'system')

  let totalTokens = system.reduce((sum, m) => sum + countTokens(m), 0)
  const kept: Message[] = []

  // Walk backward from newest, keep until budget is full
  for (let i = conversation.length - 1; i >= 0; i--) {
    const tokens = countTokens(conversation[i])
    if (totalTokens + tokens > maxTokens) break
    totalTokens += tokens
    kept.unshift(conversation[i])
  }

  return [...system, ...kept]
}

Notice the system message is always preserved — you never drop it. The sliding window walks backward from the newest message, keeping as many recent messages as the budget allows.

Summarization

When you hit the budget, summarize older messages into a compressed system message:

async function summarizeOldMessages(
  messages: Message[],
  maxTokens: number,
  countTokens: (msg: Message) => number
): Promise<Message[]> {
  const totalTokens = messages.reduce((sum, m) => sum + countTokens(m), 0)

  if (totalTokens <= maxTokens) return messages

  // Split: keep recent messages, summarize older ones
  const splitPoint = Math.floor(messages.length * 0.6)
  const oldMessages = messages.slice(1, splitPoint) // skip system
  const recentMessages = messages.slice(splitPoint)

  const summary = await callLLM([
    {
      role: 'system',
      content: 'Summarize this conversation concisely, preserving key facts and decisions.'
    },
    ...oldMessages
  ])

  return [
    messages[0], // original system message
    {
      id: crypto.randomUUID(),
      role: 'system',
      content: `Previous conversation summary: ${summary}`,
      createdAt: new Date()
    },
    ...recentMessages
  ]
}

This costs an extra API call but preserves important context better than pure truncation.

Hybrid: Sliding Window + Key Message Pinning

The most production-ready approach. Keep recent messages via sliding window, but "pin" important messages (user-marked, or messages containing decisions/requirements):

function hybridTruncation(
  messages: Message[],
  maxTokens: number,
  countTokens: (msg: Message) => number
): Message[] {
  const system = messages.filter(m => m.role === 'system')
  const pinned = messages.filter(m => m.metadata?.pinned)
  const rest = messages.filter(
    m => m.role !== 'system' && !m.metadata?.pinned
  )

  let budget = maxTokens
  budget -= system.reduce((s, m) => s + countTokens(m), 0)
  budget -= pinned.reduce((s, m) => s + countTokens(m), 0)

  const windowed = slidingWindow(
    [{ id: '_', role: 'system', content: '', createdAt: new Date() }, ...rest],
    budget,
    countTokens
  ).filter(m => m.id !== '_')

  return [...system, ...pinned, ...windowed]
}
StrategyProsConsBest For
Sliding WindowSimple, predictable, no extra API callsLoses old context entirelyCasual chat, quick Q&A
SummarizationPreserves key context from old messagesExtra API call cost, summary can lose nuanceLong technical discussions
Hybrid (Window + Pinning)Keeps critical context while trimming noiseMore complex state managementProduction apps with important decisions
Quiz
In a sliding window truncation strategy, which messages should you NEVER drop?

AI State vs UI State

The Vercel AI SDK introduces a powerful pattern for managing conversation state: splitting it into two distinct layers.

AI State is the serializable data you send to the model. It's the message history — roles, content, tool calls, tool results. It's JSON-serializable, can be stored in a database, and can be loaded back to resume a conversation. The model sees this.

UI State is the React elements you render on screen. It's the chat bubbles, the loading spinners, the inline code editors, the image previews. The model never sees this. It can contain React components, event handlers, refs — anything that React can render but JSON can't serialize.

// AI State: what the model sees (serializable)
type AIState = {
  messages: {
    role: 'user' | 'assistant' | 'system' | 'tool'
    content: string
    toolInvocations?: ToolInvocation[]
  }[]
}

// UI State: what the user sees (React elements)
type UIState = {
  id: string
  display: React.ReactNode // could be <WeatherCard />, <CodeBlock />, anything
}[]

Why split them? Because what the user sees and what the model needs are fundamentally different things.

When the model calls a tool to get weather data and returns { temperature: 72, condition: "sunny" }, the AI State stores the raw JSON. But the UI State renders a beautiful WeatherCard component with an animated sun icon and a temperature gauge. The model doesn't need to know about the sun icon. The user doesn't need to see raw JSON.

// In a Server Action using Vercel AI SDK:
async function sendMessage(input: string) {
  'use server'

  const aiState = getMutableAIState()

  // Update AI State (serializable)
  aiState.update([
    ...aiState.get(),
    { role: 'user', content: input }
  ])

  const result = await streamUI({
    model: openai('gpt-4o'),
    messages: aiState.get(),
    tools: {
      getWeather: {
        parameters: z.object({ city: z.string() }),
        generate: async function* ({ city }) {
          yield <LoadingSpinner />
          const weather = await fetchWeather(city)

          // AI State gets the raw data
          aiState.done([
            ...aiState.get(),
            { role: 'tool', content: JSON.stringify(weather) }
          ])

          // UI State gets a rich component
          return <WeatherCard data={weather} />
        }
      }
    }
  })

  return result
}

This split is what makes generative UI possible — the model can trigger rendering of arbitrary React components without those components ever touching the serialization layer.

Common Trap

Never put React elements in AI State. It's tempting to store everything in one place, but React nodes aren't serializable — you can't save them to a database, send them over the network, or restore them on page reload. Keep AI State as plain JSON. Keep UI State as React elements. If you mix them, persistence and hydration will break in subtle, hard-to-debug ways.

Multi-Turn with Tool Calls

Tool calling (also called function calling) adds complexity to the message array. A single "turn" can now involve multiple back-and-forth messages between the model and your tools, all happening before the user sees a response.

Here's what the message flow looks like:

const messages = [
  // System instructions
  { role: 'system', content: 'You can look up weather and book flights.' },

  // User asks something
  { role: 'user', content: 'What is the weather in Tokyo and book me a flight there?' },

  // Assistant decides to call TWO tools
  {
    role: 'assistant',
    content: null,
    tool_calls: [
      { id: 'call_1', type: 'function', function: { name: 'getWeather', arguments: '{"city":"Tokyo"}' } },
      { id: 'call_2', type: 'function', function: { name: 'bookFlight', arguments: '{"destination":"Tokyo"}' } }
    ]
  },

  // Tool results come back (one per tool call)
  { role: 'tool', tool_call_id: 'call_1', content: '{"temp": 22, "condition": "cloudy"}' },
  { role: 'tool', tool_call_id: 'call_2', content: '{"confirmation": "FL-8842", "departure": "2026-04-15"}' },

  // Assistant synthesizes the tool results into a response
  {
    role: 'assistant',
    content: 'Tokyo is currently 22 C and cloudy. I have booked flight FL-8842 departing April 15th.'
  }
]

Notice the pattern: the assistant message with tool_calls and the corresponding tool result messages must always appear together in the array. If you truncate messages and accidentally separate a tool call from its result, the model gets confused — it sees a call was made but no result came back.

function safeTruncate(messages: Message[]): Message[] {
  // Never split a tool call from its results
  const truncated = slidingWindow(messages, maxTokens, countTokens)

  // Verify: every tool_call has its matching tool result
  const toolCallIds = new Set<string>()
  const toolResultIds = new Set<string>()

  for (const msg of truncated) {
    if (msg.role === 'assistant' && msg.tool_calls) {
      for (const call of msg.tool_calls) {
        toolCallIds.add(call.id)
      }
    }
    if (msg.role === 'tool' && msg.tool_call_id) {
      toolResultIds.add(msg.tool_call_id)
    }
  }

  // If any tool call is missing its result (or vice versa), remove the orphan
  return truncated.filter(msg => {
    if (msg.role === 'assistant' && msg.tool_calls) {
      return msg.tool_calls.every(c => toolResultIds.has(c.id))
    }
    if (msg.role === 'tool') {
      return toolCallIds.has(msg.tool_call_id)
    }
    return true
  })
}
Quiz
What happens if you truncate a conversation and a tool call message loses its corresponding tool result message?

System Prompts — Where They Live

The system prompt is the first message in your array, with role: 'system'. It's the invisible hand that shapes every response.

const systemMessage: Message = {
  id: 'system-1',
  role: 'system',
  content: `You are a senior JavaScript tutor.
Rules:
- Explain concepts using real-world analogies
- Always show code examples
- If the user's code has a bug, don't just fix it — explain why it's wrong
- Never use var, always use const or let
- Keep responses under 500 words unless the user asks for more detail`,
  createdAt: new Date()
}

Some production patterns for system prompts:

Dynamic system prompts — update the system message based on context. If the user switches from "beginner mode" to "advanced mode," swap the system prompt:

function buildSystemPrompt(userLevel: 'beginner' | 'advanced'): string {
  const base = 'You are a JavaScript tutor.'
  const levelInstructions = userLevel === 'beginner'
    ? 'Explain everything step by step. Avoid jargon. Use simple analogies.'
    : 'Be concise. Assume strong JS knowledge. Focus on edge cases and internals.'

  return `${base}\n${levelInstructions}`
}

Multi-section system prompts — for complex apps, structure the system prompt with clear sections:

const systemPrompt = `
## Role
You are a code review assistant for a React/TypeScript codebase.

## Context
The user is working on: ${projectDescription}
Tech stack: ${techStack.join(', ')}

## Rules
- Flag any use of 'any' type
- Suggest performance improvements for components over 100 lines
- Always check for missing error boundaries

## Output Format
Use markdown. Start with a severity rating: Critical / Warning / Info.
`

The system prompt counts against your token budget. A 2000-token system prompt means 2000 fewer tokens available for conversation history. Keep it focused.

Should you use multiple system messages or one? Technically, most APIs support multiple messages with role: 'system' placed at different points in the conversation. OpenAI's models treat them all as system instructions. Anthropic's Claude treats the first system message specially (it goes in a system parameter) and subsequent ones as regular messages. For maximum compatibility, stick with a single system message at the start. If you need to inject context mid-conversation (like updated user preferences), append it as a user message with a clear label: "System note: user has switched to advanced mode."

Thread Management

Real chat apps don't have one conversation. They have dozens. Each conversation is a thread — an independent messages array with its own ID, title, and metadata.

interface Thread {
  id: string
  title: string
  createdAt: Date
  updatedAt: Date
  messages: Message[]
  metadata?: {
    model?: string
    systemPrompt?: string
    tokenCount?: number
  }
}

interface ThreadStore {
  threads: Map<string, Thread>
  activeThreadId: string | null
}

The core operations:

function createThread(systemPrompt: string): Thread {
  return {
    id: crypto.randomUUID(),
    title: 'New Chat',
    createdAt: new Date(),
    updatedAt: new Date(),
    messages: [{
      id: crypto.randomUUID(),
      role: 'system',
      content: systemPrompt,
      createdAt: new Date()
    }]
  }
}

function switchThread(
  store: ThreadStore,
  threadId: string
): ThreadStore {
  return {
    ...store,
    activeThreadId: threadId
  }
}

function getActiveMessages(store: ThreadStore): Message[] {
  if (!store.activeThreadId) return []
  return store.threads.get(store.activeThreadId)?.messages ?? []
}

Auto-titling — generate a thread title from the first exchange:

async function generateTitle(messages: Message[]): Promise<string> {
  const firstExchange = messages
    .filter(m => m.role === 'user' || m.role === 'assistant')
    .slice(0, 2)

  if (firstExchange.length === 0) return 'New Chat'

  const response = await callLLM([
    {
      role: 'system',
      content: 'Generate a short title (max 6 words) for this conversation. Return only the title, nothing else.'
    },
    ...firstExchange
  ])

  return response.content
}

Persistence — threads need to survive page reloads. Store them in localStorage for client-only apps, or in a database for production:

function persistThreads(store: ThreadStore): void {
  const serializable = {
    activeThreadId: store.activeThreadId,
    threads: Object.fromEntries(
      Array.from(store.threads.entries()).map(([id, thread]) => [
        id,
        { ...thread, messages: thread.messages }
      ])
    )
  }
  localStorage.setItem('chat-threads', JSON.stringify(serializable))
}

The useChat Hook — Anatomy

The Vercel AI SDK's useChat hook is the most widely-used abstraction for managing conversation state in React. It handles the entire message lifecycle — the messages array, input state, streaming, error handling, and API communication — in a single hook.

'use client'

import { useChat } from 'ai/react'

function ChatInterface() {
  const {
    messages,      // Message[] — the conversation history
    input,         // string — current input field value
    handleInputChange, // (e: ChangeEvent) => void
    handleSubmit,  // (e: FormEvent) => void — sends the message
    isLoading,     // boolean — true while streaming
    error,         // Error | undefined
    stop,          // () => void — abort the current stream
    reload,        // () => void — regenerate the last response
    append,        // (message: Message) => void — programmatically add a message
    setMessages,   // (messages: Message[]) => void — replace entire array
  } = useChat({
    api: '/api/chat',
    initialMessages: [],
    onFinish: (message) => {
      // Called when assistant response is complete
      console.log('Finished:', message.content)
    },
    onError: (error) => {
      console.error('Chat error:', error)
    }
  })

  return (
    <form onSubmit={handleSubmit}>
      <div>
        {messages.map(m => (
          <div key={m.id}>
            <strong>{m.role}:</strong> {m.content}
          </div>
        ))}
      </div>
      {error && <p>Something went wrong. Try again.</p>}
      <input
        value={input}
        onChange={handleInputChange}
        placeholder="Type a message..."
        disabled={isLoading}
      />
      <button type="submit" disabled={isLoading}>
        {isLoading ? 'Thinking...' : 'Send'}
      </button>
      {isLoading && (
        <button type="button" onClick={stop}>
          Stop
        </button>
      )}
    </form>
  )
}

What useChat does under the hood:

  1. Manages the messages array — appends user messages when you submit, appends assistant messages as they stream in
  2. Handles streaming — connects to your API endpoint, reads the streaming response, and updates messages progressively as tokens arrive
  3. Tracks loading stateisLoading is true from the moment you submit until the stream ends
  4. Exposes escape hatchessetMessages lets you replace the entire array (for loading a thread from a database), append lets you add messages programmatically (for suggested prompts), stop lets you abort mid-stream

The API route it connects to is a standard Next.js route handler:

// app/api/chat/route.ts
import { openai } from '@ai-sdk/openai'
import { streamText } from 'ai'

export async function POST(request: Request) {
  const { messages } = await request.json()

  const result = streamText({
    model: openai('gpt-4o'),
    messages
  })

  return result.toDataStreamResponse()
}

The entire messages array gets posted to this endpoint on every submit. The route streams back the response. The hook appends it. That's the full loop.

Quiz
What does useChat's setMessages function allow you to do?

Putting It All Together

In a production AI chat app, these concepts combine into a layered architecture:

Thread Management (create, switch, delete, list)
    |
    v
Message Array (append-only, immutable updates)
    |
    v
Token Budget (sliding window / summarization / pinning)
    |
    v
AI State / UI State Split (serializable vs renderable)
    |
    v
API Layer (useChat, streaming, tool execution)

The thread manager holds multiple conversations. Each conversation has a messages array. Before sending to the API, the token budget layer trims the array to fit the context window. The AI State / UI State split ensures the model gets clean JSON while users see rich React components. And the API layer handles the actual network call and streaming.

What developers doWhat they should do
Sending only the latest user message to the API
LLMs are stateless. Without the full history, the model has zero context about what was discussed before. It will respond as if this is the first message in the conversation.
Always send the full messages array (or a properly truncated version)
Mutating messages in place with array.push()
React compares references to detect changes. Mutating an array in place keeps the same reference, so React won't re-render. You'll see stale UI that doesn't show new messages.
Create new arrays with the spread operator: [...messages, newMessage]
Storing React components in AI State
AI State must be serializable for persistence and transmission. React nodes contain functions, refs, and circular references that break JSON.stringify. Use the AI State / UI State split pattern.
Keep AI State as plain JSON. Put React elements in UI State only.
Truncating messages without checking for orphaned tool calls
A tool call without its result (or a result without its call) creates an inconsistent history that confuses the model. It may hallucinate results, retry calls, or produce incoherent responses.
Always keep tool call messages and their result messages together as atomic pairs
Key Rules
  1. 1LLMs are stateless — you must send the full conversation history with every API request
  2. 2Messages arrays are append-only with immutable updates for React compatibility
  3. 3System messages are always preserved during truncation — never drop them
  4. 4Token budget strategies: sliding window (simple), summarization (preserves context), or hybrid with pinning (production-grade)
  5. 5AI State is serializable JSON for the model. UI State is React elements for the user. Never mix them.
  6. 6Tool calls and their results are atomic pairs — never truncate one without the other
  7. 7useChat manages messages, streaming, input state, loading, and errors in a single hook