Conversation State Management
The Chat App That Forgot Everything
A developer ships their first AI chat feature. Users love it. Then someone reports: "The AI forgets what I said two messages ago." The developer checks the code — they're sending only the latest user message to the API. No conversation history. Each request is a clean slate.
Here's the thing most people miss about LLMs: they have zero memory. Every single API call is stateless. The model doesn't "remember" your conversation. That continuity you experience in ChatGPT? It's an illusion, carefully maintained by the frontend sending the entire conversation history with every request.
This means conversation state management isn't a nice-to-have — it's the core architecture problem of any AI-powered frontend.
Think of an LLM like a brilliant consultant with amnesia. Every time you call them, you hand them a written transcript of everything you've discussed so far, plus your new question. They read the whole thing, write a response, and immediately forget everything. Your job as the frontend engineer is to maintain that transcript — adding to it, trimming it when it gets too long, and making sure it contains exactly the right context for each call.
The Message Data Model
Every conversation is an array of messages. Each message has a specific shape:
interface Message {
id: string
role: 'system' | 'user' | 'assistant' | 'tool'
content: string
createdAt: Date
// Tool-related fields
toolInvocations?: ToolInvocation[]
// Metadata for your app
metadata?: Record<string, unknown>
}
The role field is the most important piece. It tells the model who said what:
- system — Instructions that shape the model's behavior. "You are a helpful coding tutor." The user never sees this.
- user — What the human typed.
- assistant — What the model responded.
- tool — Results from tool calls (function calling). More on this later.
In practice, you'll often want to track additional metadata per message — token counts, latency, model version, whether the message was edited. The model doesn't see this metadata, but your UI needs it.
interface MessageWithMeta extends Message {
tokenCount?: number
model?: string
latencyMs?: number
isEdited?: boolean
}
The Messages Array — Append-Only by Design
The conversation state is an array of messages, and it follows an append-only pattern. You never mutate existing messages in place. You never delete from the middle. You only append.
// Correct: append-only updates
function addMessage(
messages: Message[],
newMessage: Message
): Message[] {
return [...messages, newMessage]
}
// The full cycle for one exchange:
let messages: Message[] = [
{ id: '1', role: 'system', content: 'You are a coding tutor.', createdAt: new Date() }
]
// User sends a message
messages = addMessage(messages, {
id: '2',
role: 'user',
content: 'Explain closures in JavaScript',
createdAt: new Date()
})
// Send entire array to API, get response
const response = await callLLM(messages)
// Append the assistant's response
messages = addMessage(messages, {
id: '3',
role: 'assistant',
content: response.content,
createdAt: new Date()
})
Why append-only? Three reasons:
- Immutability — React needs new array references to trigger re-renders. Mutating in place breaks change detection.
- History integrity — The model's responses depend on the exact sequence of prior messages. Editing earlier messages retroactively would make later responses nonsensical.
- Debuggability — When something goes wrong, you can replay the exact sequence of messages that led to a bad response.
Token Budget Management
Here's the problem nobody thinks about until it bites them: conversations grow, but context windows don't.
Every model has a maximum context window — the total number of tokens it can process in a single request. GPT-4o handles 128K tokens. Claude handles 200K. Sounds like a lot, until your user has a 50-message conversation about a complex codebase, each message including code snippets. Suddenly you're bumping up against the limit, and the API starts rejecting requests or — worse — silently truncating your messages.
You need a token budget strategy. There are three main approaches:
Sliding Window
Drop the oldest messages to stay within budget. Simple, but you lose context:
function slidingWindow(
messages: Message[],
maxTokens: number,
countTokens: (msg: Message) => number
): Message[] {
const system = messages.filter(m => m.role === 'system')
const conversation = messages.filter(m => m.role !== 'system')
let totalTokens = system.reduce((sum, m) => sum + countTokens(m), 0)
const kept: Message[] = []
// Walk backward from newest, keep until budget is full
for (let i = conversation.length - 1; i >= 0; i--) {
const tokens = countTokens(conversation[i])
if (totalTokens + tokens > maxTokens) break
totalTokens += tokens
kept.unshift(conversation[i])
}
return [...system, ...kept]
}
Notice the system message is always preserved — you never drop it. The sliding window walks backward from the newest message, keeping as many recent messages as the budget allows.
Summarization
When you hit the budget, summarize older messages into a compressed system message:
async function summarizeOldMessages(
messages: Message[],
maxTokens: number,
countTokens: (msg: Message) => number
): Promise<Message[]> {
const totalTokens = messages.reduce((sum, m) => sum + countTokens(m), 0)
if (totalTokens <= maxTokens) return messages
// Split: keep recent messages, summarize older ones
const splitPoint = Math.floor(messages.length * 0.6)
const oldMessages = messages.slice(1, splitPoint) // skip system
const recentMessages = messages.slice(splitPoint)
const summary = await callLLM([
{
role: 'system',
content: 'Summarize this conversation concisely, preserving key facts and decisions.'
},
...oldMessages
])
return [
messages[0], // original system message
{
id: crypto.randomUUID(),
role: 'system',
content: `Previous conversation summary: ${summary}`,
createdAt: new Date()
},
...recentMessages
]
}
This costs an extra API call but preserves important context better than pure truncation.
Hybrid: Sliding Window + Key Message Pinning
The most production-ready approach. Keep recent messages via sliding window, but "pin" important messages (user-marked, or messages containing decisions/requirements):
function hybridTruncation(
messages: Message[],
maxTokens: number,
countTokens: (msg: Message) => number
): Message[] {
const system = messages.filter(m => m.role === 'system')
const pinned = messages.filter(m => m.metadata?.pinned)
const rest = messages.filter(
m => m.role !== 'system' && !m.metadata?.pinned
)
let budget = maxTokens
budget -= system.reduce((s, m) => s + countTokens(m), 0)
budget -= pinned.reduce((s, m) => s + countTokens(m), 0)
const windowed = slidingWindow(
[{ id: '_', role: 'system', content: '', createdAt: new Date() }, ...rest],
budget,
countTokens
).filter(m => m.id !== '_')
return [...system, ...pinned, ...windowed]
}
| Strategy | Pros | Cons | Best For |
|---|---|---|---|
| Sliding Window | Simple, predictable, no extra API calls | Loses old context entirely | Casual chat, quick Q&A |
| Summarization | Preserves key context from old messages | Extra API call cost, summary can lose nuance | Long technical discussions |
| Hybrid (Window + Pinning) | Keeps critical context while trimming noise | More complex state management | Production apps with important decisions |
AI State vs UI State
The Vercel AI SDK introduces a powerful pattern for managing conversation state: splitting it into two distinct layers.
AI State is the serializable data you send to the model. It's the message history — roles, content, tool calls, tool results. It's JSON-serializable, can be stored in a database, and can be loaded back to resume a conversation. The model sees this.
UI State is the React elements you render on screen. It's the chat bubbles, the loading spinners, the inline code editors, the image previews. The model never sees this. It can contain React components, event handlers, refs — anything that React can render but JSON can't serialize.
// AI State: what the model sees (serializable)
type AIState = {
messages: {
role: 'user' | 'assistant' | 'system' | 'tool'
content: string
toolInvocations?: ToolInvocation[]
}[]
}
// UI State: what the user sees (React elements)
type UIState = {
id: string
display: React.ReactNode // could be <WeatherCard />, <CodeBlock />, anything
}[]
Why split them? Because what the user sees and what the model needs are fundamentally different things.
When the model calls a tool to get weather data and returns { temperature: 72, condition: "sunny" }, the AI State stores the raw JSON. But the UI State renders a beautiful WeatherCard component with an animated sun icon and a temperature gauge. The model doesn't need to know about the sun icon. The user doesn't need to see raw JSON.
// In a Server Action using Vercel AI SDK:
async function sendMessage(input: string) {
'use server'
const aiState = getMutableAIState()
// Update AI State (serializable)
aiState.update([
...aiState.get(),
{ role: 'user', content: input }
])
const result = await streamUI({
model: openai('gpt-4o'),
messages: aiState.get(),
tools: {
getWeather: {
parameters: z.object({ city: z.string() }),
generate: async function* ({ city }) {
yield <LoadingSpinner />
const weather = await fetchWeather(city)
// AI State gets the raw data
aiState.done([
...aiState.get(),
{ role: 'tool', content: JSON.stringify(weather) }
])
// UI State gets a rich component
return <WeatherCard data={weather} />
}
}
}
})
return result
}
This split is what makes generative UI possible — the model can trigger rendering of arbitrary React components without those components ever touching the serialization layer.
Never put React elements in AI State. It's tempting to store everything in one place, but React nodes aren't serializable — you can't save them to a database, send them over the network, or restore them on page reload. Keep AI State as plain JSON. Keep UI State as React elements. If you mix them, persistence and hydration will break in subtle, hard-to-debug ways.
Multi-Turn with Tool Calls
Tool calling (also called function calling) adds complexity to the message array. A single "turn" can now involve multiple back-and-forth messages between the model and your tools, all happening before the user sees a response.
Here's what the message flow looks like:
const messages = [
// System instructions
{ role: 'system', content: 'You can look up weather and book flights.' },
// User asks something
{ role: 'user', content: 'What is the weather in Tokyo and book me a flight there?' },
// Assistant decides to call TWO tools
{
role: 'assistant',
content: null,
tool_calls: [
{ id: 'call_1', type: 'function', function: { name: 'getWeather', arguments: '{"city":"Tokyo"}' } },
{ id: 'call_2', type: 'function', function: { name: 'bookFlight', arguments: '{"destination":"Tokyo"}' } }
]
},
// Tool results come back (one per tool call)
{ role: 'tool', tool_call_id: 'call_1', content: '{"temp": 22, "condition": "cloudy"}' },
{ role: 'tool', tool_call_id: 'call_2', content: '{"confirmation": "FL-8842", "departure": "2026-04-15"}' },
// Assistant synthesizes the tool results into a response
{
role: 'assistant',
content: 'Tokyo is currently 22 C and cloudy. I have booked flight FL-8842 departing April 15th.'
}
]
Notice the pattern: the assistant message with tool_calls and the corresponding tool result messages must always appear together in the array. If you truncate messages and accidentally separate a tool call from its result, the model gets confused — it sees a call was made but no result came back.
function safeTruncate(messages: Message[]): Message[] {
// Never split a tool call from its results
const truncated = slidingWindow(messages, maxTokens, countTokens)
// Verify: every tool_call has its matching tool result
const toolCallIds = new Set<string>()
const toolResultIds = new Set<string>()
for (const msg of truncated) {
if (msg.role === 'assistant' && msg.tool_calls) {
for (const call of msg.tool_calls) {
toolCallIds.add(call.id)
}
}
if (msg.role === 'tool' && msg.tool_call_id) {
toolResultIds.add(msg.tool_call_id)
}
}
// If any tool call is missing its result (or vice versa), remove the orphan
return truncated.filter(msg => {
if (msg.role === 'assistant' && msg.tool_calls) {
return msg.tool_calls.every(c => toolResultIds.has(c.id))
}
if (msg.role === 'tool') {
return toolCallIds.has(msg.tool_call_id)
}
return true
})
}
System Prompts — Where They Live
The system prompt is the first message in your array, with role: 'system'. It's the invisible hand that shapes every response.
const systemMessage: Message = {
id: 'system-1',
role: 'system',
content: `You are a senior JavaScript tutor.
Rules:
- Explain concepts using real-world analogies
- Always show code examples
- If the user's code has a bug, don't just fix it — explain why it's wrong
- Never use var, always use const or let
- Keep responses under 500 words unless the user asks for more detail`,
createdAt: new Date()
}
Some production patterns for system prompts:
Dynamic system prompts — update the system message based on context. If the user switches from "beginner mode" to "advanced mode," swap the system prompt:
function buildSystemPrompt(userLevel: 'beginner' | 'advanced'): string {
const base = 'You are a JavaScript tutor.'
const levelInstructions = userLevel === 'beginner'
? 'Explain everything step by step. Avoid jargon. Use simple analogies.'
: 'Be concise. Assume strong JS knowledge. Focus on edge cases and internals.'
return `${base}\n${levelInstructions}`
}
Multi-section system prompts — for complex apps, structure the system prompt with clear sections:
const systemPrompt = `
## Role
You are a code review assistant for a React/TypeScript codebase.
## Context
The user is working on: ${projectDescription}
Tech stack: ${techStack.join(', ')}
## Rules
- Flag any use of 'any' type
- Suggest performance improvements for components over 100 lines
- Always check for missing error boundaries
## Output Format
Use markdown. Start with a severity rating: Critical / Warning / Info.
`
The system prompt counts against your token budget. A 2000-token system prompt means 2000 fewer tokens available for conversation history. Keep it focused.
Should you use multiple system messages or one? Technically, most APIs support multiple messages with role: 'system' placed at different points in the conversation. OpenAI's models treat them all as system instructions. Anthropic's Claude treats the first system message specially (it goes in a system parameter) and subsequent ones as regular messages. For maximum compatibility, stick with a single system message at the start. If you need to inject context mid-conversation (like updated user preferences), append it as a user message with a clear label: "System note: user has switched to advanced mode."
Thread Management
Real chat apps don't have one conversation. They have dozens. Each conversation is a thread — an independent messages array with its own ID, title, and metadata.
interface Thread {
id: string
title: string
createdAt: Date
updatedAt: Date
messages: Message[]
metadata?: {
model?: string
systemPrompt?: string
tokenCount?: number
}
}
interface ThreadStore {
threads: Map<string, Thread>
activeThreadId: string | null
}
The core operations:
function createThread(systemPrompt: string): Thread {
return {
id: crypto.randomUUID(),
title: 'New Chat',
createdAt: new Date(),
updatedAt: new Date(),
messages: [{
id: crypto.randomUUID(),
role: 'system',
content: systemPrompt,
createdAt: new Date()
}]
}
}
function switchThread(
store: ThreadStore,
threadId: string
): ThreadStore {
return {
...store,
activeThreadId: threadId
}
}
function getActiveMessages(store: ThreadStore): Message[] {
if (!store.activeThreadId) return []
return store.threads.get(store.activeThreadId)?.messages ?? []
}
Auto-titling — generate a thread title from the first exchange:
async function generateTitle(messages: Message[]): Promise<string> {
const firstExchange = messages
.filter(m => m.role === 'user' || m.role === 'assistant')
.slice(0, 2)
if (firstExchange.length === 0) return 'New Chat'
const response = await callLLM([
{
role: 'system',
content: 'Generate a short title (max 6 words) for this conversation. Return only the title, nothing else.'
},
...firstExchange
])
return response.content
}
Persistence — threads need to survive page reloads. Store them in localStorage for client-only apps, or in a database for production:
function persistThreads(store: ThreadStore): void {
const serializable = {
activeThreadId: store.activeThreadId,
threads: Object.fromEntries(
Array.from(store.threads.entries()).map(([id, thread]) => [
id,
{ ...thread, messages: thread.messages }
])
)
}
localStorage.setItem('chat-threads', JSON.stringify(serializable))
}
The useChat Hook — Anatomy
The Vercel AI SDK's useChat hook is the most widely-used abstraction for managing conversation state in React. It handles the entire message lifecycle — the messages array, input state, streaming, error handling, and API communication — in a single hook.
'use client'
import { useChat } from 'ai/react'
function ChatInterface() {
const {
messages, // Message[] — the conversation history
input, // string — current input field value
handleInputChange, // (e: ChangeEvent) => void
handleSubmit, // (e: FormEvent) => void — sends the message
isLoading, // boolean — true while streaming
error, // Error | undefined
stop, // () => void — abort the current stream
reload, // () => void — regenerate the last response
append, // (message: Message) => void — programmatically add a message
setMessages, // (messages: Message[]) => void — replace entire array
} = useChat({
api: '/api/chat',
initialMessages: [],
onFinish: (message) => {
// Called when assistant response is complete
console.log('Finished:', message.content)
},
onError: (error) => {
console.error('Chat error:', error)
}
})
return (
<form onSubmit={handleSubmit}>
<div>
{messages.map(m => (
<div key={m.id}>
<strong>{m.role}:</strong> {m.content}
</div>
))}
</div>
{error && <p>Something went wrong. Try again.</p>}
<input
value={input}
onChange={handleInputChange}
placeholder="Type a message..."
disabled={isLoading}
/>
<button type="submit" disabled={isLoading}>
{isLoading ? 'Thinking...' : 'Send'}
</button>
{isLoading && (
<button type="button" onClick={stop}>
Stop
</button>
)}
</form>
)
}
What useChat does under the hood:
- Manages the messages array — appends user messages when you submit, appends assistant messages as they stream in
- Handles streaming — connects to your API endpoint, reads the streaming response, and updates
messagesprogressively as tokens arrive - Tracks loading state —
isLoadingis true from the moment you submit until the stream ends - Exposes escape hatches —
setMessageslets you replace the entire array (for loading a thread from a database),appendlets you add messages programmatically (for suggested prompts),stoplets you abort mid-stream
The API route it connects to is a standard Next.js route handler:
// app/api/chat/route.ts
import { openai } from '@ai-sdk/openai'
import { streamText } from 'ai'
export async function POST(request: Request) {
const { messages } = await request.json()
const result = streamText({
model: openai('gpt-4o'),
messages
})
return result.toDataStreamResponse()
}
The entire messages array gets posted to this endpoint on every submit. The route streams back the response. The hook appends it. That's the full loop.
Putting It All Together
In a production AI chat app, these concepts combine into a layered architecture:
Thread Management (create, switch, delete, list)
|
v
Message Array (append-only, immutable updates)
|
v
Token Budget (sliding window / summarization / pinning)
|
v
AI State / UI State Split (serializable vs renderable)
|
v
API Layer (useChat, streaming, tool execution)
The thread manager holds multiple conversations. Each conversation has a messages array. Before sending to the API, the token budget layer trims the array to fit the context window. The AI State / UI State split ensures the model gets clean JSON while users see rich React components. And the API layer handles the actual network call and streaming.
| What developers do | What they should do |
|---|---|
| Sending only the latest user message to the API LLMs are stateless. Without the full history, the model has zero context about what was discussed before. It will respond as if this is the first message in the conversation. | Always send the full messages array (or a properly truncated version) |
| Mutating messages in place with array.push() React compares references to detect changes. Mutating an array in place keeps the same reference, so React won't re-render. You'll see stale UI that doesn't show new messages. | Create new arrays with the spread operator: [...messages, newMessage] |
| Storing React components in AI State AI State must be serializable for persistence and transmission. React nodes contain functions, refs, and circular references that break JSON.stringify. Use the AI State / UI State split pattern. | Keep AI State as plain JSON. Put React elements in UI State only. |
| Truncating messages without checking for orphaned tool calls A tool call without its result (or a result without its call) creates an inconsistent history that confuses the model. It may hallucinate results, retry calls, or produce incoherent responses. | Always keep tool call messages and their result messages together as atomic pairs |
- 1LLMs are stateless — you must send the full conversation history with every API request
- 2Messages arrays are append-only with immutable updates for React compatibility
- 3System messages are always preserved during truncation — never drop them
- 4Token budget strategies: sliding window (simple), summarization (preserves context), or hybrid with pinning (production-grade)
- 5AI State is serializable JSON for the model. UI State is React elements for the user. Never mix them.
- 6Tool calls and their results are atomic pairs — never truncate one without the other
- 7useChat manages messages, streaming, input state, loading, and errors in a single hook