Skip to main content

Error Codes Reference

This reference covers all error codes that can be returned by Relay, when they occur, and how to handle them.

Error Code Table

CodeHTTP StatusRetryableMeaningRecommended Action
AGENT_OFFLINE503YesAgent not currently connectedRetry with exponential backoff
NOT_ALLOWED403NoApp not allowlisted for agentCheck allowlist, contact admin
AGENT_NOT_FOUND404NoAgent does not exist in orgVerify agent_id, check registration
AGENT_TIMEOUT504MaybeAgent took too long to respondRetry once, then escalate
AGENT_ERROR500MaybeAgent returned an errorCheck agent logs, retry
PAYLOAD_TOO_LARGE413NoEvent payload exceeds 64KBReduce payload size, retry
RATE_LIMITED429YesEvent rate limit exceededRespect rate limit window, retry
INVALID_MESSAGE400NoMessage malformed or missing fieldsFix message structure, retry

Detailed Error Explanations

AGENT_OFFLINE

When it occurs: The target agent is not currently connected to Relay.

HTTP Status: 503 (Service Unavailable)

Is it retryable? Yes

Recommended action:

  1. Retry after 1-2 seconds
  2. Show user: "Athena is offline. Retrying..."
  3. After 2-3 retries, inform user to try again later
  4. Do not retry more than 5 times
if error_code == "AGENT_OFFLINE":
# Retry with exponential backoff
for attempt in range(3):
await asyncio.sleep(2 ** (attempt + 1)) # 2s, 4s, 8s
result = await send_event(ws, agent_id, thread_id, payload)
if result.get("type") != "error":
break

Possible causes:

  • Agent server crashed or went offline
  • Network connectivity issue between agent and Relay
  • Agent is being redeployed
  • Agent reached max concurrent connections

What the user sees: "Athena is currently offline. Please try again in a moment."


NOT_ALLOWED

When it occurs: Your app is not allowlisted for this agent.

HTTP Status: 403 (Forbidden)

Is it retryable? No

Recommended action:

  1. Do NOT retry
  2. Log as a permission error with app_id and agent_id
  3. Disable the agent in your UI
  4. Show user: "This agent is not available for your organization"
  5. Contact your Relay admin to request allowlist access
if error_code == "AGENT_NOT_ALLOWED":
logger.error(f"Permission denied: {app_id} not allowed for {agent_id}")
print("Contact your admin to enable access to this agent")
# Disable agent in UI
disable_agent_button()

Possible causes:

  • Agent operator has allowlist enabled but didn't add your app
  • Your app was removed from allowlist
  • Wrong agent_id specified

What the user sees: "You don't have access to this agent. Contact your administrator."


AGENT_TIMEOUT

When it occurs: The agent took too long to respond (exceeded timeout window).

HTTP Status: 504 (Gateway Timeout)

Is it retryable? Maybe

Recommended action:

  1. Retry once after 2-5 seconds
  2. If it times out again, escalate to infrastructure team
  3. Limit total retry attempts to 1-2
  4. Show user: "Agent is slow. Please try again."
if error_code == "AGENT_TIMEOUT":
# Retry once
await asyncio.sleep(3)
result = await send_event(ws, agent_id, thread_id, payload)

if result.get("code") == "AGENT_TIMEOUT":
# Failed twice, give up
return {"error": "Agent is not responding", "user_message": "Please try again in a moment"}

Possible causes:

  • Agent is overloaded with other requests
  • Agent's AI model is slow (large response)
  • Agent crashed or hung mid-processing
  • Network latency

What the user sees: "Athena is taking longer than expected. Please try again."


PAYLOAD_TOO_LARGE

When it occurs: Your event payload exceeds the 64KB limit.

HTTP Status: 413 (Payload Too Large)

Is it retryable? No

Recommended action:

  1. Do NOT retry with same payload
  2. Reduce payload size
  3. Remove non-essential fields (e.g., full conversation history)
  4. Truncate large text fields
  5. Retry with trimmed payload
import json

def trim_payload(payload, max_size_bytes=64*1024):
"""Reduce payload to fit within size limit"""
current_size = len(json.dumps(payload).encode('utf-8'))

if current_size <= max_size_bytes:
return payload

# Try removing largest fields
trimmed = payload.copy()

# Remove full comment history, keep only recent
if "comments" in trimmed and len(trimmed["comments"]) > 5:
trimmed["comments"] = trimmed["comments"][-5:]

# Truncate long text fields
if "description" in trimmed:
trimmed["description"] = trimmed["description"][:500]

if "payload" in trimmed:
del trimmed["payload"]

new_size = len(json.dumps(trimmed).encode('utf-8'))

if new_size <= max_size_bytes:
return trimmed

raise ValueError(f"Payload too large: {new_size} bytes")

# Usage
try:
await send_event(ws, agent_id, thread_id, payload)
except PayloadTooLargeError:
trimmed = trim_payload(payload)
await send_event(ws, agent_id, thread_id, trimmed)

Possible causes:

  • Payload includes entire conversation history
  • Payload includes large attachments/images as base64
  • Task has many comments or large descriptions
  • Debug/metadata fields included unnecessarily

What the user sees: (This is usually a backend error, not shown to user)


RATE_LIMITED

When it occurs: Your app has exceeded its per-minute event limit.

HTTP Status: 429 (Too Many Requests)

Is it retryable? Yes, but respect the limit

Recommended action:

  1. Immediately stop sending new events
  2. Queue events internally
  3. Resume sending at a lower rate (respecting the limit)
  4. Wait at least 10+ seconds before retrying
  5. Use exponential backoff starting at 10 seconds
import time
import asyncio

class RateLimitedClient:
def __init__(self, events_per_minute=60):
self.events_per_minute = events_per_minute
self.min_interval = 60 / events_per_minute
self.event_queue = asyncio.Queue()
self.last_sent_time = 0

async def send_event(self, ws, agent_id, thread_id, payload):
"""Queue event and send respecting rate limit"""
await self.event_queue.put((agent_id, thread_id, payload))
await self.process_queue(ws)

async def process_queue(self, ws):
"""Process queue at rate limit"""
while not self.event_queue.empty():
agent_id, thread_id, payload = await self.event_queue.get()

# Wait if needed to respect rate limit
time_since_last = time.time() - self.last_sent_time
if time_since_last < self.min_interval:
await asyncio.sleep(self.min_interval - time_since_last)

event = {
"type": "event",
"agent_id": agent_id,
"thread_id": thread_id,
"payload": payload
}

try:
await ws.send(json.dumps(event))
self.last_sent_time = time.time()
except RateLimitError:
# Re-queue and wait
await self.event_queue.put((agent_id, thread_id, payload))
await asyncio.sleep(10)

Possible causes:

  • Your app is sending events too fast
  • Testing/load testing without rate limit awareness
  • Buggy loop sending duplicate events
  • Sudden traffic spike

What the user sees: (Usually handled transparently with queueing)

Check your rate limit: See Rate Limits for your organization's limits.


INVALID_MESSAGE

When it occurs: Event is malformed or missing required fields.

HTTP Status: 400 (Bad Request)

Is it retryable? No

Recommended action:

  1. Do NOT retry
  2. Log the malformed event
  3. Fix the event structure
  4. Validate before sending
def validate_event(agent_id, thread_id, payload):
"""Validate event before sending"""
if not isinstance(agent_id, str) or not agent_id.strip():
raise ValueError("agent_id must be non-empty string")

if not isinstance(thread_id, str) or not thread_id.strip():
raise ValueError("thread_id must be non-empty string")

if not isinstance(payload, dict):
raise ValueError("payload must be a dict")

if len(json.dumps(payload)) > 64 * 1024:
raise ValueError("payload exceeds 64KB")

return True

# Usage
try:
validate_event(agent_id, thread_id, payload)
await send_event(ws, agent_id, thread_id, payload)
except ValueError as e:
logger.error(f"Invalid event: {e}")

Common validation errors:

  • agent_id is empty or null
  • thread_id is empty or null
  • payload is not a dict
  • Event JSON is malformed
  • Required field is missing

What the user sees: (Usually a backend error)


RELAY_INTERNAL_ERROR

When it occurs: An unexpected error occurred in Relay (bug or infrastructure issue).

HTTP Status: 500 (Internal Server Error)

Is it retryable? Yes

Recommended action:

  1. Retry with exponential backoff
  2. After 3-5 failed attempts, escalate to support
  3. Include event_id in bug report
if error_code == "RELAY_INTERNAL_ERROR":
for attempt in range(3):
delay = 2 ** attempt # 1s, 2s, 4s
await asyncio.sleep(delay)
result = await send_event(ws, agent_id, thread_id, payload)

if result.get("type") != "error":
return result

# Still failing, contact support
logger.critical(f"Relay error persists for event {event_id}")
notify_support(f"event_id={event_id}, app_id={app_id}")

Possible causes:

  • Bug in Relay code
  • Database connection issue
  • Infrastructure outage
  • Temporary service degradation

What the user sees: "Service temporarily unavailable. Please try again shortly."


Error Handling Strategy

Decision Tree

┌─ Is it retryable?
│ ├─ No: Log error, inform user, don't retry
│ │
│ └─ Yes: How many retries already?
│ ├─ 0-2: Retry with exponential backoff
│ │
│ └─ 3+: Give up, inform user

Sample Error Handler

class RelayErrorHandler:
RETRYABLE_ERRORS = {
"AGENT_OFFLINE",
"AGENT_TIMEOUT",
"RATE_LIMITED",
"RELAY_INTERNAL_ERROR"
}

async def handle_error(self, error_code, event_id, max_retries=3):
if error_code not in self.RETRYABLE_ERRORS:
# Non-retryable
logger.error(f"Non-retryable error {error_code} for {event_id}")
return {"retryable": False, "code": error_code}

# Retryable - try again
for attempt in range(max_retries):
delay = self.calculate_backoff(attempt, error_code)
logger.info(f"Retrying after {delay}s (attempt {attempt+1}/{max_retries})")

await asyncio.sleep(delay)
result = await self.retry_event(event_id)

if result.get("type") != "error":
return {"retryable": True, "retried": True, "success": True}

# Max retries exceeded
logger.error(f"Max retries exceeded for {event_id}")
return {"retryable": True, "retried": True, "success": False}

def calculate_backoff(self, attempt, error_code):
if error_code == "RATE_LIMITED":
return max(10, 2 ** attempt) # Start at 10s for rate limit
else:
return 2 ** attempt # 1s, 2s, 4s, 8s, ...

Best Practices

Do

  • Check error code in all responses
  • Retry only retryable errors
  • Use exponential backoff
  • Log all errors with context
  • Validate payloads before sending

Don't

  • Retry permission errors
  • Retry invalid events
  • Use fixed delays
  • Ignore rate limits
  • Retry indefinitely

Support

If you encounter persistent errors not covered here, contact support@relay.ckgworks.com with:

  • Error code
  • Event ID (if available)
  • Timestamp
  • Steps to reproduce
  • Logs