Error Codes Reference
This reference covers all error codes that can be returned by Relay, when they occur, and how to handle them.
Error Code Table
| Code | HTTP Status | Retryable | Meaning | Recommended Action |
|---|---|---|---|---|
AGENT_OFFLINE | 503 | Yes | Agent not currently connected | Retry with exponential backoff |
NOT_ALLOWED | 403 | No | App not allowlisted for agent | Check allowlist, contact admin |
AGENT_NOT_FOUND | 404 | No | Agent does not exist in org | Verify agent_id, check registration |
AGENT_TIMEOUT | 504 | Maybe | Agent took too long to respond | Retry once, then escalate |
AGENT_ERROR | 500 | Maybe | Agent returned an error | Check agent logs, retry |
PAYLOAD_TOO_LARGE | 413 | No | Event payload exceeds 64KB | Reduce payload size, retry |
RATE_LIMITED | 429 | Yes | Event rate limit exceeded | Respect rate limit window, retry |
INVALID_MESSAGE | 400 | No | Message malformed or missing fields | Fix message structure, retry |
Detailed Error Explanations
AGENT_OFFLINE
When it occurs: The target agent is not currently connected to Relay.
HTTP Status: 503 (Service Unavailable)
Is it retryable? Yes
Recommended action:
- Retry after 1-2 seconds
- Show user: "Athena is offline. Retrying..."
- After 2-3 retries, inform user to try again later
- Do not retry more than 5 times
if error_code == "AGENT_OFFLINE":
# Retry with exponential backoff
for attempt in range(3):
await asyncio.sleep(2 ** (attempt + 1)) # 2s, 4s, 8s
result = await send_event(ws, agent_id, thread_id, payload)
if result.get("type") != "error":
break
Possible causes:
- Agent server crashed or went offline
- Network connectivity issue between agent and Relay
- Agent is being redeployed
- Agent reached max concurrent connections
What the user sees: "Athena is currently offline. Please try again in a moment."
NOT_ALLOWED
When it occurs: Your app is not allowlisted for this agent.
HTTP Status: 403 (Forbidden)
Is it retryable? No
Recommended action:
- Do NOT retry
- Log as a permission error with app_id and agent_id
- Disable the agent in your UI
- Show user: "This agent is not available for your organization"
- Contact your Relay admin to request allowlist access
if error_code == "AGENT_NOT_ALLOWED":
logger.error(f"Permission denied: {app_id} not allowed for {agent_id}")
print("Contact your admin to enable access to this agent")
# Disable agent in UI
disable_agent_button()
Possible causes:
- Agent operator has allowlist enabled but didn't add your app
- Your app was removed from allowlist
- Wrong agent_id specified
What the user sees: "You don't have access to this agent. Contact your administrator."
AGENT_TIMEOUT
When it occurs: The agent took too long to respond (exceeded timeout window).
HTTP Status: 504 (Gateway Timeout)
Is it retryable? Maybe
Recommended action:
- Retry once after 2-5 seconds
- If it times out again, escalate to infrastructure team
- Limit total retry attempts to 1-2
- Show user: "Agent is slow. Please try again."
if error_code == "AGENT_TIMEOUT":
# Retry once
await asyncio.sleep(3)
result = await send_event(ws, agent_id, thread_id, payload)
if result.get("code") == "AGENT_TIMEOUT":
# Failed twice, give up
return {"error": "Agent is not responding", "user_message": "Please try again in a moment"}
Possible causes:
- Agent is overloaded with other requests
- Agent's AI model is slow (large response)
- Agent crashed or hung mid-processing
- Network latency
What the user sees: "Athena is taking longer than expected. Please try again."
PAYLOAD_TOO_LARGE
When it occurs: Your event payload exceeds the 64KB limit.
HTTP Status: 413 (Payload Too Large)
Is it retryable? No
Recommended action:
- Do NOT retry with same payload
- Reduce payload size
- Remove non-essential fields (e.g., full conversation history)
- Truncate large text fields
- Retry with trimmed payload
import json
def trim_payload(payload, max_size_bytes=64*1024):
"""Reduce payload to fit within size limit"""
current_size = len(json.dumps(payload).encode('utf-8'))
if current_size <= max_size_bytes:
return payload
# Try removing largest fields
trimmed = payload.copy()
# Remove full comment history, keep only recent
if "comments" in trimmed and len(trimmed["comments"]) > 5:
trimmed["comments"] = trimmed["comments"][-5:]
# Truncate long text fields
if "description" in trimmed:
trimmed["description"] = trimmed["description"][:500]
if "payload" in trimmed:
del trimmed["payload"]
new_size = len(json.dumps(trimmed).encode('utf-8'))
if new_size <= max_size_bytes:
return trimmed
raise ValueError(f"Payload too large: {new_size} bytes")
# Usage
try:
await send_event(ws, agent_id, thread_id, payload)
except PayloadTooLargeError:
trimmed = trim_payload(payload)
await send_event(ws, agent_id, thread_id, trimmed)
Possible causes:
- Payload includes entire conversation history
- Payload includes large attachments/images as base64
- Task has many comments or large descriptions
- Debug/metadata fields included unnecessarily
What the user sees: (This is usually a backend error, not shown to user)
RATE_LIMITED
When it occurs: Your app has exceeded its per-minute event limit.
HTTP Status: 429 (Too Many Requests)
Is it retryable? Yes, but respect the limit
Recommended action:
- Immediately stop sending new events
- Queue events internally
- Resume sending at a lower rate (respecting the limit)
- Wait at least 10+ seconds before retrying
- Use exponential backoff starting at 10 seconds
import time
import asyncio
class RateLimitedClient:
def __init__(self, events_per_minute=60):
self.events_per_minute = events_per_minute
self.min_interval = 60 / events_per_minute
self.event_queue = asyncio.Queue()
self.last_sent_time = 0
async def send_event(self, ws, agent_id, thread_id, payload):
"""Queue event and send respecting rate limit"""
await self.event_queue.put((agent_id, thread_id, payload))
await self.process_queue(ws)
async def process_queue(self, ws):
"""Process queue at rate limit"""
while not self.event_queue.empty():
agent_id, thread_id, payload = await self.event_queue.get()
# Wait if needed to respect rate limit
time_since_last = time.time() - self.last_sent_time
if time_since_last < self.min_interval:
await asyncio.sleep(self.min_interval - time_since_last)
event = {
"type": "event",
"agent_id": agent_id,
"thread_id": thread_id,
"payload": payload
}
try:
await ws.send(json.dumps(event))
self.last_sent_time = time.time()
except RateLimitError:
# Re-queue and wait
await self.event_queue.put((agent_id, thread_id, payload))
await asyncio.sleep(10)
Possible causes:
- Your app is sending events too fast
- Testing/load testing without rate limit awareness
- Buggy loop sending duplicate events
- Sudden traffic spike
What the user sees: (Usually handled transparently with queueing)
Check your rate limit: See Rate Limits for your organization's limits.
INVALID_MESSAGE
When it occurs: Event is malformed or missing required fields.
HTTP Status: 400 (Bad Request)
Is it retryable? No
Recommended action:
- Do NOT retry
- Log the malformed event
- Fix the event structure
- Validate before sending
def validate_event(agent_id, thread_id, payload):
"""Validate event before sending"""
if not isinstance(agent_id, str) or not agent_id.strip():
raise ValueError("agent_id must be non-empty string")
if not isinstance(thread_id, str) or not thread_id.strip():
raise ValueError("thread_id must be non-empty string")
if not isinstance(payload, dict):
raise ValueError("payload must be a dict")
if len(json.dumps(payload)) > 64 * 1024:
raise ValueError("payload exceeds 64KB")
return True
# Usage
try:
validate_event(agent_id, thread_id, payload)
await send_event(ws, agent_id, thread_id, payload)
except ValueError as e:
logger.error(f"Invalid event: {e}")
Common validation errors:
agent_idis empty or nullthread_idis empty or nullpayloadis not a dict- Event JSON is malformed
- Required field is missing
What the user sees: (Usually a backend error)
RELAY_INTERNAL_ERROR
When it occurs: An unexpected error occurred in Relay (bug or infrastructure issue).
HTTP Status: 500 (Internal Server Error)
Is it retryable? Yes
Recommended action:
- Retry with exponential backoff
- After 3-5 failed attempts, escalate to support
- Include event_id in bug report
if error_code == "RELAY_INTERNAL_ERROR":
for attempt in range(3):
delay = 2 ** attempt # 1s, 2s, 4s
await asyncio.sleep(delay)
result = await send_event(ws, agent_id, thread_id, payload)
if result.get("type") != "error":
return result
# Still failing, contact support
logger.critical(f"Relay error persists for event {event_id}")
notify_support(f"event_id={event_id}, app_id={app_id}")
Possible causes:
- Bug in Relay code
- Database connection issue
- Infrastructure outage
- Temporary service degradation
What the user sees: "Service temporarily unavailable. Please try again shortly."
Error Handling Strategy
Decision Tree
┌─ Is it retryable?
│ ├─ No: Log error, inform user, don't retry
│ │
│ └─ Yes: How many retries already?
│ ├─ 0-2: Retry with exponential backoff
│ │
│ └─ 3+: Give up, inform user
Sample Error Handler
class RelayErrorHandler:
RETRYABLE_ERRORS = {
"AGENT_OFFLINE",
"AGENT_TIMEOUT",
"RATE_LIMITED",
"RELAY_INTERNAL_ERROR"
}
async def handle_error(self, error_code, event_id, max_retries=3):
if error_code not in self.RETRYABLE_ERRORS:
# Non-retryable
logger.error(f"Non-retryable error {error_code} for {event_id}")
return {"retryable": False, "code": error_code}
# Retryable - try again
for attempt in range(max_retries):
delay = self.calculate_backoff(attempt, error_code)
logger.info(f"Retrying after {delay}s (attempt {attempt+1}/{max_retries})")
await asyncio.sleep(delay)
result = await self.retry_event(event_id)
if result.get("type") != "error":
return {"retryable": True, "retried": True, "success": True}
# Max retries exceeded
logger.error(f"Max retries exceeded for {event_id}")
return {"retryable": True, "retried": True, "success": False}
def calculate_backoff(self, attempt, error_code):
if error_code == "RATE_LIMITED":
return max(10, 2 ** attempt) # Start at 10s for rate limit
else:
return 2 ** attempt # 1s, 2s, 4s, 8s, ...
Best Practices
Do
- Check error code in all responses
- Retry only retryable errors
- Use exponential backoff
- Log all errors with context
- Validate payloads before sending
Don't
- Retry permission errors
- Retry invalid events
- Use fixed delays
- Ignore rate limits
- Retry indefinitely
Support
If you encounter persistent errors not covered here, contact support@relay.ckgworks.com with:
- Error code
- Event ID (if available)
- Timestamp
- Steps to reproduce
- Logs