Optimize Performance

A client escalation about slowness is one of the most stressful moments in an outsourcing engagement. The app felt fine in staging, but under real traffic the API response times tripled and users are complaining. Sun Agent Kit’s performance agents profile your application systematically — identifying N+1 queries, missing indexes, blocking I/O, and memory leaks — and then generate the specific, targeted fixes your codebase needs, not a generic list of “consider caching.”

Overview

Goal: Diagnose the root cause of a performance regression, implement targeted fixes, and validate improvement
Time: 30–60 minutes (vs 4–12 hours of manual profiling and trial-and-error)
Agents used: debugger, scout, implementer, tester
Commands: /sk:debug, /sk:scout, /sk:cook, /sk:test

Prerequisites

  • Sun Agent Kit installed and authenticated (installation guide)
  • Access to application logs or a log file with slow request traces
  • Database access for query analysis (read-only credentials are sufficient)
  • A reproducible scenario — URL path, payload, and approximate request rate that triggers slowness
  • Optional: APM data export (Datadog, New Relic, or application-level timing logs)

Step-by-Step Workflow

Step 1: Profile the application and identify bottlenecks

Start by giving the agent the symptom — a slow endpoint URL or a log excerpt — and let it locate the source.

/sk:debug "API endpoint GET /api/products is returning in 3-8 seconds. Normal was 200ms. Traffic increased 3x last week."

What happens: The agent:

  1. Parses the symptom description and identifies the affected endpoint and expected baseline
  2. Traces the request path from controller through services to database queries
  3. Analyzes query patterns looking for N+1 queries, missing indexes, and unbounded result sets
  4. Checks for caching layers and identifies synchronous I/O in hot paths
  5. Produces a root cause analysis with prioritized findings

Step 2: Scout for other performance issues across the codebase

Widen the search to catch latent problems before they become the next escalation.

/sk:scout "N+1 query, SELECT *, missing pagination, synchronous I/O, no cache, unbounded loop"

What happens: The agent:

  1. Scans the codebase for common performance anti-patterns
  2. Prioritizes findings by impact (HIGH for N+1 in hot paths, LOW for admin views)
  3. Generates a fix roadmap saved to plans/reports/

Step 3: Fix N+1 queries with eager loading

The agent rewrites the affected queries to use joins or eager loading, matching the ORM idioms already in your codebase.

/sk:cook "Fix N+1 queries in ProductController, OrderController, and UserController — use eager loading per the ORM conventions in this project"

What happens: The agent:

  1. Detects the ORM in use (Eloquent, Prisma, TypeORM, etc.) and applies idiomatic eager loading
  2. Rewrites queries to include related models in a single database call
  3. Adds pagination where results were previously unbounded
  4. Runs tests to verify no regressions

Step 4: Add database indexes for high-traffic columns

The agent reads your migration history, identifies missing indexes on foreign keys and frequently-filtered columns, and generates the migration files.

/sk:cook "Add missing database indexes for performance — analyze the schema and generate migrations"

What happens: The agent:

  1. Analyzes the schema for foreign keys, WHERE clause columns, and ORDER BY columns without indexes
  2. Generates migration files with the appropriate index definitions
  3. Estimates the impact of each index on common query patterns

Step 5: Implement a caching layer

Add Redis caching to the highest-traffic endpoints, with appropriate TTLs and cache invalidation hooks.

/sk:cook "Add Redis caching to GET /api/products and GET /api/categories — 5 minute TTL, invalidate on product or category write"

What happens: The agent:

  1. Checks Redis configuration in the project environment
  2. Applies cache-aside pattern to the specified endpoints with parameterized cache keys
  3. Adds cache invalidation hooks on write operations (create, update, delete)
  4. Writes tests for cache hit, cache miss, and invalidation scenarios

Step 6: Validate the improvements

Run the test suite and verify the optimizations work correctly.

/sk:test "run all tests and verify no regressions from performance optimizations"

What happens: The agent runs the full test suite including the new cache and query tests, and reports any failures.

Complete Example: API Response Time Degraded After Traffic Spike

Scenario

A Japanese e-commerce client’s product catalog API was performing fine during the pilot — a few hundred registered users, fast response times. The official launch brought thousands of users in 48 hours. By day two, the client’s PM is in your Slack: “Product pages take 5–8 seconds to load. Some requests time out completely. Please fix immediately.” Your team has never seen the production environment under real load. You have no APM tooling configured.

Commands chained

# 1. Diagnose — give the agent the symptom
/sk:debug "Product catalog API (GET /api/v1/products) degraded from 200ms to 5-8s after launch. Timeouts occurring under load. No APM data available."

# 2. Full performance audit to find all related issues
/sk:scout "N+1 query, missing index, missing pagination, synchronous I/O, no cache, unbounded loop"

# 3. Fix the highest-impact issues first
/sk:cook "Fix N+1 queries in product catalog — use eager loading per ORM conventions"
/sk:cook "Add database indexes on foreign keys and frequently filtered columns"

# 4. Add caching layer
/sk:cook "Add Redis caching to product list and product detail endpoints — tag-based invalidation on product write events"

# 5. Add pagination to prevent unbounded result sets
/sk:cook "Add cursor-based pagination to GET /api/v1/products — default page size 20, max 100"

# 6. Test everything
/sk:test "run full test suite after performance optimizations"

Result

Within a couple of hours after the escalation, the API is back to fast response times under load. The client receives a report explaining the specific root causes and the changes made. The PM responds: “Thank you for the fast response. Please add monitoring so we catch this earlier next time.” That is a follow-up scope conversation, not a crisis.

Time Comparison

TaskManualWith Sun Agent Kit
Reading code to find slow queries60 minminutes
Identifying N+1 patterns manually45 minminutes
Researching and writing index migrations30 minminutes
Designing and implementing cache layer90 minminutes
Writing cache invalidation logic30 minminutes
Writing client performance report30 minminutes
Total~5 hoursunder 30 minutes

Performance Targets Table

Use these targets as acceptance criteria when closing a performance ticket:

Endpoint TypeAcceptable p50Acceptable p95Maximum p99
List (paginated, cached)< 100ms< 200ms< 500ms
List (paginated, no cache)< 300ms< 600ms< 1,000ms
Detail (single record)< 80ms< 150ms< 300ms
Write (create/update)< 200ms< 400ms< 800ms
Search (full-text)< 500ms< 1,000ms< 2,000ms
Report / aggregation< 2,000ms< 5,000ms< 10,000ms

Best Practices

1. Always measure before you optimize — get a baseline number ✅

The agent asks for a symptom description before generating any fix. If you skip profiling and jump to “add Redis everywhere,” you may cache the wrong thing and mask the real problem. Require a concrete metric: “3,200ms average on GET /api/products under 50 concurrent users” is a diagnosis; “it feels slow” is not.

2. Fix N+1 queries before adding caching ✅

Caching a slow query is not a fix — it is hiding the problem. An N+1 that fires hundreds of queries per request will still fire them every time the cache misses, and under write-heavy workloads the cache will miss constantly. Fix the query first, then add caching to reduce database load further.

3. Do not cache database writes to speed up response time ❌

It is tempting to “cache” a write operation by returning success immediately and processing asynchronously. This is a message queue pattern, not a cache — and implementing it incorrectly produces data loss. If you need async processing, use /sk:cook "add job queue" to scaffold a proper queue integration, not an ad-hoc in-memory cache.

4. Do not set a single TTL for all cached data ❌

Product names change infrequently — a 10-minute TTL is fine. A user’s cart changes on every interaction — caching it for 10 minutes causes lost-cart bugs. The agent defaults to conservative TTLs per entity type; review and adjust them based on your write frequency before deploying.

Troubleshooting

Problem: After adding indexes, queries are not faster

Solution: Run EXPLAIN ANALYZE on the slow query to confirm the index is being used. Postgres and MySQL will sometimes ignore a new index if statistics are stale. Run ANALYZE products; (Postgres) or ANALYZE TABLE products; (MySQL) to update statistics. If the query planner still ignores the index, the agent can help redesign the index to better match the query’s WHERE and ORDER BY clauses.

Problem: Redis caching works in development but causes stale data in production

Solution: The most common cause is missing cache invalidation on write paths you did not anticipate. Run /sk:scout "find all write operations on the products table" to find all places that write to the table — every one of them needs the cache invalidation hook. The agent will list any that were missed.

Problem: Load test shows high error rate even after fixes

Solution: High error rates under load usually indicate connection pool exhaustion, not query speed. Check your database connection pool size (DB_POOL_SIZE in .env). A common misconfiguration is leaving the pool at the default of 5 connections on a server handling 100+ concurrent requests. Set it to (2 × CPU cores) + number of disks as a starting point.

Problem: Performance is fine under load test but slow for real users

Solution: Load tests using the same endpoint and payload do not capture real-world variability. Run /sk:debug "slow for specific users" and check whether slowness correlates with users who have large datasets (many orders, many products in cart). The fix is usually adding a user_id index or scoping queries that are accidentally loading cross-user data.

Next Steps

  • Code Review — Catch performance anti-patterns during code review before they reach production
  • Refactor Legacy Code — Structural refactoring often resolves systemic performance issues that point fixes cannot reach
  • Security Audit & Scan — Run alongside performance work to ensure optimizations do not introduce security regressions

Key takeaway: Performance escalations feel urgent because they are — and the fastest path from a slow API back to fast response times is systematic profiling followed by targeted fixes, not guessing and hoping.