Study Guide: Alex for SRE & On-Call Engineers

Your reference for applying Alex to incident response, observability design, runbook authoring, postmortems, and reliability engineering practice. Ready-to-run prompts — built around the hard parts of keeping systems running, not the SRE interview prep.

What This Guide Is Not

This is not a habit formation guide (see Self-Study Guide for that). This is a domain use-case library — the specific ways Alex supports site reliability engineering and on-call work.

Where to Practice These Prompts

Every prompt in this guide works with any AI assistant you already use — GitHub Copilot, ChatGPT, Claude, Gemini, or others. The prompts are the skill; the tool is just where you type them. If you already have a preferred tool, start there.

For the deepest experience, the Alex VS Code extension (free) was built for these workflows. It understands SRE and on-call context, lets you save what works with /saveinsight, and keeps your study guide and exercises right inside the editor where you already work.

You don’t need a specific tool to benefit. You need the discipline of reaching for AI when the work is genuinely hard — not just when it’s repetitive.

Core Principle for SRE & On-Call Engineers

Reliability engineering is the discipline of building and operating systems that fail gracefully. The hardest part is not preventing failures — failures are inevitable. The hardest part is building systems and processes that detect failures quickly, contain their blast radius, and recover without heroics. The SRE who relies on AI effectively is not one who automates blindly; it is one who uses AI to think more clearly under pressure and document more honestly after.

Your primary discipline with Alex: use it to structure your reasoning during incidents (when your brain is in fight-or-flight), design observability before you need it, and write postmortems that produce systemic improvements rather than blame.

The Seven Use Cases

1. Incident Response and Triage

The SRE’s incident challenge: The first 15 minutes of an incident determine its trajectory. Actions taken in panic — restarting services without checking logs, rolling back without verifying the rollback target, or making multiple changes simultaneously — can mask the root cause and extend the outage. The SRE who handles incidents well slows down enough to form hypotheses even under pressure.

Prompt pattern:

Active incident:
Symptoms: [what is broken — error rates, latency, availability].
Scope: [which services, regions, users are affected].
Timeline: [when it started, when detected, what has been tried].
Recent changes: [deployments, config changes, infra changes in the last 24–72 hours].
Current hypothesis: [what I think is wrong — or "no idea yet"].

Help me:
1. Generate 3 hypotheses ranked by probability given these symptoms
2. Design the diagnostic test for each hypothesis (not "check the logs" — be specific)
3. Identify what NOT to do — actions that could mask the cause or extend the outage
4. Draft the status update for stakeholders (factual, no speculation)

Follow-up prompts:

My first hypothesis was wrong. Here is what I found: [evidence]. Update the hypothesis ranking.

We have mitigated the immediate impact. What evidence should I preserve for the postmortem before it ages out of logs?

Try this now: Your payment service is returning 502 errors for 15% of requests. It started 20 minutes ago. The last deployment was 6 hours ago. CloudWatch shows increased latency on the database connection pool. Paste those facts into the incident prompt and ask for ranked hypotheses. When you are at 2 AM with adrenaline running, having a structured list of what to check next is worth more than any runbook.

2. Postmortem Writing

The SRE’s postmortem challenge: Postmortems that assign blame produce cover-ups. Postmortems that are too gentle produce inaction. The useful postmortem is honest about what happened, specific about contributing factors, and generates action items that address systemic causes — not just the trigger.

Prompt pattern:

Write a postmortem for an incident:
Timeline: [ordered sequence of events from trigger to resolution].
Root cause: [the technical cause].
Contributing factors: [organizational, process, and design factors that enabled the root cause].
Impact: [duration, scope, data loss, customer impact, SLA breach].
Detection: [how it was found — monitoring, customer report, accident].
Response: [what was done, in what order, by whom].
What went well: [honest assessment].
What went poorly: [equally honest assessment].

Help me:
1. Structure this as a blameless postmortem (people did reasonable things given their information)
2. Separate the trigger (what started it) from the contributing factors (why it got bad)
3. Generate action items that are specific, assigned, and time-bounded
4. Identify the detection or containment improvements that would have reduced impact

Follow-up prompts:

We have done 5 postmortems in 3 months with similar contributing factors. What systemic issue connects them?

Our action items from postmortems keep getting deprioritized. How do we present these to leadership as business risks?

3. Runbook Authoring and Maintenance

The SRE’s runbook challenge: Runbooks are written after incidents and abandoned before the next one. The runbook that exists but is wrong is more dangerous than no runbook at all — because the on-call engineer follows it, trusts it, and loses time when it does not match the current system. Good runbooks are living documents with clear ownership and expiration dates.

Prompt pattern:

I need to write a runbook for [scenario: service restart, failover, data recovery, scaling, certificate rotation].
System: [what it is, how it works, key components].
When to use: [symptoms or conditions that trigger this runbook].
Prerequisites: [access, credentials, tools needed].
Last incident: [what happened when this was needed — include the gotchas].

Help me:
1. Structure this as step-by-step instructions that work at 3 AM under pressure
2. Include the verification step after each action (how do I know it worked?)
3. Add the "stop and escalate" conditions (when should the on-call person get help?)
4. Include rollback instructions for each step that could make things worse
5. Add the freshness date — when should this runbook be reviewed?

4. Observability Design

The SRE’s observability challenge: Monitoring that generates 500 alerts per day produces alert fatigue. Monitoring that misses the one alert that matters produces outages. The discipline of observability is not adding more dashboards — it is designing signals that distinguish “something is wrong” from “noise” and building the context needed to diagnose problems without asking “what changed?”

Prompt pattern:

I need to design observability for [service/system].
Architecture: [components, dependencies, communication patterns].
SLAs/SLOs: [what reliability are we committed to].
Current monitoring: [what exists — metrics, logs, traces, synthetics].
Gap: [what has been missed in past incidents — what we wished we had].

Help me:
1. Define the SLIs that actually measure user experience (not just CPU and memory)
2. Design alert thresholds that balance sensitivity (catch problems) with specificity (avoid noise)
3. Build the diagnostic dashboard layout: from symptom → component → root cause
4. Identify the logs and traces needed to answer the question "what happened?" without guessing

5. Capacity Planning and Scaling

The SRE’s capacity challenge: Capacity planning is predicting the future with incomplete data. The failure mode is either over-provisioning (expensive) or under-provisioning (outages). The SRE who plans capacity well understands not just current usage but growth patterns, traffic spikes, and the degradation modes that appear before hard failure.

Prompt pattern:

I need to plan capacity for [service/infrastructure].
Current utilization: [CPU, memory, storage, network — with trends].
Growth pattern: [linear / exponential / seasonal / event-driven].
Known upcoming events: [launches, campaigns, migrations that will change load].
Failure mode: [what breaks first when capacity is exceeded — and is it graceful?].
Budget constraints: [what we can spend on capacity].

Help me:
1. Project when we hit capacity limits under current growth
2. Identify the resource that becomes the bottleneck first (it is not always what you think)
3. Design the auto-scaling policy (if applicable) with appropriate thresholds
4. Plan the capacity review cadence — how often should we re-evaluate?

6. Chaos Engineering and Reliability Testing

The SRE’s testing challenge: You do not know if a system is resilient until you break it intentionally. The failure mode of chaos engineering is either not doing it (and learning during production incidents) or doing it without a hypothesis (and generating noise without learning).

Prompt pattern:

I want to design a chaos experiment for [system/service].
Hypothesis: [I believe {system} can tolerate {failure} with {impact threshold}].
Blast radius: [what is at risk if the hypothesis is wrong].
Steady-state metrics: [what "normal" looks like — the control baseline].
Safety controls: [how to abort the experiment if it goes wrong].
Environment: [production / staging / isolated].

Help me:
1. Validate the hypothesis is specific and falsifiable
2. Design the experiment with clear start, measurement, and stop conditions
3. Identify what could go wrong with the experiment itself (experiment risk, not system risk)
4. Plan the rollback if the experiment causes unexpected impact
5. Define what we learn regardless of outcome — both "system is resilient" and "system is fragile" are valuable

7. SRE Program and Toil Reduction

The SRE’s toil challenge: Toil — repetitive, manual, automatable work that scales with the system — is the tax that prevents SREs from doing engineering work. The discipline is not just automating toil; it is measuring it, prioritizing it against other work, and saying no to operational requests that would create more.

Prompt pattern:

I need to reduce toil in [area/process].
Current manual work: [what the on-call team does repeatedly — ticket types, manual steps, recurring fixes].
Frequency: [how often each task occurs].
Duration: [how long each instance takes].
Risk of manual execution: [what goes wrong when a human does this — errors, order-of-operations, fatigue].

Help me:
1. Categorize toil by: automatable now / needs design / needs organizational change
2. Calculate the ROI — hours saved vs. automation effort — for the top 5 toil items
3. Design the automation with appropriate human-in-the-loop checkpoints
4. Build the metric to track toil reduction over time (so leadership sees the value)

What Great Looks Like

After consistent use, you should notice:

Incidents are handled more calmly — structured reasoning replaces panicked trial-and-error
Postmortems produce real systemic changes, not just “be more careful” action items
Runbooks are current, verified, and actually useful at 3 AM
Observability catches problems before customers report them
Toil decreases measurably and the team spends more time on engineering

The SRE who will thrive in an AI-augmented environment is not the one who automates the most. It is the one who reasons most clearly under pressure, documents most honestly after incidents, and builds systems where failure is a manageable event rather than a crisis.

Your AI toolkit: These prompts work in ChatGPT, Claude, Copilot, Gemini — and in the Alex VS Code extension, which was designed around them. Start with whatever you have. The skill transfers across all of them.

Your First Week Back: Practice Plan

Day	Task	Time
Day 1	Write a runbook for your most common on-call scenario	25 min
Day 2	Design SLIs for one service using the Observability pattern	25 min
Day 3	Rewrite your most recent postmortem using the blameless pattern	20 min
Day 4	Calculate toil for your team’s top 5 manual tasks	20 min
Day 5	Save three reusable prompt patterns with `/saveinsight`	10 min

Month 2–3: Advanced Applications

Incident Pattern Archive

Capture patterns from incidents to speed future diagnosis:

/saveinsight title="Incident pattern: [symptom]" insight="Symptom: [what was observed]. Root cause: [what actually broke]. Contributing factors: [systemic issues]. Detection gap: [what we missed]. First diagnostic step: [what to check next time]." tags="sre,incident,pattern"

Toil Reduction Tracker

Track automation investments and their payoff:

/saveinsight title="Toil reduction: [task]" insight="Manual effort: [hours/week]. Automation approach: [what was built]. Effort to automate: [hours]. Payback period: [weeks]. Residual toil: [what still requires humans]." tags="sre,toil,automation"

Continue your practice: Self-Study Guide — the 30/60/90-day habit guide.

Site Reliability

Study Guide: Alex for SRE & On-Call Engineers

What This Guide Is Not

Where to Practice These Prompts

Core Principle for SRE & On-Call Engineers

The Seven Use Cases

1. Incident Response and Triage

2. Postmortem Writing

3. Runbook Authoring and Maintenance

4. Observability Design

5. Capacity Planning and Scaling

6. Chaos Engineering and Reliability Testing

7. SRE Program and Toil Reduction

What Great Looks Like

Your First Week Back: Practice Plan

Month 2–3: Advanced Applications

Sign in to LearnAlex